Digital cloning of online social networks for language-sensitive agent-based modeling of misinformation spread

doi:10.1371/journal.pone.0304889

Fig 1.

Base network characterization.

(A) Network diagram of our base social media network. Node size is proportional to community population number, and edge thickness is proportional to the number of user edges between two community nodes. The labels are extracted by applying topic modeling to recorded tweet history within each community. (B) A directed network diagram for a sample of users within the ‘Assange’ community where each node represents a user within the community, node size is proportional to follower count, and edge transparency is proportional to node out-degree (C) The in-degree and out-degree distribution of our base network.

More »

Expand

Fig 2.

Data segmentation.

Diagram displaying how historical social media data from users in our base network is distributed amongst various stages of development stages for the ABM, infection model, and mutation model.

More »

Expand

Fig 3.

Schematic diagram of the ABM logic.

Illustrative diagram conveying the operating principle behind the ABM. A source user is infected when they share a source post. Their followers are exposed to their infection, some of which will become infected themselves by resharing the source post. This process continues across infection layers, with a fraction mutating the infection as they transmit it by adding additional commentary to their reshare post.

More »

Expand

Fig 4.

Schematic diagram of the infection model training process.

Diagram describing the training process for the infection model, which predicts whether User A will retweet User B’s post. The core model is a gradient boosted classifier with three sets of input features (i) transformer embeddings of User B’s post (i) transformer embeddings extracted from both historical tweets User B has authored and historical tweets User A has retweeted from others (iii) user metadata—such as number of followers, number of followees, etc.–from both User A and User B. Once the infection model is trained, it can be deployed to estimate the likelihood of infection spread.

More »

Expand

Fig 5.

Infection model and ABM characterization.

(A) The AUC-ROC curves for the infection model across the training set and set of hold-out test sets from different time periods that occurred after all recorded training set events. Slight overfitting between the training and test sets is observed; however, performance across test sets appears roughly consistent, suggesting Period I and II user behavior encoded during the training process is indicative of forward-looking information sharing behavior for multiple months. (B) The number of infections across infection layers for a set of ABM trials for a sample source post. The grey lines represent traces obtained from each of the 1000 trials. The blue bands denote the 68% percentile bands across these trials, with the red dashed line representing the median number of infections at each infection layer across all trials.

More »

Expand

Fig 6.

Comparison of infections in base and cloned networks.

(A) For a set of source posts sampled across all users in our base network, we plot the infection rates extracted from simulating these events within our ABM versus the infection rate measured in the base network ABM (Pearson correlation in log-space equal to 0.81, p < 0.01). Infection rate, which is calculated as number of infections divided by the number of source author followers, is presented to provide a consistent scale across the observations. (B) A similar plot to (A), except all events are sampled from u_S (Pearson correlation in log-space equal to 0.08, p = .075). Since all author-level features are fixed for these events, the visualization conveys the extent to which the ABM can anticipate variations in virality arising solely from post text. In both plots, the blue solid line represents a linear fit to the data, with the bands denoting the 95% confidence intervals of the fit.

More »

Expand

Fig 7.

ABM infections across communities.

(A) A comparison of the distribution of infections rates across communities for T_S between our base network and a simulation of the event with our ABM. (B) A heatmap presenting the community-to-community infection rates recorded when simulating T_S through our ABM, with each grid block representing the fraction of total infections originating from the associated infection pathway.

More »

Expand

Fig 8.

Countermeasure evaluation and ABM topical sensitivity.

(A) Results for a set of simulations of T_S where we block variable amounts of influential users (x-axis) and measure the corresponding effect on total number of infections within the cloned network (y-axis). We run a base simulation of T_S to identify users that generated the most infections. We then run additional simulations while blocking the top X most influential accounts, where X varies over a range of 0–1000. When a user is blocked in the ABM, they cannot infect other users. (B) We simulate an inoculation campaign within our ABM by running a set of simulations where a variable fraction of users within a community (x-axis) has their output infection probabilities decreased by ~20%. These simulations mimic the effect of inoculation campaigns that reduce the likelihood users will pass on misinformation. As can be seen in the plot, as inoculation fraction decreases, so does the total number of infections recorded within the cloned network (y-axis). The community chosen for inoculation here is the COVID-Vaccines community that generated the most infections within base simulations of T_S (C) We seed our ABM with a set of posts on different common misinformation topics, as well as a baseline post on cooking. We notice large variations in the output infection numbers, indicating information spread within our cloned network is sensitive to topic of discussion. In all three plots, infection numbers are presented on a normalized [0,1] scale.

More »

Expand