Fig 1.
Diagram illustrating the data processing approach to extract corresponding user and tweet pairs for training the contrastive model.
The number of tweets (speech bubble icons) and users (network icons) remaining after each processing step are specified.
Table 1.
Identified set of controversial topics and viewpoints for evaluation.
The number of tweets for each topic is in parentheses. VP = Viewpoint, S = Support, O = Oppose, B = Believe, R = Refute.
Table 2.
Social network properties for the giant connected component in each of the three retweet networks in this study.
The Engaged Users property denotes the number of users who participated in both retweeting and generating original posts in Avax.
Fig 2.
UMAP latent space visualization of two baseline embedding methods (a and b) and a concatenated embedding combining DistilRoBERTa with node2vec embeddings (c) for a selected set of topics with manually annotated binary viewpoints.
(a) DistilRoBERTa. (b) CT-BERT. (c) DistilRoBERTa + n2v.
Table 3.
Correlation scores between shortest path length (social proximity) and tweet similarity (content locality) across the retweet networks extracted in Avax.
All correlation scores yield p-values below 0.05, indicating statistical significance. The tweet embeddings were extracted using DistilRoBERTa.
Fig 3.
ViewpointNN vs. TopicNN performance with K = 5 of contrastive models and the two baselines.
We include performance for the following: three models trained on the active retweet network with different cut-off values of shortest path length (4, 3 and 2), four models trained with different pre-trained embeddings on the original network, and the best-performing model trained on the strong network. FT refers to fully fine-tuned models, and TA-FT refers to topic-aware models.
Table 4.
The number of sample pairs, tweets and users for the training, validation, and test data splits per dataset.
Table 5.
Model performance over the three types of retweet networks.
We report performance on ViewpointNN and TopicNN metrics with k = 5 as well as their average (NN Mean), and the Pearson correlation metric. The Baseline performance for ViewpointNN and TopicNN metrics is the same across different retweet networks as the baselines do not leverage the social network information. The correlation metric is computed on the corresponding test split for each network type. FT refers to fine-tuned contrastive models, and TA-FT refers to topic-aware contrastive models. The best results for each metric is shown in bold.
Fig 4.
Model performance in ViewpointNN and TopicNN at different values of k for topic-aware contrastive models trained with CT-BERT embeddings.
TA-FT refers to models leveraging both fine-tuned embeddings and static topic-aware embeddings. Models shown here are trained using the original, active, and strong retweet networks. (a) ViewpointNN. (b) TopicNN.
Fig 5.
ViewpointNN performance with k=15 of the best contrastive model (TA-FT Strong) and baselines across the different topics in the annotated set of tweets.
Fig 6.
(Left) UMAP visualizations of the embeddings from the best performing topic-aware fine-tuned model trained with the strong retweet network.
(Right) UMAP visualization of annotated set of tweets with their corresponding node2vec embeddings. (a) TA-FT CT-BERT Strong. (b) Node2Vec.
Fig 7.
UMAP visualizations of baseline embeddings and the embeddings from the best performing topic-aware fine-tuned model across different annotated topics.
The TA-FT CT-BERT model shown uses the strong network for training. (a) Base DistilRoBERTa (Depopulation Agenda). (b) Base CT-BERT (Depopulation Agenda). (c) TA-FT CT-BERT (Depopulation Agenda). (d) Base DistilRoBERTa (Miscarriage Agenda). (e) Base CT-BERT (Miscarriage Agenda). (f) TA-FT CT-BERT (Miscarriage Agenda). (g) Base DistilRoBERTa (Vaccines). (h) Base CT-BERT (Vaccines). (i) TA-FT CT-BERT (Vaccines).
Fig 8.
Stacked bar plot showing the distribution of topic categories across BERTopic-identified topics.
The y-axis displays the BERTopic topics, while the x-axis represents the number of tweets belonging to the 12 annotated categories, which are listed in the legend. Viewpoints are shaded blue or red based based on two groups of viewpoints more likely to be associated based on the social network polarization. (a) Baseline DistilRoBERTa. (b) Baseline CT-BERT. (c) TA-FT CT-BERT Strong.