Weakly supervised contrastive representation learning to encode narrative viewpoint of COVID-19 tweets

doi:10.1371/journal.pcsy.0000089

Fig 1.

Diagram illustrating the data processing approach to extract corresponding user and tweet pairs for training the contrastive model.

The number of tweets (speech bubble icons) and users (network icons) remaining after each processing step are specified.

More »

Expand

Table 1.

Identified set of controversial topics and viewpoints for evaluation.

The number of tweets for each topic is in parentheses. VP = Viewpoint, S = Support, O = Oppose, B = Believe, R = Refute.

More »

Expand

Table 2.

Social network properties for the giant connected component in each of the three retweet networks in this study.

The Engaged Users property denotes the number of users who participated in both retweeting and generating original posts in Avax.

More »

Expand

Fig 2.

UMAP latent space visualization of two baseline embedding methods (a and b) and a concatenated embedding combining DistilRoBERTa with node2vec embeddings (c) for a selected set of topics with manually annotated binary viewpoints.

(a) DistilRoBERTa. (b) CT-BERT. (c) DistilRoBERTa + n2v.

More »

Expand

Table 3.

Correlation scores between shortest path length (social proximity) and tweet similarity (content locality) across the retweet networks extracted in Avax.

All correlation scores yield p-values below 0.05, indicating statistical significance. The tweet embeddings were extracted using DistilRoBERTa.

More »

Expand

Fig 3.

ViewpointNN vs. TopicNN performance with K = 5 of contrastive models and the two baselines.

We include performance for the following: three models trained on the active retweet network with different cut-off values of shortest path length (4, 3 and 2), four models trained with different pre-trained embeddings on the original network, and the best-performing model trained on the strong network. FT refers to fully fine-tuned models, and TA-FT refers to topic-aware models.

More »

Expand

Table 4.

The number of sample pairs, tweets and users for the training, validation, and test data splits per dataset.

More »

Expand

Table 5.

Model performance over the three types of retweet networks.

We report performance on ViewpointNN and TopicNN metrics with k = 5 as well as their average (NN Mean), and the Pearson correlation metric. The Baseline performance for ViewpointNN and TopicNN metrics is the same across different retweet networks as the baselines do not leverage the social network information. The correlation metric is computed on the corresponding test split for each network type. FT refers to fine-tuned contrastive models, and TA-FT refers to topic-aware contrastive models. The best results for each metric is shown in bold.

More »

Expand

Fig 4.

Model performance in ViewpointNN and TopicNN at different values of k for topic-aware contrastive models trained with CT-BERT embeddings.

TA-FT refers to models leveraging both fine-tuned embeddings and static topic-aware embeddings. Models shown here are trained using the original, active, and strong retweet networks. (a) ViewpointNN. (b) TopicNN.

More »

Expand

Fig 5.

ViewpointNN performance with k=15 of the best contrastive model (TA-FT Strong) and baselines across the different topics in the annotated set of tweets.

More »

Expand

Fig 6.

(Left) UMAP visualizations of the embeddings from the best performing topic-aware fine-tuned model trained with the strong retweet network.

(Right) UMAP visualization of annotated set of tweets with their corresponding node2vec embeddings. (a) TA-FT CT-BERT Strong. (b) Node2Vec.

More »

Expand

Fig 7.

UMAP visualizations of baseline embeddings and the embeddings from the best performing topic-aware fine-tuned model across different annotated topics.

The TA-FT CT-BERT model shown uses the strong network for training. (a) Base DistilRoBERTa (Depopulation Agenda). (b) Base CT-BERT (Depopulation Agenda). (c) TA-FT CT-BERT (Depopulation Agenda). (d) Base DistilRoBERTa (Miscarriage Agenda). (e) Base CT-BERT (Miscarriage Agenda). (f) TA-FT CT-BERT (Miscarriage Agenda). (g) Base DistilRoBERTa (Vaccines). (h) Base CT-BERT (Vaccines). (i) TA-FT CT-BERT (Vaccines).

More »

Expand

Fig 8.

Stacked bar plot showing the distribution of topic categories across BERTopic-identified topics.

The y-axis displays the BERTopic topics, while the x-axis represents the number of tweets belonging to the 12 annotated categories, which are listed in the legend. Viewpoints are shaded blue or red based based on two groups of viewpoints more likely to be associated based on the social network polarization. (a) Baseline DistilRoBERTa. (b) Baseline CT-BERT. (c) TA-FT CT-BERT Strong.

More »

Expand