Emotional Dynamics in the Age of Misinformation

According to the World Economic Forum, the diffusion of unsubstantiated rumors on online social media is one of the main threats for our society. The disintermediated paradigm of content production and consumption on online social media might foster the formation of homogeneous communities (echo-chambers) around specific worldviews. Such a scenario has been shown to be a vivid environment for the diffusion of false claim. Not rarely, viral phenomena trigger naive (and funny) social responses—e.g., the recent case of Jade Helm 15 where a simple military exercise turned out to be perceived as the beginning of the civil war in the US. In this work, we address the emotional dynamics of collective debates around distinct kinds of information—i.e., science and conspiracy news—and inside and across their respective polarized communities. We find that for both kinds of content the longer the discussion the more the negativity of the sentiment. We show that comments on conspiracy posts tend to be more negative than on science posts. However, the more the engagement of users, the more they tend to negative commenting (both on science and conspiracy). Finally, zooming in at the interaction among polarized communities, we find a general negative pattern. As the number of comments increases—i.e., the discussion becomes longer—the sentiment of the post is more and more negative.

Unfortunately, despite the enthusiastic rhetoric about collective intelligence [22][23][24], the direct and undifferentiated access to the knowledge production process is causing opposite effects-e.g., the recent case of Jade Helm 15 [25] where a simple military exercise turned out to be perceived as the beginning of the civil war in the US. Unsubstantiated rumors often jump often disagreements between humans, and even individuals are not consistent with themselves.
In this study, as is often in the sentiment analysis literature [39], we have approximated the sentiment with an ordinal scale of three values: negative (−), neutral (0), and positive (+). Even with this rough approximation, and disagreements on single cases, it turns out that on a large scale, when one deals with thousands of sentiment assignments, the aggregated sentiment converges to stable values [40].
Our approach to automatic sentiment classification of texts is based on supervised machine learning. There are four steps: (i) a sample of texts is manually annotated with sentiment, (ii) the labeled set is used to train and tune a classifier, (iii) the classifier is evaluated on an independent test set or by cross-validation, and (iv) the classifier is applied to the whole set of texts.
We have collected over one million of Facebook comments. About 20k were randomly selected for manual annotation. We have engaged 22 native Italian speakers, active on Facebook, to manually annotate the comments by sentiment. The annotation is supported by a web-based platform Goldfinch-provided by Sowa Labs, http://www.sowalabs.com-and was accomplished in two months. About 20% of the comments were intentionally duplicated, in order to measure the mutual (dis)agreement of human annotators.
There are several measures to evaluate the inter-annotator agreement and performance of classification models. We argue that the inter-annotator agreement provides an upper bound that the best classification model can achieve. In practice, however, different learning algorithms have various limitations, and, most importantly, only a limited amount of training data is available. In order to compare the classifier performance to the inter-annotator agreement, we have selected three measures which are applied to evaluate both, performance and agreement: Accuracy, F 1 , and Accuracy±1. Exact definitions are in the Methods section, here we just briefly summarize them. Accuracy is the fraction of correctly classified examples for all three sentiment classes-no ordering between the classes is taken into account, and all three are treated equally. F 1 is the harmonic mean of precision and recall for a selected class. F 1 ðÀ; þÞ is the average of F 1 for the negative and positive class only, ignoring the neutral class. It is a standard measure of performance for sentiment classification [41]. The idea is that the misclassification of neutral sentiment can be ignored as it is less important then the extremes, i.e., negative or positive sentiment (however, it still affects their precision and recall). Accuracy±1 (an abbreviation for Accuracy within 1) completely ignores the neutral class. It counts as errors just the negative sentiment examples predicted as positive, and vice versa. It takes into account the fact that the neutral class is between the negative and the positive, and tolerates misclassifications within neighbouring classes. Table 1 gives the evaluation results. In the case of the inter-annotator agreement, 3,262 examples were labeled twice by two different annotators, and measures assess their agreement. In the case of a sentiment classifier evaluation, we applied 10-fold cross-validation. The results in Table 1 are the average of 10 classifiers, with 95% confidence interval. One can see that the average classifier has reached a performance close to human agreement. In terms of extreme errors, i.e., 1 − Accuracy±1 the performance of the classifier is as good as the agreement between the annotators. However, in terms of Accuracy and F 1 , there is still some room for improvement. We speculate that the main reason for the gap is a relatively low number of annotated examples. Based on our experience in training SVM classifiers in other domains (such as stock market, elections, generic tweets, etc.), we estimate that about 50,000 to 100,000 training examples are needed to reach the level of the inter-annotator agreement. Fig 1 gives the distribution of sentiment values after applying the classification model to the entire set of over one million comments. We assume that the sentiment values are ordered, and that the difference from the neutral value to both extremes, negative and positive, is the same. Thus one can map the sentiment values from ordinal to a real-valued interval [−1, +1]. The mean sentiment over the entire set is −0.34, prevailingly negative.

Sentiment on science and conspiracy posts
The sentiment analysis and classification task allowed us to associate each comment of our dataset to a sentiment value-i.e., −1 if negative, 0 if neutral, and 1 if positive. Taking all the comments of science and conspiracy posts, we can simply divide them into negative, neutral and positive (Fig 2, left), and analyze their proportions. We find that 70% of the comments on science pages is neutral or positive, differently from conspiracy pages (51%). Moreover, comments on science pages are twice as positive (20%) than those on conspiracy pages (10%).
To measure the effect induced on users by a post, we compute the average sentiment of all its comments. We grouped posts sentiment by defining three thresholds in order to equally divide the space; in particular, we say a post to be negative if the average sentiment 2 [−1,  When focusing on users, the approach is analogous. We define the sentiment of a user as the mean of the sentiment of all her comments. The mean sentiment for each user is then classified as negative, neutral, or positive by means of the same thresholds used for posts. Fig 2  (right) shows the aggregated sentiment both for science and conspiracy users. We find that the sentiment of users commenting on conspiracy pages is mainly negative (55%), while the sentiment of a small fraction of users (10%) is positive. On the contrary, the sentiment of users commenting on science pages is particularly neutral (45%), and negative only for 29% of users. Almost the same percentage (26%) is represented by positive sentiment.

Sentiment and virality
Now we focus on the interplay between the virality of a post and its generated sentiment. In particular we want to understand how the sentiment varies for increasing levels of comments, likes, and shares. Notice that each of these actions has a particular meaning [42][43][44]. A like stands for a positive feedback to the post; a share expresses the will to increase the visibility of a given information; and a comment is the way in which online collective debates take form around the topic promoted by posts. Comments may contain negative or positive feedbacks with respect to the post. Fig 3 shows the aggregated sentiment of a post as a function of its number of comments (top), likes (center), and shares (bottom) both for science (left) and conspiracy (right) posts. The sentiment has been regressed w.r.t. the logarithm of the number of comments (resp., likes, shares). We do not show confidence intervals, since they are defined (C.I. 95%) as X AE S:E: ¼ X AE 1:96 s ffiffi n p and when n ! 1, S.E. = 0. We notice that the sentiment decreases both for science and conspiracy when the number of comments of the post increases. However, we also note that it becomes more positive for science posts when the number of likes and shares increase, differently from conspiracy posts.
To assess the direct relationship between the number of comments and the negativity of the sentiment, a randomization test was performed. In particular, we took all the comments of science (resp., conspiracy) posts and randomly reassigned the original sentiments. Then, we regressed the sentiment w.r.t. the number of comments and compared the obtained slope with the one shown in Fig 3 (top). Over 10k randomized tests, the obtained slope was always greater than the original one. More precisely, while the slope for the original comments for Science is equal to −0.051 (resp., −0.070 for Conspiracy), the quantiles of the distribution of the slopes in the randomized test are: Q 0 = −0.010, Q 1 = −0.002, Q 2 = −0.00002, Q 3 = 0.002, Q 4 = 0.010 (resp., Q 0 = −0.004, Q 1 = −0.0008, Q 2 = −0.000004, Q 3 = 0.0008, Q 4 = 0.005, for Conspiracy). Therefore, given that the negative relationship between the sentiment and the length of the discussion disappears when the comment sentiments are randomized, we conclude that the length of the discussion is a relevant dimension when considering the negativity of the sentiment.
Summarizing, we found that both comments and posts, as well as users of conspiracy pages tend to be much more negative than those of science pages. Interestingly, the sentiment becomes more and more negative when the number of comments of the post increases-i.e., the discussion becomes longer-both on science and conspiracy pages. However, differently from conspiracy posts, when the number of likes and shares increases, the aggregated sentiment of science posts becomes more and more positive.

Sentiment and users activity
In this section we aim at understanding more in depth how the sentiment changes with respect to users' engagement in one of the two communities. Previous works [17,19,20] showed that the distribution of the users activity on the different contents is highly polarized. Therefore we now want to focus on the sentiment of polarized users. More precisely, we say a user to be polarized on science (respectively, on conspiracy) if she left more than 95% of her likes on science (respectively, on conspiracy) posts (for further details about the effect of the thresholding refer to the Methods Section).
Therefore, we take all polarized users having commented at least twice, i.e., 14,887 out of 33,268 users polarized on science and 67,271 out of 135,427 users polarized on conspiracy. Fig 4 shows the Probability Density Function (PDF) of the mean sentiment of polarized users with at least two comments. In Table 2 we compare the mean sentiment of all users and polarized users having commented at least twice. Our results show that the overall negativity increases w.r.t. all users, such a feature is more evident on the conspiracy side.
We now want to investigate how the mean sentiment of a user changes with respect to her commenting activity -i.e., when her total number of comments increases. In Fig 5 we show the mean sentiment of polarized users as a function of their number of comments. The more active a polarized user is, the more she tends toward negative values both on science and conspiracy posts. The sentiment has been regressed w.r.t. the logarithm of the number of comments. Interestingly, the sentiment of science users decreases faster than that of conspiracy users. We performed a randomization test taking all comments on both categories and then randomly reassigning the original sentiments. Then, we regressed the sentiment w.r.t. the number of comments and compared the obtained slope with the one shown in Fig 5. The obtained slope over 10k randomized tests was always greater than the original one. In particular, while the slope for the original comments for Science is equal to −0.070 (resp., −0.037 for Conspiracy), the quantiles of the distribution of the slopes in the randomized test are: Q 0 = −0.006, Q 1 = −0.001, Q 2 = 0.00001, Q 3 = 0.001, Q 4 = 0.006 (resp., Q 0 = −0.003, Q 1 = −0.0005, Q 2 = 0.00001, Q 3 = 0.0005, Q 4 = 0.003, for Conspiracy). Therefore users activity is a relevant dimension when considering the value of the sentiment, which is more and more negative on both categories when the users activity increases.

Interaction across communities
In this section we aim at investigating the sentiment when usual consumers of science and conspiracy news meet. To do this we pick all posts representing the arena where the debate between science and conspiracy users takes place. In particular, we select all posts commented at least once by both a user polarized on science and a user polarized on conspiracy. We find 7,751 such posts (out of 315,567) -reinforcing the fact that the two communities of users are strictly separated and do not often interact with one another. In Fig 6 we show the proportions of negative, neutral, and positive comments (left) and posts (right). The aggregated sentiment of such posts is slightly more negative (60%) than for general posts (54% for conspiracy, 27% for science, see Fig 2). When focusing on comments, we have similar percentages of neutral (42%) and negative (48%) comments, while a small part (10%) is represented by positive comments. We want to understand if the sentiment correlates with the length of the discussion. Hence, we analyze how the sentiment changes when the number of comments of the post increases, as we previously did for general posts (Fig 3).  Also in this case we performed a randomization test taking all the comments and randomly reassigning the original sentiments. Then, we regressed the sentiment w.r.t. the number of comments and compared the obtained slope with the one shown in Fig 7. Over 10k randomized tests, the obtained slope was always greater than the original one. In particular, while the slope for the original comments is equal to −0.048, the quantiles of the distribution of the slopes in the randomized test are: Q 0 = −0.009, Q 1 = −0.002, Q 2 = 0.00004, Q 3 = 0.002, Q 4 = 0.009. Therefore, we conclude that the length of the discussion does affect the negativity of the sentiment.

Conclusions
In this work we analyzed the emotional dynamics on pages of opposite worldviews, science and conspiracy. Previous works [17,19,20] showed that users are strongly polarized towards the two narratives. Moreover, we found that users of both categories seem to not distinguish between verified contents and unintentional false claims. In this manuscript we focused on the emotional behavior of the same users on Facebook. In general, we noticed that the sentiment on conspiracy pages tends to be more negative than that on science pages. In addition, by focusing on polarized users, we identified an overall increase of the negativity of the sentiment. In particular, the more active polarized users, the more they tend to be negative, both on science and conspiracy. Furthermore, the sentiment of polarized users is negative also when they interact with one another. Also in this case, as the number of comments increases -i.e., the discussion becomes longer-the sentiment of the post is more and more negative. This work provides important insights about the emotional dynamics in a disintermediated environment. Indeed, recent studies [32,33] pointed out that reading comments of other user may affect the discussion. Our findings confirm such a phenomenon and make explicit that the longer the discussion the more negative the sentiment. In particular, discussions around conspiracy news degenerate faster than the scientific one. This latter point opens to interesting question about the quasi-religious mentality of conspiracists [45] and the way in which such an echo-chamber digests and debate news and events.

Ethics statement
The entire data collection process has been carried out exclusively through the Facebook Graph API [46], which is publicly available, and for the analysis (according to the specification settings of the API) we used only public available data (users with privacy restrictions are not included in the dataset). The pages from which we download data are public Facebook entities (can be accessed by anyone). User content contributing to such pages is also public unless the user's privacy settings specify otherwise and in that case it is not available to us.

Data collection
We identified two main categories of pages: conspiracy news -i.e. pages promoting contents neglected by main stream media-and science news. The first category includes all pages diffusing conspiracy information -pages which disseminate controversial information, most often lacking supporting evidence and sometimes contradictory to the official news (i.e., conspiracy theories). The second category is that of scientific dissemination, including scientific institutions and scientific press having the main mission to diffuse scientific knowledge. Note that we do not focus on the truth value of information but rather on the possibility of verifying the content of the page. While the latter is an easy task for scientific news-e.g., by identifying the authors of the study or if the paper passed a peer review process-it usually becomes more difficult for conspiracy-like information, if not unfeasible. We defined the space of our investigation with the help of Facebook groups very active in debunking conspiracy theses (Protesi di Complotto, Che vuol dire reale, La menzogna diventa verità e passa alla storia). We categorized pages according to their contents and their self description. The resulting dataset -downloaded over a timespan of four years (2010 to 2014)-is composed of 73 public Italian Facebook pages and it is the same used in [19] and [20]. To the best of our knowledge, the final dataset is the complete set of all scientific and conspiracy information sources active in the Italian Facebook scenario. Table 3 summarizes the details of our data collection.

Classification and annotator agreement measures
Our approach to sentiment classification of texts is based on supervised machine learning, where a sample of texts is first manually annotated with sentiment and then used to train and evaluate a classifier. The classifier is then applied to the whole corpus. The measures to assess the agreement between annotators and the quality of the classifier are based on coincidence and confusion matrices, respectively.
Annotators were asked to label each text with negative 0 neutral 0 positive sentiment. When two annotators are given the same text, they can either agree (both give the same label) or disagree (they give different labels). The annotators can disagree in two ways: one label is neutral while the other is extreme (negative or positive), or both are extreme: one negative and one positive -we call this severe disagreement. A convenient way to represent the overall (dis)agreement between the annotators is a coincidence matrix, where each text that is annotated twice appears in the table twice. Table 4 gives a generic 3 × 3 annotator agreement table, while the actual data are in Tables 5 and 6. All agreements are on the diagonal of the table. As the labels are ordered (negative 0 neutral 0 positive), the further the cell from the diagonal, the more severe is the error. From such a table one can calculate the annotator agreement (the sum of the main diagonal divided by the number of all the elements in the table) and the severe disagreement: the sum of top right and bottom left corners divided by the number of all the elements in the table.
To compare the predictions of a classifier to a golden standard (manually annotated data, in our case), a confusion matrix is used. Table 4 also represents a generic 3 × 3 confusion matrix for the (ordered) sentiment classification case. Each element hx, yi represents the number of examples from the actual class x, predicted as class y. All agreements/correct predictions are in  the diagonal of the table. In the ordinal classification case, the further the cell from the diagonal, the more severe is the error.
Accuracy is the fraction of correctly classified examples: Accuracy ¼ hÀ; Ài þ h0; 0i þ hþ; þi N Accuracy within n [47] allows for a wider range of predictions to be considered correct. We use Accuracy within 1 (Accuracy±1) where only misclassifications from negative to positive and vice-versa are considered incorrect: AccuracyAE1ðÀ; þÞ ¼ 1 À hþ; Ài þ hÀ; þi N F 1 ðþ; ÀÞ is the macro-averaged F-score of the positive and negative classes, a standard evaluation measure [41] used also in the SemEval competition (http://alt.qcri.org/semeval2015/) for sentiment classification tasks: F 1 is the harmonic mean of Precision and Recall for each class [48]: Precision for class x is the fraction of correctly predicted examples out of all the predictions with class x: Recall for class x is the fraction of correctly predicted examples out of all the examples with actual class x: From the above tables and definitions, one can see that the annotator agreement is equivalent to Accuracy and that severe disagreement is equivalent to 1 − Accuracy±1. F 1 has no counterpart between the annotator agreement measures, but is a standard measure in evaluation of sentiment classifiers. On the other hand, Cohen's kappa [49] is a standard measure of interrater agreement, but rarely used to evaluate classification models. The original Cohen's kappa is applicable to categorical (unordered) classes, and weighted kappa was devised for ordered Emotional Dynamics in the Age of Misinformation classes. We use Cohen's weighted kappa [50] to compare the inter-annotator agreement and self-agreement.

Data annotation
Data annotation is a process in which some predefined labels are assigned to each data point. A subset of 19,642 comments from the Facebook dataset of one million (Table 3) was selected for manual sentiment annotation and later used to train a sentiment classifier. A user-friendly web and mobile devices annotation platform Goldfinch-provided by Sowa Labs, http://www. sowalabs.com/-was used. Trustworthy Italian native speakers, active on Facebook, were engaged for the annotations. The annotation task was to label each Facebook comment-isolated from its context-as negative, neutral, or positive. The guideline given to the annotators was to estimate the emotional attitude of the user when posting a comment to Facebook. The exact question an annotator should answer was: 'Is the user happy (pleased, satisfied), or unhappy (angry, sad, frustrated), or neutral?' A dedicated Facebook group was formed to dispatch detailed annotation instructions, to provide a forum for discussion, and to post ongoing annotation results which stimulated the annotators to contribute. During the annotation process, which lasted for about two months, the annotator performance was monitored in terms of the inter-annotator agreement and self-agreement, based on 20% of the comments which were intentionally duplicated. No compensation, other then gratitude and personal satisfaction for contributing to interesting scientific research, was awarded.
The annotation process resulted in 19,642 sentiment labeled comments, 3,902 of them annotated twice. Out of 3,902 duplicates, 3,262 were polled twice to two different annotators and are used to assess the inter-annotator agreement, and 640 were polled twice to the same annotator and are used to asses the annotators' self-agreement. The coincidence matrices with the inter-annotator agreement and self-agreement are in Tables 5 and 6, respectively.
Note that, in a coincidence matrix, each annotated example appears twice (once for each of the two annotators), thus the matrix is symmetric. This is in contrast to a confusion matrix where one knows the ground truth, and the matrix values are the number of examples in the actual and predicted classes.
The four evaluation measures, defined above, were used to quantify the inter-annotator and the annotators' self-agreement. The results are in Table 7.

Classification
Ordinal classification, also known as ordinal regression, is a form of multi-class classification where there is a natural ordering between the classes, but no meaningful numeric difference between them [47]. We treat sentiment classification as an ordinal regression task with three ordered classes. We apply the wrapper approach, described in [51], with two linear-kernel Support Vector Machine (SVM) [34] classifiers. SVM is a state-of-the-art supervised learning algorithm, well suited for large scale text categorization tasks, and robust on large feature spaces. The two SVM classifiers were trained to distinguish the extreme classes (negative and positive) from the rest (neutral plus positive, and neutral plus negative, respectively). During prediction, if both classifiers agree, they yield the common class, otherwise, if they disagree, the assigned class is neutral.
The sentiment classifier was trained and tuned on the training set of 15,714 annotated comments. The comments were processed into the standard Bag-of-Words (BoW) representation, with the following settings: lemmatized BoW include unigrams and bigrams, minimum ngram frequency is five, TF-IDF weighting, no stop-word removal, and normalized vectors. Additional features and settings were chosen, based on the results of 10-fold stratified crossvalidation on the training set: normalization of diacritical characters, url replacement, length of text, presence of upper cased words, negation (language specific), swearing (language specific), positive words from a predefined dictionary (language specific), unusual punctuation (several exclamation or question marks, . . .), unusually repeated characters, happy or sad emoticons in the text, and their presence at the end of the sentence.
The trained sentiment classifier was then evaluated on a disjoint test set of the remaining 3,928 comments. The confusion matrix between the annotators (actual classes) and the classifier (predicted classes) is in Table 8. The sentiment class distribution, after applying the classifier to the whole set of one million Facebook comments, is in Fig 1. Another evaluation was performed by a 10-fold cross-validation on the complete set of 19,642 training examples. The confusion matrix between the annotators and the 10 classifiers is in Table 9. The averaged evaluation measures over 10 classifiers, with 95% confidence interval are in Table 1.

Statistical tools
To characterize random variables, a main tool is the probability distribution function (PDF), which gives the probability that a random variable X assumes a value in the interval [a, b], i.e.
Pða X bÞ ¼ R b a f ðxÞdx. Labeling algorithm. The labeling algorithm may be described as a thresholding strategy on the total number of users likes. Considering the total number of likes of a user L u on both posts P in categories S and C. Let l s and l c define the number of likes of a user u on P s or P c , respectively denoting posts from scientific or conspiracy pages. Then, the total like activity of a user on one category is given by l s L u . Fixing a threshold θ we can discriminate users with enough activity on one category. More precisely, the condition for a user to be labeled as a polarized user in one category can be described as l s L u _ l c L u > θ. In Fig 8 we show the number of polarized users as a function of θ. Both curves decrease with a comparable rate. Fig 9 shows the Probability Density Function (PDF) of the mean sentiment of all polarized users (top) and polarized

List of pages
In this section, we provide the full list of Facebook pages of our dataset. Table 10 lists scientific  pages, while Table 11 lists conspiracy pages. help in annotating the dataset for the sentiment classification task. We wish to thank Sašo Rutar for running some of the sentiment classifier evaluation experiments.