Fig 1.
Comparison of annotators self-agreement (green), the inter-annotator agreement (blue), and an automated sentiment classifier (TwoPlaneSVMbin, red) in terms of Krippendorff’s Alpha.
On the left-hand side are the 13 language datasets, and on the right-hand side the four application datasets. The datasets are ordered by decreasing self-agreement. The error bars indicate estimated 95% confidence intervals.
Fig 2.
The English (left) and Russian (right) datasets.
For English, there is still a gap (Alpha = 0.097) between the classifier (in red) and the inter-annotator agreement (in blue).
Fig 3.
The classifier’s peak performance (in red, Alpha = 0.536) is at 150,000 labeled tweets.
Fig 4.
The Slovenian (left) and Bulgarian (right) datasets.
The Slovenian classifier peak is at 70,000 tweets (Alpha = 0.459). The Bulgarian classifier peak is at 40,000 tweets (Alpha = 0.378).
Fig 5.
The complete dataset (left), and separate datasets labeled by the two main annotators (middle and right).
Fig 6.
Joint Serbian/Croatian/Bosnian dataset.
There is an oscillation in performance and high variability.
Fig 7.
Separate Serbian (left), Croatian (middle), and Bosnian (right) datasets.
Here, the lower quality Serbian set has no adverse effects on the higher quality Croatian and Bosnian sets.
Fig 8.
There are two peaks (at 50,000 tweets, Alpha = 0.394, and at 160,000 tweets, Alpha = 0.391), and a large drop in between, due to a topic shift.
Fig 9.
There is a consistent drop of performance and high variability.
Fig 10.
The sentiment distribution of the environmental tweets in the training and application sets.
Negative tweets are denoted by red, neutral by yellow, and positive by green color. The grey bar denotes the sentiment score (the mean) of each dataset.
Table 1.
The number and distribution of sentiment annotated posts, and the time period of the posts.
The top part of the table refers to the 13 language datasets, and the bottom to the four application datasets.
Table 2.
Sentiment distributions of the application datasets as predicted by the sentiment classifiers.
The rightmost column shows the sentiment score (the mean) of the application and training datasets (the later from Table 1), respectively.
Table 3.
The number of annotators, and the number and fraction of posts annotated twice.
The self-agreement column gives the number of posts annotated twice by the same annotator, and the inter-agreement column the posts annotated twice by two different annotators.
Table 4.
The self- and inter-annotator agreement measures.
The 95% confidence intervals for Alpha are computed by bootstrapping. Albanian and Spanish (in bold) have very low agreement values.
Table 5.
Differences between the three sentiment classes (−,0,+).
The differences are measured in terms of Alpha, for the union of self- and inter-annotator agreements. The second column shows the relative difference between the Alphaint (interval) and Alphanom (nominal) agreement measures. The third and fourth columns show the distances of the negative (−) and positive (+) class to the neutral class (0), respectively, normalized with the distance between them. The last row is the average difference, but without the low quality Albanian and Spanish, and the subsumed Emojis datasets (in bold). Only the numbers in bold do not support the thesis that sentiment classes are ordered.
Fig 11.
Comparison of six classification models in terms of Alpha.
Results are the average of 10-fold cross-validations.
Table 6.
Evaluation results of the sentiment classifiers.
The dataset sizes in the second column, used for training the classifiers, are the result of merging multiple annotated tweets (there was no merging for the Facebook(it) and DJIA30 datasets). The 95% confidence intervals for Alpha are estimated from 10-fold cross-validations. The last row is an evaluation of the general English language model (trained from the English dataset in row 3) on the Environment dataset.
Fig 12.
Results of the Friedman-Nemenyi test of classifiers ranking.
The six classifiers are compared in terms of their ranking using two evaluation measures, Alpha (left) and (right). The ranks of classifiers within the critical distance (2.09) are not statistically significantly different.