Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1.

Comparison of annotators self-agreement (green), the inter-annotator agreement (blue), and an automated sentiment classifier (TwoPlaneSVMbin, red) in terms of Krippendorff’s Alpha.

On the left-hand side are the 13 language datasets, and on the right-hand side the four application datasets. The datasets are ordered by decreasing self-agreement. The error bars indicate estimated 95% confidence intervals.

More »

Fig 1 Expand

Fig 2.

The English (left) and Russian (right) datasets.

For English, there is still a gap (Alpha = 0.097) between the classifier (in red) and the inter-annotator agreement (in blue).

More »

Fig 2 Expand

Fig 3.

The Polish dataset.

The classifier’s peak performance (in red, Alpha = 0.536) is at 150,000 labeled tweets.

More »

Fig 3 Expand

Fig 4.

The Slovenian (left) and Bulgarian (right) datasets.

The Slovenian classifier peak is at 70,000 tweets (Alpha = 0.459). The Bulgarian classifier peak is at 40,000 tweets (Alpha = 0.378).

More »

Fig 4 Expand

Fig 5.

The German datasets.

The complete dataset (left), and separate datasets labeled by the two main annotators (middle and right).

More »

Fig 5 Expand

Fig 6.

Joint Serbian/Croatian/Bosnian dataset.

There is an oscillation in performance and high variability.

More »

Fig 6 Expand

Fig 7.

Separate Serbian (left), Croatian (middle), and Bosnian (right) datasets.

Here, the lower quality Serbian set has no adverse effects on the higher quality Croatian and Bosnian sets.

More »

Fig 7 Expand

Fig 8.

The Portuguese dataset.

There are two peaks (at 50,000 tweets, Alpha = 0.394, and at 160,000 tweets, Alpha = 0.391), and a large drop in between, due to a topic shift.

More »

Fig 8 Expand

Fig 9.

The Spanish dataset.

There is a consistent drop of performance and high variability.

More »

Fig 9 Expand

Fig 10.

The sentiment distribution of the environmental tweets in the training and application sets.

Negative tweets are denoted by red, neutral by yellow, and positive by green color. The grey bar denotes the sentiment score (the mean) of each dataset.

More »

Fig 10 Expand

Table 1.

The number and distribution of sentiment annotated posts, and the time period of the posts.

The top part of the table refers to the 13 language datasets, and the bottom to the four application datasets.

More »

Table 1 Expand

Table 2.

Sentiment distributions of the application datasets as predicted by the sentiment classifiers.

The rightmost column shows the sentiment score (the mean) of the application and training datasets (the later from Table 1), respectively.

More »

Table 2 Expand

Table 3.

The number of annotators, and the number and fraction of posts annotated twice.

The self-agreement column gives the number of posts annotated twice by the same annotator, and the inter-agreement column the posts annotated twice by two different annotators.

More »

Table 3 Expand

Table 4.

The self- and inter-annotator agreement measures.

The 95% confidence intervals for Alpha are computed by bootstrapping. Albanian and Spanish (in bold) have very low agreement values.

More »

Table 4 Expand

Table 5.

Differences between the three sentiment classes (−,0,+).

The differences are measured in terms of Alpha, for the union of self- and inter-annotator agreements. The second column shows the relative difference between the Alphaint (interval) and Alphanom (nominal) agreement measures. The third and fourth columns show the distances of the negative (−) and positive (+) class to the neutral class (0), respectively, normalized with the distance between them. The last row is the average difference, but without the low quality Albanian and Spanish, and the subsumed Emojis datasets (in bold). Only the numbers in bold do not support the thesis that sentiment classes are ordered.

More »

Table 5 Expand

Fig 11.

Comparison of six classification models in terms of Alpha.

Results are the average of 10-fold cross-validations.

More »

Fig 11 Expand

Table 6.

Evaluation results of the sentiment classifiers.

The dataset sizes in the second column, used for training the classifiers, are the result of merging multiple annotated tweets (there was no merging for the Facebook(it) and DJIA30 datasets). The 95% confidence intervals for Alpha are estimated from 10-fold cross-validations. The last row is an evaluation of the general English language model (trained from the English dataset in row 3) on the Environment dataset.

More »

Table 6 Expand

Fig 12.

Results of the Friedman-Nemenyi test of classifiers ranking.

The six classifiers are compared in terms of their ranking using two evaluation measures, Alpha (left) and (right). The ranks of classifiers within the critical distance (2.09) are not statistically significantly different.

More »

Fig 12 Expand