## Figures

## Abstract

Social media are becoming an increasingly important source of information about the public mood regarding issues such as elections, Brexit, stock market, etc. In this paper we focus on sentiment classification of Twitter data. Construction of sentiment classifiers is a standard text mining task, but here we address the question of how to properly evaluate them as there is no settled way to do so. Sentiment classes are ordered and unbalanced, and Twitter produces a stream of time-ordered data. The problem we address concerns the procedures used to obtain reliable estimates of performance measures, and whether the temporal ordering of the training and test data matters. We collected a large set of 1.5 million tweets in 13 European languages. We created 138 sentiment models and out-of-sample datasets, which are used as a gold standard for evaluations. The corresponding 138 in-sample datasets are used to empirically compare six different estimation procedures: three variants of cross-validation, and three variants of sequential validation (where test set always follows the training set). We find no significant difference between the best cross-validation and sequential validation. However, we observe that all cross-validation variants tend to overestimate the performance, while the sequential methods tend to underestimate it. Standard cross-validation with random selection of examples is significantly worse than the blocked cross-validation, and should not be used to evaluate classifiers in time-ordered data scenarios.

**Citation: **Mozetič I, Torgo L, Cerqueira V, Smailović J (2018) How to evaluate sentiment classifiers for Twitter time-ordered data? PLoS ONE 13(3):
e0194317.
https://doi.org/10.1371/journal.pone.0194317

**Editor: **Frank Emmert-Streib,
Tampere University of Technology, FINLAND

**Received: **December 15, 2017; **Accepted: **February 28, 2018; **Published: ** March 13, 2018

**Copyright: ** © 2018 Mozetič et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All Twitter data were collected through the public Twitter API and are subject to the Twitter terms and conditions. The 15 Twitter language datasets are available in three separate files from CLARIN.SI at http://hdl.handle.net/11356/1054. For each language and each labeled tweet, there is the tweet ID (as provided by Twitter), the sentiment label (negative, neutral, or positive), and the annotator ID (anonymized). Note that Twitter terms do not allow to openly publish the original tweets, they have to be fetched through the Twitter API. Precise details how to fetch the tweets, given tweet IDs, are provided in Twitter API documentation at: https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-lookup. The TwoPlaneSVMbin classifier and several other machine learning algorithms are implemented in an open source LATINO library, openly available at https://github.com/latinolib. The R code and data files needed to reproduce all the figures and tables in the paper are available at http://ltorgo.github.io/TwitterDS/. The criteria by which Twitter data was acquired is described in http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0155036. Non-English Tweets were acquired through Twitter Search API by specifying the geolocations of the largest cities. For English tweets, we used Twitter Streaming API (a random sample of 1% of all the public tweets), and filtered out the English posts. There are elements of indeterminism and randomness in this process, so one cannot reconstruct the data by repeating these steps. However, each tweets ever posted is assigned a unique identifier (by Twitter) and a set of Twitter IDs precisely identifies the set of tweets. The set of Twitter IDs (with our sentiment annotations) that exactly corresponds to the data analyzed in the submitted draft is available here: http://hdl.handle.net/11356/1054.

**Funding: **Igor Mozetič and Jasmina Smailović acknowledge financial support from the H2020 FET project DOLFINS (grant no. 640772), and the Slovenian Research Agency (research core funding no. P2-0103). Luis Torgo and Vitor Cerqueira acknowledge financing by project "Coral - Sustainable Ocean Exploitation: Tools and Sensors/NORTE-01-0145-FEDER-000036," financed by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Online social media are becoming increasingly important in our society. Platforms such as Twitter and Facebook influence the daily lives of people around the world. Their users create and exchange a wide variety of contents on social media, which presents a valuable source of information about public sentiment regarding social, economic or political issues. In this context, it is important to develop automatic methods to retrieve and analyze information from social media.

In the paper we address the task of sentiment analysis of Twitter data. The task encompasses identification and categorization of opinions (e.g., negative, neutral, or positive) written in quasi-natural language used in Twitter posts. We focus on estimation procedures of the predictive performance of machine learning models used to address this task. Performance estimation procedures are key to understand the generalization ability of the models since they present approximations of how these models will behave on unseen data. In the particular case of sentiment analysis of Twitter data, high volumes of content are continuously being generated and there is no immediate feedback about the true class of instances. In this context, it is fundamental to adopt appropriate estimation procedures in order to get reliable estimates about the performance of the models.

The complexity of Twitter data raises some challenges on how to perform such estimations, as, to the best of our knowledge, there is currently no settled approach to this. Sentiment classes are typically ordered and unbalanced, and the data itself is time-ordered. Taking these properties into account is important for the selection of appropriate estimation procedures.

The Twitter data shares some characteristics of time series and some of static data. A time series is an array of observations at regular or equidistant time points, and the observations are in general dependent on previous observations [1]. On the other hand, Twitter data is time-ordered, but the observations are short texts posted by Twitter users at any time and frequency. It can be assumed that original Twitter posts are not directly dependent on previous posts. However, there is a potential indirect dependence, demonstrated in important trends and events, through influential users and communities, or individual user’s habits. These long-term topic drifts are typically not taken into account by the sentiment analysis models.

We study different performance estimation procedures for sentiment analysis in Twitter data. These estimation procedures are based on (**i**) cross-validation and (**ii**) sequential approaches typically adopted for time series data. On one hand, cross-validations explore all the available data, which is important for the robustness of estimates. On the other hand, sequential approaches are more realistic in the sense that estimates are computed on a subset of data always subsequent to the data used for training, which means that they take time-order into account.

Our experimental study is performed on a large collection of nearly 1.5 million Twitter posts, which are domain-free and in 13 different languages. A realistic scenario is emulated by partitioning the data into 138 datasets by language and time window. Each dataset is split into an in-sample (a training plus test set), where estimation procedures are applied to approximate the performance of a model, and an out-of-sample used to compute the gold standard. Our goal is to understand the ability of each estimation procedure to approximate the true error incurred by a given model on the out-of-sample data.

The paper is structured as follows. Related work provides an overview of the state-of-the-art in estimation methods. In section Methods and experiments we describe the experimental setting for an empirical comparison of estimation procedures for sentiment classification of time-ordered Twitter data. We describe the Twitter sentiment datasets, a machine learning algorithm we employ, performance measures, and how the gold standard and estimation results are produced. In section Results and discussion we present and discuss the results of comparisons of the estimation procedures along several dimensions. Conclusions provide the limitations of our work and give directions for the future.

## Related work

In this section we briefly review typical estimation methods used in sentiment classification of Twitter data. In general, for time-ordered data, the estimation methods used are variants of cross-validation, or are derived from the methods used to analyze time series data. We examine the state-of-the-art of these estimation methods, pointing out their advantages and drawbacks.

Several works in the literature on sentiment classification of Twitter data employ standard cross-validation procedures to estimate the performance of sentiment classifiers. For example, Agarwal et al. [2] and Mohammad et al. [3] propose different methods for sentiment analysis of Twitter data and estimate their performance using 5-fold and 10-fold cross-validation, respectively. Bermingham and Smeaton [4] produce a comparative study of sentiment analysis between blogs and Twitter posts, where models are compared using 10-fold cross-validation. Saif et al. [5] asses binary classification performance of nine Twitter sentiment datasets by 10-fold cross validation. Other, similar applications of cross-validation are given in [6, 7].

On the other hand, there are also approaches that use methods typical for time series data. For example, Bifet and Frank [8] use the prequential (predictive sequential) method to evaluate a sentiment classifier on a stream of Twitter posts. Moniz et al. [9] present a method for predicting the popularity of news from Twitter data and sentiment scores, and estimate its performance using a sequential approach in multiple testing periods.

The idea behind the *K*-fold cross-validation is to randomly shuffle the data and split it in *K* equally-sized folds. Each fold is a subset of the data randomly picked for testing. Models are trained on the *K* − 1 folds and their performance is estimated on the left-out fold. *K*-fold cross-validation has several practical advantages, such as an efficient use of all the data. However, it is also based on an assumption that the data is independent and identically distributed [10] which is often not true. For example, in time-ordered data, such as Twitter posts, the data are to some extent dependent due to the underlying temporal order of tweets. Therefore, using *K*-fold cross-validation means that one uses future information to predict past events, which might hinder the generalization ability of models.

There are several methods in the literature designed to cope with dependence between observations. The most common are sequential approaches typically used in time series forecasting tasks. Some variants of *K*-fold cross-validation which relax the independence assumption were also proposed. For time-ordered data, an estimation procedure is sequential when testing is always performed on the data subsequent to the training set. Typically, the data is split into two parts, where the first is used to train the model and the second is held out for testing. These approaches are also known in the literature as the out-of-sample methods [11, 12].

Within sequential estimation methods one can adopt different strategies regarding train/test splitting, growing or sliding window setting, and eventual update of the models. In order to produce reliable estimates and test for robustness, Tashman [11] recommends employing these strategies in multiple testing periods. One should either create groups of data series according to, for example, different business cycles [13], or adopt a randomized approach, such as in [14]. A more complete overview of these approaches is given by Tashman [11].

In stream mining, where a model is continuously updated, the most commonly used estimation methods are holdout and prequential [15, 16]. The prequential strategy uses an incoming observation to first test the model and then to train it.

Besides sequential estimation methods, some variants of *K*-fold cross-validation were proposed in the literature that are specially designed to cope with dependency in the data and enable the application of cross-validation to time-ordered data. For example, blocked cross-validation (the name is adopted from Bergmeir [12]) was proposed by Snijders [17]. The method derives from a standard *K*-fold cross-validation, but there is no initial random shuffling of observations. This renders *K* blocks of contiguous observations.

The problem of data dependency for cross-validation is addressed by McQuarrie and Tsai [18]. The modified cross-validation removes observations from the training set that are dependent with the test observations. The main limitation of this method is its inefficient use of the available data since many observations are removed, as pointed out in [19]. The method is also known as non-dependent cross-validation [12].

The applicability of variants of cross-validation methods in time series data, and their advantages over traditional sequential validations are corroborated by Bergmeir et al. [12, 20, 21]. The authors conclude that in time series forecasting tasks, the blocked cross-validations yield better error estimates because of their more efficient use of the available data. Cerqueira et al. [22] compare performance estimation of various cross-validation and out-of-sample approaches on real-world and synthetic time series data. The results indicate that cross-validation is appropriate for the stationary synthetic time series data, while the out-of-sample approaches yield better estimates for real-world data.

Our contribution to the state-of-the-art is a large scale empirical comparison of several estimation procedures on Twitter sentiment data. We focus on the differences between the cross-validation and sequential validation methods, to see how important is the violation of data independence in the case of Twitter posts. We consider longer-term time-dependence between the training and test sets, and completely ignore finer-scale dependence at the level of individual tweets (e.g., retweets and replies). To the best of our knowledge, there is no settled approach yet regarding proper validation of models for Twitter time-ordered data. This work provides some results which contribute to bridging that gap.

## Methods and experiments

The goal of this study is to recommend appropriate estimation procedures for sentiment classification of Twitter time-ordered data. We assume a static sentiment classification model applied to a stream of Twitter posts. In a real-case scenario, the model is trained on historical, labeled tweets, and applied to the current, incoming tweets. We emulate this scenario by exploring a large collection of nearly 1.5 million manually labeled tweets in 13 European languages (see subsection Data and models). Each language dataset is split into pairs of the in-sample data, on which a model is trained, and the out-of-sample data, on which the model is validated. The performance of the model on the out-of-sample data gives an estimate of its performance on the future, unseen data. Therefore, we first compute a set of 138 out-of-sample performance results, to be used as a gold standard (subsection Gold standard). In effect, our goal is to find the estimation procedure that best approximates this out-of-sample performance.

Throughout our experiments we use only one training algorithm (subsection Data and models), and two performance measures (subsection Performance measures). During training, the performance of the trained model can be estimated only on the in-sample data. However, there are different estimation procedures which yield these approximations. In machine learning, a standard procedure is cross-validation, while for time-ordered data, sequential validation is typically used. In this study, we compare three variants of cross-validation and three variants of sequential validation (subsection Estimation procedures). The goal is to find the in-sample estimation procedure that best approximates the out-of-sample gold standard. The error an estimation procedure makes is defined as the difference to the gold standard.

### Data and models

We collected a large corpus of nearly 1.5 million Twitter posts written in 13 European languages. This is, to the best of our knowledge, by far the largest set of sentiment labeled tweets publicly available. We engaged native speakers to label the tweets based on the sentiment expressed in them. The sentiment label has three possible values: negative, neutral or positive. It turned out that the human annotators perceived the values as ordered. The quality of annotations varies though, and is estimated from the self- and inter-annotator agreements. All the details about the datasets, the annotator agreements, and the ordering of sentiment values are in our previous study [23]. The sentiment distribution and quality of individual language datasets is in Table 1. The tweets in the datasets are ordered by tweet ids, which corresponds to ordering by the time of posting.

The last column is a qualitative assessment of the annotation quality, based on the levels of the self- and inter-annotator agreement.

There are many supervised machine learning algorithms suitable for training sentiment classification models from labeled tweets. In this study we use a variant of Support Vector Machine (SVM) [24]. The basic SVM is a two-class, binary classifier. In the training phase, SVM constructs a hyperplane in a high-dimensional vector space that separates one class from the other. In the classification phase, the side of the hyperplane determines the class. A two-class SVM can be extended into a multi-class classifier which takes the ordering of sentiment values into account, and implements ordinal classification [25]. Such an extension consists of two SVM classifiers: one classifier is trained to separate the negative examples from the neutral-or-positives; the other separates the negative-or-neutrals from the positives. The result is a classifier with two hyperplanes, which partitions the vector space into three subspaces: negative, neutral, and positive. During classification, the distances from both hyperplanes determine the predicted class. A further refinement is a **TwoPlaneSVMbin** classifier. It partitions the space around both hyperplanes into bins, and computes the distribution of the training examples in individual bins. During classification, the distances from both hyperplanes determine the appropriate bin, but the class is determined as the majority class in the bin.

The vector space is defined by the features extracted from the Twitter posts. The posts are first pre-processed by standard text processing methods, i.e., tokenization, stemming/lemmatization (if available for a specific language), unigram and bigram construction, and elimination of terms that do not appear at least 5 times in a dataset. The Twitter specific pre-processing is then applied, i.e, replacing URLs, Twitter usernames and hashtags with common tokens, adding emoticon features for different types of emoticons in tweets, handling of repetitive letters, etc. The feature vectors are then constructed by the Delta TF-IDF weighting scheme [26].

In our previous study [23] we compared five variants of the SVM classifiers and Naive Bayes on the Twitter sentiment classification task. TwoPlaneSVMbin was always between the top, but statistically indistinguishable, best performing classifiers. It turned out that monitoring the quality of the annotation process has much larger impact on the performance than the type of the classifier used. In this study we fix the classifier, and use TwoPlaneSVMbin in all the experiments.

### Performance measures

Sentiment values are ordered, and distribution of tweets between the three sentiment classes is often unbalanced. In such cases, *accuracy* is not the most appropriate performance measure [8, 23]. In this context, we evaluate performance with the following two metrics: Krippendorff’s *Alpha* [27], and [28].

*Alpha* was developed to measure the agreement between human annotators, but can also be used to measure the agreement between classification models and a gold standard. It generalizes several specialized agreement measures, takes ordering of classes into account, and accounts for the agreement by chance. *Alpha* is defined as follows:
(1)
where *D*_{o} is the observed disagreement between models, and *D*_{e} is a disagreement, expected by chance. When models agree perfectly, *Alpha* = 1, and when the level of agreement equals the agreement by chance, *Alpha* = 0. Note that *Alpha* can also be negative. The two disagreement measures are defined as:
(2)
(3)

The arguments, *N*, *N*(*c*, *c*′), *N*(*c*), and *N*(*c*′), refer to the frequencies in a coincidence matrix, defined below. *c* (and *c*′) is a discrete sentiment variable with three possible values: *negative* (−1), *neutral* (0), or *positive* (+1). *δ*(*c*, *c*′) is a difference function between the values of *c* and *c*′, for ordered variables defined as:
(4)

Note that disagreements *D*_{o} and *D*_{e} between the extreme classes (*negative* and *positive*) are four times larger than between the neighbouring classes.

A coincidence matrix tabulates all pairable values of *c* from two models. In our case, we have a 3-by-3 coincidence matrix, and compare a model to the gold standard. The coincidence matrix is then the sum of the confusion matrix and its transpose. Each labeled tweet is entered twice, once as a (*c*, *c*′) pair, and once as a (*c*′, *c*) pair. *N*(*c*, *c*′) is the number of tweets labeled by the values *c* and *c*′ by different models, *N*(*c*) and *N*(*c*′) are the totals for each value, and *N* is the grand total.

is an instance of the *F* score, a well-known performance measure in information retrieval [29] and machine learning. We use an instance specifically designed to evaluate the 3-class sentiment models [28]. is defined as follows:
(5)

implicitly takes into account the ordering of sentiment values, by considering only the extreme labels, *negative* (−1) and *positive* (+1). The middle, *neutral*, is taken into account only indirectly. *F*_{1}(*c*) is the harmonic mean of precision and recall for class *c*, *c* ∈ {−1, +1}. implies that all negative and positive tweets were correctly classified, and as a consequence, all neutrals as well. indicates that all negative and positive tweets were incorrectly classified. does not account for correct classification by chance.

### Gold standard

We create the gold standard results by splitting the data into the in-sample datasets (abbreviated as in-set), and out-of-sample datasets (abbreviated as out-set). The terminology of the in- and out-set is adopted from Bergmeir et al. [12]. Tweets are ordered by the time of posting. To emulate a realistic scenario, an out-set always follows the in-set. From each language dataset (Table 1) we create *L* in-sets of varying length in multiples of 10,000 consecutive tweets, where *L* = ⌊*N*/10000⌋. The out-set is the subsequent 10,000 consecutive tweets, or the remainder at the end of each language dataset. This is illustrated in Fig 1.

Each labeled language dataset (Table 1) is partitioned into *L* in-sets and corresponding out-sets. The in-sets always start at the first tweet and are progressively longer in multiples of 10,000 tweets. The corresponding out-set is the subsequent 10,000 consecutive tweets, or the remainder at the end of the language dataset.

The partitioning of the language datasets results in 138 in-sets and corresponding out-sets. For each in-set, we train a TwoPlaneSVMbin sentiment classification model, and measure its performance, in terms of *Alpha* and , on the corresponding out-set. The results are in Tables 2 and 3. Note that the performance measured by *Alpha* is considerably lower in comparison to , since the baseline for *Alpha* is classification by chance.

The baseline, *Alpha* = 0, indicates classification by chance.

The baseline, , indicates that all negative and positive examples are classified incorrectly.

The 138 in-sets are used to train sentiment classification models and estimate their performance. The goal of this study is to analyze different estimation procedures in terms of how well they approximate the out-set gold standard results shown in Tables 2 and 3.

### Estimation procedures

There are different estimation procedures, some more suitable for static data, while others are more appropriate for time-series data. Time-ordered Twitter data shares some properties of both types of data. When training an SVM model, the order of tweets is irrelevant and the model does not capture the dynamics of the data. When applying the model, however, new tweets might introduce new vocabulary and topics. As a consequence, the temporal ordering of training and test data has a potential impact on the performance estimates.

We therefore compare two classes of estimation procedures. Cross-validation, commonly used in machine learning for model evaluation on static data, and sequential validation, commonly used for time-series data. There are many variants and parameters for each class of procedures. Our datasets are relatively large and an application of each estimation procedure takes several days to complete. We have selected three variants of each procedure to provide answers to some relevant questions.

First, we apply 10-fold cross-validation where the training:test set ratio is always 9:1. Cross-validation is *stratified* when the fold partitioning is not completely random, but each fold has roughly the same class distribution. We also compare standard *random* selection of examples to the *blocked* form of cross-validation [12, 17], where each fold is a block of consecutive tweets. We use the following abbreviations for cross-validations:

**xval(9:1, strat, block)**- 10-fold, stratified, blocked;**xval(9:1, no-strat, block)**- 10-fold, not stratified, blocked;**xval(9:1, strat, rand)**- 10-fold, stratified, random selection of examples.

In sequential validation, a sample consists of the training set immediately followed by the test set. We vary the ratio of the training and test set sizes, and the number and distribution of samples taken from the in-set. The number of samples is 10 or 20, and they are distributed equidistantly or semi-equidistantly. In all variants, samples cover the whole in-set, but they are overlapping. See Fig 2 for illustration. We use the following abbreviations for sequential validations:

**seq(9:1, 20, equi)**- 9:1 training:test ratio, 20 equidistant samples,**seq(9:1, 10, equi)**- 9:1 training:test ratio, 10 equidistant samples,**seq(2:1, 10, semi-equi)**- 2:1 training:test ratio, 10 samples randomly selected out of 20 equidistant points.

A sample consists of a training set, immediately followed by a test set. We consider two scenarios: (A) The ratio of the training and test set is 9:1, and the sample is shifted along 10 or 20 equidistant points. (B) The training:test set ratio is 2:1 and the sample is positioned at 10 randomly selected points out of 20 equidistant points.

## Results and discussion

We compare six estimation procedures in terms of different types of errors they incur. The error is defined as the difference to the gold standard. First, the magnitude and sign of the errors show whether a method tends to underestimate or overestimate the performance, and by how much (subsection Median errors). Second, relative errors give fractions of small, moderate, and large errors that each procedure incurs (subsection Relative errors). Third, we rank the estimation procedures in terms of increasing absolute errors, and estimate the significance of the overall ranking by the Friedman-Nemenyi test (subsection Friedman test). Finally, selected pairs of estimation procedures are compared by the Wilcoxon signed-rank test (subsection Wilcoxon test).

### Median errors

An estimation procedure estimates the performance (abbreviated *Est*) of a model in terms of *Alpha* and . The error it incurs is defined as the difference to the gold standard performance (abbreviated *Gold*): *Err* = *Est* − *Gold*. The validation results show high variability of the errors, with skewed distribution and many outliers. Therefore, we summarize the errors in terms of their medians and quartiles, instead of the averages and variances.

The median errors of the six estimation procedures are in Tables 4 and 5, measured by *Alpha* and , respectively.

Fig 3 depicts the errors with box plots. The band inside the box denotes the median, the box spans the second and third quartile, and the whiskers denote 1.5 interquartile range. The dots correspond to the outliers. Fig 3 shows high variability of errors for individual datasets. This is most pronounced for the Serbian/Croatian/Bosnian (scb) and Portuguese (por) datasets where variation in annotation quality (scb) and a radical topic shift (por) were observed. Higher variability is also observed for the Spanish (spa) and Albanian (alb) datasets, which have poor sentiment annotation quality (see [23] for details).

Errors are measured in terms of *Alpha*.

The differences between the estimation procedures are easier to detect when we aggregate the errors over all language datasets. The results are in Figs 4 and 5, for *Alpha* and , respectively. In both cases we observe that the cross-validation procedures (xval) consistently overestimate the performance, while the sequential validations (seq) underestimate it. The largest overestimation errors are incurred by the random cross-validation, and the largest underestimations by the sequential validation with the training:test set ratio 2:1. We also observe high variability of errors, with many outliers. The conclusions are consistent for both measures, *Alpha* and .

Errors are measured in terms of *Alpha*.

Errors are measured in terms of .

### Relative errors

Another useful analysis of estimation errors is provided by a comparison of relative errors. The relative error is the absolute error an estimation procedure incurs divided by the gold standard result: *RelErr* = |*Est* − *Gold*|/*Gold*. We chose two, rather arbitrary, thresholds of 5% and 30%, and classify the relative errors as small (*RelErr* < 5%), moderate (5% ≤ *RelErr* ≤ 30%), and large (*RelErr* > 30%).

Fig 6 shows the proportion of the three types of errors, measured by *Alpha*, for individual language datasets. Again, we observe a higher proportion of large errors for languages with poor annotations (alb, spa), annotations of different quality (scb), and different topics (por).

Small errors (< 5%) are in blue, moderate ([5, 30]%) in green, and large errors (> 30%) in red.

Figs 7 and 8 aggregate the relative errors across all the datasets, for *Alpha* and , respectively. The proportion of errors is consistent between *Alpha* and , but there are more large errors when the performance is measured by *Alpha*. This is due to smaller error magnitude when the performance is measured by *Alpha* in contrast to , since *Alpha* takes classification by chance into account. With respect to individual estimation procedures, there is a considerable divergence of the random cross-validation. For both performance measures, *Alpha* and , it consistently incurs higher proportion of large errors and lower proportion of small errors in comparison to the rest of the estimation procedures.

Small errors (< 5%) are in blue, moderate ([5, 30]%) in green, and large errors (> 30%) in red.

Small errors (< 5%) are in blue, moderate ([5, 30]%) in green, and large errors (> 30%) in red.

### Friedman test

The Friedman test is used to compare multiple procedures over multiple datasets [30–33]. For each dataset, it ranks the procedures by their performance. It tests the null hypothesis that the average ranks of the procedures across all the datasets are equal. If the null hypothesis is rejected, one applies the Nemenyi post-hoc test [34] on pairs of procedures. The performance of two procedures is significantly different if their average ranks differ by at least the critical difference. The critical difference depends on the number of procedures to compare, the number of different datasets, and the selected significance level.

In our case, the performance of an estimation procedure is taken as the absolute error it incurs: *AbsErr* = |*Est* − *Gold*|. The estimation procedure with the lowest absolute error gets the lowest (best) rank. The results of the Friedman-Nemenyi test are in Figs 9 and 10, for *Alpha* and , respectively.

The average ranks are computed from absolute errors, measured by *Alpha*. The black bars connect ranks that are not significantly different at the 5% level.

The average ranks are computed from absolute errors, measured by . The black bar connects ranks that are not significantly different at the 5% level.

For both performance measures, *Alpha* and , the Friedman rankings are the same. For six estimation procedures, 13 language datasets, and the 5% significance level, the critical difference is 2.09. In the case of (Fig 10) all six estimation procedures are within the critical difference, so their ranks are not significantly different. In the case of *Alpha* (Fig 9), however, the two best methods are significantly better than the random cross-validation.

### Wilcoxon test

The Wilcoxon signed-rank test is used to compare two procedures on related data [33, 35]. It ranks the differences in performance of the two procedures, and compares the ranks for the positive and negative differences. Greater differences count more, but the absolute magnitudes are ignored. It tests the null hypothesis that the differences follow a symmetric distribution around zero. If the null hypothesis is rejected one can conclude that one procedure outperforms the other at a selected significance level.

In our case, the performance of pairs of estimation procedures is compared at the level of language datasets. The absolute errors of an estimation procedure are averaged across the in-sets of a language. The average absolute error is then *AvgAbsErr* = ∑|*Est* − *Gold*|/*L*, where *L* is the number of in-sets. The results of the Wilcoxon test, for selected pairs of estimation procedures, for both *Alpha* and , are in Fig 11.

Compared are the average absolute errors, measured by *Alpha* (top) and (bottom). Thick solid lines denote significant differences at the 1% level, normal solid lines significant differences at the 5% level, and dashed lines insignificant differences. Arrows point from a procedure which incurs smaller errors to a procedure with larger errors.

The Wilcoxon test results confirm and reinforce the main results of the previous sections. Among the cross-validation procedures, blocked cross-validation is consistently better than the random cross-validation, at the 1% significance level. Stratified approach is better than non-stratified, but significantly (5% level) only for . The comparison of the sequential validation procedures is less conclusive. The training:test set ratio 9:1 is better than 2:1, but significantly (at the 5% level) only for *Alpha*. With the ratio 9:1 fixed, 20 samples yield better performance estimates than 10 samples, but significantly (5% level) only for . We found no significant difference between the best cross-validation and sequential validation procedures in terms how well they estimate the average absolute errors.

### Data and code availability

All Twitter data were collected through the public Twitter API and are subject to the Twitter terms and conditions. The Twitter language datasets are available in a public language resource repository clarin.si at http://hdl.handle.net/11356/1054, and are described in [23]. There are 15 language files, where the Serbian/Croatian/Bosnian dataset is provided as three separate files for the constituent languages. For each language and each labeled tweet, there is the tweet ID (as provided by Twitter), the sentiment label (negative, neutral, or positive), and the annotator ID (anonymized). Note that Twitter terms do not allow to openly publish the original tweets, they have to be fetched through the Twitter API. Precise details how to fetch the tweets, given tweet IDs, are provided in Twitter API documentation https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-lookup. However, upon request to the corresponding author, a bilateral agreement on the joint use of the original data can be reached.

The TwoPlaneSVMbin classifier and several other machine learning algorithms are implemented in an open source LATINO library [36]. LATINO is a light-weight set of software components for building text mining applications, openly available at https://github.com/latinolib.

All the performance results, for gold standard and the six estimation procedures, are provided in a form which allows for easy reproduction of the presented results. The **R** code and data files needed to reproduce all the figures and tables in the paper are available at http://ltorgo.github.io/TwitterDS/.

## Conclusions

In this paper we present an extensive empirical study about the performance estimation procedures for sentiment analysis of Twitter data. Currently, there is no settled approach on how to properly evaluate models in such a scenario. Twitter time-ordered data shares some properties of static data for text mining, and some of time series data. Therefore, we compare estimation procedures developed for both types of data.

The main result of the study is that standard, random cross-validation should not be used when dealing with time-ordered data. Instead, one should use blocked cross-validation, a conclusion already corroborated by Bergmeir et al. [12, 20]. Another result is that we find no significant differences between the blocked cross-validation and the best sequential validation. However, we do find that cross-validations typically overestimate the performance, while sequential validations underestimate it.

The results are robust in the sense that we use two different performance measures, several comparisons and tests, and a very large collection of data. To the best of our knowledge, we analyze and provide by far the largest set of manually sentiment-labeled tweets publicly available.

There are some biased decisions in our creation of the gold standard though, which limit the generality of the results reported, and should be addressed in the future work. An out-set always consists of 10,000 tweets, and immediately follows the in-sets. We do not consider how the performance drops over longer out-sets, nor how frequently should a model be updated. More importantly, we intentionally ignore the issue of dependent observations, between the in- and out-sets, and between the training and test sets. In the case of tweets, short-term dependencies are demonstrated in the form of retweets and replies. Medium- and long-term dependencies are shaped by periodic events, influential users and communities, or individual user’s habits. When this is ignored, the model performance is likely overestimated. Since we do this consistently, our comparative results still hold. The issue of dependent observations was already addressed for blocked cross-validation [21, 37] by removing adjacent observations between the training and test sets, thus effectively creating a gap between the two. Finally, it should be noted that different Twitter language datasets are of different sizes and annotation quality, belong to different time periods, and that there are time periods in the datasets without any manually labeled tweets.

## Acknowledgments

We thank Miha Grčar and Sašo Rutar for valuable discussions and implementation of the LATINO library.

## References

- 1. Anderson OD. More effective time-series analysis and forecasting. Journal of Computational and Applied Mathematics. 1995;64(1-2):117–147.
- 2.
Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau R. Sentiment analysis of Twitter data. In: Proc. Workshop on Languages in Social Media. ACL; 2011. p. 30–38.
- 3.
Mohammad SM, Kiritchenko S, Zhu X. NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets. arXiv preprint arXiv:13086242; 2013.
- 4.
Bermingham A, Smeaton AF. Classifying sentiment in microblogs: is brevity an advantage? In: Proc. 19th ACM Intl. Conference on Information and Knowledge Management. ACM; 2010. p. 1833–1836.
- 5.
Saif H, Fernández M, He Y, Alani H. Evaluation datasets for Twitter sentiment analysis: A survey and a new dataset, the STS-Gold. In: Proc. 1st Intl. Workshop on Emotion and Sentiment in Social and Expressive Media: Approaches and Perspectives from AI (ESSEM); 2013.
- 6.
Saif H, He Y, Alani H. Semantic sentiment analysis of Twitter. In: Proc. Intl. Semantic Web Conference (ISWC). Springer; 2012. p. 508–524.
- 7.
Wang X, Wei F, Liu X, Zhou M, Zhang M. Topic sentiment analysis in Twitter: a graph-based hashtag sentiment classification approach. In: Proc. 20th ACM Intl. Conference on Information and Knowledge Management. ACM; 2011. p. 1031–1040.
- 8.
Bifet A, Frank E. Sentiment knowledge discovery in Twitter streaming data. In: Proc. 13th Intl. Conference on Discovery Science; 2010. p. 1–15.
- 9.
Moniz N, Torgo L, Rodrigues F. Resampling approaches to improve news importance prediction. In: Proc. Advances in Intelligent Data Analysis XIII (IDA). Springer; 2014. p. 215–226.
- 10. Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Statistics Surveys. 2010;4:40–79.
- 11. Tashman LJ. Out-of-sample tests of forecasting accuracy: an analysis and review. International Journal of Forecasting. 2000;16(4):437–450.
- 12. Bergmeir C, Benítez JM. On the use of cross-validation for time series predictor evaluation. Information Sciences. 2012;191:192–213.
- 13. Fildes R. Evaluation of aggregate and individual forecast method selection rules. Management Science. 1989;35(9):1056–1065.
- 14.
Torgo L. An infra-structure for performance estimation and experimental comparison of predictive models in R. arXiv preprint arXiv:14120436; 2014.
- 15.
Bifet A, Kirkby R. Data stream mining: a practical approach. The University of Waikato, New Zealand; 2009.
- 16. Ikonomovska E, Gama J, Džeroski S. Learning model trees from evolving data streams. Data Mining and Knowledge Discovery. 2011;23(1):128–168.
- 17.
Snijders TAB. On cross-validation for predictor evaluation in time series. In: Proc. Workshop On Model Uncertainty and its Statistical Implications. Springer; 1988. p. 56–69.
- 18.
McQuarrie AD, Tsai CL. Regression and Time Series Model Selection. Singapore: World Scientific Publishing; 1998.
- 19. Bergmeir C, Hyndman RJ, Koo B, et al. A Note on the Validity of Cross-Validation for Evaluating Time Series Prediction. Monash University, Department of Econometrics and Business Statistics, Working Paper. 2015;10.
- 20.
Bergmeir C, Benítez JM. Forecaster performance evaluation with cross-validation and variants. In: Proc. 11th Intl. Conference on Intelligent Systems Design and Applications (ISDA). IEEE; 2011. p. 849–854.
- 21. Bergmeir C, Costantini M, Benítez JM. On the usefulness of cross-validation for directional forecast evaluation. Computational Statistics & Data Analysis. 2014;76:132–143.
- 22.
Cerqueira V, Torgo L, Smailović J, Mozetič I. A comparative study of performance estimation methods for time series forecasting. In: Proc. 4th Intl. Conference on Data Science and Advanced Analytics (DSAA). IEEE; 2017. p. 529–538.
- 23. Mozetič I, Grčar M, Smailović J. Multilingual Twitter sentiment classification: the role of human annotators. PLoS ONE. 2016;11(5):e0155036. pmid:27149621
- 24.
Vapnik VN. The Nature of Statistical Learning Theory. New York, USA: Springer; 1995.
- 25.
Gaudette L, Japkowicz N. Evaluation methods for ordinal classification. In: Advances in Artificial Intelligence; 2009. p. 207–210.
- 26.
Martineau J, Finin T. Delta TFIDF: An improved feature space for sentiment analysis. In: Proc. 3rd AAAI Intl. Conference on Weblogs and Social Media (ICWSM); 2009. p. 258–261.
- 27.
Krippendorff K. Content Analysis, An Introduction to Its Methodology. 3rd ed. Thousand Oaks, CA, USA: Sage Publications; 2013.
- 28. Kiritchenko S, Zhu X, Mohammad SM. Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research. 2014;50:723–762.
- 29.
Van Rijsbergen CJ. Information Retrieval. 2nd ed. Newton, MA, USA: Butterworth; 1979.
- 30. Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association. 1937;32(200):675–701.
- 31. Friedman M. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics. 1940;11(1):86–92.
- 32. Iman RL, Davenport JM. Approximations of the critical region of the Friedman statistic. Communications in Statistics-Theory and Methods. 1980;9(6):571–595.
- 33. Demšar J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research. 2006;7(Jan):1–30.
- 34.
Nemenyi PB. Distribution-free Multiple Comparisons. PhD thesis, Princeton University, USA; 1963.
- 35. Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bulletin. 1945;1(6):80–83.
- 36.
Grčar M. Mining text-enriched heterogeneous information networks. PhD thesis, Jozef Stefan International Postgraduate School, Ljubljana, Slovenia; 2015.
- 37. Racine J. Consistent cross-validatory model-selection for dependent data: hv-block cross-validation. Journal of Econometrics. 2000;99(1):39–61.