How to evaluate sentiment classifiers for Twitter time-ordered data?

Social media are becoming an increasingly important source of information about the public mood regarding issues such as elections, Brexit, stock market, etc. In this paper we focus on sentiment classification of Twitter data. Construction of sentiment classifiers is a standard text mining task, but here we address the question of how to properly evaluate them as there is no settled way to do so. Sentiment classes are ordered and unbalanced, and Twitter produces a stream of time-ordered data. The problem we address concerns the procedures used to obtain reliable estimates of performance measures, and whether the temporal ordering of the training and test data matters. We collected a large set of 1.5 million tweets in 13 European languages. We created 138 sentiment models and out-of-sample datasets, which are used as a gold standard for evaluations. The corresponding 138 in-sample datasets are used to empirically compare six different estimation procedures: three variants of cross-validation, and three variants of sequential validation (where test set always follows the training set). We find no significant difference between the best cross-validation and sequential validation. However, we observe that all cross-validation variants tend to overestimate the performance, while the sequential methods tend to underestimate it. Standard cross-validation with random selection of examples is significantly worse than the blocked cross-validation, and should not be used to evaluate classifiers in time-ordered data scenarios.


Introduction
Online social media are becoming increasingly important in our society. Platforms such as Twitter and Facebook influence the daily lives of people around the world. Their users create and exchange a wide variety of contents on social media, which presents a valuable source of information about public sentiment regarding social, economic or political issues. In this context, it is important to develop automatic methods to retrieve and analyze information from social media.
In the paper we address the task of sentiment analysis of Twitter data. The task encompasses identification and categorization of opinions (e.g., negative, neutral, or positive) written in quasi-natural language used in Twitter posts. We focus on estimation procedures of the predictive performance of machine learning models used to address this task. Performance estimation procedures are key to understand the generalization ability of the models since they present approximations of how these models will behave on unseen data. In the particular case of sentiment analysis of Twitter data, high volumes of content are continuously being generated and there is no immediate feedback about the true class of instances. In this context, it is fundamental to adopt appropriate estimation procedures in order to get reliable estimates about the performance of the models. The complexity of Twitter data raises some challenges on how to perform such estimations, as, to the best of our knowledge, there is currently no settled approach to this. Sentiment classes are typically ordered and unbalanced, and the data itself is time-ordered. Taking these properties into account is important for the selection of appropriate estimation procedures.
The Twitter data shares some characteristics of time series and some of static data. A time series is an array of observations at regular or equidistant time points, and the observations are in general dependent on previous observations [1]. On the other hand, Twitter data is timeordered, but the observations are short texts posted by Twitter users at any time and frequency. It can be assumed that original Twitter posts are not directly dependent on previous posts. However, there is a potential indirect dependence, demonstrated in important trends and events, through influential users and communities, or individual user's habits. These longterm topic drifts are typically not taken into account by the sentiment analysis models.
We study different performance estimation procedures for sentiment analysis in Twitter data. These estimation procedures are based on (i) cross-validation and (ii) sequential approaches typically adopted for time series data. On one hand, cross-validations explore all the available data, which is important for the robustness of estimates. On the other hand, sequential approaches are more realistic in the sense that estimates are computed on a subset of data always subsequent to the data used for training, which means that they take time-order into account.
Our experimental study is performed on a large collection of nearly 1.5 million Twitter posts, which are domain-free and in 13 different languages. A realistic scenario is emulated by partitioning the data into 138 datasets by language and time window. Each dataset is split into an in-sample (a training plus test set), where estimation procedures are applied to approximate the performance of a model, and an out-of-sample used to compute the gold standard. Our goal is to understand the ability of each estimation procedure to approximate the true error incurred by a given model on the out-of-sample data.
The paper is structured as follows. Related work provides an overview of the state-of-the-art in estimation methods. In section Methods and experiments we describe the experimental setting for an empirical comparison of estimation procedures for sentiment classification of time-ordered Twitter data. We describe the Twitter sentiment datasets, a machine learning algorithm we employ, performance measures, and how the gold standard and estimation results are produced. In section Results and discussion we present and discuss the results of comparisons of the estimation procedures along several dimensions. Conclusions provide the limitations of our work and give directions for the future.

Related work
In this section we briefly review typical estimation methods used in sentiment classification of Twitter data. In general, for time-ordered data, the estimation methods used are variants of cross-validation, or are derived from the methods used to analyze time series data. We examine the state-of-the-art of these estimation methods, pointing out their advantages and drawbacks. Several works in the literature on sentiment classification of Twitter data employ standard cross-validation procedures to estimate the performance of sentiment classifiers. For example, Agarwal et al. [2] and Mohammad et al. [3] propose different methods for sentiment analysis of Twitter data and estimate their performance using 5-fold and 10-fold cross-validation, respectively. Bermingham and Smeaton [4] produce a comparative study of sentiment analysis between blogs and Twitter posts, where models are compared using 10-fold cross-validation. Saif et al. [5] asses binary classification performance of nine Twitter sentiment datasets by 10-fold cross validation. Other, similar applications of cross-validation are given in [6,7].
On the other hand, there are also approaches that use methods typical for time series data. For example, Bifet and Frank [8] use the prequential (predictive sequential) method to evaluate a sentiment classifier on a stream of Twitter posts. Moniz et al. [9] present a method for predicting the popularity of news from Twitter data and sentiment scores, and estimate its performance using a sequential approach in multiple testing periods.
The idea behind the K-fold cross-validation is to randomly shuffle the data and split it in K equally-sized folds. Each fold is a subset of the data randomly picked for testing. Models are trained on the K − 1 folds and their performance is estimated on the left-out fold. K-fold crossvalidation has several practical advantages, such as an efficient use of all the data. However, it is also based on an assumption that the data is independent and identically distributed [10] which is often not true. For example, in time-ordered data, such as Twitter posts, the data are to some extent dependent due to the underlying temporal order of tweets. Therefore, using Kfold cross-validation means that one uses future information to predict past events, which might hinder the generalization ability of models.
There are several methods in the literature designed to cope with dependence between observations. The most common are sequential approaches typically used in time series forecasting tasks. Some variants of K-fold cross-validation which relax the independence assumption were also proposed. For time-ordered data, an estimation procedure is sequential when testing is always performed on the data subsequent to the training set. Typically, the data is split into two parts, where the first is used to train the model and the second is held out for testing. These approaches are also known in the literature as the out-of-sample methods [11,12].
Within sequential estimation methods one can adopt different strategies regarding train/ test splitting, growing or sliding window setting, and eventual update of the models. In order to produce reliable estimates and test for robustness, Tashman [11] recommends employing these strategies in multiple testing periods. One should either create groups of data series according to, for example, different business cycles [13], or adopt a randomized approach, such as in [14]. A more complete overview of these approaches is given by Tashman [11].
In stream mining, where a model is continuously updated, the most commonly used estimation methods are holdout and prequential [15,16]. The prequential strategy uses an incoming observation to first test the model and then to train it.
Besides sequential estimation methods, some variants of K-fold cross-validation were proposed in the literature that are specially designed to cope with dependency in the data and enable the application of cross-validation to time-ordered data. For example, blocked crossvalidation (the name is adopted from Bergmeir [12]) was proposed by Snijders [17]. The method derives from a standard K-fold cross-validation, but there is no initial random shuffling of observations. This renders K blocks of contiguous observations. The problem of data dependency for cross-validation is addressed by McQuarrie and Tsai [18]. The modified cross-validation removes observations from the training set that are dependent with the test observations. The main limitation of this method is its inefficient use of the available data since many observations are removed, as pointed out in [19]. The method is also known as non-dependent cross-validation [12].
The applicability of variants of cross-validation methods in time series data, and their advantages over traditional sequential validations are corroborated by Bergmeir et al. [12,20,21]. The authors conclude that in time series forecasting tasks, the blocked cross-validations yield better error estimates because of their more efficient use of the available data. Cerqueira et al. [22] compare performance estimation of various cross-validation and out-of-sample approaches on real-world and synthetic time series data. The results indicate that cross-validation is appropriate for the stationary synthetic time series data, while the out-of-sample approaches yield better estimates for real-world data.
Our contribution to the state-of-the-art is a large scale empirical comparison of several estimation procedures on Twitter sentiment data. We focus on the differences between the cross-validation and sequential validation methods, to see how important is the violation of data independence in the case of Twitter posts. We consider longer-term time-dependence between the training and test sets, and completely ignore finer-scale dependence at the level of individual tweets (e.g., retweets and replies). To the best of our knowledge, there is no settled approach yet regarding proper validation of models for Twitter time-ordered data. This work provides some results which contribute to bridging that gap.

Methods and experiments
The goal of this study is to recommend appropriate estimation procedures for sentiment classification of Twitter time-ordered data. We assume a static sentiment classification model applied to a stream of Twitter posts. In a real-case scenario, the model is trained on historical, labeled tweets, and applied to the current, incoming tweets. We emulate this scenario by exploring a large collection of nearly 1.5 million manually labeled tweets in 13 European languages (see subsection Data and models). Each language dataset is split into pairs of the insample data, on which a model is trained, and the out-of-sample data, on which the model is validated. The performance of the model on the out-of-sample data gives an estimate of its performance on the future, unseen data. Therefore, we first compute a set of 138 out-of-sample performance results, to be used as a gold standard (subsection Gold standard). In effect, our goal is to find the estimation procedure that best approximates this out-of-sample performance.
Throughout our experiments we use only one training algorithm (subsection Data and models), and two performance measures (subsection Performance measures). During training, the performance of the trained model can be estimated only on the in-sample data. However, there are different estimation procedures which yield these approximations. In machine learning, a standard procedure is cross-validation, while for time-ordered data, sequential validation is typically used. In this study, we compare three variants of cross-validation and three variants of sequential validation (subsection Estimation procedures). The goal is to find the insample estimation procedure that best approximates the out-of-sample gold standard. The error an estimation procedure makes is defined as the difference to the gold standard.

Data and models
We collected a large corpus of nearly 1.5 million Twitter posts written in 13 European languages. This is, to the best of our knowledge, by far the largest set of sentiment labeled tweets publicly available. We engaged native speakers to label the tweets based on the sentiment expressed in them. The sentiment label has three possible values: negative, neutral or positive. It turned out that the human annotators perceived the values as ordered. The quality of annotations varies though, and is estimated from the self-and inter-annotator agreements. All the details about the datasets, the annotator agreements, and the ordering of sentiment values are in our previous study [23]. The sentiment distribution and quality of individual language datasets is in Table 1. The tweets in the datasets are ordered by tweet ids, which corresponds to ordering by the time of posting.
There are many supervised machine learning algorithms suitable for training sentiment classification models from labeled tweets. In this study we use a variant of Support Vector Machine (SVM) [24]. The basic SVM is a two-class, binary classifier. In the training phase, SVM constructs a hyperplane in a high-dimensional vector space that separates one class from the other. In the classification phase, the side of the hyperplane determines the class. A twoclass SVM can be extended into a multi-class classifier which takes the ordering of sentiment values into account, and implements ordinal classification [25]. Such an extension consists of two SVM classifiers: one classifier is trained to separate the negative examples from the neutral-or-positives; the other separates the negative-or-neutrals from the positives. The result is a classifier with two hyperplanes, which partitions the vector space into three subspaces: negative, neutral, and positive. During classification, the distances from both hyperplanes determine the predicted class. A further refinement is a TwoPlaneSVMbin classifier. It partitions the space around both hyperplanes into bins, and computes the distribution of the training examples in individual bins. During classification, the distances from both hyperplanes determine the appropriate bin, but the class is determined as the majority class in the bin.
The vector space is defined by the features extracted from the Twitter posts. The posts are first pre-processed by standard text processing methods, i.e., tokenization, stemming/lemmatization (if available for a specific language), unigram and bigram construction, and elimination of terms that do not appear at least 5 times in a dataset. The Twitter specific pre-processing is then applied, i.e, replacing URLs, Twitter usernames and hashtags with common tokens, adding emoticon features for different types of emoticons in tweets, handling of repetitive letters, etc. The feature vectors are then constructed by the Delta TF-IDF weighting scheme [26].
In our previous study [23] we compared five variants of the SVM classifiers and Naive Bayes on the Twitter sentiment classification task. TwoPlaneSVMbin was always between the top, but statistically indistinguishable, best performing classifiers. It turned out that monitoring the quality of the annotation process has much larger impact on the performance than the

Performance measures
Sentiment values are ordered, and distribution of tweets between the three sentiment classes is often unbalanced. In such cases, accuracy is not the most appropriate performance measure [8,23]. In this context, we evaluate performance with the following two metrics: Krippendorff's Alpha [27], and F 1 [28].
Alpha was developed to measure the agreement between human annotators, but can also be used to measure the agreement between classification models and a gold standard. It generalizes several specialized agreement measures, takes ordering of classes into account, and accounts for the agreement by chance. Alpha is defined as follows: where D o is the observed disagreement between models, and D e is a disagreement, expected by chance. When models agree perfectly, Alpha = 1, and when the level of agreement equals the agreement by chance, Alpha = 0. Note that Alpha can also be negative. The two disagreement measures are defined as: Note that disagreements D o and D e between the extreme classes (negative and positive) are four times larger than between the neighbouring classes.
A coincidence matrix tabulates all pairable values of c from two models. In our case, we have a 3-by-3 coincidence matrix, and compare a model to the gold standard. The coincidence matrix is then the sum of the confusion matrix and its transpose. Each labeled tweet is entered twice, once as a (c, c 0 ) pair, and once as a (c 0 , c) pair. N(c, c 0 ) is the number of tweets labeled by the values c and c 0 by different models, N(c) and N(c 0 ) are the totals for each value, and N is the grand total.
F 1 is an instance of the F score, a well-known performance measure in information retrieval [29] and machine learning. We use an instance specifically designed to evaluate the 3-class sentiment models [28]. F 1 is defined as follows: F 1 implicitly takes into account the ordering of sentiment values, by considering only the extreme labels, negative (−1) and positive (+1). The middle, neutral, is taken into account only indirectly. F 1 (c) is the harmonic mean of precision and recall for class c, c 2 {−1, +1}. F 1 ¼ 1 implies that all negative and positive tweets were correctly classified, and as a consequence, all neutrals as well. F 1 ¼ 0 indicates that all negative and positive tweets were incorrectly classified. F 1 does not account for correct classification by chance.

Gold standard
We create the gold standard results by splitting the data into the in-sample datasets (abbreviated as in-set), and out-of-sample datasets (abbreviated as out-set). The terminology of the inand out-set is adopted from Bergmeir et al. [12]. Tweets are ordered by the time of posting. To emulate a realistic scenario, an out-set always follows the in-set. From each language dataset ( Table 1) we create L in-sets of varying length in multiples of 10,000 consecutive tweets, where L = bN/10000c. The out-set is the subsequent 10,000 consecutive tweets, or the remainder at the end of each language dataset. This is illustrated in Fig 1. The partitioning of the language datasets results in 138 in-sets and corresponding out-sets. For each in-set, we train a TwoPlaneSVMbin sentiment classification model, and measure its performance, in terms of Alpha and F 1 , on the corresponding out-set. The results are in Tables 2 and 3. Note that the performance measured by Alpha is considerably lower in comparison to F 1 , since the baseline for Alpha is classification by chance.
The 138 in-sets are used to train sentiment classification models and estimate their performance. The goal of this study is to analyze different estimation procedures in terms of how well they approximate the out-set gold standard results shown in Tables 2 and 3.

Estimation procedures
There are different estimation procedures, some more suitable for static data, while others are more appropriate for time-series data. Time-ordered Twitter data shares some properties of both types of data. When training an SVM model, the order of tweets is irrelevant and the model does not capture the dynamics of the data. When applying the model, however, new tweets might introduce new vocabulary and topics. As a consequence, the temporal ordering of training and test data has a potential impact on the performance estimates.
We therefore compare two classes of estimation procedures. Cross-validation, commonly used in machine learning for model evaluation on static data, and sequential validation, commonly used for time-series data. There are many variants and parameters for each class of procedures. Our datasets are relatively large and an application of each estimation procedure  (Table 1) is partitioned into L in-sets and corresponding outsets. The in-sets always start at the first tweet and are progressively longer in multiples of 10,000 tweets. The corresponding out-set is the subsequent 10,000 consecutive tweets, or the remainder at the end of the language dataset. https://doi.org/10.1371/journal.pone.0194317.g001 takes several days to complete. We have selected three variants of each procedure to provide answers to some relevant questions.
In sequential validation, a sample consists of the training set immediately followed by the test set. We vary the ratio of the training and test set sizes, and the number and distribution of samples taken from the in-set. The number of samples is 10 or 20, and they are distributed equidistantly or semi-equidistantly. In all variants, samples cover the whole in-set, but they are overlapping. See Fig 2 for illustration. We use the following abbreviations for sequential validations:  How to evaluate sentiment classifiers for Twitter time-ordered data?

Results and discussion
We compare six estimation procedures in terms of different types of errors they incur. The error is defined as the difference to the gold standard. First, the magnitude and sign of the errors show whether a method tends to underestimate or overestimate the performance, and by how much (subsection Median errors). Second, relative errors give fractions of small, moderate, and large errors that each procedure incurs (subsection Relative errors). Third, we rank the estimation procedures in terms of increasing absolute errors, and estimate the significance of the overall ranking by the Friedman-Nemenyi test (subsection Friedman test). Finally, selected pairs of estimation procedures are compared by the Wilcoxon signed-rank test (subsection Wilcoxon test).

Median errors
An estimation procedure estimates the performance (abbreviated Est) of a model in terms of Alpha and F 1 . The error it incurs is defined as the difference to the gold standard performance (abbreviated Gold): Err = Est − Gold. The validation results show high variability of the errors, with skewed distribution and many outliers. Therefore, we summarize the errors in terms of their medians and quartiles, instead of the averages and variances. The median errors of the six estimation procedures are in Tables 4 and 5, measured by Alpha and F 1 , respectively.  dots correspond to the outliers. Fig 3 shows high variability of errors for individual datasets. This is most pronounced for the Serbian/Croatian/Bosnian (scb) and Portuguese (por) datasets where variation in annotation quality (scb) and a radical topic shift (por) were observed. Higher variability is also observed for the Spanish (spa) and Albanian (alb) datasets, which have poor sentiment annotation quality (see [23] for details). The differences between the estimation procedures are easier to detect when we aggregate the errors over all language datasets. The results are in Figs 4 and 5, for Alpha and F 1 , respectively. In both cases we observe that the cross-validation procedures (xval) consistently overestimate the performance, while the sequential validations (seq) underestimate it. The largest overestimation errors are incurred by the random cross-validation, and the largest underestimations by the sequential validation with the training:test set ratio 2:1. We also observe high variability of errors, with many outliers. The conclusions are consistent for both measures, Alpha and F 1 . How to evaluate sentiment classifiers for Twitter time-ordered data?

Relative errors
Another useful analysis of estimation errors is provided by a comparison of relative errors. The relative error is the absolute error an estimation procedure incurs divided by the gold standard result: RelErr = |Est − Gold|/Gold. We chose two, rather arbitrary, thresholds of 5% and 30%, and classify the relative errors as small (RelErr < 5%), moderate (5% RelErr 30%), and large (RelErr > 30%). Fig 6 shows the proportion of the three types of errors, measured by Alpha, for individual language datasets. Again, we observe a higher proportion of large errors for languages with poor annotations (alb, spa), annotations of different quality (scb), and different topics (por). Figs 7 and 8 aggregate the relative errors across all the datasets, for Alpha and F 1 , respectively. The proportion of errors is consistent between Alpha and F 1 , but there are more large errors when the performance is measured by Alpha. This is due to smaller error magnitude when the performance is measured by Alpha in contrast to F 1 , since Alpha takes classification by chance into account. With respect to individual estimation procedures, there is a considerable divergence of the random cross-validation. For both performance measures, Alpha and F 1 , it consistently incurs higher proportion of large errors and lower proportion of small errors in comparison to the rest of the estimation procedures.

Friedman test
The Friedman test is used to compare multiple procedures over multiple datasets [30][31][32][33]. For each dataset, it ranks the procedures by their performance. It tests the null hypothesis that the average ranks of the procedures across all the datasets are equal. If the null hypothesis is rejected, one applies the Nemenyi post-hoc test [34] on pairs of procedures. The performance of two procedures is significantly different if their average ranks differ by at least the critical difference. The critical difference depends on the number of procedures to compare, the number of different datasets, and the selected significance level. In our case, the performance of an estimation procedure is taken as the absolute error it incurs: AbsErr = |Est − Gold|. The estimation procedure with the lowest absolute error gets the How to evaluate sentiment classifiers for Twitter time-ordered data? lowest (best) rank. The results of the Friedman-Nemenyi test are in Figs 9 and 10, for Alpha and F 1 , respectively.
For both performance measures, Alpha and F 1 , the Friedman rankings are the same. For six estimation procedures, 13 language datasets, and the 5% significance level, the critical difference is 2.09. In the case of F 1 (Fig 10) all six estimation procedures are within the critical difference, so their ranks are not significantly different. In the case of Alpha (Fig 9), however, the two best methods are significantly better than the random cross-validation.

Wilcoxon test
The Wilcoxon signed-rank test is used to compare two procedures on related data [33,35]. It ranks the differences in performance of the two procedures, and compares the ranks for the positive and negative differences. Greater differences count more, but the absolute magnitudes are ignored. It tests the null hypothesis that the differences follow a symmetric distribution around zero. If the null hypothesis is rejected one can conclude that one procedure outperforms the other at a selected significance level.
In our case, the performance of pairs of estimation procedures is compared at the level of language datasets. The absolute errors of an estimation procedure are averaged across the in- How to evaluate sentiment classifiers for Twitter time-ordered data?
sets of a language. The average absolute error is then AvgAbsErr = ∑|Est − Gold|/L, where L is the number of in-sets. The results of the Wilcoxon test, for selected pairs of estimation procedures, for both Alpha and F 1 , are in Fig 11. The Wilcoxon test results confirm and reinforce the main results of the previous sections. Among the cross-validation procedures, blocked cross-validation is consistently better than the random cross-validation, at the 1% significance level. Stratified approach is better than non-stratified, but significantly (5% level) only for F 1 . The comparison of the sequential validation procedures is less conclusive. The training:test set ratio 9:1 is better than 2:1, but significantly (at the 5% level) only for Alpha. With the ratio 9:1 fixed, 20 samples yield better performance estimates than 10 samples, but significantly (5% level) only for F 1 . We found no significant difference between the best cross-validation and sequential validation procedures in terms how well they estimate the average absolute errors.

Data and code availability
All Twitter data were collected through the public Twitter API and are subject to the Twitter terms and conditions. The Twitter language datasets are available in a public language resource repository CLARIN.SI at http://hdl.handle.net/11356/1054, and are described in [23]. There are 15 language files, where the Serbian/Croatian/Bosnian dataset is provided as three separate files for the constituent languages. For each language and each labeled tweet, there is the tweet ID (as provided by Twitter), the sentiment label (negative, neutral, or positive), and the annotator ID (anonymized). Note that Twitter terms do not allow to openly publish the original tweets, they have to be fetched through the Twitter API. Precise details how to fetch the tweets, given tweet IDs, are provided in Twitter API documentation https://developer.twitter.com/en/ docs/tweets/post-and-engage/api-reference/get-statuses-lookup. However, upon request to the corresponding author, a bilateral agreement on the joint use of the original data can be reached.
The TwoPlaneSVMbin classifier and several other machine learning algorithms are implemented in an open source LATINO library [36]. LATINO is a light-weight set of software components for building text mining applications, openly available at https://github.com/ latinolib.
All the performance results, for gold standard and the six estimation procedures, are provided in a form which allows for easy reproduction of the presented results. The R code and How to evaluate sentiment classifiers for Twitter time-ordered data? data files needed to reproduce all the figures and tables in the paper are available at http:// ltorgo.github.io/TwitterDS/.

Conclusions
In this paper we present an extensive empirical study about the performance estimation procedures for sentiment analysis of Twitter data. Currently, there is no settled approach on how to properly evaluate models in such a scenario. Twitter time-ordered data shares some properties of static data for text mining, and some of time series data. Therefore, we compare estimation procedures developed for both types of data.
The main result of the study is that standard, random cross-validation should not be used when dealing with time-ordered data. Instead, one should use blocked cross-validation, a conclusion already corroborated by Bergmeir et al. [12,20]. Another result is that we find no significant differences between the blocked cross-validation and the best sequential validation. However, we do find that cross-validations typically overestimate the performance, while sequential validations underestimate it. How to evaluate sentiment classifiers for Twitter time-ordered data?
The results are robust in the sense that we use two different performance measures, several comparisons and tests, and a very large collection of data. To the best of our knowledge, we analyze and provide by far the largest set of manually sentiment-labeled tweets publicly available. There are some biased decisions in our creation of the gold standard though, which limit the generality of the results reported, and should be addressed in the future work. An outset always consists of 10,000 tweets, and immediately follows the in-sets. We do not consider how the performance drops over longer out-sets, nor how frequently should a model be updated. More importantly, we intentionally ignore the issue of dependent observations, between the in-and out-sets, and between the training and test sets. In the case of tweets, short-term dependencies are demonstrated in the form of retweets and replies. Medium-and long-term dependencies are shaped by periodic events, influential users and communities, or  F 1 (bottom). Thick solid lines denote significant differences at the 1% level, normal solid lines significant differences at the 5% level, and dashed lines insignificant differences. Arrows point from a procedure which incurs smaller errors to a procedure with larger errors. https://doi.org/10.1371/journal.pone.0194317.g011 How to evaluate sentiment classifiers for Twitter time-ordered data?
individual user's habits. When this is ignored, the model performance is likely overestimated. Since we do this consistently, our comparative results still hold. The issue of dependent observations was already addressed for blocked cross-validation [21,37] by removing adjacent observations between the training and test sets, thus effectively creating a gap between the two. Finally, it should be noted that different Twitter language datasets are of different sizes and annotation quality, belong to different time periods, and that there are time periods in the datasets without any manually labeled tweets.