Tweet sentiment quantification: An experimental re-evaluation

Sentiment quantification is the task of training, by means of supervised learning, estimators of the relative frequency (also called “prevalence”) of sentiment-related classes (such as Positive, Neutral, Negative) in a sample of unlabelled texts. This task is especially important when these texts are tweets, since the final goal of most sentiment classification efforts carried out on Twitter data is actually quantification (and not the classification of individual tweets). It is well-known that solving quantification by means of “classify and count” (i.e., by classifying all unlabelled items by means of a standard classifier and counting the items that have been assigned to a given class) is less than optimal in terms of accuracy, and that more accurate quantification methods exist. Gao and Sebastiani 2016 carried out a systematic comparison of quantification methods on the task of tweet sentiment quantification. In hindsight, we observe that the experimentation carried out in that work was weak, and that the reliability of the conclusions that were drawn from the results is thus questionable. We here re-evaluate those quantification methods (plus a few more modern ones) on exactly the same datasets, this time following a now consolidated and robust experimental protocol (which also involves simulating the presence, in the test data, of class prevalence values very different from those of the training set). This experimental protocol (even without counting the newly added methods) involves a number of experiments 5,775 times larger than that of the original study. Due to the above-mentioned presence, in the test data, of samples characterised by class prevalence values very different from those of the training set, the results of our experiments are dramatically different from those obtained by Gao and Sebastiani, and provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.


Introduction
Quantification (also known as supervised prevalence estimation, or learning to quantify) is the task of training (by means of supervised learning) a predictor that estimates the relative frequency (also known as prevalence, or prior probability) of the classes of interest in a set (here often called a "sample") of unlabelled data items, where the data used to train the predictor are a set of labelled data items [21]. 1 Quantification finds applications in fields (such as the social sciences [24], epidemiology [26], market research [8], and ecological modelling [3]) that inherently deal with aggregate (rather than individual) data, but is also relevant to other applications such as resource allocation [18], word sense disambiguation [7], and improving classifier fairness [5].
In the realm of textual data, one important domain to which quantification is applied is sentiment analysis [27,33].In fact, as argued by Esuli et al. [9], many applications of sentiment classification are such that the final goal is not determining the class label (e.g., Positive, or Neutral, or Negative) of an individual unlabelled text (for example, a blog post, a response to an open question, or a comment on a product), but is that of determining the relative frequencies of the classes of interest in a set of unlabelled texts.In a 2016 paper, Gao and Sebastiani [20] (hereafter: [GS2016]) have argued that, when the objects of analysis are tweets, the vast majority of sentiment classification efforts actually have quantification as their final goal, since hardly anyone who engages in sentiment classification of tweets is interested in the sentiment conveyed by a specific tweet.We call the resulting task tweet sentiment quantification [1,9,20].
It is well-known (see e.g., [15,16,18,21,23,28,35,37,40,42]) that solving quantification by means of "classify and count" (i.e., by classifying all the unlabelled items by means of a standard classifier and counting the items that have been assigned to a given class) is less than optimal in terms of accuracy, and that more accurate quantification methods exist.Driven by these considerations, [GS2016] presented an experimental comparison of 8 important quantification methods on 11 Twitter datasets annotated by sentiment, with the goal of assessing the strengths and weaknesses of the various methods for tweet sentiment quantification.That paper became then influential 2 and a standard reference on this problem, and describes what is currently the largest comparative experimentation on tweet sentiment quantification.
In this paper we argue that the experimental results obtained in [GS2016] are unreliable, as a result of the fact that the experimental protocol used in that paper was weak.We thus present new experiments in which we re-test all 8 quantification methods originally tested in [GS2016] (plus some additional ones that have been proposed since then) on the same 11 datasets used in [GS2016], this time using a now consolidated and much more robust experimental protocol.These new experiments (whose number is 5,775 times larger than the number of experiments conducted in [GS2016], even without counting the experiments on new quantification methods that had not been considered in [GS2016]) return results dramatically different from those obtained in [GS2016], and thus give us a new, more reliable picture of the relative merits of the various methods on the tweet sentiment quantification task.
The rest of this paper is structured as follows.In Section 2 we discuss experimental protocols for quantification, and argue why the experimentation carried out in [GS2016] is, in hindsight, weak.In Section 3 we present the new experiments we have run, briefly discussing the quantification methods and the datasets we use, and explaining in detail the experimental protocol we use.Section 4 discusses the results and the conclusions that they allow drawing, also pointing at how they differ from the ones of [GS2016].Section 5 is devoted to concluding remarks.
We make all the code we use for our experiments available. 3Together with the fact that [GS2016] made available (in vector form) all their 11 datasets4 , this allows our experiments to be easily reproduced by other researchers.
2 Experimental Protocols for Quantification

Notation
In this paper we use the following notation.By x we indicate a document drawn from a domain X of documents, while by y we indicate a class drawn from a set of classes (also known as a codeframe) Y = {y 1 , ..., y |Y| }.Given x ∈ X and y ∈ Y, a pair (x, y) thus denotes a document with its true class label.Symbol σ denotes a sample, i.e., a non-empty set of (labelled or unlabelled) documents drawn from X .By p σ (y) we indicate the true prevalence of class y in sample σ, by pσ (y) we indicate an estimate of this prevalence5 , and by pM σ (y) we indicate the estimate of this prevalence obtained by means of quantification method M .Since 0 ≤ p σ (y) ≤ 1 and 0 ≤ pσ (y) ≤ 1 for all y ∈ Y, and since y∈Y p σ (y) = y∈Y pσ (y) = 1, the p σ (y)'s and the pσ (y)'s form two probability distributions across the same codeframe.
By D(p, p) we denote an evaluation measure for quantification; these measures are typically divergences, i.e., functions that measure the amount of discrepancy between two probability distributions.By L we denote a set of labelled documents, that we typically use as a training set, while by U we denote a set of unlabelled documents, that we typically use as a sample to quantify on.We take a hard classifier to be a function h : X → Y, and a soft classifier to be a function s : X → [0, 1] |Y| , where s(x) is a vector of |Y| posterior probabilities (each indicated as Pr(y|x)), such that y∈Y Pr(y|x) = 1; Pr(y|x) indicates the probability of membership in y of item x as estimated by the soft classifier s.By δ σ (y) we denote the set of documents in sample σ that have been assigned to class y by a hard classifier.

Why do we need quantification?
Quantification may be seen as the task of approximating a true distribution p σ , where p σ is defined on a sample σ and across the classes in a codeframe Y = {y 1 , ..., y |Y| }, by means of a predicted distribution pσ ; in other words, in quantification one needs to generate estimates pσ (y 1 ), ..., pσ (y |Y| ) of the true (and unknown) class prevalence values p σ (y 1 ), ..., p σ (y |Y| ), where y∈Y pσ (y) = y∈Y p σ (y) = 1.In this paper we consider a ternary sentiment quantification task (an example of single-label multiclass quantification) in which the codeframe is Y = {Positive, Neutral, Negative}, and where these three class labels will be indicated, for brevity, by the symbols {⊕, , }.All the 11 datasets discussed in Section 3.5 use this codeframe.
The reason why true quantification methods (i.e., different from the trivial "classify and count" mentioned in Section 1) are needed is the fact that many applicative scenarios suffer from distribution shift, the phenomenon according to which the distribution p L (y) in the training set L may substantially differ from the distribution p U (y) in the unlabelled data U that one needs to label [30,38].The presence of distribution shift means that the well-known IID assumption, on which most learning algorithms for training classifiers are based, does not hold; in turn, this means that "classify and count" will perform less than optimally on sets of unlabelled items that exhibit distribution shift with respect to this training set, and that the higher the amount of shift, the worse we can expect "classify and count" to perform.

The APP and the NPP
There are two main experimental protocols that have been used in the literature for evaluating quantification; we will here call them the artificial-prevalence protocol (APP) and the natural-prevalence protocol (NPP).
The APP consists of taking a standard dataset 6 , split into a training set L of labelled items and a set U of unlabelled items, and conducting repeated experiments in which either the training set prevalence values or the test set prevalence values of the classes are artificially varied by means of subsampling (i.e., by removing random elements of specific classes until the desired class prevalence values are obtained).In other words, subsampling is used either to generate s training samples L 1 ⊆ L, ..., L s ⊆ L, or to generate t test samples U 1 ⊆ U, ..., U t ⊆ U, or both, where the class prevalence values of the generated samples are predetermined and set in such a way as to generate a wide array of distribution drift values.This is meant to test the robustness of a quantifier (i.e., of an estimator of class prevalence values) in scenarios characterized by class prevalence values very different from the ones the quantifier has been trained on.For instance, in the binary quantification experiments carried out in [16], given codeframe Y = {y 1 , y 2 }, repeated experiments are conducted in which examples of either y 1 or y 2 are removed at random from the test set in order to generate predetermined prevalence values for y 1 and y 2 in the samples U 1 , ..., U t thus obtained.In this way, the different samples are characterised by a different prevalence of y 1 (e.g., p U (y 1 ) ∈ {0.00, 0.05, ..., 0.95, 1.00}) and, as a result, by a different prevalence of y 2 .This can be repeated, thus generating multiple random samples for each chosen pair of class prevalence values.Analogously, random removal of examples of either y 1 or y 2 can be performed on the training set, thus bringing about training samples with different values of p L (y 1 ) and p L (y 2 ).
This protocol had been criticised (see [11]) because it may generate samples exhibiting class prevalence values very different from the ones of the set from which the sample was extracted, i.e., class prevalence values that might be hardly plausible in practice.As a result, one may resort to the NPP, which consists instead of conducting experiments on "real" datasets only, i.e., datasets consisting of a training set L and a test set U that have been sampled IID from the data distribution.In other words, no extraction of samples from the dataset is performed by perturbing the original class prevalence values; instead, a single train-and-test run is performed, using the original training set L as the training sample L and the original test set U as the test sample U .
The experimentation conducted by [GS2016] on tweet sentiment quantification is indeed an example of the NPP, since it relies on 11 "original" datasets of tweets annotated by sentiment, i.e., no extraction of samples at prespecified values of class prevalence was performed.However, [GS2016] probably failed to realise that, while in classification an experiment involving 11 different datasets probably counts as large and robust, this does not hold in quantification if only one test per dataset is conducted.The reason is that, since the objects of quantification are sets of documents [FS]: in the same way that the objects of classification are individual documents, testing a quantifier on just 11 sets of documents should be considered, from an experimental point of view, a drastically insufficient experimentation, akin to testing a classifier on 11 documents only.
Unfortunately, finding a large enough set (say, 1,000 or more) of datasets sampled IID from the respective data distributions is nearly impossible; this indicates that extracting a large enough number of samples from the same dataset is probably the only way to go for evaluating quantification. 7Indeed, most recent quantification works (e.g., 6 By "a standard dataset" we here mean any dataset that has originally been assembled for testing classification systems; any such dataset can be used for testing quantification systems too. 7An example set of experiments that use the NPP on a large enough set of test sets is the one reported in [11], where the authors test quantifiers on 52 × 99=5,148 binary test sets.This results from the fact that, in using the RCV1-v2 test collection, they consider the 99 RCV1-v2 classes and bin the RCV1-v2 791,607 test documents in 52 bins (each corresponding to a week's worth of data, since the RCV1-v2 data span one year) of 15,212 documents each on average.However, it is not always easy to find test collections with such a large amount of classes and annotated data, and this limits the applicability of the NPP.It should also be mentioned that, as Card and Smith [6] noted, the vast September 21, 2021 4/23 [6,12,14,23,28,29,35,36,40]) adopt the APP, and not the NPP.
As a result, we should conclude that the experimentation conducted in [GS2016] is weak, and that the results of that experimentation are thus unreliable.We thus re-evaluate the same quantification methods that [GS2016] tested (plus some other more recent ones) on the same datasets, this time following the by now consolidated and much more robust APP; in our case, this turns out to involve 5,775 as many experiments as run in the original study, even without considering the experiments on quantification methods that had not been considered in [GS2016]).
It might be argued that the APP is unrealistic because it generates test samples whose class prevalence values are too far away from the values seen in the test set from where they have been extracted, and that such scenarios are thus unlikely to occur in real applicative settings.However, in the absence of any prior knowledge about how the class prevalence values are allowed or expected to change in future data, the APP turns out to be not only the fairest protocol, since it relies on no assumptions that could penalize or benefit any particular method, but also the most interesting for quantification, since quantification is especially useful in cases of distribution shift. 8

Experiments
In this section describe the experiments we have carried out in order to re-assess the merits of different quantification methods under the lens of the APP.We have conducted all these experiments using QuaPy 9 , a software framework for quantification written in Python that we have developed and made available through GitHub. 10

Evaluation measures
As the measures of quantification error we use Absolute Error (AE) and Relative Absolute Error (RAE), defined as where p is the true distribution, p is the estimated distribution, and Y is the set of classes of interest (Y = {⊕, , } in our case).Note that RAE is undefined when at least one of the classes y ∈ Y is such that its prevalence in the sample U is 0. To solve this problem, in computing RAE we smooth majority of the 5,148 RCV1-v2 binary test sets used in [11] exhibit very little distribution shift, which makes the testbed used in [11] unchallenging for quantification methods.
8 Yet another way of saying this comes from the observation that, should we adopt the NPP instead of the APP, a method that trivially returns, as the class prevalence estimates for every test sample, the class prevalence values from the training set (this trivial method is commonly known in the quantification literature as the maximum likelihood prevalence estimator -MLPE), would probably perform well, and might even beat all genuinely engineered quantification methods.The reason why it would probably perform well is that the expectations of the class prevalence values of samples drawn IID from the test set coincide with the class prevalence values of the test set, and these, again by virtue of the IID assumption, are likely to be close to those of the training set.In other words, the reason why MLPE typically performs well when evaluated according to the NPP, does not lie in the (inexistent) qualities of MLPE as a quantification method, but in the fact that the NPP is a weak evaluation protocol.
where p(y) denotes the smoothed version of p(y) and the denominator is just a normalising factor (same for the p(y)'s); following [18], we use the quantity = 1/(2|U |) as the smoothing factor.We then use the smoothed versions of p(y) and p(y) in place of their original non-smoothed versions in Equation 2; as a result, RAE is now always defined.
The reason why we use AE and RAE is that from a theoretical standpoint they are, as it has been recently argued [41], the most satisfactory evaluation measures for quantification.This means that we do not consider other measures used in [GS2016], such as KLD, NAE, NRAE, and NKLD, since [41] shows them to be inadequate for evaluating quantification.

Quantification methods used in [GS2016]
We now briefly describe the quantification methods used in [GS2016], that we also use in this paper.
The simplest quantification method (and the one that acts as a lower-bound baseline for all quantification methods) is the above-mentioned Classify and Count (CC), which, given a hard classifier h, consists of computing where C h ij indicates the number of documents classified as y i by h and whose true label is y j .CC is an example of an aggregative quantification method, i.e., a method that requires the (hard or soft) classification of all the unlabelled items as an intermediate step.All the methods discussed in this section are aggregative.
The Adjusted Classify and Count (ACC) quantification method (see [14,18]) derives from the observation that, by the law of total probability, it holds that where δ(y i ) denotes (see Section 2.1) the set of documents that have been assigned to class y i by the hard classifier h.Equation 5 can be more conveniently rewritten as Note that the leftmost factor of Equation 6is known (it is the fraction of documents that the classifier has assigned to class y i , i.e., pCC U (y i )), and that C h ij / yx∈Y C h xj (which represents the disposition of the classifier to assign y i when y j is the true label), while unknown, can be estimated by k-fold cross-validation on L. Note also that p U (y j ) is unknown (it is the goal of quantification to estimate it), and that there are |Y| instances of Equation 5, one for each y i ∈ Y.We are then in the presence of a system of |Y| linear equations in |Y| unknowns (the p U (y j )'s); ACC thus consists of estimating these latter (i.e., computing pACC U (y j )) by solving, by means of the known techniques, this system of linear equations.
CC and ACC use the predictions generated by the hard classifier h, as evident by the fact that both Equations 4 and 6 depend on factors of type C h ij .Since most classifiers can be configured to return "soft predictions" in the form of posterior probabilities Pr(y|x) (from which hard predictions are obtained by choosing the y for which Pr(y|x) is maximised), 11 and since posterior probabilities contain richer information than hard predictions, it makes sense to try and generate probabilistic versions of the CC and ACC methods [4] by replacing "hard" counts C h ij with their expected values, i.e., with C s ij = (x,yj )∈U Pr(y i |x).One can thus define Probabilistic Classify and Count (PCC) as pPCC and Probabilistic Adjusted Classify and Count (PACC), which consists of estimating p U (y j ) (i.e., computing pPACC U (y j )) by solving the system of |Y| linear equations in |Y| unknowns The fact that PCC is a probabilistic version of CC is evident from the structural similarity between Equations 4 and 7, which only differ for the fact that the hard classifier h of Equation 4 is replaced by a soft classifier s in Equation 7; the same goes for ACC and PACC, as evident from the structural similarity of Equations 6 and 8.
A further method that [GS2016] uses is the one proposed in [39] (which we here call SLD, from the names of its proposers, and which was called EMQ in [GS2021]), which consists of training a probabilistic classifier and then using the EM algorithm (i) to update (in an iterative, mutually recursive way) the posterior probabilities that the classifier returns, and (ii) to re-estimate the class prevalence values of the test set, until mutual consistency, defined as the situation in which is achieved for all y ∈ Y. Quantification methods SVM(KLD), SVM(NKLD), SVM(Q), belong instead to the "structured output learning" camp.Each of them is the result of instantiating the SVM perf structured output learner [25] to optimise a different loss function.SVM(KLD) [11] minimises the Kullback-Leibler Divergence (KLD); SVM(NKLD) [10] minimises a version of KLD normalised by means of the logistic function; SVM(Q) [2] minimises Q, the harmonic mean of a classification-oriented loss (recall) and a quantification-oriented loss (RAE).Each of these learners generates a "quantification-oriented" classifier, and the quantification method consists of performing CC by using this classifier.These three learners inherently generate binary quantifiers (since SVM perf is an algorithm for learning binary predictors only), but we adapt them to work on single-label multiclass quantification.This adaptation consists of training one binary quantifier for each class in Y = {⊕, , } by applying a one-vs-all strategy.Once applied to a sample, these three binary quantifiers produce a vector of three estimated prevalence values, one for each class in Y = {⊕, , }; we then L1-normalize this vector so as to make the three class prevalence estimates sum up to one (this is also the strategy followed in [GS2016]).

Additional quantification methods
From the "structured output learning" camp we also consider SVM(AE) and SVM(RAE), i.e., variants of the above-mentioned methods that minimise (instead of KLD, NKLD, or Q) the AE and RAE measures, since these latter are, for reasons discussed in Section 3.1, the evaluation measures used in this paper for evaluating the quantification accuracy of our systems.We consider SVM(AE) only when using AE as the evaluation measure, and we consider SVM(RAE) only when using RAE as the evaluation measure; this obeys the principle that a sensible user, after deciding the evaluation measure to use for their experiments, 12 would instantiate SVM perf with that measure, and not with others.These methods have never been used before in the literature, but are obvious variants of the last three methods we have described.
We also include two methods based on the notion of quantification ensemble [35,36].Each such ensemble consists of n base quantifiers, trained from randomly drawn samples of q documents each, where these samples are characterised by different class prevalence values.At testing time, class prevalence values are estimated as the average of the estimates returned by the base members of the ensemble.We include two ensemble-based methods recently proposed by Pérez-Gállego et al. [36]; in both methods, a selection of members for inclusion in the final ensemble is performed before computing the final estimate.The first method we consider is E(PACC) Ptr , a method based on an ensemble of PACC-based quantifiers to which a dynamic selection policy is applied.This policy consists of selecting the n/2 base quantifiers that have been trained on the n/2 samples characterised by the prevalence values most similar to the one being tested upon (where similarity was previously estimated using all members in the ensemble).We further consider E(PACC) AE , a method which performs a static selection of the n/2 members that deliver the smallest absolute error on the training samples.In our experiments we use n=50 and q=1,000.
We also report results for HDy [22], a probabilistic binary quantification method that views quantification as the problem of minimising the divergence (measured in terms of the Hellinger Distance) between two cumulative distributions of posterior probabilities returned by the classifier, one coming from the unlabelled examples and the other coming from a validation set.HDy looks for the mixture parameter α that best fits the validation distribution (consisting of a mixture of a "positive" and a "negative" distribution) to the unlabelled distribution, and returns α as the estimated prevalence of the positive class.We adapt the model to the single-label multiclass scenario by using the one-vs-all strategy as described above for the methods based on SVM perf .
ACC and PACC define two simple linear adjustments to be applied to the aggregated scores returned by general-purpose classifiers.We also use a more recently proposed adjustment method based on deep learning, called QuaNet [12].QuaNet models a neural non-linear adjustment by taking as input (i) all the class prevalence values as estimated by CC, ACC, PCC, PACC, and SLD; (ii) the posterior probabilities Pr(y|x) for each document x and for each class y ∈ Y, and (iii) embedded representations of the documents.As the method for generating the document embeddings we simply perform principal component analysis and retain the 100 most informative components. 13QuaNet relies on a recurrent neural network (a bidirectional LSTM) to produce "sample embeddings" (i.e., dense, multi-dimensional representations of the test samples as observed from the input data), which are then concatenated with the class prevalence estimates obtained by CC, ACC, PCC, PACC, and SLD, and then used to generate the final prevalence estimates by transforming this vector through a set of feed-forward layers (of size 1,024 and 512), followed by ReLU activations and dropout (with drop probability set to 0.5).

Underlying classifiers
Consistently with [GS2016], as the classifier underlying CC, ACC, PCC, PACC, and SLD, we use one trained by means of L2-regularised logistic regression (LR); we also do the same for E(PACC) Ptr , E(PACC) AE , HDy, and QuaNet.The reasons of this choice are the same as described in [GS2016], i.e., the fact that logistic regression is known to deliver very good classification accuracy across a variety of application domains, and the fact that a classifier trained by means of LR returns posterior probabilities that tend to be fairly well-calibrated, a fact which is of fundamental importance for methods such as PCC, PACC, SLD, HDy, and QuaNet.By using the same learner used in [GS2016] we also allow a more direct comparison of results.

Datasets
The datasets on which we run our experiments are the same 11 datasets on which the experiments of [GS2016] were carried out, and whose characteristics are described succinctly in Table 1.As already noted at the end of Section 1, [GS2016] makes these datasets available already in vector form; we refer to [GS2016] for a fuller description of these datasets.
Note that [GS2016] had generated these vectors by using state-of-the-art, tweet-specific preprocessing, which included, e.g., URL normalisation, detection of exclamation and/or question marks, emoticon recognition, and computation of "the number of all-caps tokens, (...), the number of hashtags, the number of negated contexts, the number of sequences of exclamation and/or question marks, and the number of elongated words" [GS2016, §4.1]; in other words, every effort was made in [GS2016] to squeeze every little bit of information from these tweets, in a tweet-specific way, in order to enhance accuracy as much as possible.
In the experiments described in this paper we perform feature selection by discarding all features that occur in fewer than 5 training documents.
According to the principles of the APP, as described in Section 2.3, for each of the 11 datasets we here extract multiple samples from the test set, according to the following protocol.For each different triple (p(⊕), p( ), p( )) of class prevalence values such that each class prevalence is in the finite set P = {0.00,0.05, ..., 0.95, 1.00} and such that the three values sum up to 1, we extract m random samples of q documents each such that the extracted samples exhibit the class prevalence values described by the triple.In these experiments we use m = 25 and q = 100.For each label y ∈ {⊕, , } and for each sample, the extraction is carried out by means of sampling without replacement. 14t is easy to verify that there exist |P |(|P | + 1)/2 = 231 different triples with values in P . 15Our experimentation of a given quantification method M on a given dataset Table 1.Datasets used in this work and their main characteristics.Columns L Tr , L Va , U contain the numbers of tweets in the training set, held-out validation set, and test set, respectively.Column "Shift" contains the values of distribution shift between L ≡ L Tr L Va and U , measured in terms of absolute error, columns p L (⊕), p L ( ), and p L ( ) contain the class prevalence values of our three classes of interest in the training set L, while columns p U (⊕), p U ( ), and p U ( ) contain the class prevalence values for the unlabelled set U . for optimising the hyperparameters, retraining M on the entire labelled set L ≡ L Tr L Va using the optimal hyperparameter values, and testing the trained system on each of the 25×231=5,775 samples extracted from the test set U .This is sharply different from [GS20216], where the experimentation of a quantification method M on a given dataset consists of testing the trained system on one sample only, i.e., on the entire set U .

Parameter optimisation
Parameter optimisation is an important factor, that could bias, if not carried out properly, a comparative experimentation of different quantification methods.As we have argued elsewhere [31], when the quantification method is of the aggregative type, for this experimentation to be unbiased, not only it is important to optimise the hyperparameters of the classifier that underlies the quantification method, but it is also important that this optimisation is carried out using a quantification-oriented loss, and not a classification-oriented loss.
In order to optimise a quantification-oriented loss it is necessary to test each hyperparameter setting on multiple samples extracted from the held-out validation set, in the style of the evaluation described in Section 3.5.In order to do this, for each combination of class prevalence values we extract, from the held-out validation set of each dataset, m samples of q documents each, again using class prevalence values in P = {0.00,0.05, ..., 0.95, 1.00}.Here we use m = 5 and q = 100; we use a value of m five times smaller than in the evaluation phase (see Section 3.5) in order to keep the computational cost of the parameter optimisation phase within acceptable bounds.
For each label y ∈ {⊕, , } and for each sample, the extraction is carried out by sampling without replacement if the test set contains at least p y • q examples, and by sampling with replacement otherwise. 16n the experiments that we report in this paper, the hyperparameter that we , that the performance of the method is statistically significantly different from that of the best method).For ease of readability, for each dataset we colour-code cells in intense green for the best result, intense red for the worst result, and an interpolated tone for the scores in-between.2) or the RAE measure (Table 3).We evaluate the former batch of experiments only in terms of AE and the latter batch only in terms of RAE, following the principle that, once a user knew the measure to be used in the evaluation, they would carry out the parameter optimisation phase in terms of exactly that measure.
Hereafter, with the notation M D we will indicate quantification method M with the parameters of the learner optimised using measure D.

Results
Table 2 reports AE results obtained by the quantification methods of Sections 3.2 and 3.3 as tested on the 11 datasets of Section 3.5, while Table 3 does the same for RAE.The tables also report the results of a paired sample, two-tailed t-test that we have run, at different confidence levels, in order to check if other methods are different or not, in a statistically significant sense, from the best-performing one.
An important aspect that emerges from these tables is that the behaviour of the different quantifiers is fairly consistent across our 11 datasets; in other words, when a method is a good performer on one dataset, it tends to be a good performer on all datasets.Together with the fact that we test on a large set of samples, and that these are characterised by values of distribution shift across the entire range of all possible such shifts, this allows us to be fairly confident in the conclusions that we draw from these results.Table 3.Values of RAE obtained in our experiments; each value is the average across 5,775 values, each obtained on a different sample.Boldface indicates the best method for a given dataset.Superscripts † and ‡ denote the methods (if any) whose scores are not statistically significantly different from the best one according to a paired sample, two-tailed t-test at different confidence levels: symbol † indicates that 0.001 < p-value < 0.05 while symbol ‡ indicates that 0.05 ≤ p-value.The absence of any such symbol indicates that p-value ≤ 0.001 (i.e., that the performance of the method is statistically significantly different from that of the best method).For ease of readability, for each dataset we colour-code cells in intense green for the best result, intense red for the worst result, and an interpolated tone for the scores in-between.

Methods tested in [GS2016]
Newly added methods A second observation is that three methods (ACC, PACC, and SLD) stand out, since they perform consistently well across all datasets and for both evaluation measures.In particular, SLD is the best method for 7 out of 11 datasets (and is not different, in a statistically significant sense, from the best method on yet another dataset) when testing with AE, and for all 11 datasets when testing with RAE.PACC also performs very well, and is the best performer for 3 out of 11 datasets when testing with AE.The fact that both ACC and PACC tend to perform well shows that the intuition according to which CC predictions should be "adjusted" by estimating the disposition of the classifier to assign class y i when class y j is the true label, is valuable and robust to varying levels of distribution shift.The same goes for SLD, although SLD "adjusts" the CC predictions differently, i.e., by enforcing the mutual consistency (described by Equation 9) between the posterior probabilities and the class prevalence estimates.
By contrast, these results show a generally disappointing performance on the part of all methods based on structured output learning, i.e., on the SVM perf learner.Note that the fact that SVM(KLD), SVM(NKLD), SVM(Q) optimise a performance measure different from the one used in the evaluation (AE or RAE) cannot be the cause of this suboptimal performance, since this latter also characterises SVM(AE) when tested with AE as the evaluation measure, and SVM(RAE) when tested with RAE.
CC and PCC do no perform well either.If this was somehow to be expected for CC, this is surprising for PCC, which always performs worse than CC in our experiments, on all datasets and for both performance measures.It would be tempting to conjecture that this might be due to a supposedly insufficient quality of the posterior probabilities returned by the underlying classifier; however, this conjecture is implausible, since the quality of the posterior probabilities did not prevent SLD from displaying sterling performance, and PACC from performing very well.
Contrary to the observations reported in [36], the E(PACC) Ptr and E(PACC) AE September 21, 2021 ensemble methods fail to improve over the base quantifier (PACC) upon which they are built.The likely reason for this discrepancy is that, while Pérez-Gállego et al. [36] trained the base quantifiers on training samples of the same size as the original training set (i.e., they use q = |L|), we use smaller training samples (i.e., we use q=1,000) in order to keep training times within reasonable bounds (this is also due to the fact that the datasets we consider in this study are much larger than those used in [36], not only in terms of the number of instances but especially in terms of the number of features). 17e now turn to comparing the results of our experiments with the ones reported in [GS2016].For doing this, for each dataset we rank, in terms of their performance, the 8 quantification methods used in both batches of experiments, and compare the rank positions obtained by each method in the two batches. 18he results of this comparison are reported in Table 4 (for AE) and Table 5 (for RAE).Something that jumps to the eye when observing these tables is that our experiments lead to conclusions that are dramatically different from those drawn by [GS2016].First, SLD now unquestionably emerges as the best performer, while it was often ranked among the worst performers in [GS2016].Conversely, PCC was the winner on most combinations (dataset, measure) in [GS2016], while our experiments have shown it to be a bad performer.Other methods too see their merits disconfirmed by our experiments; in particular, ACC and PACC have climbed up the ranked list, while all other methods (especially SVM(KLD)) have lost ground.
The reason for the different conclusions that these two batches of experiments allow drawing is, in all evidence, the amounts of distribution shift which the methods have had to confront in the two scenarios.In the experiments of [GS2016] this shift was very moderate, since the only test sample used (which coincided with the entire test set) usually displayed class prevalence values not too different from the class prevalence values in the training set.This is shown in the last column of Table 1, where the shift between training set and test set (expressed in terms of absolute error) is reported for each dataset; shift values range between 0.0020 and 0.1055, with an average value across all datasets of 0.0301, which is a very low value.In our experiments, instead, the quantification methods need to confront class prevalence values that are sometimes very different from the ones in the training set; shift values range between 0.0000 and 0.6666, with an average value across all samples of 0.2350.This means that the quantification methods that have emerged in our experiments are the ones that are robust to possibly radical changes in these class prevalence values, while the ones that had fared well in the experiments of [GS2016] are the methods that tend to perform well merely in scenarios where these changes are bland.
This situation is well depicted in the plots of Figures 1 and 2. For generating these plots we have computed, for each of the 11×5,775=63,525 test samples, the distribution shift between the training set and the test sample, and we have binned these 63,525 samples into bins characterised by approximately the same amount of distribution shift (we compute distribution shift as the absolute error between the training distribution and the distribution of the test sample, using bins of width equal to 0.05 (i.e., [0.00,0.05],(0.05,0.10], etc.).The plots show, for a given quantification method and for a given bin, the quantification error of the method, measured (by means of AE in the top figure and by means of RAE in the bottom figure) as the average error across all The plots clearly show that, for CC, PCC, SVM(KLD), SVM(NKLD), SVM(Q), as well as for the newly added SVM(AE) and SVM(RAE), this error increases, in a very substantial manner as distribution shift increases.A common characteristic of this group of methods, that we will dub the "unadjusted" methods, is that none of them attempts to correct, or adjust the counts resulting from the classification of data items, thus resulting in quantification systems that behave reasonably well for test set class prevalence values close to the ones of the training set (i.e., for low values of distribution shift), but that tend to generate large errors for higher values of shift.The obvious conclusion is that failing to adjust makes the method not robust to high amounts of distribution shift, and that the reason why some of the unadjusted methods were successful in the evaluation of [GS2016] is that this latter confronted the methods with very low amounts of distribution shift.In fact, it is immediate to note from Figures 1  and 2 that, when distribution shift is between 0.020 and 0.1055 (the values of distribution shift that the experiments of [GS2016] tackled -the region of Figures 1  and 2 between the two vertical dotted lines encloses values of shift up to that level), the difference in performance between different quantification methods is small.
In our plots, by contrast, methods ACC, PACC, SLD, along with the newly added HDy, QuaNet, E(PACC) AE , and E(PACC) Ptr , form a second group of methods, that we will dub the "adjusted" methods, since they all implement, in one way or another, different strategies for post-processing the class prevalence estimations returned by base classifiers.The quantification error displayed by the "adjusted" methods remains fairly stable across the entire range of distribution shift values, which is clearly the reason of their success in the APP-based evaluation we have presented here.
Figure 3 shows the estimated class prevalence value (y axis) that each method delivers, on average across all test samples and all datasets, for each true prevalence (x axis); results are displayed separately for each of the three target classes and for methods optimized according to either AE or RAE.Note that the ideal quantifier (i.e., one that makes zero-error predictions) would be represented by the diagonal (0,0)-(1,1), here displayed as a dotted line.These plots support our observation that two groups of methods, the "adjusted" vs. the "unadjusted", exist (this is especially evident for the ⊕ and the classes, where they originate two quite distinct bundles of curves), and show how the unadjusted methods fail to produce good estimates for the entire range of prevalence values.As could be expected, all methods intersect approximately in the same point, which corresponds to the average training prevalence of the class across all datasets (p L (⊕) = 0.278, p L ( ) = 0.426, p L ( ) = 0.296), given that all methods tend to produce low error (hence similar values) for test class prevalence values close to the training ones.September 21, 2021 16/23 Figure 4 displays box-plot diagrams for the error bias (i.e., for the signed error between the estimated prevalence value and the true prevalence value) for all methods and independently for each class, as averaged across all datasets and test samples.The "adjusted" methods show lower error variance, as witnessed by the fact that their box-plots (indicating the first and third quartiles of the distribution) tend to be squashed and their whiskers (indicating the maximum and minimum, disregarding outliers) tend to be shorter.Some methods tend to produce many outliers (see, e.g., ACC and PACC in the class), which might be due to the fact that the adjustments that those methods perform may become unstable in some cases. 19Overall, PACC and SLD, the two strongest methods among the quantification systems we have tested, seem to be also the methods displaying the smallest bias across the three classes.
As a final note, the reader might wonder why, for certain well-performing methods, quantification error even seems to decrease for particularly high values of distribution shift (see e.g., ACC, PACC, SLD in Figure 1 or SLD and ACC in Figure 2).The answer is that quantification error values for very high levels of shift are, in our experiments, not terribly reliable, because (as clearly shown by the green histograms in Figures 1  and 2) they are averages across very few data points.To see this, note that the values of AE range (see [41]) between 0 (best) and 2(1 − min y∈Y p(y)) |Y| (worst), which in our ternary case means 2 3 (1 − 0) = 0.6 (because we indeed have test samples in which the prevalence of at least one class is 0).However, there are many more samples with extremely low AE values than samples with extremely high AE values; for instance, out of the 11×5,775=63,525 samples that we have generated in our experiments (see Section 2.3), there are only 25 whose value of distribution shift is comprised in the interval [0.60, 0.66], while there are no fewer than 3,300 whose value is comprised in the interval [0.00, 0.06], even if the two intervals have the same width.To see why, note for instance that we can reach an AE value of 0.6 only when one of the classes in the training set has a prevalence value of 0 (see Equation 10), while an AE value of 0 can be reached for all training sets.As a result, the average AE values at the extreme right of the plots in Figures 1 and 2 (say, those beyond x = 0.55) are averages across very few data points, and are thus unstable and unreliable.This does not invalidate our general observations, though, since each quantification method we test displays, on the [0.00,0.55]interval, a very clear, unmistakable behaviour.

Conclusions
The results of our experiments show that a re-evaluation of the relative merits of different quantification methods on the tweet sentiment quantification task was necessary.We have shown that the experimentation previously conducted in [GS2016] was weak, since the experimental protocol that was followed led the authors of this study to conduct their evaluation on a radically insufficient amount of test data points.We have then conducted a re-evaluation of the same methods on the same datasets according to a more robust, and now widely accepted, experimental protocol, which has lead to an experimentation on a number of datapoints 5,775 times larger than the one of [GS2016].In addition to these experiments, we have also tested some further methods, some of which had appeared after [GS2016] was published.This experimentation has proven necessary for at least two reasons.The first reason is that some of the evaluation functions (such as KLD and NKLD) that had been used in [GS2016] are now known to be unsatisfactory, and their use should thus be deprecated in favour of functions such as AE and RAE.The second reason, and probably the most important one, is that the results of our re-evaluation have radically disconfirmed the conclusions originally drawn by the authors of [GS2016], showing that the methods (e.g., PCC) who had emerged as the best performers in [GS2016] tend to behave well only in situations characterised by very low distribution shift; on the contrary, when distribution shift increases, other methods (such as SLD) are to be preferred.In particular, our experiments do justice to the SLD method, which had obtained fairly bland results in the experiments of [GS2016], and which now emerges as the true leader of the pack, thanks to consistently good performance across the entire spectrum of distribution shift values.

Fig 1 .
Fig 1. Performance of the various quantification methods, represented by the coloured lines and measured in terms of AE (lower is better), as a function of the distribution shift between training set and test sample; the results are averages across all samples in the same bin, i.e., characterised by approximately the same amount of shift, independently of the dataset they were sampled from.The two vertical dotted lines indicate the range of distribution shift values exhibited by the experiments of [GS2016] (i.e., in those experiments, the AE values of distribution shift range between 0.020 and 0.1055).The green histogram in the background shows instead how the samples we have tested upon are distributed across the different bins.

Fig 2 .
Fig 2. Performance of the various quantification methods, represented by the coloured lines and measured in terms of RAE (lower is better), as a function of the distribution shift between training set and test sample; the results are averages across all samples in the same bin, i.e., characterised by approximately the same amount of shift, independently of the dataset they were sampled from.Unlike in Figure 1, for better clarity these results are actually displayed on a logarithmic scale.The two vertical dotted lines indicate the range of distribution shift values exhibited by the experiments of [GS2016] (i.e., in those experiments, the AE values of distribution shift range between 0.020 and 0.1055).The green histogram in the background shows instead how the samples we have tested upon are distributed across the different bins.

Fig 3 .
Fig 3.Estimated prevalence as a function of true prevalence according to various quantification methods.Results are displayed separately for classes ⊕ (top), (middle), and (bottom), with methods optimized for according to AE (left) and RAE (right).

Fig 4 .
Fig 4. Box-plots of the error bias (signed error).Results are displayed separately for classes ⊕ (top), (middle), and (bottom), with methods optimized for according to AE (left) and RAE (right).

Table 2 .
Values of AE obtained in our experiments; each value is the average across 5,775 values, each obtained on a different sample.Boldface indicates the best method for a given dataset.Superscripts † and ‡ denote the methods (if any) whose scores are not statistically significantly different from the best one according to a paired sample, two-tailed t-test at different confidence levels: symbol † indicates that 0.001 < p-value < 0.05 while symbol ‡ indicates that 0.05 ≤ p-value.The absence of any such symbol indicates that p-value ≤ 0.001 (i.e.

Table 4 .
Rank positions of the quantification methods in our AE experiments, and (between parentheses) the rank positions obtained by the same methods in the evaluation of [GS2016].Boldface indicates the best method in terms of average rank in our APP-based experiments, while underline is used to indicate the same for the NPP-based experiments of [GS2016].

Table 5 .
Rank positions of the quantification methods in our RAE experiments, and (between parentheses) the rank positions obtained by the same methods in the evaluation of [GS2016].Boldface indicates the best method in terms of average rank in our APP-based experiments, while underline is used to indicate the same for the NPP-based experiments of [GS2016].