Predictability of Extreme Events in Social Media

It is part of our daily social-media experience that seemingly ordinary items (videos, news, publications, etc.) unexpectedly gain an enormous amount of attention. Here we investigate how unexpected these extreme events are. We propose a method that, given some information on the items, quantifies the predictability of events, i.e., the potential of identifying in advance the most successful items. Applying this method to different data, ranging from views in YouTube videos to posts in Usenet discussion groups, we invariantly find that the predictability increases for the most extreme events. This indicates that, despite the inherently stochastic collective dynamics of users, efficient prediction is possible for the most successful items.


Datasets
We worked with four different datasets. Each of them is composed by a collection of items wich have an associated time series of activity that we consider a proxy of attention; in the case of Stack-Overflow and Usenet, each post or vote has its own timestamp, and in the case of YouTube and PLOS all the views are collected in time bins (day and month, respectively). All the data is available at Ref. [1] in a format prepared to be used with a Python script (also included in the fileset) to compute the Predictability measure we propose. In Fig. S1 we show the distributions of activity at a given time t * .

YouTube
The YouTube dataset is an unbiased collection (16.2 million) of publicly available videos published between Jan. 2012 and Apr. 2013. The data was retrieved through the YouTube API provided by Google (https: //developers.google.com/youtube/code).

Usenet
The Usenet dataset is a collection (0.8 million) of publicly available threads in online discussion groups (they can be retrieved directly through Google Groups, which holds a large archive of them; for example, to view the group comp.os.linux, visit https://groups.google.com/forum/#!forum/comp.os.linux). All posts published in the groups indicated in Sec. SI 2.3 were retrieved before March 2008, see Ref. [2] for details.

Stack-Overflow
The Stack-Overflow dataset is a collection (4.

PLOS ONE
The PLOS ONE dataset is a collection (72246) of publicly available scientific publications of the journal PLOS ONE, between Dec. 2006 and Aug. 2013. The data was retrieved through the API provided by PLOS (http://api.plos.org), but is identical to the one published in Ref. [3].

Extreme Value statistics
The cumulative Generalized Pareto Distribution is given by [4] F for x > x p . In our analysis it is essential to consider the discretization of the observations (specially for small values), and therefore we used a discretized version of the Generalized Pareto Distribution, which has probability density function where ζ is the Hurwitz Zeta function. The parameters of Eq. (2) are estimated using maximum likelihood fit [5]. The goodness of the fit is measured by the p-value computed as the probability of having a Kolmogorov-Smirnov statistic, between randomly sampled data (from the fitted distribution) and their fit, bigger than the one measured between the real data and their fit [6]. Fits with p-values bigger than 0.05 are regarded as acceptable.
For each database we considered the fit of all data (see SI Sec. 2.1) and of the data partitioned in groups as described below. For the case of the partitioned data, we report the results for the lowest threshold that guarantees statistical significance for at least 80% of the groups analyzed for each database. This allows us to have a good estimation of the AUC for a wide range of x * values (see Fig. 3).

Overall distributions
Here we report the fits to the whole databases. Database

YouTube
The property used to partition the dataset was the Category: each uploaded video belongs to a pre-defined Category selected by the user (15 in total). The threshold used in the reported results is

Usenet
The dataset is partitioned in 9 different Discussion Groups, which cover different topics. The threshold used is

Stack Overflow
The Stack Overflow dataset is composed by questions and answers of computer-science topics. We use the questions and divide them in groups g according to their tags. Since each question has many tags a classification procedure was performed. Programming languages are the most common type of tag and therefore we selected the 10 more common tags of this type (see list in the table below). Tags which contained one of these 10 tags as a substring were considered as equivalent to the short tags. Similarly, tags that could be associated to a single programing language were also merged. The remaining tags were grouped in a group labelled rest, which included also all cases in the intersection of two or more programming language groups. The complete grouping of the tags can be seen at Ref. [1] in the file "Stack-Overflow Lemmas". The threshold used for the fits is

PLOS ONE
For the PLOS ONE dataset, we chose the amount of authors papers have as the grouping factor; the group labelled 12, actually contains all the papers with more than 11 authors. The threshold used for the fits is

Quality of binary predictions
Comparing binary predictions and observations gives four possible results, given by the combination of the prediction (positive or negative) and its success (true or false). If A denotes the prediction of an event (an alarm), the hit rate (or True Positive Rate) and the false alarm rate (or False Positive Rate) are defined as hit rate ≡ number of true positives number of positives = P (A|E), false alarm rate ≡ number of false positives number of negatives = P (A|Ē). ( These are analogous to measures like Accuracy and Specificity or Precision and Recall. Prediction strategies typically have a specificity parameter (e.g., controlling the rate of false positives). Varying this parameter, a prediction curve that goes from (0, 0) to (1, 1) is built in the hit×false-alarm space.

Demonstration that strategy LD (Bayes classifier) is dominant
A strategy is dominant when for any given false alarm rate, the hit rate is maximized. Following definition (3), we write the x and y coordinates of the hit×false-alarm plot as hit rate ≡ P (A|E) = G g=1 P (A|g)P (g|E) = G g=1 π g y g ≡ y, where for notational convenience y g ≡ P (g|E), x g ≡ P (g|Ē), and π g ≡ P (A|g). Since predictions are issued based only on the information about the groups, strategies (both deterministic and stochastic) are defined uniquely by π g , while x g and y g are estimated from data. The computation of the dominant strategy corresponds to finding the π g 's that maximize y with the constraint G g=1 π g x g = x. This problem can be solved exactly by applying the simplex method. Define h such that g<h x g < x < g≤h x g ; we write Eq. (4) as: Isolating π h in the lower equation and introducing it in the top one we obtain Notice that y g /x g is the contribution of the group g to the slope of the prediction curve in the hit×falsealarm space. If the G groups are ordered by decreasing P (E|g), then y g /x g also decreases with g. Therefore (y g /x g − y h /x h ) > 0 for g < h and (y g /x g − y h /x h ) > 0 for g > h and Eq. (6) is maximized by choosing π g such that the two last terms vanish. This is achieved choosing which correspond to issuing positive predictions only to the h groups with largest P (E|g) and is equivalent to strategy (LD) mentioned in the main text. Positive events are predicted for the group h in Eq. (8) as much as needed to reach the required false positive rate x.

Computation of Π for the optimal strategy
As illustrated in Fig. 2(b), the partition performed by the optimal strategy defines G different intervals in the hit and false alarm axis (the points for which P (E|g) = P * , g ∈ {1 . . . G}) and therefore G 2 rectangles in the hit×false-alarm space. The (g, h) rectangle has height P (h)P (E|h)/P (E) = P (h|E), width P (g|Ē) (whereĒ is the complement of E, i.e., P (Ē|g) = 1 − P (E|g)), and therefore it has an area A g,h = P (h|E)P (g|Ē). The curve of strategy (LD) is the union of the diagonals of the g = h rectangles (which are obtained by increasing p * ). Π is two times the sum of the rectangles and triangles under this curve minus half of all the area: Π = 2 g h<g A g,h + 1 2 g A g,g − 1 2 g h A g,h = g h<g A g,h − g h>g A g,h = g h<g (A g,h − A h,g ) = g h<g P (h|E)P (g|Ē) − P (h|Ē)P (g|E) = g h<g P (g)P (h) (P (E|h) − P (E|g)) P (E)(1 − P (E)) , where we used g h A g,h = 1. This finishes our demonstration of Eq. 2.