Link prediction on Twitter

With over 300 million active users, Twitter is among the largest online news and social networking services in existence today. Open access to information on Twitter makes it a valuable source of data for research on social interactions, sentiment analysis, content diffusion, link prediction, and the dynamics behind human collective behaviour in general. Here we use Twitter data to construct co-occurrence language networks based on hashtags and based on all the words in tweets, and we use these networks to study link prediction by means of different methods and evaluation metrics. In addition to using five known methods, we propose two effective weighted similarity measures, and we compare the obtained outcomes in dependence on the selected semantic context of topics on Twitter. We find that hashtag networks yield to a large degree equal results as all-word networks, thus supporting the claim that hashtags alone robustly capture the semantic context of tweets, and as such are useful and suitable for studying the content and categorization. We also introduce ranking diagrams as an efficient tool for the comparison of the performance of different link prediction algorithms across multiple datasets. Our research indicates that successful link prediction algorithms work well in correctly foretelling highly probable links even if the information about a network structure is incomplete, and they do so even if the semantic context is rationalized to hashtags.


Introduction
This is the supplementary material for the paper "Link Prediction on Twitter". In the first Section we present the used network analysis measures and results obtained for the networks constructed from the content of tweets. In the second Section we list the results for seven link prediction measures.

Network Analysis Measures and Results
Initially we present the characterization of networks with the standard set of network measures. Table 1 in the supplementary material shows the results for emo-net datasets while Table 2 shows the results for SC datasets. The characterization of hashtags networks is in Tab. 3 for emo-net and in Tab. 4 for SC datasets. All results are reported for the full 100% of links in networks and for subnetworks with the 75%, 50% and 25% of links.
For the characterization of complex networks we use measures defined in the continuation. The network G = (V, E) is a pair of a set of nodes V (or vertices) and a set of links E (or edges), where N is the number of nodes and K is the number of links. The number of network components is denoted by ω. In weighted networks every link connecting two nodes u and v has an associated weight w uv . A node degree deg(u) is the number of links directly connected (or incident) to node u and the set of nodes incident to a node u is denoted as Γ(u). The strength of a node s u is the sum of weights of all links incident to u.
Average network strength k is the sum of all link weights in a network divided by the number of nodes N : The average network degree k is the ratio of the number of links to the number of nodes. For undirected networks we multiply this ratio by 2 since undirected links always have two incident nodes: Node selectivity, originally proposed by Masucci and Rodgers in 2006, for a node v corresponds to the sum of weights of all incident links divided by that nodes degree (denoted as deg (v)): Network density is represented as the ratio between the number of existing links and the number of all possible links: .
Average path length for a network, where d uv denotes the number of links lying on the shortest path between u, v ∈ V , is computed as following: .
The network radius denotes the shortest (v), where (v) is defined as the maximum distance between v ∈ V and any other node: R = min ( (v)).
Network transitivity where possible triangles are identified by the number of triads (two links with a shared node): Average clustering coefficient, where c(v) is the clustering coefficient for a node v, sums all the individual clustering coefficients and divides them by the number of nodes: The global network efficiency is the reciprocal value of a networks average path length: The assortativity coefficient is the Pearson correlation coefficient of degree between pairs of linked nodes where e uv is the joint probability distribution (mixing matrix) of the degrees where a u and b v are the fraction of links that start and end at nodes u and v, and where σ a and σ b are the standard deviations of the distributions a u and b v . Here we present the sum rules which e uv satisfies: Table 1 presents the network measures for emo-net datasets constructed as co-occurrence of all-words in tweets. The results are listed for subnetworks constructed from the 25%, 50%, 75% and 100% of links. Table 2 presents the network measures for SC datasets constructed as the co-occurrence of all-words in tweets. The results are listed for subnetworks constructed from the 25%, 50%, 75% and 100% of links. Table 3 presents the network measures for emo-net datasets constructed as the co-occurrence of hashtags in tweets. The results are listed for subnetworks constructed from the 25%, 50%, 75% and 100% of links. Table 4 presents the network measures for SC datasets constructed as the co-occurrence of hashtags in tweets. The results are listed for subnetworks constructed from 25%, 50%, 75% and 100% of links.
The SC dataset prepared in 2009 and annotated for polarity is available at http://help.sentiment140.com/for-students/.    The results are expressed for N : number of nodes, K: number of links, k : average network degree, s : average network strength, e : average network selectivity, d: network density, ω: number of components, L: average path length, D: network diameter, R: network radius, T : network transitivity, C: average clustering coefficient, A: network degree assortativity (not weighted) and E: global network efficiency). Measurements are reported for the 25%, 50%, 75% and 100% hastags subnetworks of the four SC datasets. The SC 10 4 networks are constructed from 10000 tweets, while SC 10 5 are constructed from 100000 tweets.

Ranking diagrams for precision in all-words networks
In Fig. 1 of the supplementary materials we show ranking diagrams for precision for the the 25% (a), 50% (b) and 75% (c) networks from all-words in tweets over all datasets. The seven tested link prediction measures the weighted common neighbors (CN), the weighted Jaccard coefficient (JC), the weighted preferential attachment (PA), the weighted Adamic-Adar (AA), the resource allocation index (RA), selectivity (SE) and inverse selectivity (IS) are ranked according to the values of precision over eight datasets.

Ranking diagrams for precision in hastags networks
In Fig. 2 of the supplementary materials we show ranking diagrams for precision for the 25% (top), 50% (middle) and 75% (bottom) networks from hastags in tweets over all datasets. The seven tested link prediction measures the weighted common neighbors (CN), the weighted Jaccard coefficient (JC), the weighted preferential attachment (PA), the weighted Adamic-Adar (AA), the resource allocation index (RA), selectivity (SE) and inverse selectivity (IS) are ranked according to the values of precision over eight datasets.

Results for the networks from all-words in tweets
In this section we list results for the 25%, 50% and 75% networks constructed for co-occurance of all-words in tweets in all eight datasets. The results are reported for precision, the F1 score and the the area under the receiver operating characteristic curve (AUC) for each of the tested link prediction measures: the weighted common neighbors in Table 5, the weighted Jaccard coefficient in Table 6, the weighted preferential attachment in Table 7, the weighted Adamic-Adar in Table 8, the resource allocation index in Table 9, selectivity in Table 10 and inverse selectivity in Table 11. The results in terms of precision, the F1 score and AUC for the weighted common neighbors for the 25%, 50% and 75% of links in all-words networks on eight datasets. The results in terms of precision, the F1 score and AUC for the weighted Jaccard coefficient for the 25%, 50% and 75% of the links in all-words networks on eight datasets. The results in terms of precision, the F1 score and AUC for the weighted preferential attachment for the 25%, 50% and 75% of links in all-words networks on eight datasets. The results in terms of precision, the F1 score and AUC for the weighted Adamic-Adar for the 25%, 50% and 75% of links in all-words networks on eight datasets. The results in terms of precision, the F1 score and AUC for the resource allocation index for the 25%, 50% and 75% of links in all-words networks on eight datasets. The results in terms of precision, the F1 score and AUC for selectivity for the 25%, 50% and 75% of links in all-words networks on eight datasets. The results in terms of precision, the F1 score and AUC for inverse selectivity for the 25%, 50% and 75% of links in all-words networks on eight datasets.

Results for the networks from hashtags in tweets
In this section we list results for the 25%, 50% and 75% networks constructed for the co-occurance of hashtags in tweets in all eight datasets. The results are reported for precision, the F1 score and the AUC for each of the tested link prediction measures: the weighted common neighbors in Table 12, the weighted Jaccard coefficient in Table 13, the weighted preferential attachment in Table 14, the weighted Adamic-Adar in Table 15, the resource allocation index in Table 16, selectivity in Table 17 and inverse selectivity in Table 18. The results in terms of precision, the F1 score and AUC for the weighted common neighbors for the 25%, 50% and 75%of links in hashtags networks on eight datasets. The results in terms of precision, the F1 score and AUC for the weighted Jaccard coefficient for the 25%, 50% and 75% of links in hashtags networks on eight datasets. The results in terms of precision, the F1 score and AUC for the weighted preferential attachment for the 25%, 50% and 75% of links in hashtags networks on eight datasets. The results in terms of precision, the F1 score and AUC for the weighted Adamic-Adar for the 25%, 50% and 75% of links in hashtags networks on eight datasets. The results in terms of precision, the F1 score and AUC for the resource allocation index for the 25%, 50% and 75% of links in hashtags networks on eight datasets. The results in terms of precision, the F1 score and AUC for selectivity for the 25%, 50% and 75% of links in hashtags networks on eight datasets. The results in terms of precision, the F1 score and AUC for inverse selectivity for the 25%, 50% and 75% of links in hashtags networks on eight datasets. In this section we list results for the 25%, 50% and 75% networks of hastags in SC 10 5 pos for the top 200 and the top 500 hashtags. The results are reported for precision, the F1 score and the AUC for each of the tested link prediction measures: the weighted common neighbors in Table 19, the weighted Jaccard coefficient in Table 20, the weighted preferential attachment in Table 21, the weighted Adamic-Adar in Table 22, the resource allocation index in Table 23, selectivity in Table 24 and inverse selectivity in Table 25. The results in terms of precision, the F1 score and AUC for the weighted common neighbors for the 25%, 50% and 75% of links in the top 200 and the top 500 hashtags networks on SC 10 5 pos dataset. The results in terms of precision, the F1 score and AUC for the weighted Jaccard coefficient for the 25%, 50% and 75% of links in the top 200 and the top 500 hashtags networks on SC 10 5 pos dataset. The results in terms of precision, the F1 score and AUC for the weighted preferential attachment for the 25%, 50% and 75% of links in the top 200 and the top 500 hashtags networks on SC 10 5 pos dataset. The results in terms of precision, the F1 score and AUC for the weighted Adamic-Adar for the 25%, 50% and 75% of links in the top 200 and the top 500 hashtags networks on SC 10 5 pos dataset. The results in terms of precision, the F1 score and AUC for the resource allocation index for the 25%, 50% and 75% of links in the top 200 and the top 500 hashtags networks on SC 10 5 pos dataset. The results in terms of precision, the F1 score and AUC for selectivity for the 25%, 50% and 75% of links in the top 200 and the top 500 hashtags networks on SC 10 5 pos dataset. The results in terms of precision, the F1 score and AUC for inverse selectivity for the 25%, 50% and 75% of links in the top 200 and the top 500 hashtags networks on SC 10 5 pos dataset.