Statistically Validated Networks in Bipartite Complex Systems

Many complex systems present an intrinsic bipartite structure where elements of one set link to elements of the second set. In these complex systems, such as the system of actors and movies, elements of one set are qualitatively different than elements of the other set. The properties of these complex systems are typically investigated by constructing and analyzing a projected network on one of the two sets (for example the actor network or the movie network). Complex systems are often very heterogeneous in the number of relationships that the elements of one set establish with the elements of the other set, and this heterogeneity makes it very difficult to discriminate links of the projected network that are just reflecting system's heterogeneity from links relevant to unveil the properties of the system. Here we introduce an unsupervised method to statistically validate each link of a projected network against a null hypothesis that takes into account system heterogeneity. We apply the method to a biological, an economic and a social complex system. The method we propose is able to detect network structures which are very informative about the organization and specialization of the investigated systems, and identifies those relationships between elements of the projected network that cannot be explained simply by system heterogeneity. We also show that our method applies to bipartite systems in which different relationships might have different qualitative nature, generating statistically validated networks in which such difference is preserved.

In recent years, many complex systems have been described and modeled in terms of bipartite networks [1][2][3][4][5]. Examples include movies and actors [1, 2,4], authors and scientific papers [6][7][8][9], email accounts and emails [10], mobile phones and phone calls [11], plants and animals that pollinate them [12,13]. One ubiquitous property of bipartite complex systems is their heterogeneity. For example, in a given period of time, some actors play in many movies, whereas others play in a few, some authors write a few papers, whereas others write many. Movies are also heterogeneous because of the size of cast, as well as papers because of the number of authors. Heterogeneity is also a common feature of biological complex systems. The genome of some organisms might contain a small set of proteins performing a given class of biological functions whereas the corresponding set of proteins is large for other organisms. Bipartite networks are composed by two disjoint sets of nodes such that every link connects a node in the first set with a node of the second set. The properties of bipartite complex systems are often investigated by considering the one-mode projection of the bipartite network, i.e. one creates a network of nodes belonging to one of the two sets and two nodes are connected when they have at least one common neighboring node of the other set. When one constructs a projected network with nodes from only one set, the system heterogeneity makes it very difficult to identify preferential links between the elements. Preferential links are links carrying information about the structure and organization of complex systems. It is therefore of great use to devise a method allowing to statistically validate whether a given link in the projected network is consistent or not with a null hypothesis of random connectivity between elements of the bipartite network .
Here we introduce an unsupervised method to statistically validate each link of the projected network. A schematic summary of our method is provided in Fig. 1. The key ingredients of our method are (i) the selection of a null hypothesis of random connectivity between elements in the bipartite network consistent with the degree heterogeneity of both sets of elements, (ii) the identification of an analytical or computationally feasible procedure, which is able to associate a p-value testing the presence of a link against the selected null hypothesis and (iii) the appropriate correction of the statistical significance level in the presence of multiple hypothesis testing [14,15] of links across the network. It is worth noting that a direct simulation approach cannot be used to achieve point (ii) for large networks. In fact the statistical correction needed in the presence of multiple hypothesis testing is usually so low that a numerical validation of the null hypothesis would require the simulation of an unrealistically large number of different random realizations of the system. Therefore the resolution of point (ii) is essential for the practical application of our method.
We apply our method to three different systems, namely the set of clusters of orthologous genes (COG) detected in completely sequenced genomes [16,17], a set of daily returns of 500 US financial stocks, and the set of world movies of the IMDb database [18]. The Let us now discuss in detail the first example of the COG database [16,17], where one set is composed by 66 organism's genomes (13 Archaea, 50 Bacteria and 3 unicellular Eukaryota) and the other by 4,873 COGs. The number of COGs in a genome is heterogeneous, ranging from 362 to 2,243. Similarly, the number of genomes in which a specific COG is present ranges from 3 to 66. Here we consider the projected network of genomes. In this system we are able to associate a p-value to each link in an analytical way by following a two steps procedure. The procedure followed is illustrated in Fig. 1 of the Supplementary Information.
First we consider subsets of the bipartite system. We call these subsets COG-k subsets. A COG k is a COG present in k = 3, · · · , 66 genomes. In each subset of COG k s we are therefore left only with the heterogeneity of organisms. We identify a preferential relationship between each pair of organisms by statistically validating the co-occurrence of COG k s against a null hypothesis that takes into account such heterogeneity. Specifically, given two organisms A and B, let N A be the number of COG k s in organism A and N B the number of COG k s in organism B. The total number of COG k s is N k and the observed number of COG k s belonging to both A and B is N AB . Under the null hypothesis of random co-occurrence, the probability of observing X co-occurrences is given by the hypergeometric distribution [19] (1) We can therefore associate a p-value to the observed N AB as p(N AB ) = 1 − We refer to the network obtained by using the Bonferroni threshold as the Bonferroni network. A less stringent correction for multiple hypothesis testing is the False Discovery Rate (FDR) [15]. The threshold of the FDR correction linearly increases with the number of tests in which the null hypothesis is rejected.
We also consider the FDR correction and we refer to the network obtained by using it as the FDR network. By construction, the Bonferroni network is a subgraph of the FDR network, which is a subgraph of the adjacency network.
Let us now analyze the statistically validated networks obtained for this biological system. The Bonferroni network of organisms includes 58 non isolated nodes connected by clear biological interpretation in terms of organisms' lineage. The FDR network of organisms includes all the 66 organisms and the number of weighted links in this network is 369 (Fig. 2B). Thus the entire set is covered and the additional links provides relations among the groups already observed in the Bonferroni network. Note that the adjacency network of this system is a complete graph.
It is worth noting that both the Bonferroni and the FDR network display a clear cluster structure and the clusters have a direct biological interpretation in terms of lineages.
Even if the FDR network is completely connected, the application of community detection algorithms [20], such as Infomap [21], to the statistically validated networks gives a clear community structure (see Fig. 2). This is not true for the adjacency network and shows that the statistically validated networks are able to identify the many relevant links inside communities and the few relevant links between different communities of organisms.
As a second example we consider the collective dynamics of the daily return of N s = 500 highly capitalized US financial stocks in the period 2001-2003 (748 trading days). Here, the two sets of the bipartite system are the stocks (with categorical information on their returns) and the trading days, and we consider here the projected network of stocks. The interest in this example is that we (i) generalize our procedure to complex systems where the elements are monitored by continuous variables, (ii) show how to simplify the above procedure when the second source of heterogeneity (in the above example the COG frequency in different organisms) is small, and (iii) show how to classify links according to the type of relation between the two nodes.
Since we want to identify similarities and differences among stock returns not due to the global market behavior, we investigate the excess return of each stock i with respect to the average daily return of all the stocks in our set. The excess return of each stock i at day t is then converted into a categorical variable with 3 states: up, down, and null. For each stock we introduce a daily varying threshold σ i (t) as the average of the absolute excess return (a measure of volatility) of stock i over the previous 20 days. State up (down) is assigned when the excess return of stock i at day t is larger (smaller) than σ i (t) (-σ i (t)).
The state null is assigned to the remaining days. We study the co-occurrence of states The last system we investigate is the bipartite system of movies and actors of the IMDb. We choose this system because (i) it is a large system (89,605 movies and 412,143 actors), (ii) it has a large heterogeneity both in movies and in actors, and (iii) it allows a sophisticated cluster characterization analysis based on the characteristics of the movie, namely genre, language, country, etc..
The actors heterogeneity ranges between 1 and 247 and it is so pronounced that there is no practical way to eliminate it when constructing statistically validated networks of movies. The approach of the k-subsets is not feasible due to lack of sufficient statistics.
Therefore, only a statistical validation against a null hypothesis fully taking into account the movies heterogeneity but not describing the heterogeneity of actors can be performed.
In the presence of this limitation a number of false positive links can be expected. In spite of this limitation, the results obtained for the statistically validated networks are very informative about several aspects of the movie industry.
We construct the statistically validated networks of movies by testing the co-occurrence of actors in the cast of each movie pair. A schematic representation of the procedure used to validate links is provided in Fig. 2 of Supplementary Information. The null hypothesis we use is again given by the hypergeometric distribution, which naturally takes into account the heterogeneity due to the number of actors performing in each movie. Table I  In the Supplementary Information we analyze the movie communities detected when Infomap is applied to different networks. In the community detection of adjacency and statistically validated networks we also weight links according to Ref.
[29] to heuristically take into account actors' heterogeneity in performing movies. We find that, both in weighted and unweighted networks, the clusters of movies obtained from the Bonferroni and FDR networks have a high homogeneity in terms of production country, language, genre, and filming location.
In summary, our method allows to validate links describing preferential relationships among the heterogeneous elements of complex systems with intrinsic bipartite nature. Our method is very robust with respect to the presence of links, which are false positive. In fact, for all the investigated systems we verified that by randomly rewiring the bipartite network the associated Bonferroni network was empty. By applying the method to three different systems, we showed that it is extremely flexible since it can be applied to systems with different degree of heterogeneity and to systems described by binary, categorical, or continuous variables.
Supplementary Information is linked to this manuscript.
Software A R code computing statistically validated networks of bipartite systems is available from the authors on request.   The procedure used to statistically validate each link of the projected network at the Bonferroni level for each subset of size k is summarized in Fig. 1 where an illustrative simple example is provided. In the example, we show as set A the elements we select to construct the projected network (these are organisms They are the link between elements 4 and 5 and the link between 6 and 7. All the other links are consistent with a null hypothesis of random connection preserving set A degree heterogeneity.

B. Single source of heterogeneity
When the degree heterogeneity of the set dual to the projected network is negligible the link validation proceeds as illustrated in Fig. 2. This is the case when Set A is the set of stock/states and Set B is the set of trading days. Again, the p-value is obtained by using the hypergeometric distribution neglecting the limited degree heterogeneity of Set B. In the specific example, the link 4-5 turns out not to be consistent with the approximate null hypothesis and therefore it is a statistically validated link. It is worth noting that the same procedure is also followed when the degree heterogeneity of the dual set is too large to be taken into account with the k-subset methodology discussed in the previous subsection. In fact we also use it as an approximately valid procedure for the movie/actor system.

II. CLUSTER DETECTION AND CHARACTERIZATION
In the present study we perform community detection 1,2 on the adjacency and statistically validated networks, in order to put in evidence the different community structure of these networks. We obtain a partition of the vertices of networks by using the Infomap method by Rosvall and Bergstrom 3 . This algorithm is considered one of the best 2 available today and it allows to efficiently investigate both weighted and unweighted networks. The method uses the probability flow of random walks in the considered network to identify the community structure of the system. This approach implies that two independent applications of the method to the same network may produce (typically slightly) different partitions of vertices.
For each investigated network, we run the Infomap 10 3 times and we select the best partition according to the minimal "code length" 3 . The obtained partition depends on whether the network is weighted or not, and eventually on how weights are selected. Therefore, in the following we discuss case by case the way in which cluster detection is performed. Once clusters of elements are detected, there still remains the problem of cluster interpretation.
We address this problem by comparing the partition of the system as produced by the Infomap with an a priori classification of the elements of the system. For instance movies can be characterized by their genre, and stocks can be characterized according to their economic sector.

A. Cluster characterization
Let us consider a system of N elements and a specific detected cluster C of N C elements to be characterized. Each element of the system has a certain number of attributes according to the considered a priori classification, e.g. a movie can be classified as "thriller" and "drama". We indicate the total number of different attributes over all the elements of the system with N A . For each attribute Q of the system, e.g. the Financial sector for stocks or the genre Comedy for movies, we test if Q is over-expressed in cluster C. In other words, we test if the number N C,Q of elements in cluster C that have the attribute Q is larger than what expected by randomly selecting the N C elements in the cluster from the total N elements of the system. The probability that X elements in cluster C have the attribute Q, under the null hypothesis that elements in the cluster are randomly selected, is given by the hypergeometric distribution H(X|N, N C , N Q ), where N Q is the total number of elements in the system with attribute Q. Therefore, we can associate a p-value with the observed number N C,Q of elements in cluster C that are classified with the attribute Q according to the equation If p(N C,Q ) is smaller than a given statistical threshold p b we say that the attribute Q is overexpressed in cluster C, and therefore the attribute Q is a characterizing aspect of cluster C. We separately test all the possible N A attributes for each detected cluster C. So, also in this case, we perform multiple comparisons, and we use a statistical threshold p b corrected for multiple comparisons by using the Bonferroni correction, i.e. we set p b = 0.01/N A .
A pretty similar approach to the one described in this subsection is used in Gene Ontology network which presents 2,451 distinct clusters. The cluster size decreases smoothly from the largest value of 13,608 down to the smallest value of 2. We will see in the following discussion that the partitioning of the network presents a certain degree of informativeness about the system. In fact the obtained clusters present a certain degree of homogeneity with respect to the main country of production, the language and some classes of genre of movies.
The FDR network is characterized by a largest connected component of 30,934 movies. The Infomap algorithm makes a partition of this and other components of the network into 3,967 clusters whose size is decreasing from 1,478 to 2 movies. However, the application of the Infomap algorithm refines the natural partitioning of the network by detecting 2,782 clusters whose size is ranging from 577 to 2 movies. We also note that the number of connected components in the Bonferroni network (2,456) is roughly equal to the number of the Infomap clusters in the adjacency network (2,451).

A. Community detection in weighted movie networks
Our method provides a full control of the statistical validation of links against a random null hypothesis taking into account the fact that different movies have a different number of actors. The system also presents a second source of heterogeneity. In fact, different actors typically play a different number of movies. In our sample the number of movies played by a single actor is ranging from 1 to 247. As discussed in the main text, we do not have a rigorous and computationally feasible way to also take into account this second source of heterogeneity in our statistical validation procedure. We therefore use a heuristic approach, and take into account this heterogeneity by following M. E. J. Newman. Specifically, we adapt the procedure proposed in the paper 6 to our system by weighting the link present between movie a and movie b with a value w ab that takes into account the number of movies played by the actors playing both the movies. Specifically, where the sum is taken over all the Q actors who play both movie a and b, and N i is the total number of movies played by actor i.
By performing community detection on the weighted adjacency movie network, we obtain a more refined partitioning of the 78,686 movies present in the network. Specifically, the clusters obtained with the Infomap algorithm present 3,386 clusters whose size is decreasing from 6,523 to 2. By performing community detection on the weighted statistically validated networks, we obtain results that are very similar to those obtained for the corresponding unweighted networks. The impact of considering link weights in community detection is quantitatively discussed in the following subsection.

B. How link weights affect the community structure of networks
The Infomap method allows to take into account link weights. This feature implies that results of community detection in a given network may significantly change when weights of links are considered. For each network, we quantify the difference between the partition obtained without using link weights and the partition obtained by taking link weights into account by calculating the (normalized) mutual information between the two partitions 7 .
The mutual information takes the maximum value of 1 for two identical partitions of the network. We observe a value of 0.798, 0.913 and 0.976 for the adjacency, FDR and Bonferroni networks, respectively. We therefore observe a net increase of the mutual information when we consider statistically validated networks. Furthermore, the mutual information reaches a value very close to 1 for the Bonferroni network, which is obtained under the most restrictive statistical requirements. In other words, we observe that the source of heterogeneity of the actors' productivity has a minor impact into the partitioning of the statistically validated networks, especially for the Bonferroni network.

C. Cluster size and inclusiveness
In the following, we separately discuss the results obtained for the partitioning of the In the majority of cases, the clusters detected in the Bonferroni network correspond to the strongest interconnected parts of larger clusters detected in the FDR network. The clusters of the FDR network correspond in turn to sets of movies which in large majority are present in bigger clusters observed in the weighted adjacency network. This general observation should not be seen as a strict inclusive relation but we wish to point out that a sort of "typical" inclusiveness is observed for most of the detected clusters.
In the next subsection we describe results about the characterization of clusters in the different networks.

D. Cluster characterization
We analyze the clusters of movies we obtain for the three different weighted networks, by considering the over-expression of specific characteristics of the movies contained in each cluster. By using the information about movies as provided by IMDb, we consider 4 different classifications of movies. Indeed the IMDb reports for each movie indication about (i) country or countries of production, (ii) language or languages used in the movie, (iii) movie genre (or genres) and (iv) location or locations where the movie was shot. Only in a limited number of cases some of these classifications are not available. When this happens we indicate the missing attribute about the movie as "not available". We characterize clusters obtained for all the networks by separately testing the over-expression of each attribute present in each one of the above mentioned 4 classifications.
In different networks we observe a different profile of over-expression. The degree of specificity is higher for smaller clusters and therefore a higher specificity is observed for the Bonferroni and for the FDR networks. This is especially true for the genre and the location classifications. In general, the country and language over-expression is quite specific for most clusters in the investigated networks. Exceptions are clusters containing movies produced in former Yugoslavia and Soviet Union. This is due to the fact that during the investigated period these countries have split into several independent countries.
In Table I   We discuss the case of the largest cluster observed in the partition of the adjacency weighted network. This is a cluster of 6,523 movies mainly in English and mainly produced in the USA. In the caption of Table II,  in California but cities of many other states are also observed.
We observe that 3,600 movies of the above described adjacency network cluster are split in many clusters of the weighted FDR network. In Table II we report some information about the seven FDR nework clusters having the largest intersection with the considered adjacency network cluster. Cluster labels in the figure are those provided by the Infomap.    The second case study concerns a cluster of Indian movies. The weighted adjacency movie network presents five large clusters of Indian movies. Here we discuss the properties of the second largest cluster of these five Indian movies clusters. We do not consider the largest one, because it is already very homogeneous according to the language: 90% of indian movies in this cluster are in Hindi. We indicate the selected cluster of Indian movies as cluster 24 of the weighted adjacency network. This cluster consists of 648 movies. In the caption of Table III we report all the over-expressions observed for this cluster. The over-expressed production country is indeed India and over-expressions are also observed for four languages spoken in India, five distinct movie genres, and eight filming locations. By comparing the clusters of the weighted FDR network with this cluster of the weighted adjacency network, we observe a large overlapping of movies. For example, 515 movies of cluster 24 are present in clusters 5 and 43 of the weighted FDR network. In other words, the Indian cluster 24 detected in the weighted adjacency network splits into two distinct clusters in the weighted FDR network. The first cluster (cluster 5) comprises movies where the language spoken is mainly Telugu, whereas the second cluster (cluster 43) mainly comprises movies in Tamil.
The characterization of genre and filming location is more specific than the one observed for cluster 24 of the adjacency network, but the degree of specificity is not too high (see the first two columns of Table III).
A higher degree of specificity is observed when we consider the clusters of the weighted Bonferroni network. In Table III, we show the over-expression characterization of the five largest clusters of Bonferroni network overlapping with cluster 24 of the weighted adjacency network (last five columns of Table III). There is a unique language characterization per cluster at this level. The filming location characterization is poor due to the fact that this information is often absent for Indian movies recorded in the IMDb. In fact the "not available" (NA) over-expression is the most frequent one.
In conclusion, we also notice for Indian movies the ability of statistically validated networks to describe communities of movies that are smaller but more homogeneous, according to the considered classifications, with respect to the communities of movies in the adjacency network.