Finding Statistically Significant Communities in Networks

Community structure is one of the main structural features of networks, revealing both their internal organization and the similarity of their elementary units. Despite the large variety of methods proposed to detect communities in graphs, there is a big need for multi-purpose techniques, able to handle different types of datasets and the subtleties of community structure. In this paper we present OSLOM (Order Statistics Local Optimization Method), the first method capable to detect clusters in networks accounting for edge directions, edge weights, overlapping communities, hierarchies and community dynamics. It is based on the local optimization of a fitness function expressing the statistical significance of clusters with respect to random fluctuations, which is estimated with tools of Extreme and Order Statistics. OSLOM can be used alone or as a refinement procedure of partitions/covers delivered by other techniques. We have also implemented sequential algorithms combining OSLOM with other fast techniques, so that the community structure of very large networks can be uncovered. Our method has a comparable performance as the best existing algorithms on artificial benchmark graphs. Several applications on real networks are shown as well. OSLOM is implemented in a freely available software (http://www.oslom.org), and we believe it will be a valuable tool in the analysis of networks.


Numerical estimation of the internal connection probability
The assessment of a cluster's significance given the null (configuration) model relies on the estimation of the probability described in Eq. 1 of the main text. This function has to be evaluated many times along the execution of OSLOM in order to clean up each cluster and to evaluate the clusters at the different hierarchical levels. We explain here how the values of the distribution function can be estimated or approximated in a practical implementation of OSLOM.
For convenience, we rewrite the equation here (S1) While estimating the value of the probability of Eq. S1 for a certain k in i , the most computationally expensive part is the evaluation of the normalization factor A. In fact, this would force us to evaluate the rest of the formula for all the allowed values of k in i and add up the result. A simple way out of this problem is to approximate the distribution by another whose normalization factor is known. To do so, we can think of a slightly different null model, in which the edges are still drawn at random and the formation of self-loops is admitted. This is actually the null model on which the definition of modularity is based [1]. In such model, the equivalent of Eq. S1 becomes an hypergeometric function that is much easier to estimate (see [2]). Both distributions, that of Eq. S1 and the hypergeometric, provide close numerical values for the same k in i , except if the probability of generating self-loops in the null model is high. The probability that reshuffling the connections at random a stub of vertex i connects to another stub of the same vertex, is given by k 2 i /2M . In the software implementation of OSLOM, the hypergeometric approximation for Eq. S1 is used as long as k 2 i /2M < 1. Otherwise, we directly measure A from Eq. S1.

Extension of the method to weighted networks
In the main text, it is briefly discussed how to extend OSLOM to weighted graphs. We mention also that some of the technical issues, such as combining both r w and r t , are not trivial. This procedure is described here in further detail.
Remember that we start from an ansatz for the distribution of the weights in the null model. The distribution of the probability of having a certain weight on the edge joining vertices i and j was assumed to be The idea behind this expression is that the weight of an edge is proportional to the average weight of its endvertices ( w i = s i /k i and w j = s j /k j ). We proposed the harmonic average because it is more sensitive to small values of w i . Our goal is to define a fitness function r which has to be a uniform random variable on our randomized weighted network. And we want to combine the fitness function depending on the topology with one depending on the weight distribution in order to detect meaningful fluctuations in any of them. Let us consider a vertex i which has l connections with a given subgraph C (not including i). For the topological part, we have already computed the probability that i shares l or more edges with vertices of C (Eq. S1). We call this number r t . Each of the l edges joining i with C carries a weight. We consider the corresponding normalized weight ω s = w s / w s , where w s is the weight on the s-th edge, with s = 1, 2, . . . , l. Since we want a single number taking into account all the weights in the set, we can simply consider the sum of all the ω s : Ω is the sum of l exponentially distributed variables (with rate equal to one) and therefore it follows the Erlang distribution [3]. Let us call r w the cumulative of Ω: In this way, we managed to define two variables r t and r w which are both uniformly distributed in the null model. Now, we would like to combine these two scores to have a final score for our vertex i. Unfortunately this is not so simple. We remind that r w is defined only on the N n neighbors of subgraph C while r t is defined for all the N * = N − n C ≥ N n vertices out of C, so the two variables are defined on samples of different size, in general. A way to overcome this difficulty is to scale r t to an equivalent random variable r t defined on a smaller sample. This amounts to map each index i in the set 1, 2, ..., N * of the old variable onto an index j in the set 1, 2, ..., N n of the new variable. Given i, the natural solution is to pick the index j such that the cumulative probability Ω t q on the sample of N * vertices coincides (at least with the approximation allowed by the specific numerics involved) with the cumulative probability Ω w q on the smaller sample of N n vertices. It can be shown that this can be achieved with a good approximation (in the limit of j close to N n ) with the following rescaling: Once we computed r t and r w we need to combine them in order to have a single score to rank the vertices. We consider the product r t · r w and the final score r tw = p(r t · r w < x) = x(1 − log x). The last expression comes from the assumption that the two variables are both uniform and independent. The set of variables {r tw } is then used to rank the vertices and to compute the cumulative probabilities Ω tw q , with N n instead of N * .

Girvan-Newman benchmark
The benchmark by Girvan and Newman [4] (GN benchmark) is a class of graphs with 128 vertices, each, divided into four equal-sized groups. Every vertex has expected degree 16 (with a very peaked distribution about 16). The (average) number of neighbors of a vertex within its group is k in , whereas the (average) number of external neighbors is k out . By construction, k in + k out = 16. In the language of the planted -partition model [5], the probability that a vertex is linked to another vertex of its group is p = k in /31, the probability that a vertex is linked to external vertices is q = k out /96. The condition Finding statistically significant communities in networks p > q for the four groups to be communities is then equivalent to k out 12 (this does not account for random fluctuations, though [2,6]). Fig. S1 shows the Normalized Mutual Information (in the version devised in Ref. [7]) between the planted partition of the GN benchmark and the partition found by the algorithm as a function of k out . As a term of comparison we used again Infomap [8]. Fig. S1 shows that Infomap is more accurate for low values of k out than OSLOM, but its performance drops rapidly for k out 6, whereas OSLOM shows a slower decay.
OSLOM is slightly worse than Infomap because it finds several homeless vertices, as we explained in the main text (Section 3.1.1).

Weighted LFR benchmark
In Figs. S2 and S3 we report the comparative analysis of OSLOM and Infomap on weighted LFR graphs. To build the weighted benchmark graphs [9] one needs two additional parameters: the exponent β of the relation between the strength of a vertex and its degree (the strength of a vertex is the sum of the weights of the edges incident on the vertex); the weighted mixing parameter µ w , which is the natural extension to weighted networks of the topological µ (that here we call µ t ), i.e. it is the ratio between the sum of the weights on the edges joining a vertex to its neighbors in different communities and the strength of the vertex. In the analysis, we fix the value of the topological mixing parameter µ t and see how the normalized mutual information varies as a function of µ w . In Fig. S2 the benchmark graphs consist of 5000 vertices, and we consider the usual two ranges of community sizes (S and B). In Fig. S3 the graphs consist of 50000 vertices, and we consider a single, but much wider, range of community sizes (from 20 to 1000). When µ t = 0.5 or µ t = 0.6, we find that OSLOM detects the right clusters for any value of µ w , for N = 5000, which is truly remarkable, while Infomap is unable to find the partition for µ w 0.6. OSLOM's striking result comes from the fact that the score r tw of a vertex on weighted graphs is given by the product of two numbers, the topological score r t and the weight score r w (Section 2). If µ t is not too large, the topological term r t is very low and brings down the whole score r tw , which remains significant for any choice of the weighted mixing parameter µ w . Basically, OSLOM is able to recognize the right clusters from the topology alone. When µ t = 0.5 or µ t = 0.6 and N = 50000, OSLOM maintains an excellent performance for the whole range of µ w , while Infomap again fails for µ w 0.6. For µ t = 0.7 the performances of the two algorithms worsen and OSLOM is still superior, though the results are essentially comparable for both network sizes. For µ t = 0.8 Infomap is more accurate than OSLOM, when N = 5000, while both methods are not very good when N = 50000. However, from Figs. S2 and S3 it is apparent that OSLOM works the better, the larger the network size. So, on very large networks (N 50000) we expect that OSLOM has a comparable or superior performance than Infomap for every pair of values (µ t , µ w ). We also infer that the performance of both algorithms worsens if clusters are on average larger.

Directed LFR benchmark
Figs. S4 and S5 show the results of the test on directed LFR graphs [9]. This time we have to distinguish between in-degree (number of incoming edges) and out-degree (number of outgoing edges) of a vertex. The in-degree distribution is taken to be a power law, with exponent τ in , whereas the out-degree is the same for all vertices, for simplicity. The mixing parameter µ expresses the ratio of the number of inneighbors of a vertex belonging to different clusters and the total number of in-neighbors of the vertex. The in-neighbor of a vertex i is any vertex j connected to i by an edge going from j to i. Figs. S4 and S5 tell us that OSLOM outperforms Infomap, especially when communities span a broader range of sizes. The performances of both algorithms slightly worsen on larger networks. The famous karate club network of Zachary [10] is a standard benchmark in community detection. Vertices are members of a karate club in the United States, who were monitored during a period of three years. Edges connect members who had social interactions outside the club. After some time, a conflict between the club president and the instructor caused the fission of the club in two separate groups, supporting the instructor and the president, respectively. In Fig. S6 we see the community structure found by OSLOM. It indeed finds two communities, plus a homeless vertex (12). Vertex 3 is shared between the two clusters, as it has several neighbors in both groups. We shall illustrate overlapping and homeless vertices with stars and triangles, respectively. The communities coincide with the ones observed by Zachary with the exception of vertices 3 and 12, which Zachary put with the squares. However, vertex 3 is overlapping, so it belongs to both clusters, which seems quite reasonable by looking at the figure. Also, vertex 12 is homeless due to its loose relationship with its group (it has only one neighbor).  [11]. Vertices of the network are dolphins and two dolphins are connected if they were seen together more often than expected by chance. The dolphins separated in two groups after one of them left the place for some time. OSLOM finds two communities, with five overlapping vertices (2,8,20,29,31), plus two homeless vertices (40, 61), which are very loosely connected to the rest of the graph. All vertices which are uniquely assigned to the same group (indicated by the same symbol, square or circle, in the figure) are classified in the same community by Lusseau as well.

American college football
Another well known benchmark in community detection is the network of American college football teams, compiled by Girvan and Newman [4]. It comprises 115 vertices, representing Division I-A colleges. Edges correspond to games played by the teams against each other during the regular season of fall 2000. The teams are divided into 12 conferences. Games between teams in the same conference are usually (but not always) more frequent than games between teams of different conferences, so there is a organization in clusters where communities correspond to conferences. In Fig. S8 we see that OSLOM finds three hierarchical levels. The lowest level consists of 11 clusters and 5 homeless vertices. There are no overlapping vertices. Six clusters correspond exactly to the conferences, three others match the conferences up to one vertex, one up to two vertices, the last cluster along with the homeless vertices mostly mix teams of the conferences Sun Belt and Independents. The latter is not a proper conference, whereas Sun Belt includes colleges which are geographically very spreadout, so they happen to play quite often games with the other teams, resulting much more mixed with them than teams of other conferences. Interestingly, in the second hierarchical level we find two large communities (plus four homeless teams), corresponding quite well to a geographical separation of the colleges in East and West. Finding statistically significant communities in networks Figure S6. Application of OSLOM to real networks: Zachary's karate club.

C. elegans metabolic network
Finding statistically significant communities in networks Figure S7. Application of OSLOM to real networks: Lusseau's social network of bottlenose dolphins.
Finding statistically significant communities in networks level 1 level 2 level 3 Figure S8. Application of OSLOM to real networks: American college football network.
Finding statistically significant communities in networks level 1 level 2 Figure S9. Application of OSLOM to real networks: metabolic network of C. elegans.
Finding statistically significant communities in networks