Discovering Communities through Friendship

We introduce a new method for detecting communities of arbitrary size in an undirected weighted network. Our approach is based on tracing the path of closest‐friendship between nodes in the network using the recently proposed Generalized Erds Numbers. This method does not require the choice of any arbitrary parameters or null models, and does not suffer from a system‐size resolution limit. Our closest‐friend community detection is able to accurately reconstruct the true network structure for a large number of real world and artificial benchmarks, and can be adapted to study the multi‐level structure of hierarchical communities as well. We also use the closeness between nodes to develop a degree of robustness for each node, which can assess how robustly that node is assigned to its community. To test the efficacy of these methods, we deploy them on a variety of well known benchmarks, a hierarchal structured artificial benchmark with a known community and robustness structure, as well as real‐world networks of coauthorships between the faculty at a major university and the network of citations of articles published in Physical Review. In all cases, microcommunities, hierarchy of the communities, and variable node robustness are all observed, providing insights into the structure of the network.


Closeness Measures and Community Detection
A schematic diagram of a network with easily-detected community structure is shown in Fig. S1(a). In this network, a pair of communities with |c| = N/2 nodes each is connected by exactly one edge (between α and β). For any reasonable measure of closeness, a node will feel closer to other nodes within its community rather than those in a different community, with the Generalized Erdös numbers (GENs), Jacard Coefficients (JCs), and overlap explicitly demonstrated as having this property. Resistance Distance, the mean first passage time between nodes, and the Adar / Adamic coefficient [1] all behave in a similar manner (not shown). There is a clear separation between in-community and outof-community closenesses for the network in Fig. S1(a), which can be used to determine the correct community structure. Each node i is constrained to be in the same community as their closest friend, f (i). This is similar in spirit to the resistance-distance approach of Wu and Huberman [2], but does not require an arbitrary threshold for defining communities. Each measure of closeness will behave differently when fuzzy communities are detected, with some outperforming others in the ability to detect communities (as discussed in the main text). We note that no nodes in a network can be in a community by themselves using this approach, since all connected nodes necessarily have a closest friend. It may be possible to remove this restriction by introducing self-loops into the network, but we leave this to later work.
In Fig. S1(a), it is important to note that it is not possible to continuously tune an arbitrary parameter to find different partitions. Using modularity maximization with resolution parameter γ as an example, at γ = 1 we expect to detect the correct partition of two communities. For γ → 0, we expect to find only a single community, including all nodes in the network. No reasonable measure of closeness will ever produce this coarsest partition of Fig. S1(a), since it would require a node in c 1 to feel closer to nodes it has fewer connections to than those it has many connections to. Regardless of the closeness measure chosen (and even if a tunable free parameter included in our the measure), a single community can not be detected using the CF approach so long as nodes feel closer to their neighbors than their non-neighbors. The coarser partition of a single community is, however, readily detected using the hierarchical approach described in the text.
In Fig. S1(b), we show a pathological network topology for which the CF method will fail: two distinct communities with each node connected to a single, central node (δ). Most measures of closeness will find all nodes feel closest to δ (and all reasonable measures will, so long as the intra-community edges are sufficiently sparse and the network sufficiently large), so the CF approach will assign all nodes to the same community as δ. In such a case, only a single community will be (incorrectly) detected. This can be avoided by searching for the closest unpopular friend: after sorting nodes into ascending order of how close node i feels to them, the closest node with degree less than or equal to the next-closest is selected as f (i). Note that for the closest unpopular friend algorithm on a weighted graph, we still search for lower degree k i rather than strength W i , which avoids nodes that are connected to many other nodes (k i 1) but not nodes that are strongly connected to a few nodes (W i 1). The node δ is assigned to the community its closest unpopular friend is in, which will depend on the details of the network topology and the choice of closeness measure.

Fractured Communities
While the CF and CUF algorithms provides a intuitive method for detecting communities in an arbitrary graph, it is possible for a correct community to be fractured into two or more parts due to the local variability of the density of edges. As pictured in Fig. S2), an intended community A may be split into two groups, A 1 and A 2 due to the asymmetric connections each of these sub-communities has to the communities B and C. As pictured, there are either more edges leading from A 1 to B than there are between A 1 and C or the total weight between A 1 and B is larger than between A 1 and C. While useful information could be found in the structure of the fractured communities, is also desirable to recover the 'correct' communities despite these local variations. In order to produce a more useful method for community detection, we must supplement both of these approaches with an algorithm to merge fractured communities to better recover the 'correct' partition.
The fracture of communities can be due to two aspects of the detection: first, the decisions are purely local (even if the closeness measure incorporates the global topology). Because the decisions are not made with a global quality function, the splitting of a community into two pieces is not penalized. Second, the random nature of the networks allows for variability in the local density of edges. These fluctuations in density will affect all community detection methods [3], and in some cases may call into question the 'correctness' of the intended partition.
In Fig. S2, the detected groups A 1 and A 2 (which are in the same community in the 'correct' partition) will likely have a large number of edges between them. If we imagine a single community is mistakenly broken into two sub-communities of the same size (n nodes apiece), the number of edges between the sub-communities should scale as n 2 (with a uniform density of edges in the correct community A). This allows us to build a relatively simple greedy search for communities to merge. Once a CF or CUF partition has been determined, we perform a search for the pair of communities g and h with the largest value of k g→h /max(n g , n h ) 2 , where k g→h is the number of edges leading from group g to group h and n g is the number of nodes in group g. Before we merge the communities g and h, we check to ensure that k g→h ≥ min(k g→g , k g→h ). If the inequality is not satisfied (i.e. there are fewer edges between g and h are in either g or h alone) the greedy search is halted, otherwise g and h are merged and the search repeats.
In Fig. S3, we use the GENs to compare the CF and CUF methods for with or without community merging (averaged over 100 realizations of the network). We see that the CUF approach with merging gives the best overall results, with the largest normalized mutual information [4,5] (as defined in eq. 2 of the main text) for all values of k out , with only a moderate improvement over the CF approach. However, the CUF approach is more prone to community fracture (where the black circles do not converge to I = 1 as k out → 0), and greedy merging is therefore essential for reliable reconstruction of the network. We note that the merging of fractured communities as implemented here could also be used with modularity maximizing methods and may improve the spurious splitting of communities in some cases.
As noted in Fig. S2, variability in the local density of edges can lead to fractured partitions. The propensity of modularity maximization for finding spurious sub-communities can perhaps be most clearly seen by considering a random network without any community structure. We generate networks with the probability of an edge between any nodes is p edge , with 0.04 ≤ p edge ≤ 1. In Fig. S4, we show the number of communities detected in these networks using both greedy modularity maximization (squares) and the CUF approach (circles). The CUF performs far better than the greedy modularity maximization for p edge 0.1, while modularity maximization consistently finds more than one community for all p edge < 1. For small p edge , we expect fluctuations in the edges will produce a locally higher density of edges randomly, which may be detected using any community detection method [6]. However, as p edge increases, these fluctuations should be less significant, and the CUF that detects only a single community may be preferable.

A Common Hierarchical Benchmark
It is natural to define a coarse grained network formed with the communities in the higherresolution partition acting as nodes in the new, lower-resolution network in order to detect a hierarchy of community structure in the network. However, it is not immediately obvious how to choose the new edges in the new network, and rather than attempt to define coarsegrained edges at this resolution, we take the closeness between the coarse-grained nodes to be the average closeness between communities in the high-resolution partition. In the particular case of the GENs, the harmonic mean is the appropriate way to average the closeness, as the closeness between communities should be dominated by the nodes in each that feel close to one another (small E ij , thus more significant in the harmonic mean), rather than the nodes that do not feel close to one another (large E ij , less significant in the harmonic mean). For other closeness measures, it a linear mean may be the more reasonable choice for averaging the closeness between communities.
It is worth re-emphasizing that we do not expect to be able to continuously tune the resolution of the coarse-grained network with a free parameter. Our ability to detect hierarchical structure of course depends on the accuracy of the higher resolution partition (with an inaccurate partition unlikely to accurately detect the correct macrocommunities), and the existence of a 'correct' hierarchical partition. If nodes in communities g and h are very close to one another in the original network, a reasonable method of averaging should ensure they are close in the coarse grained network. While the detected partition will weakly depend on the method of coarse graining, it is not possible to tune the averaging as it is in the case of modularity maximization (or other approaches), where choosing a resolution γ 1 will assign all coarse-grained nodes to the same community.
We apply our coarse graining approach to detect the community structure of the benchmark presented by Rechardt and Bornholdt [7] depicted in Fig. S5. The network of N = 512 nodes is composed of 16 microcommunities with on average k in = 16 edges internally per node. Four of these microcommunities form a macrocommunity, with on average k out edges per node within a macrocommunity and k mix edges per node between macrocommunities. Note that this is the benchmark that is modified in order to produce the benchmark of variable robustness as described in the main text. The mutual information between the correct and detected partitions of the micro-communities (using the CUF approach with the GENs as the closeness measure) is shown in Fig. S6(a) for varying k out and k mix . The microcommunities are detected accurately for small k out , with the transition from 'good' to 'bad' detection occurring for k out + k mix ≈ 34 (the point at which I = 0.5, averaged over the four curves shown), more than twice the value of k in = 16. It is worth noting that for the larger values of k out (12 or 14), often the failure to saturate to I = 1 at k mix = 0 is due to the fact that the method will fail to detect the microcommunity structure of 16 communities, but rather the macrocommunity structure of 4 macrocommunities. For sufficiently dense connections within the macrocommunities, the CUF method does fail to detect the finest resolution of the network. However, so long as the microcommunities are accurately detected, the macrocommunity structure is also correctly determined (as shown in Fig. S6(b)). For k out = 16, we generally fail to find the macrocommunity structure because of the poor detection of the fine-resolution structure, while for k out = 12 or 14, the macrocommunity structure is not reliably found as k mix → 0. Modularity-based methods or other approaches may outperform these results [7,8] if the correct (but a priori unknown) resolution parameter is chosen. However, our approach gives a single partition for each scale (both micro-and macro-), and performs very well so long as the micro-communities are not too fuzzy (k out is sufficiently small), without using an unknown parameter.

Common Real-World Community Benchmarks
Modularity maximization performs quite well on the artificial GN benchmark precisely because of the modular structure inherent in the test: the correct solution was also the modularity maximizing one. This may not be the case in real world networks, where the 'correct' partition is determined from external information and is independent of the partition's modularity. To see the utility of the CF or CUF methods, we examine three simple real-world benchmarks with an a priori known partition in the main text. The football network [9] is comprised of nodes representing american football teams, with edges denoting games played between them in 2000. The 'correct' partition groups each team within their externally-defined division. The political blogs network [10] is a set of blogs in the leadup to the 2006 US midterm election, with an edge representing a link from one blog to another (we use an undirected version of this network). The political books network [11] is a set of books purchased on amazon.com around the 2004 US presidential elections, with an edge representing a co-purchase of a pair of books. In the political blogs and books networks, the 'correct' partition is the node's apparent political leaning: liberal vs. conservative in the former and liberal, independent, or conservative in the latter. All of these benchmarks are unweighted networks (with w ij = 0 or 1).
One common benchmark with a known community structure not mentioned in the main text is Zachary's Karate club [12]. This is a very small network of 34 nodes representing members of a karate club at an unnamed university, with edges denoting the out-of-club interactions between individuals. The club split into two parts due to a disagreement over the club's leadership, and the 'correct' partition denotes which individuals fell on a particular side of the disagreement. The karate club is partitioned using a number of approaches in Fig. S7, with from left to right modularity maximization, CF/CUF using the GENs, using the JCs, and using overlap. For the Karate club benchmark, we find surprisingly that both overlap and the GENs perform extremely poorly while the Jacard coefficients (JCs) perfectly reconstruct the correct partition. However, if the closest friend (CF) approach is used (rather than the closest unpopular friend approach, which avoids high degree nodes when assigning communities and is implemented throughout the main text), the GENs perfectly reconstruct the network, followed by overlap and then by the JCs. This illustrates that pathological networks do indeed exist that have not been fully accounted for in the CUF methodology, and it is difficult to predict exactly which method will be optimal a priori. We also note that a CF partition can be generated rapidly when generating a CUF partition, and by examining a global quality function (such as modularity), one can easily distinguish which partition better represents the structure of the network. Thus, despite the unexpected behavior of our approach when considering the Karate Club network, we determine that (a) the GENs remain a reasonable choice for the closeness measure and (b) that it may be necessary to compare the results of the CUF approach to a global quality function (such as modularity) to determine if the partition is reasonable.

Simulated Annealing of the Benchmark
In order to generate the network used in benchmarking the community detection and robustness measure in the main text, we used simulated annealing to produce a network with the desired properties. The desired in-, out-, and mixing-degree of each node were computed: k in,0 i , the desired number of edges from node i to nodes in its microcommunity, k out,0 i , the desired number of edges leading from i to any node in its macrocommunity (but not in the microcommunity) and k mix,0 i , the number of edges leading from i out of its macrocommunity. From these the total number of edges M = 1 2 i (k in,0 i + k out,0 i + k mix,0 i ) was determined, and a network of N = 512 nodes was generated having precisely M randomly distributed edges. The network was then randomly rewired, with a new trial configuration generated by removing one edge connecting the randomly chosen i and j, and a new edge being drawn between i and k. This trial configuration was accepted using a metropolis criterion: p acc = min(1, e −β(E old −E trial ) ), with the energy of a configuration where the first term of E is minimized if the in-, out-, and mix-degrees of each node satisfy our desired conditions.The temperature parameter β is set to β = 1 initially, and incrementally increased by 2×10 −5 at each attempted rewiring. A total of 500,000 rewiring attempts were made, with each edge on average experiencing ≈ 975 attempted rewirings.

How Ties are Handled
Unlike many real world networks, the network in Fig. S1(a) is highly symmetric and the closeness between nodes in groups A or B is likewise symmetric so that there is not a unique closest friend. In this case, we must develop a rule for handling ties in the closeness. In the case of a tie, we randomly but consistently select the 'closest' neighbor of i, f (i). This is accomplished by initially randomizing the node index, and choosing the node with the lowest (random) index as closest. In practice, the importance of ties in the artificial or real world networks networks depends on the choice of closeness measure. The Jacard coefficient JC ij = |C i ∩ C j |/|C i ∪ C j | can easily produce ties [13] for complex networks, whereas the GENs require highly symmetric networks to see a tie. The lack of ties is an additional advantage of measures that incorporate the global topology of the network, rather than purely local information.

Details of the DASH robustness
The DASH database, downloaded in June 2010 contained N 0 = 918 journals and 2404 articles published by 3385 unique author names, not all of which work at Harvard. Because of the interdisciplinary and highly connected nature of the journals Science, Nature, and Proc. Natl. Acad. Sci, these three journals are removed from the network. This alteration does not alter the shape of either the degree or weight distributions (although the removal of edges does affect their particular fitting parameters).
While briefly discussed in the text, it is worthwhile to examine the structure of the DASH network in detail, to determine the power of the degree of robustness in finding complex topologies or incorrectly assigned nodes. When we examine the degrees of robustness observed in the network, nodes with few edges connecting them to their community have a correspondingly low degree of robustness, reflecting the fact that they are only weakly connected to their assigned community. Low values for the degree of robustness D (1) i for these weakly connected nodes is unsurprising. We can use the degree of robustness to find nodes that are on the boundary between communities (i.e. that are strongly connected both to their assigned community as well as to a different community to which they are not assigned). We find 142 nodes with D (1) i ≤ 2, 53% of which have k in i ≤ 2 (indicating that they are simply of low degree, rather than on the boundary of a community). However, there are a few nodes that have D (1) ≤ 2 but are strongly connected to their respective communities (having high degree and weight directed into c i ). Due to their large values of k in i , these nodes are most likely on the boundary of their respective communities. The five journals with smallest D  Table S1. Some of these journals have a k in i k i (so many edges lead from i to different communities), while others have k in i ≈ k i (so most of the edges from i are within its assigned community). Examining the topology of the DASH network connected to these nodes that are boundarylike shows two distinct causes of high in-degree and low degree of robustness. Cognition, the second journal in Table S1 has more than twice as many out-edges as in-edges, but these out-edges are distributed amongst a wide range of communities. In Table S1, Cognition has the most weight (W in i = 14) directed towards its community (Phys. Sci. 4, primarily focused on Oceanography and Atmospheric Science), but has a large weight of 12 directed towards the Phys. Sci. 3 community (focused primarily on Psychology and Neuroscience, a more natural choice of community assignment for Cognition). It is likely that this node was incorrectly assigned, but the fact that the highest weight points towards Phys. Sci. 4 makes the misassignment understandable. The degree of robustness has allowed us to locate this possible error with ease, while the in-degree (k in i = 8), total degree, the ratio of in-to total degrees (k in i /k i = 0.32, and is the 17 th worst of all journals), or the ratio of in-to total strengths (W in i /W i = 0.34, the 10 th worst of all journals) would not highlight Cognition as a particularly troublesome node. Table S1 all have a low degree of robustness for a different reason. For these, the largest number of edges point towards their assigned communities, and in all but one case (the Journal of Economic History) the largest weights are also pointed towards their respective communities. However, in each case the journal is connected to the 'core' of a different community: nodes in a different community with both high inweights or in-degrees and high robustness. While the assignment of each node in Table  S1 to its respective community is often reasonable (since the majority of edges are within its assigned community), each of these nodes is also connected to one or more nodes that effectively define a neighboring community. These journals act as a bridge between the (generally less robust) communities to which they are assigned and the core of a robust, strongly connected community.

The other journals in
It is also of interest to determine the quality of the assignment of each microcommunity to its macrocommunity. The thin black lines in Fig. 2 of the main text denote the macrocommunity robustness r (2) c = D (2) i i∈c of each assignment. We note that a robust microcommunity (with high r c = D (1) i i∈c ) does not necessarily imply a robust assignment to its macrocommunity, and that many well formed microcommunities have a very low value of r (2) c . Table S2 shows that the lowest values of r (2) c typically occur for communities that have relatively few out-edges (and thus their assignment to their macrocommunity is expected to be fragile). However, the assignment of the Philosophy and History 1 (PH1) microcommunity to its macrocommunity is surprising, as it has a very low ratio of in-to out-degree and in-to out-strength. While the placement of PH1 to the Philosophy and History macrocommunity may appear to be an error, the surprising assignment is due to the fact that 75% of the out-of-macrocommunity edges and 84% of the out-of-macrocommunity weight are due to only two journals: the strong connections that Social Studies of Science and Annual Review of Sociology have towards Mathematical Sciences 3 (also focused on the Social Sciences). There are three journals in PH1 that are connected to the Philosophy and History macrocommunity, Isis, Persepectives on Science, and Journal of the History of Ideas. Two of these journals are in the 'core' of PH 1 (with D (1) = 17), while only one of the journals strongly connected to Math. Sci. is in the core (with D (1) = 16). Thus, the assignment of PH1 to the Philosophy and History macrocommunity is due to the fact that, while more weight is directed out of the assigned macrocommunity, the core journals of PH1 are more strongly connected to Philosophy and History journals. PH 1 is clearly boundary-like, and our robustness measure of r (2) c accurately detects this fragile assignment.
8 Additional details of the Phys. Rev. Network The Physical Review network included over 462,000 articles published in any Physical Review journal up to July 2010. Due to the size of the network , we consider only the subset of articles that have garnered at least 100 citations, with the largest connected component including 3651 articles and over 16,000 edges. While the network is unweighted (one citation is neither stronger nor weaker than another, thus w ij = 0 or 1) and directed (article i cites article j, but not vice-versa), we consider the non-directed version (with w ij = w ji =0 or 1). The community structure at one resolution of the Phys. Rev. network up to 2007 has previously been determined [14]. The detected communities are similar in many respects to the community structure we have detected, although these other papers did not report an examine of any additional hierarchical structure, as we discuss in the main text.

Figure Captions and Tables
Figure S1: Detecting communities with the CF and CUF methods. (a) An example of two clearly defined communities (c 1 and c 2 ), each of size N/2 with exactly one edge connecting them. Any plausible measure of closeness based on the network topology will clearly distinguish between intra-and inter-community connections. The closeness between nodes within the community as measured by the GENs (E in ), JCs (J in ), and overlap (O in ), with |c| the number of nodes in each community. Likewise, the closeness between nodes in different communities is shown with the superscript 'out'. (b) A schematic network of a single highly connected node (δ) to which all nodes in the network will feel closest. Assigning each node to the same community as their closest friend (the CF approach) will assign all nodes to the same community as δ, thus detecting only one community. By avoiding high-degree nodes (the CUF approach), the two communities are correctly detected, with δ assigned to one or the other. Figure S2: Merging of fractured communities. Community A is fractured into two communities, A 1 and A 2 due to the fact that A 1 is more strongly connected to B (connections labelled 'stronger') than to C (connections labelled 'weak'), while community A 2 is more strongly connected to C than B. In this coarse-grained schematic, 'stronger' may represent either high weight or many edges between them. Because A 1 and A 2 are truly subsets of the same community in the 'correct' partition, we expect a large number of edges between them. Figure S3: Improvements in the methods using fracture merging. A comparison of the approaches for community detection using the GENs, using the Newman-Girvan benchmark. Red squares denote the CUF after community merging, which gives the best overall results. Black circles denotes the result of the CUF without merging, and has a low mutual information to the expected partition due to fracture (even for clear communities, with low k mix ). The blue up and purple down triangles are the results for the CF algorithm with and without fracture correction, respectively. Figure S4: Community detection in unstructured networks. The number of communities n c detected using greedy modularity maximization (up and down triangles) or the CUF method (squares and circles) for a randomly linked network (with no intended community structure) as a function of the probability of an edge between two nodes, p edge . Greedy modularity maximization is shown in the purple down triangles for N = 100 nodes and black up triangles for N = 200 nodes, while the blue circles shows CUF detection for N = 100 and red squares for CUF with N = 200. When there is no intended structure in the network, modularity maximization tends to find a relatively large number of communities, while the CUF method typically finds only one community (for sufficiently large p edge 0.1. Figure S5: The adjacency matrix of the Reichardt-Bornholdt hierarchical benchmark. Each node is a member of a micro-community of 32 nodes, with k in = 16 connections to the other nodes in its micro-community on average. Each micro-community is a member of one of four macro-communities, and each node in a macro-community has k out edges internally on average. Each node has on average k mix edges to nodes outside of their macro-community. Figure S6: Accuracy of the hierarchical benchmark. The detection of (a) micro-and (b) macro-communities averaged over 100 realizations of the network. For all samples, k in = 16 is held fixed. k out is varied as k out = 8 (blue circles), 10 (red squares), 12 (black up triangles) and 14 (purple down triangles). The mixing between macrocommunities is varied with 2 ≤ k mix ≤ 30. The CUF approach accurately detects the microcommunities over a wide range of values of k out and k mix , and is clearly able to accurately detect the microcommunity structure for sufficiently clear communities. So long as the microcommunity structure is accurately detected, the macrocommunity structure seems reliably determined as well. Figure S7: The karate club network. Mutual information (a) and modularity (b) of the partitions of the Karate club [12] detected by a variety of approaches, with the a priori correct partition known. The leftmost results show the results of partitions using the greedy (striped red) and Potts model (striped blue) modularity maximizing partitions. For the remainder, red denotes the closest friend (CF) while blue denotes the closest unpopular friend (CUF) approach, with the GENs, JCs, and overlap shown. Surprisingly, the GENs implemented using the CUF method performs the worst in all respect (in contrast to most other benchmarks where it performs the best). For the Karate club network, the GENs do reconstruct the exact 'correct' partition if the CF method is used.

Name
Community D Table S1: The five most boundary-like nodes (with the lowest non-zero values of D (1) i /k in i ). The first, J. Econ. Hist., has a high degree and strength and large k in and k out . Similarly, Cognition has the smallest ratio of in-edges to total node degree, and is connected to a large number of other communities. The last three elements in the table are surprising in that they have a few connections outside of their communities (k out i = 1 or 2 compared to k in = 8 or 10) but still have low degrees of robustness. This is because while they have many in-community connections, their few out-of-community connections lead to strong, central nodes in other communities. These boundary-like nodes would not be easily detected by simply looking at the in-degrees or in-out ratio.

Community
Focus r High En. Physics 0.38 3 6 0 0 Table S2: The five least robust macrocommunity assignments. k in c and W in c denote the total number of edges and total weight from the microcommunity to other microcommunities in its macrocommunity respectively, while k out c and W out c denote the total number and weight of edges into any other macrocommunity. Philosophy and History 1 (PH 1) is the worst, and lies on the boundary of the Philosophy and History macrocommunity and the Mathematical Sciences macrocommunity. The other macrocommunity assignments are very fragile do to the very small number of connections, and are peripheral microcommunities.