Characterizing the Community Structure of Complex Networks

Background Community structure is one of the key properties of complex networks and plays a crucial role in their topology and function. While an impressive amount of work has been done on the issue of community detection, very little attention has been so far devoted to the investigation of communities in real networks. Methodology/Principal Findings We present a systematic empirical analysis of the statistical properties of communities in large information, communication, technological, biological, and social networks. We find that the mesoscopic organization of networks of the same category is remarkably similar. This is reflected in several characteristics of community structure, which can be used as “fingerprints” of specific network categories. While community size distributions are always broad, certain categories of networks consist mainly of tree-like communities, while others have denser modules. Average path lengths within communities initially grow logarithmically with community size, but the growth saturates or slows down for communities larger than a characteristic size. This behaviour is related to the presence of hubs within communities, whose roles differ across categories. Also the community embeddedness of nodes, measured in terms of the fraction of links within their communities, has a characteristic distribution for each category. Conclusions/Significance Our findings, verified by the use of two fundamentally different community detection methods, allow for a classification of real networks and pave the way to a realistic modelling of networks' evolution.


I. INTRODUCTION
The modern science of complex systems has experienced a significant advance after the discovery that the graph representation of such systems, despite its simplicity, reveals a set of crucial features that suffice to disclose their general structural properties, function and evolution mechanisms [1][2][3][4][5][6][7].Representing a complex system as a graph means turning the elementary units of the system into nodes, while links between nodes indicate their mutual interactions or relations.Many complex networks are characterized by a broad distribution of the number of neighbors of a node, i.e. its degree.This is responsible of peculiar properties such as high robustness against random failures [8] and the absence of a threshold for the spreading of epidemics [9].
Another important feature of complex networks is represented by their mesoscopic structure, characterized by the presence of groups of nodes, called communities or modules, with a high density of links between nodes of the same group and a comparatively low density of links between nodes of different groups [10][11][12][13].This compartmental organization of networks is very common in systems of diverse origin.It was remarked already in the 1960's that a hierarchical modular structure is necessary for the robustness and stability of complex systems, and gives them an evolutionary advantage [14].
Exploring network communities is important for three main reasons: 1) to reveal network organization at a coarse level, which may help to formulate realistic mechanisms for its genesis and evolution; 2) to better understand dynamic processes taking place on the network (e.g., spreading processes of epidemics and innovation), which may be considerably affected by the modular struc-ture of the graph; 3) to uncover relationships between the nodes which are not apparent by inspecting the graph as a whole and which can typically be attributed to the function of the system.Therefore it is not surprising that the last years have witnessed an explosion of research on community structure in graphs.The main problem, of course, is how to detect communities in the first place, and this is the essential issue tackled by most papers on the topic which have appeared in the literature.A huge number of methods and techniques have been designed, but the scientific community has not yet agreed on which methods are most reliable and when a method should or should not be adopted.This is due to the fact that the concept of community is ill-defined.Since the focus has been on method development, very little has been done so far to address a fundamental question of this endeavor: what do communities in real networks look like?This is what we will try to assess in this paper.
Previous investigations have shown that across a wide range of networks, the distribution of community sizes is broad, with many small communities coexisting with some much larger ones [11,[15][16][17][18].The tail of the distribution can be often quite well fitted by a power law.Leskovec et al. [19] have carried out a thorough investigation of the quality of communities in real networks, measured by the conductance score [20].They found that the lowest conductance, indicating well-defined modules, is attained for communities of a characteristic size of ∼ 100 nodes, whereas much larger communities are more "mixed" with the rest of the network.For this reason they suggest that the mesoscopic organization of networks may have a core-periphery structure, where the periphery consists of small well-defined communities and the core comprises larger modules, which are more densely connected to each other and therefore harder to detect.Guimerá and Amaral have proposed a classification of the nodes based on their roles within communities [21].
However, the fundamental properties of communities in real networks are still mostly unknown.Uncovering such properties is the main goal of this paper.For this purpose, we have performed an extensive statistical analysis of the community structure of many real networks from nature, society and technology.The main conclusion is that communities are characterized by distinctive features, which are common for networks of the same class but differ from one class to another.Remarkably, such characterization is independent of the specific method adopted to find the communities.

II. DATA AND METHODS
As our target is to study the statistical features of communities, we need to employ data sets on large networks containing high numbers of communities of varying size.Our data sets contain ∼ 10 5 − 10 6 nodes, with exception for protein interaction networks (PINs), where the largest available data sets are of the order of 10 4 nodes.
Table I lists the network datasets we have used, along with some basic statistics.Most of them have been downloaded from the Stanford Large Network Dataset Collection (http://snap.stanford.edu/data/).Some networks are originally directed (e.g., the Web graph), but we have treated them as undirected.Further details on all networks can be found in the Appendix.
Overall, we have considered five categories of networks: • Communication networks.This class comprises the email network of a large European research institution, and a set of relationships between Wikipedia users communicating via their discussion pages.Note that in both cases, communication is not necessarily personal but involves, e.g., mass emails, and thus these networks cannot be considered as social networks.
• Internet.Here we have two maps of the Internet at the Autonomous Systems (AS) level, produced by the two main projects exploring the topology of the Internet: CAIDA (http://www.caida.org/)and DIMES (http://www.netdimes.org/).

• Information networks.
This class includes a citation network of online preprints in www.arxiv.org,a co-purchasing network of items sold by www.amazon.comand two samples of the Web graph, one representing the domains berkeley.eduand stanford.edu(Web-BS), the other was released by Google (Web-G).
• Biological networks.This class contains the PINs of three organisms: fruit fly (Drosophila melanogaster), yeast (Saccharomyces cerevisiae) and man (Homo sapiens).
Here we considered four datasets: a network of friendship relationships between users of the on-line community LiveJournal (www.livejournal.com);the set of trust relationships between users of the consumer review site epinions.com;the friendship network of users of slashdot.org;the friedship network of users of www.last.fm.
The problem of choosing a method for detecting communities is a very delicate one.First, very efficient algorithms are needed, because the networks we study are large.This requirement rules out the majority of existing methods.Second, as discussed above, there is no common agreement on an all-purpose community detection method.This is because of the absence of a shared definition of community, which is justified by the nature of the problem itself.Consequently, there is also arbitrariness in defining reliable testing procedures for the algorithms.Nevertheless, there is a wide consensus on the definition of community originally introduced in a paper by Condon and Karp [22].The idea is that a network has communities if the probability that two nodes of the same community are connected exceeds the probability that nodes of different communities are connected.This concept of community has been implemented to create classes of benchmark graphs with communities, such as those introduced by Girvan and Newman [10] and the graphs recently designed by Lancichinetti et al. [23], which integrate the benchmark by Girvan and Newman with realistic distributions of degree and community size  (LFR benchmark).Recent work indicates that some algorithms perform very well on the LFR benchmark [24].
In particular, the Infomap method introduced by Rosvall and Bergstrom [25] has an outstanding performance, and it is also fast and thus suitable for large networks.However, as every community detection method has its own "flavor" and preference towards labeling certain types of structure as communities, relying on a single method is not enough if general conclusions on community structure are to be presented.Therefore we have cross-checked the results obtained by Infomap with those produced by a very different algorithm, the Label Propagation Method (LPM) proposed by Leung et al. [26].The latter has proven to be reliable on the LFR benchmark and is also fast enough to handle the largest systems of our collection.Detailed descriptions of Infomap and the LPM are given in the Appendix.Here we just point out the profound differences between the two techniques.Infomap is a global optimization method, which aims to optimize a quality function expressing the code length of an infinitely long random walk taking place on the graph.The LPM is a local method instead, where nodes are attributed to the same community where most of their neighbors are.The partitions obtained by both methods for the same network are in general different.However, the general statistical features of community structure do not appear to depend much on the details of partitions.
In the following, only Infomap results will be presented; for LPM, see Appendix.

III. RESULTS
We begin the analysis by briefly discussing the distribution of community sizes (Fig. 1).We see that, as expected, for each system there is a wide range of community sizes, spanning several orders of magnitude for

COMMUNICATION
Figure 3: Visualized examples of communities in networks of different classes.Communication networks (a: email, b: Wiki Talk) contain very sparse communities with star-like hubs.These hubs give rise to very low shortest path lengths within communities (see Fig. 4).Star-like hubs are are also present in Internet communities (c: DIMES, d: CAIDA), which are relatively sparse as well.The CAIDA community displays a "merged-star" structure fairly typical for these networks (see Appendix).On the contrary, information networks contain dense communities up to large cliques (e: Amazon, f: Web-BS).In biological networks, the larger the community, the less tree-like it is (g: D. melanogaster, h: H. sapiens).Finally, communities in social networks appear on average fairly homogeneous (i: Slashdot, j: Epinions).
the largest systems.This is in agreement with earlier studies [11,[15][16][17][18].The overall shapes of the distributions are similar across systems of the same class.Distributions for biological networks show the largest differences, which, however, is likely to result from noise as the networks are smaller.For biological networks, analysis performed with the LPM shows slightly different, well overlapping distributions (see Appendix).
Next, we turn to the topology of the communities, and study the link density of communities and its dependence on community size.The link density of a subgraph is defined as the fraction of existing links to possible links, ρ = 2t/ [s (s − 1)], where t is the number of its internal links and s its size measured in nodes.Here, we use the scaled link density ρ = ρs = 2t/ (s − 1), which also approximately amounts to the average community-internal degree of nodes in the community.We have chosen this measure since it clearly points out the nature of subgraphs.For trees, there are always s − 1 links, and hence ρtree = 2. On the other hand, for full cliques ρ = 1 and hence ρclique = s.The dashed lines indicate the limiting cases (ρ tree = 2, ρclique = s).We see that the link densities in the communication and Internet networks are very close to the lower limit, which means that their communities are tree-like and contain only few or no loops.In communication networks, the scaled link density does not depend on community size, whereas in Internet graphs large com-munities appear somewhat denser.Networks in these two classes are the sparsest in our collection, as their very small average degree indicates that they are overall not much denser than trees (see Table I).It should be noted that in general, the intuitive view on communities is that they are "dense" compared to the rest of the network.However, as the methods applied here yield partitions, the communities of a tree-like network are also necessarily tree-like.Contrary to the above, the much denser information networks reveal a different picture, where communities are fairly dense objects, with the scaled density increasing with s.Especially in the Amazon network, communities with s < ∼ 10 are almost cliques.Social networks show yet another pattern: the scaled density of the modules grows quite regularly with the size s, approximately as a power law.Communities in social networks are mostly far from the two limiting cases: they are denser than trees, but much sparser than cliques, with the exception of small communities which appear more tree-like.Finally, the biological networks are characterized by two regimes: for s < ∼ 10, communities are very tree-like; for larger values of s the scaled density increases with s.In Fig. 3 the characteristic communities of each network class are illustrated.
The compactness of communities can be measured using the average shortest path length within each community.Fig. 4 displays the average values of as function of community size s.For all networks, the average shortest path lengths are very small, < 3 with the exception of social networks.Interestingly, all plots reveal the same basic pattern, independently of the network class.For very small communities, grows approximately as the logarithm of the community size (indicated by the dashed line), which is the "small-world" property typically observed in complex networks [27].We call these modules microcommunities.For sizes s of the order of 10, however, the increase of suddenly becomes less pronounced, and several curves reach a plateau.Modules with > ∼ 10 nodes are macrocommunities.The stabilization of the average shortest path length in macrocommunities can be attributed to the presence of nodes with high degree, i.e. hubs, which make geodesic paths on average short.We remark that, since most of our systems have broad degree distributions, shortest path lengths are very short [28], but the sharp transition we observe is unexpected and appears as an entirely novel feature.
For communication networks, there is a plateau with ∼ 2 for s > 10.As these communities are tree-like, this indicates that they have a star-like structure where most nodes are connected to a central hub only and thus their distance equals two.For the Internet networks, the joint presence of low density and low distances also means that hubs dominate the structure -here, "mergedstar" structures consisting of two or more hubs sharing many of their neighbors were observed (see Fig. 3d).This structure guarantees an efficient communication between the systems' units.On the contrary, information, social, and biological networks have a higher density and hence their short path lengths are due to both the density and the presence of hubs.Hubs play the least dominant role in social networks, as the average shortest path lengths keep slowly increasing also for large s.
The above picture is further corroborated by Fig. 5, which displays the ratio of the maximal observed community-internal degree of nodes max(k in ) to s − 1 as a function of the community size s.This ratio equals unity if any node is connected to all other nodes in its community, and thus it quantifies the dominance of hubs within communities.For communication networks, max(k in )/(s − 1) is close to unity even for large s, in accordance with the above observations on star-like communities.For Internet, this quantity somewhat decreases with s, as communities may contain multiple hubs which do not connect to all other nodes.In information networks, there are some differences.In the Web graphs, the largest communities contain nodes connecting (almost) the entire community.As the edge density in these communities is high, there may be several such nodes -in a clique, all nodes have degree s − 1.For biological and social networks, there is a decreasing trend.Especially in social networks, there are few or no dominant hubs in large communities.
Let us next take a closer look at the relationship between individual nodes and community structure.Here, the most natural property to investigate is the internal degree k in , indicating the number of neighbors of a node in its community.We measure the embeddedness of a node in its community with the ratio k in /k, characterizing the extent to which the node's neighborhood belongs to the same community as the node itself.The probability distribution of the embeddedness ratio of all nodes in their respective networks is displayed in Fig. 6.One would straighforwardly assume that on average, the em- beddedness of nodes would be fairly large, and a substantial fraction of their neighbors should reside inside their respective communities.However, Fig. 6 shows a more intricate pattern, where smaller values of k in /k are not at all rare.All of our networks are characterized by a substantial fraction of nodes which are entirely internal to their communities, i.e. have no links to outside their community and thus k in /k = 1.These correspond to the rightmost data points in each plot, and such nodes typically amount to over 50% of all nodes.These nodes have mostly a low degree (such as the degree-one nodes connected to hubs in communication networks).Networks in the same class follow essentially a very similar pattern.Communication networks and the Internet have very similar-looking profiles, where the distribution has a peak around k in /k ∼ 0.5.Information networks, instead, have a rather different profile, with an initial smooth increase reaching a plateau at about k in /k ∼ 0.4.The biological networks, despite the inevitable noise, also show a consistent picture across datasets.They somewhat resemble the communication and Internet networks, with an initial rise until k in /k ∼ 0.5, followed by a slow descent for larger values.Social networks have a rather flat distribution over the whole range, with little variations from one system to another.This means that there are many nodes with most of their neighbors outside their own community.Most community detection techniques, including the ones we have adopted, tend to assign each node to the community which contains the largest fraction of its neighbors.This implies that if a node has only a few neighbors within its own community, it will have even fewer neighbors within other individual communities.Such nodes act as "intermediates" between many different modules, and are shared between many communities rather than belonging to a single community only.Hence it would be more correct to assign them to more than one community.Overlapping communities are known to be very common in social networks, and dedicated techniques for their detection have been introduced [15,[29][30][31][32].

IV. DISCUSSION AND CONCLUSIONS
Since the advent of the science of complex networks, its focus has shifted from understanding the emergence and importance of system-level characteristics to mesoscopic properties of networks.These are manifested in communities, i.e. densely connected subgraphs.Communities are ubiquitous in networks and typically play an important role in the function of a complex system -modules in protein-interaction networks relate to specific biological functions, and communities in social networks represent the fundamental level of organization in a society.The dual problem of formally defining and accurately detecting communities has so far attracted the most of attention, at the cost of a lack of understanding of the fundamental structural properties of communities.Our aim in this paper has been to uncover some of these properties.
Our results indicate that communities detected in networks of the same class display surprisingly similar structural characteristics.This is remarkable, as some classes are really broad and comprise systems of different origin (e.g. the class of information networks, which includes graphs of citation, co-purchasing and the Web).The result is verified by two different community detection methods which are both partition-based but rely on entirely different principles.In accordance with earlier results, community size distributions are broad for all systems we have studied.Link densities within communities depend strongly on the network class.The average shortest path length displays similar behavior across all classes, initially increasing logarithmically as a function of community size (microcommunities) and then slowing down or saturating for communities of size s > ∼ 10 (macrocommunities).In combination with our results on link density in communities, the behavior of path lengths reveals a picture where high-degree nodes are very dominant in communities of certain classes (communication, Internet) and play a less important role in the connectivity of others, especially social networks.This picture is corroborated by the analysis of maximal communityinternal degrees of nodes.Finally, also the probability distribution of the fraction of internal links for nodes displays a clear signature for each of the considered classes.
The signatures we have found are a sort of network ID, and could be used both to classify other systems and to identify new network classes.Moreover, they could become essential elements of network models, with the advantage of more accurate descriptions of real networks and predictions of their evolution.
Although our results have been obtained using two different methods, their general validity merits some discussion.As the concept of "community" is ill-defined, every method for detecting communities is based on a specific interpretation of the concept.Furthermore, the underlying philosophies of methods can largely differ.Methods requiring that communities are "locally" very dense, such as clique percolation [15], would detect only a few communities in the communication and Internet networks, as they do not consider trees or stars as communities -nevertheless, this result would be consistent for networks of the same class.On the other hand, it is evident that partition-based methods neglect the fact that nodes may participate in multiple communities.However, it is worth noting that whichever method is used, the resulting communities are actual subgraphs of the network under study, i.e. its building blocks.Thus their statistical properties reflect the mesoscopic organization of networks, and our results indicate that this organization is similar within classes of networks.
In Fig. 7 we show the degree distributions for all the networks.The degree distribution spans several orders of magnitude.In Fig. 8 the clustering coefficient [27] of nodes with degree k is plotted as a function of k, defined as the number of links between neighbors t of the node divided by the maximum possible number of such links f: ).As we can see, the shape of the clustering spectrum is basically the same across all networks, with a rapid decrease of the clustering coefficient with k, except for the Web graphs, which are known to include very dense subgraphs and cliques, for which the clustering coefficient can be appreciably high also for nodes of degree ∼ 100.In Fig. 9 we report the average degree k nn of the neighbors of nodes with degree k again as a function of k [33].Communication networks, the Internet and the Web graphs are clearly disassortative, the other networks are either moderately disassortative or do not exhibit a particular correlation.Only the Livejournal friendship network has an assortative pattern for intermediate degree-values.

Appendix B: The community detection methods
In this section, we briefly explain the two community detection algorithms.For a detailed description, the reader is referred to the original publications.
Infomap [25] is based on the idea that a random walker exploring the network should get trapped inside dense modules for a fairly long time, and cross the boundaries of modules only infrequently.This simple idea is formalized by considering the problem of finding the optimal description of the path of the walker, which can be achieved by labelling every node with a prefix given by a unique name for the module it belongs to and a suffix given by a unique name within its module.The labels of nodes, while unique within their module, can be recycled in different modules to achieve the most compressed de- scription.According to such two-level description, given a partition of the graph, one can compute the amount of information needed to describe the path of the walker.If the network has a well-defined community structure, the code length of the two-level description may be shorter than the code length of the one-level description, in which each node has a unique name, as the walker will perform most of its steps within each module and comparatively few between the modules.In this way, the recycling of the labels leads to a more compact description of the process.Then the problem of Infomap is finding the partition which gives the smallest description length.This optimization problem is solved using a greedy optimization algorithm in order to obtain the results in reasonable time.The use of random walks makes the method naturally generalizable to the case of directed and weighted graphs.For directed graphs, due to the possibility of having dangling ends, which are sinks for the diffusion process, it is necessary to introduce a teleportation factor, similarly to Google's PageRank algorithm [34].
The Label Propagation Method [35] basically simulates the spreading of labels based on the simple rule that at each iteration a given node takes the most frequent label in its neighborhood.The starting configuration is chosen such that every node is given a different label and the procedure is iterated until convergence.This method has the problem of partitioning the network such that there are very big clusters, due to the possibility of a few labels to propagate over large portions of the graph.The LPM version that we used in our analysis is a modification by Leung et al. [26] that handles this problem by introducing a hop score which tells how far a certain label is from its origin.The hop score is decreased while the label spreads through the network and this improves the quality of the partitions found by the method.
Appendix C: Main results from the Label Propagation Method In order to verify that our results are not due to the method alone, but represent real features of the mesoscopic organization of the networks, we have carried the analysis presented for Infomap in the main paper with the Label Propagation method as well.The following plots show the characteristics presented in the main paper obtained via the label propagation method.Results are consistent with those obtained with Infomap.

Appendix D: Further Statistics on Community Properties
In this section, we want to show some other statistical properties of the modules.All figures display the results obtained using Infomap (upper panel) and the Label Propagation Method (lower panel).
As in the main article only the average values of link densities are shown, we first want to show what the probability distribution of the link density ρ of communities looks like.Fig. 15 shows that in all the systems there are dense modules together with sparser modules.Nevertheless, there is a dependency on the size of the mod-  ules: Fig. 16 shows what happens if we discard very small communities, with less than 3 nodes (s < 3), and Fig. 17 displays what is left when we consider fairly big modules, s > 10; only social and information networks include dense modules even after this filtering.
Next, we show the average internal clustering coefficient as a function of the module size s, Fig. 18.The clustering coefficient c is a node property defined as the number of links between neighbors t of the node divided by the maximum possible number of such links for a node with the same degree k: c = t/( 1 2 k (k − 1)).For nodes with degree smaller than two we consider the clustering coefficient to be undefined and leave them out of the calculations of the averages.Here, "internal" means that the clustering coefficient is computed by only considering the subgraph of the community, which includes only the internal links in the community.For communication systems and the Internet, the average internal clustering coefficient of large communities can reach fairly high values although the corresponding densities ρ are low.This can be explained in terms of "merged-star" structures, where two (or more) high-degree nodes are connected, their neighbours have a low degree (approx.the number of hubs) and are connected to all hubs.As then the clustering coefficient for these nodes is typically unity and their number is large, they dominate the average clustering coefficient within the community.

Figure 1 :
Figure 1: Distribution of community sizes.All distributions are broad, and similar for systems in the same category.Data points are averages within logarithmic bins of the module size s.

Figure 2 :
Figure 2: Scaled link density of communities as a function of the community size.Communication and Internet networks consist of essentially tree-like communities, while communities of social and information networks are much denser.Small modules in biological networks are often tree-like, while larger modules are denser.Data points are averages within logarithmic bins of the module size s.

Fig. 2
Fig.2displays the average scaled link densities ρ as function of community size for different networks.The dashed lines indicate the limiting cases (ρ tree = 2, ρclique = s).We see that the link densities in the communication and Internet networks are very close to the lower limit, which means that their communities are tree-like and contain only few or no loops.In communication networks, the scaled link density does not depend on community size, whereas in Internet graphs large com-

Figure 4 :
Figure4: Average shortest path lengths within communities as a function of community size s.After an initial logarithmic "small-world" regime (dashed diagonal line), the average shortest path grows much slower or saturates for communities with s > ∼ 10 nodes (dotted vertical line).Data points are averages within logarithmic bins of module size s.

Figure 5 :Figure 6 :
Figure 5:  The maximal observed internal degree of nodes as a function of the community size s.This quantity equals one if any node is linked to all other nodes of its community, and thus quantifies the dominance of hubs within communities.

Figure 10 :
Figure 10: Distribution of community sizes.

Figure 11 :
Figure 11: Scaled link density of communities as a function of the community size.

Figure 12 :
Figure 12: Average shortest path of a community as a function of the community size s.

Figure 13 :
Figure 13: Ratio between the maximum internal degree max(kin) of a node and the maximum possible number of internal neighbors s − 1 as a function of s, the module size.

Figure 14 :
Figure 14: Distribution of the fraction of neighbors of a node belonging to the community of the node.

Figure 17 :
Figure 17: Distribution of the link density for s > 10.

Figure 18 :Figure 19 :Figure 20 :
Figure 18: Internal clustering coefficient as a function of the module size.

Table I :
List of the network datasets used for our analysis.For each network we specify the number of nodes and links, the average and maximum degree.

Table II :
[36]gory nameexponent min degree exp error p−value Power-law exponents of the degree distribution and the minimum degree from which the fit holds.We used maximum likelihood fitting[36].

Table III :
[36]r-law exponents of the community size distribution derived from Infomap.We used maximum likelihood fitting[36].

Table IV :
[36]r-law exponents of the community size distribution derived from the LPM.We used maximum likelihood fitting[36].