^{¤a}

^{¤b}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: JGY AA LHD LJD. Performed the experiments: JGY. Analyzed the data: JGY AA LHD LJD. Wrote the paper: JGY AA LHD LJD. Designed the software used in analysis: JGY.

Current address: Departament de Física Fonamental, Universitat de Barcelona, Carrer de Martí i Franquès 1, 08028 Barcelona, Spain

Current address: Santa Fe Institute, Santa Fe, NM 87501, United States of America

Community detection is the process of assigning nodes and links in significant communities (e.g. clusters, function modules) and its development has led to a better understanding of complex networks. When applied to sizable networks, we argue that most detection algorithms correctly identify prominent communities, but fail to do so across multiple scales. As a result, a significant fraction of the network is left uncharted. We show that this problem stems from larger or denser communities overshadowing smaller or sparser ones, and that this effect accounts for most of the undetected communities and unassigned links. We propose a generic cascading approach to community detection that circumvents the problem. Using real and artificial network datasets with three widely used community detection algorithms, we show how a simple cascading procedure allows for the detection of the missing communities. This work highlights a new detection limit of community structure, and we hope that our approach can inspire better community detection algorithms.

Over the course of the last decade, network science has attracted an ever growing interest since it provides important insights on a large class of interacting complex systems. One of the features that has drawn much attention is the structure of interactions highlighted by the network representation. Indeed, it has become increasingly clear that global structural patterns emerge in most real networks [

While the exact definition of communities is still not agreed upon [

By developing techniques to extract this organization, one assumes that communities are encoded in the way nodes are interconnected, and that their structure may be recovered from limited and/or incomplete topological information. Various algorithms and models have been proposed to tackle the problem, each featuring a different definition of the community structure while sharing the same general objective. Although these tools have been used with success in several different contexts [

The paper is organized as follows. First, we argue that current algorithms tend to overlook small communities found in the neighborhood of larger, denser ones, under very general conditions. In the following sections, we investigate the exact mechanisms that cause this so-called

It is known that a resolution limit exists for a large class of community detection algorithms that rely on the optimization of a quality function over

In the majority of real world applications, the optimal covering of a network should include

In essence, shadowing occurs when large/dense communities act as screens preventing the detection of smaller/sparser adjacent communities. To illustrate this phenomenon, we study three detection algorithms based on two different paradigms of community structure, namely nodes and links communities. Note that while improved versions of these algorithms have been proposed [

The clique percolation algorithm (CPA) [

We expect shadowing since the size of the cliques,

The Greedy Clique Expansion (GCE) algorithm [

The pervasive overlap and the fact that

The link clustering algorithm (LCA) [_{ab} the link between nodes _{ik} and _{jk} (attached to a same node _{+}(_{+}(_{ik}, _{jk}).

The sets _{+}(_{+}(_{ik}, _{jk}) = 6/13. Note that apart from nodes _{ik}, _{jk}).

Once the similarity has been calculated for all adjacent pairs of links, communities are built by iteratively aggregating adjacent links whose similarity exceeds a given threshold _{c}. This algorithm naturally allows communities to overlap (to share nodes) since a node can belong to as many communities as its degree.

Again, a shadowing effect is expected, since the similarity threshold _{c} acts as a global resolution parameter. To elucidate the global aspect of _{c}, one must describe how its value is chosen (as proposed in [_{j} of community _{j} and _{j} are the number of links and nodes in community _{j} computes the fraction of potential “excess links” that are present in the community. The similarity threshold _{c} is chosen to maximize the overall density of the communities
_{c}, and where _{c}) is typically a well-behaved function of _{c} that displays a single maximal plateau [_{c} corresponding to this plateau is selected since it leads,

Following an analysis similar to that presented in the CPA case, we expect small communities to be left undetected as they are eclipsed by larger and denser ones. This is mainly due to the use of a resolution parameter (_{c}) that cannot be adjusted locally. For instance, links in a small community could exhibit vanishing similarities because some of the associated nodes are hubs (nodes of high degree). This is especially true in the vicinity of large and dense clusters whose nodes are typically of high degree (see

The pairwise unions of the three sets _{+}(_{+}(_{+}(_{ab}, _{bc} and _{ac} share lower similarities—namely _{ac}, _{bc}) = _{ab}, _{bc}) = 3/22 and _{ab}, _{ac}) = 3/17—than if the triangle had been completely isolated (_{ac}, _{bc}) = _{ab}, _{bc}) = _{ab}, _{ac}) = 1). It is therefore likely that these three links will be left unassigned.

Figs

identify large or dense communities by tuning the resolution parameter;

remove the internal links of the communities identified in step 1;

repeat until no new significant communities are found.

The first iteration of this algorithm detects the communities that are normally targeted by detection algorithms, thus ensuring that the cascading approach retains the main features of the “canonical” community structure. After removal of links involved in the detected communities, a new iteration of the detection algorithm is then performed on a sparser network in which previously hidden communities are now apparent. This process is repeated until a final and more thorough covering of the network into overlapping communities is obtained. Note that the resulting cover is not necessarily hierarchical, but simply more complete.

For example, in the case of the CPA, a high value of

A similar strategy is employed to uncover hidden communities with GCE. Seeds of at least _{c}) ≃ 0 only yields chains of links (the keystone ensures a non-vanishing similarity), which in general are not classified as significant communities.

It is worth mentioning that conducting this repeated analysis does not increase the computational cost significantly, because the cascading algorithm scales exactly like the community detection algorithm used at each iteration, and because the number of iterations that can be carried is small (typically less than 10). Moreover, the size of the networks (number of links and nodes) effectively decreases after each iteration, further reducing the cost (numerical evidences will be presented in the next Section).

To investigate the efficiency and the behavior of the cascading detection, we first apply our approach to 8 small real network datasets: arXiv cond-mat circa 2004 (hereafter:

First and foremost, our results show that cascading detection

(_{largest} < _{2nd largest}, with

Although the increasing sparseness of the network hinders the performance of the CPA and GCE, it also reduces the cost of the subsequent detection steps.

(

(

Visual inspection of the detected communities not only verifies the quality of the hidden communities, but also confirms our intuition of the shadowing effect. A look at

(

More complex structures and correlations are also brought to light using this approach.

Important insights can be gained by applying detection algorithms to networks that are accompanied by meta-information. In real networks, meta-information such as declared affiliations (e.g. individuals in social networks) can be be used to define

We study 3 large networks downloaded from the SNAP database [

Because the considerable time and storage space requirements of clique based algorithms make them unsuitable for such large (^{6}) and clustered networks, we have restricted ourselves to the LCA algorithm for the remaining of the section.

Complete detection is achieved with 6 iterations for

First and most importantly, we find that the cascading approach either increases the

The nature of the functional communities can explain these 2 contrasting behaviors. Indeed, the very definition of what qualifies as a link is closely related to the definition of the functional communities of the

In contrast, the

Second, the evolution of the structure under cascading detection is also of interest. In all cases, the network shrinks rapidly, both in term of assignable edges (

The previous sub-section indicates that the cascading approach performs better when functional communities are dense, i.e. similar to structural communities. This hypothesis can be investigated with artificial networks generated specifically to exhibit known structural communities (called built-in communities). We apply the cascading LCA to Lancichinetti-Fortunato [_{2} of the community size distribution (the distribution follows a power-law).

These parameters can be used to tune the difficulty of the community detection problem. The mixing parameter dictates the fraction of

In the standard implementation of [

The last parameter of interest is _{2}, the scale exponent of the community size distribution. Since this distribution follows a power-law, its average is only well-defined for _{2} > 2, whereas its variance goes to infinity whenever _{2} < 3. Thus, as _{2} decreases, increasingly large communities appear. Because LF networks do not control explicitly for size correlation in neighboring communities, some of the large communities happen—through pure chance—to share nodes with very small communities. In fact, it can be verified that some of these large communities neighbor much smaller communities, despite some trace of assortative mixing based on community sizes (See

In _{2} ∈ [1.25, 3.75], with

Relative change in normalized mutual information obtained by comparing the structure detected by the pure LCA and the cascading LCA, when applied to LF networks. All results are discrete points, but solid curves are added to guide the eye. (_{2} ranges from 1.25 to 3.75 (x-axis). Averaged data is shown on the color map (10 different networks for each point), while the distribution of raw data is shown in the plots to the right and bellow (black dots). Within raw data plots, the solid curve shows the average and the gray area indicates the standard deviation.

It is useful to split the joint _{2} space in 4 qualitative different regions of interest to analyze the results of

high _{2}: overlapping with heterogeneous community sizes;

high _{2}: overlapping with homogeneous community sizes.

low _{2}: non-overlapping with heterogeneous community sizes;

low _{2}: non-overlapping with homogeneous community sizes;

Shadowing is

The results of regions 2, 3 and 4 highlight the fact that the cascading approach is not a silver bullet. Because shadowing happens less frequently in regions 3 and 4 (no or little overlap), repeated applications of the detection algorithm sometimes decrease the quality of the final partition, through overfitting. Furthermore, since the effects of shadowing are more pronounced when community sizes are heterogeneous, the cascading detection approach is also prone to overfitting in regions 2 and 4. Nonetheless, it is important to realize that the changes in NMI are

In a normal situation, where built-in communities are not known, the results shown in _{v} of the size distribution, i.e. the ratio of its standard deviation to its mean. If we maximize the average NMI of the final outcome based on the variability of the size distribution detected at the first iteration, then _{v} ≈ 1.15 is the optimal threshold for the low mixing (

Coefficient of variation _{v} of the community size distribution for (

Nonetheless, we can assess the prevalence of shadowed _{v} suggests that shadowing of structural communities definitely occurs in the _{v} = 5.51), and _{v} = 2.42). No definitive conclusion can be drawn for the other networks, since their calculated variability of _{v} > 0.7 lie slightly below the optimal threshold for low mixing cases, except for the _{v} = 0.48). These results must be considered with care since LF networks do not capture all the structural complexity of real networks [

Finally, we find once again that the cascading approach involves modest increase in running time (See

Raw running times are computed to the millisecond precision and averaged over 10 independent and complete iterations. The network generating process is not included in the total running time. (

In conclusion, we have defined the shadowing effect in community detection and have illustrated the types of scenarios where it might arise. This effect calls for a simple solution: a cascading use of detection algorithms. This meta-approach has been shown to reduce the hidden portion of a network, and to find relevant communities in real and benchmark networks when shadowing does occur.

We have shown that a simple implementation of the cascading philosophy can indeed unveil communities that were initially overshadowed by larger and/or denser communities. Interestingly, in both real datasets and benchmarks, cascading approach appears more likely to detect meaningful remaining communities whenever subsequent iterations of the detection algorithm still required significant computing time. As observed in the LF benchmarks, this situation occurs with networks that we know more sensitive to shadowing effects (per definition): strongly overlapping communities with heterogeneous sizes.

The current implementation is meant to be the first level of cascading approaches, opening the way to more subtle meta-algorithms. For instance, one could construct an extreme version, where communities are detected one by one (in the spirit of Ref. [

Any significant improvement in community detection will help shrink the gap between analytical models and their real network counterparts. The difficult problem of accurately modeling the dynamical properties of real networks might be better tackled if one includes complex community structure through comprehensive distributions or solved motifs [

Finally, in addition to the technical developments presented in this paper, perhaps the most insightful observation can be simply stated: since community structure occurs at all scales, global partitioning of overlapping communities must be done sequentially, cascading through the organizational layers of the network.

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

The mixing parameter is given by the average of the ratio _{i} is the degree of node _{v} is defined as the standard deviation of a distribution, normalized by the mean.

(PDF)

This figure shows the average value of the NMI

(PDF)

This array samples 25 points of the parameter space, i.e. fractions of overlapping nodes set to _{2} = [1.5, 2, 2.5, 3, 3.5] (left to right). All networks consists of

(PDF)

For each network, correlations are computed using the community structure detected by the cascading version of CPA (left), GCE (center) and LCA (right). Each subplot shows the number of communities of

(PDF)

The authors wish to thank the Gephi development team for their visualization tool [