A Stochastic Model for Detecting Overlapping and Hierarchical Community Structure

Community detection is a fundamental problem in the analysis of complex networks. Recently, many researchers have concentrated on the detection of overlapping communities, where a vertex may belong to more than one community. However, most current methods require the number (or the size) of the communities as a priori information, which is usually unavailable in real-world networks. Thus, a practical algorithm should not only find the overlapping community structure, but also automatically determine the number of communities. Furthermore, it is preferable if this method is able to reveal the hierarchical structure of networks as well. In this work, we firstly propose a generative model that employs a nonnegative matrix factorization (NMF) formulization with a l2,1 norm regularization term, balanced by a resolution parameter. The NMF has the nature that provides overlapping community structure by assigning soft membership variables to each vertex; the l2,1 regularization term is a technique of group sparsity which can automatically determine the number of communities by penalizing too many nonempty communities; and hence the resolution parameter enables us to explore the hierarchical structure of networks. Thereafter, we derive the multiplicative update rule to learn the model parameters, and offer the proof of its correctness. Finally, we test our approach on a variety of synthetic and real-world networks, and compare it with some state-of-the-art algorithms. The results validate the superior performance of our new method.


Introduction
Many real-world systems can be represented as complex networks, where the vertices represent the components of the systems and the links represent the interactions between them. In complex networks, they share some common properties such as the small-world property [1] and power-law degree distributions [2]. Recently, community structure, one of the most important inherent properties in complex networks, attracts a lot of attentions. It is regarded as groups of vertices with denser connections within groups but sparser connections between them [3]. It is believed that communities play important roles in different real-world systems. It reveals the meaningful topological structure in a wide variety of real-world networks, e.g. friendship communities in a social network or protein functional structures in a protein interaction network. Therefore, discovering communities is very crucial for us to understand how different units in a complex system communicate with each other and work together. Furthermore, it provides a valuable insight to the organization and behavior of the network, and offers clues for further investigations. It is thus not surprising that community detection in networks is essential and has been widely investigated over the last few years.
An important problem in community detection is finding overlapping communities, i.e., vertices may belong to more than one community [4]. Overlapping communities exist comprehensively in the real-world networks. For example, in a social network, an individual may participate in several social communities, depending on personal professions, friends, etc. In last decades, a number of approaches for overlapping community detection have been proposed [5][6][7]. Among these algorithms, stochastic blockmodel, instead of directly detecting communities, is a form of statistical inference for networks and describes how community structures are generated [7]. Here we give a simple example of stochastic blockmodel to formulate community structure. In the blockmodel, we can specify a set of probabilities p ck 's, where p ck represents the probability of a link between any two vertices in communities c and k, respectively. Then p ck can be specified a large value when c = k, and a small value otherwise. In the generative process, we can create a network that has many links within communities and few between them. We fit the model to the observed network, and then its community structure can be inferred by parameters learning. Thus, stochastic blockmodel seems to be theoretically solid by using statistical inference. Besides, the model generates the expected network similar with the original network, which can help us better understand the structure of the network, and hence it often gets some interpretable and more trustworthy results. Because of the above property, if the model fits a network well, we can further apply it to predict the missing links from the viewpoint of generating the network. More widely, this property also means that the model is not limited to detecting traditional community structure (i.e., a set of communities with dense internal connections and sparse external ones), but any type of community structures (such as hierarchical, bipartite, or k-partite structures and many others [8]), which can be formulated as a model [7]. Therefore, stochastic model has been becoming a type of promising method for overlapping community detection. Specifically, this method requires to explicitly model the network and then infers the community memberships of vertices. Along this line, some recent methods are based on blockmodel or its variations, and employ nonnegative matrix factorization (NMF) to learn their models [9][10][11][12][13][14]. For example, Zarei et al. [9] introduced a vertex-vertex correlation matrix to represent the relationship between vertices, instead of adjacency matrix. Then they applied NMF to analysis this feature matrix and got the overlapping communities. Psorakis et al. [10] generated the expectation network by using two nonnegative matrices. Then they utilized a Bayesian NMF which combines the Kullback-Leibler (KL) divergence with the prior model on the two nonnegative matrices to infer overlapping communities from a network, while determining the number of communities. Wang et al. [11] proposed three NMF techniques (Symmetric NMF, Asymmetric NMF, and Joint NMF) to detect communities in undirected, directed, and compound networks, respectively. Zhang et al. [12] modeled the expectation network by two nonnegative matrices in three factors form. In this way, they could learn the community membership of each vertex as well as the interaction among communities. Zhang et al. [13] developed a symmetric binary matrix factorization model to identify overlapping communities. Besides, they could distinguish outliers from overlapping vertices. Cao et al. [14] proposed a model consisting of the centrality matrix of vertices and degree matrix of communities. Then based on NMF, they inferred the two types of parameters to identify overlapping communities, hubs, and outliers simultaneously.
However, for a model-based method, one often has to solve the model selection problem, i.e. inferring the correct number of communities in a network automatically. In some cases, one can obtain the number of communities in advance. But in most situations, we do not know how many communities in a network because we are often lack of background or domain knowledge. As a result, it is very difficult for us to properly determine the number of communities. Model selection and scalability are two common drawbacks suffered by almost all methods based on stochastic blockmodel. The traditional statistical model selection strategy (e.g. minimum description length [15,16] or consensus clustering [17]) applied to stochastic models may, in principle, be able to find the number of communities K in a consistent and satisfactory manner. But it needs to scan K in a large range which makes it too computationally expensive to be applied to real large networks. We find out that, Bayesian model selection [10,18] uses priors that penalize their model for including too many nonzero parameter values and hence achieves a balance between the number of communities and goodness of fitting the network data, which avoids scanning K. However, the priors themselves contain undetermined parameters whose values can influence the number of communities and hence the problem is not completely solved by this approach. Besides, they did not consider the detection of hierarchical structure, which often appears in real networks.
Another key problem in community detection is finding hierarchical communities, where small communities are nested in larger ones. Different resolutions determine the average size of communities [19], which enables us to explore the hierarchical levels of the network. If the resolution is low, the whole network will be divided into several large communities. Extremely, if the resolution is low enough, maybe the whole network will be considered as a largest community. On the contrary, if the resolution is high, the network will be divided into many small communities. Taking the hierarchical structure of a school network as an example, at a low scale, the whole school network can be considered as a community. On the other hand, at a higher scale, each class may be represented as a community. And even each grade can be represented as a community between the two scales. The communities detected at different scales may represent different functional units. Thus, hierarchy can not only show the macroscopical view of the network, but also reveal more detailed community information. Furthermore, as one usually has no knowledge about how large the communities are, so it is necessary to compare the detected communities at different scales.
Recently, some researches have focused on hierarchical community detection. For instance, Blondel et al. [20] proposed a heuristic method based on modularity optimization. Furthermore, they merged the small communities if merging these communities increases the modularity. In this way, the hierarchical community structure was built. In [19], Lancichinetti et al. introduced a fitness function with a resolution parameter. By tuning the parameter, they were able to obtain the hierarchical communities. [21] defines a tightness function of local community. Similarly, by adjusting a resolution parameter of this function, they obtained the communities at different scales. The hierarchical community detection methods mentioned above all adopt the heuristic optimization of a quality function. Generally, the fact that many real networks have communities with pervasive groups leads to that a global hierarchy of vertices cannot capture the relationships between overlapping vertices. In addition, these hierarchical community detection methods can not reconcile the antagonistic organizing principles of overlapping communities and hierarchy [22]. The combination of soft community memberships and hierarchy may be a visible solution.
To deal with the above problems, we present a novel model-based approach for overlapping and hierarchical community detection. This approach is free to set the number of communities (specified by users or determined automatically), which is beneficial to real-world application scenarios. The generative model contains two parts: the loss function and the l 2,1 norm regularization term, which are balanced by a resolution parameter. The loss function measures the distance between the real network and the expected generated network correlating with the membership of vertices, and the regularization term controls the group sparsity of the model. Thereafter, we derive an update rule of the model parameters to infer the membership matrix of vertices. The resolution parameter balances the loss function and the regularization term. By tuning the balance parameter, we can obtain communities at different scales. When we know the number of communities K, it is easy to directly specify K. Otherwise, it is easy to specify a large initial K 0 , and then the l 2,1 norm regularization term penalizes some empty communities where no vertices participate in. In this way, we get a more accurate number of communities from the large K 0 by abandoning the empty communities.
The rest of the paper is organized as follows. First, we formalize the problems as a generative model and give the update rule of the model parameters. We then verify our approach on various networks including artificial and real-world networks. At last, we summarize the discussions and conclude with our future work.

Methods
In this section, we first describe our generative model, and then present an algorithm based on nonnegative matrix factorization to learn the parameters of the model. Finally, we offer an illustrative example to depict the main idea of our method.

Generative model
We consider an undirected and weighted network G = (V, E), where V denotes the set of vertices and E denotes the set of links. Usually, we use the adjacency matrix A to represent G, where A ij equals to the weight of a link between vertex i and j if they are connected, and otherwise, it is 0. Thus, A is a N × N matrix, where N is the number of vertices. In our model, if we know the number of communities K in advance, we can set K directly. Otherwise, we can set a relative large initial K 0 , and then our model will automatically determine a suitable number of communities.
The first step in the model is to generate an expected adjacency matrixÂ, which has the same size as A. The elementÂ ij inÂ denotes the expected weight of links between vertices i and j. HereÂ ij is specified by a set of parameters U ik which represents the propensity of vertex i belonging to community k. Therefore, we define the membership matrix, U 2 R N×K , which consists of all the elements U ik . To be specific, U ik U jk denotes the expected weight of links between i and j in community k. Summing over the communities, the expected weight of links between vertices i and j in the network is: Eq (1) can also be rewritten asÂ Under this model, if vertices i and j belong to the same community k, which means U ik and U jk are both relatively large, the value ofÂ ij will be large. This implies that vertices i and j have a high propensity of being connected in community k. Then, a set of such vertices tends to be connected relatively densely in community k. Otherwise,Â ij will be small. Furthermore, when the number of communities K is unknown, it is preferable to determine it automatically. But in practice, it is easy to set an initial maximum number K 0 . Then, what we need is to determine K from K 0 . In the community structure of a network, assuming that a community k is "redundant" or unnecessary, it means all the vertices do not participate in this community. Accordingly, the entry U ik should be zero for all i, which implies all the values in the kth column of membership matrix U will be zero. Finally, the problem becomes how to make some "redundant" columns in the original matrix U zero, and select several non-zero columns from U. Generally speaking, one usually uses l 1 norm regularization to promote a sparse solution and improve generalization, but it cannot obtain a wide class of solutions known to have certain "group sparsity" structure. Incorporating group information using mixed-norm regularization has been previously discussed in statistics and machine learning. A favorable approach in literature is to use the mixed l 2,1 norm regularization [23,24]. Inspiring by the above idea, we add a constraint to U by using l 2,1 norm, which is defined as where U k is the kth column of U. As we can see in (3), the l 2,1 norm of a matrix is the sum of vector l 2 norm of its columns, which can be considered as l 1 norm of vector l 2 norm of its columns. So penalization with l 2,1 norm promotes as many zero columns as possible to appear in the matrix. Thus, the l 2,1 norm will achieve the group sparsity.
There are many options to fit the adjacency matrix A and the expected adjacency matrixÂ. Least square loss and the generalized Kullback-Leibler (KL) divergence are most widely used [25]. Similar with other NMF-based methods [11][12][13][14], we adopt the least square loss between A andÂ for simplicity. Here, we show the motivation of the least square loss from the viewpoint of likelihood. Usually, the observed weight A ij between vertices i and j can be represented by the expected weightÂ ij adding additional noise " ij , Suppose the noise " ij follows zero mean normal distribution with standard deviation of τ, that is A ij $ NðÂ ij ; t 2 Þ. In general, we assume each weight A ij is independent, therefore, the proba- Thus for all the weights, the log likelihood can be written as Then maximizing the log likelihood is equivalent to minimize Thus we have This means the least square loss underlies the Gaussian additive observation noise [18]. Finally, by adding the l 2,1 norm regularization term to control the group sparsity of U, the formulation of our model is where λ is a positive real-valued parameter, which controls the degree of group sparsity. In other words, the coefficient λ controls the number and the average size of communities. If λ is small, we can get smaller communities. And if it is large, we can get less but larger communities. If it is large enough, the whole network becomes a community. Hence, the parameter λ tunes the resolution of the network. Different λ means various scales of the network.

Model learning
Firstly we deduce the gradient of (8) with respect to U ik We then get the positive term of (9) and the negative term of (9) According to the gradient decent algorithm, one can use the [Á] + and [Á] − to define an iterative learning based update rule as follows: Here, η ik is a positive learning rate. If we choose according to the Oja rules [26], the update rule becomes a multiplicative update rule: Finally, we can simply update U ik by multiplying its current value with the ratio of the negative term to positive term It is worth noting that the update rule converges to a local optimum. But in order to get a better result, we take the similar strategy as other NMF methods used. We first initialize ten "seeds" randomly, that is, we initialize ten U matrices randomly. When the algorithm converges, the ten different matrices will lead to ten different results, which corresponds to ten values of the loss function respectively. We can then find the minimum value from these values, and consider the corresponding matrix U as the final solution that will be not too far from the global optimum.
Both of disjoint and overlapping communities can be derived from the obtained membership matrix U. To be specific, to derive a disjoint partition, vertex i can be assigned to community r = argmax k {U ik , k = 1, 2, . . ., K}. To construct a structure with overlapping communities, we take the similar strategy in [12] and [27]. We first choose the maximum and minimum value in vector {U i1 , U i2 , . . ., U iK }, We then rescale each entry in this vector to [0-1] as follows, In this way, the maximum value in each row rescales to 1 and the minimum value in each row rescales to 0. The rest entries rescale to values in the range of 0 to 1. We then vary a threshold from 0 to 1 and set all those entries in U that exceed a predetermined threshold to 1, and 0 otherwise. Now different thresholds will lead to different communities. In order to get the desired communities, we select modularity Q (or minimum description length L) as the quality metric, which depends on the specialized scenarios. Thus, the proper threshold is the one that corresponds to the community structure with the maximum Q-value (or the minimum L-value). Now, inspired by the proof in [11], we will give the proof of the correctness of the updating rules in (14). THEOREM 1. At convergence, the converged solution U of the updating rule in (14) satisfies the Karush-Kuhn-Tucker (KKT) condition of the optimization theory [28].
PROOF. Because of the nonnegativity constraint of U, we introduce the Lagrangian multiplier δ for U. In this way, we can construct the Lagrangian function as where δ is a N × K nonnegative matrix and its element δ ik is the Lagrangian multiplier of U ik . Then we have the KKT conditions Following the first KKT condition, we obtain Further, by the third KKT condition, we have On the other hand, once U converges, according to the updating rule of (14), the converged solution U satisfies which can be written as This is identical to (21). Besides, the updating rule guarantees the nonnegativity of U and δ is the Lagrangian multipliers, and thus the second and the fourth KKT conditions are satisfied. Then the updating rule in (14) satisfies all the above KKT conditions, so we finish the proof of the correctness of our updating rule.
An illustrative example for l 2,1 norm regularization In this section, we do not intend to show the whole procedure of our method, but to depict the main effect of our above method by adding l 2,1 norm regularization. Here we use the wellknown karate club network, and assume the number of communities is 17 at first. Fig 1(a) is the color mapping of U normalized to 0-1 obtained by standard NMF [25], where colors close to red indicate the strong propensity of vertex i belonging to community k, and colors close to blue indicate the weak propensity of vertex i belonging to community k. Fig 1(b) is the color mapping of U normalized to 0-1 obtained by our method. Here, we set the resolution parameter λ = 1.7, and in the later sections we will introduce how to determine λ in details. As we can see, both the standard NMF and our method will get the membership matrix U with size of 34 × 8. However, for our method, there are only three non-zero columns, and the remaining columns are all zeros. This suggests that, although we set the number of communities at a large number at the beginning, but actually, not all communities are essential for this network. It also means that there are some communities in which all the vertices do not involved, and the three non-zero columns are the derived three communities that we are looking for. As a result, after removing the unnecessary communities, we get the number of communities as K = 3, corresponding to 3 significant communities. But for the standard NMF without group sparsity, when we set K = 17, it cannot select the suitable communities from the initial setting. The vertices are assigned to the 17 communities, and hence we cannot get the significant split of the Karate club network.
It is worth noting that we can get both the overlapping and disjoint communities from U. The strategy about how to use the membership matrix U to determine the communities and how to choose a suitable threshold have been introduced in the Model Learning section, which are similar with other NMF methods. Besides, we can also get the multi-scale structures when varying the resolution parameter λ.

Results
In this section, we test the performance of our approach on artificial and real-world networks in terms of the results of hard-partitions, overlapping structures, as well as hierarchical structures. We also give the analysis of the resolution parameter in this model.
The methods compared include: Louvain method [20] and Infomap [29], both of which are the most popular hard-partitioning methods; CPM [4], which is one of the most widely used overlapping community detection method; Fuzzy Infomap (F-Infomap) [30], which is an extension of Infomap to detect overlapping communities; SNMF [11] and BNMTF [12], which are the model-based methods based on NMF. Except some special comparison scenarios, we will make a full comparison against all these above methods.
There are various quality metrics that are used to evaluate the goodness of community structures. However, each of these metrics is only applicable to several types of evaluation scenarios. For this reason, we will use different quality metrics in different cases. They include: normalized mutual information (NMI) [31] which is used to evaluate the hard-partition results on networks with known community structures, generalized normalized mutual information (GNMI) [19] which is used to evaluate the overlapping results on networks with known community structures, modularity Q [32] which can be used to evaluate the hard-partition results on real networks without ground-truth, and the extended map equation L [30] which can be used to evaluate overlapping results on real networks without ground-truth.

Test on synthetic networks
To evaluate the performance of our approach, we conduct experiments on three types of synthetic benchmarks.
We firstly evaluate the hard-partitioning results. Here we adopt the widely used Newman's model, proposed in [3]. The graph consists of 128 vertices, which are divided into four communities of 32 vertices each. Each vertex has the expected degree 16, including an average z in edges connecting to vertices within the same community and z out edges to vertices in other communities. With the increase of z out , the community structure becomes more and more ambiguous. In this experiment, we first choose Louvain method, Infomap, SNMF and BNMTF to be compared, and use NMI as the accuracy metric. The higher the value of NMI index, the better the results will be. Because CPM and F-Infomap only provide overlapping community results, we cannot compute their NMI values, and hence we cannot compare with them in terms of NMI. For this problem, we further use GNMI which is suitable for evaluating both overlapping and disjoint structures as the accuracy metric to compare with all these methods.
The comparisons in terms of NMI and GNMI are shown in Figs 2 and 3, respectively. As we can see, when z out is small, the community structures are very clear, so both of the NMI and  GNMI accuracies of all the methods are very high, close to 1. With the increase of z out , the community structures will be not so clear, and it becomes a challenge to these methods. Especially, when z out > 6, the NMI (and GNMI) accuracy begins to decrease quickly. When z out > 8, which means the number of between-community edges per vertex is more than that of withincommunity edges and there is almost no community structure in the network, it will lead to a very low value of NMI (and GNMI) index for most of the methods. But in general, the performance of our method is often better than (or competitive with) that of the other methods in terms of both NMI and GNMI.
We then evaluate the overlapping community results. Here we adopt a new type of benchmark proposed by Lancichinetti, Fortunato and Radicchi [33], named as LFR. Compared with Newman's model, LFR model can not only generate overlapping communities, but also possess the statistical property of heterogeneous distributions of degree and community size, which often appears in real-world networks.
Following the experiment designed by Lancichinettic et al. [33], the setting of the parameters in LFR model is shown in Table 1. Here, N denotes the number of vertices; k denotes the average degree per vertex; maxk denotes the maximum degree of vertex; u denotes the mixing parameter, i.e., each vertex shares a fraction u of its links with vertices in other communities; t 1 denotes the minus exponent for the degree sequence; t 2 denotes the minus exponent for the community size distribution; minc denotes the minimum for the community size; maxc denotes the maximum for the community size; on denotes the fraction of overlapping vertices; om denotes the number of memberships of the overlapping vertices. In Table 1, we use S denotes the benchmark networks with smaller communities, B denotes the the benchmark networks with larger communities; 0.1 and 0.3 denotes networks with different mixing parameters which are 0.1 and 0.3, respectively. So we produced four types of LFR benchmarks.
In this experiment, we choose Louvain method, CPM, Infomap, F-Infomap and SNMF to be compared, and use GNMI as the accuracy metric. Because BNMTF cannot provide results within 100h for each set of the tests, we did not include it here. The comparison results are shown in Fig 4. As we can see, F-Infomap and our method perform best on networks with small communities (see Fig 4(a) and 4(b)). On large communities (see Fig 4(c) and 4(d)), the performance of our method is competitive with that of F-Infomap and SNMF, but much better than that of the other 3 methods. With the increase of the fraction of overlapping vertices on, the overlapping community structure will become more and more indistinct, and hence the curves will often decrease. But compared with other methods, our approach is still relatively stable with the change of on, and performs well. Besides, the performance of our approach is also relatively stable when the mixing parameter u changes from 0.1 to 0.3. This also expresses the effectiveness of our method on LFR benchmarks.
Finally we evaluate the hierarchical community results. We adopt the hierarchical community model proposed by Lancichinetti et al. [19], which extends Newman's basic model to accommodate the hierarchical structures. This new model contains 512 vertices and two levels in all. At its second level, the 512 vertices are arranged in 16 communities of 32 vertices each. Further, the 16 communities form 4 super communities of 128 vertices each at its first level. On average, each vertex has k 1 links connecting other 31 vertices within the same community at the second level, has k 2 links connecting the other 96 vertices within the same supercommunity at the first level, and has k 3 links connecting the rest of the network. Usually, with the increase of k 2 and k 3 , the communities at each level become more and more unclear, leading to a challenge for all the methods. Following the parameter configuration in [19], here we specify k 1 = k 2 = 16 and vary k 3 from 16 to 36 with an interval of 2.
As this model only provides the ground-truth of hierarchical structure with disjoint communities, hence we choose Louvain method which can provide hierarchical community structure to be compared. Besides, because Informap has been extended to detect hierarchical structure [34], we also choose it to be compared, and use the name H-Infomap in short. Fig 5  shows this comparison results at the first scale. As we can see, the performance of our approach is better than that of H-Infomap, and is competitive with that of Louvain method. Especially, when k 3 > 26, the NMI accuracy of Louvain method is slightly better than that of our method, although they are both very high and close to 1. Then when k 3 > 22, because H-Infomap detects 512 communities, which is much more than the ground-truth, its value of NMI decreases to around 0.35. Furthermore, Fig 6 shows the comparison results at the second level. As we can see, the performance of our approach is better than that of both Louvain method and H-Infomap. The reason may be that, Louvain method is based on the modularity optimization suffering from resolution limits [35], and thus it cannot find a high resolution solution, such as the structure at the second level structure with 16 communities. H-Infomap also does not find the real number of communities 16 at the second level, e.g., H-Infomap often finds 4 or 512 communities on the networks. On the contrary, our method is flexible to adjust the resolution parameter λ, and hence can accurately detect community structures at different levels.
To sum up, we adopt three types of artificial benchmark networks to test the performance of our approach. The results not only show the superior performance of our approach, but also validate its flexible ability to detect communities in terms of different cases, such as disjoint communities, overlapping communities as well as the hierarchical structures. Test on real-world networks As real networks may have some different topological properties from artificial networks, here we use various real-world networks to further evaluate the performance of our algorithm. We select 8 widely-used real networks which are summarized in Table 2, where N denotes the number of vertices, M denotes the number of links, and K denotes the number of communities. Especially, "-" implies that the number of communities is unknown. In the following, we will test the performance of different algorithms in terms of disjoint communities and overlapping communities, respectively.  Firstly, we estimate the disjoint community results for different algorithms. Because Table 2 contains networks with and without known communities, here we take modularity Q as a unified quality metric. In this experiment, we select Louvain method, Infomap, BNMF and BNMTF to be compared. CPM and F-Infomap cannot provide the hard-partitioning results, and hence we cannot compute Q-values for these results. Therefore, we did not include these two methods here. The comparison results are shown in Table 3. The last row of Table 3 is the average Q-value of each method on all the networks. Because Louvain method is based on the optimization of modularity Q, it is not surprising that it gives the best performance on almost all the networks. Thus, for clarity, we show the results of Louvain method in the right column of the table separately. Except Louvain method, we mark the best results by boldface and our second best results by italic. As we can see, our approach achieves the best performance among those methods which do not directly optimize modularity Q. Also of note is that, our approach has the advantage of providing overlapping and hierarchical solutions, while other methods do not.
Next, we estimate the qualities of overlapping communities obtained by different algorithms, and select the extended map equation L as the quality metric. Here we use all the methods to be compared. The results are shown in Table 4. The last row of Table 4 is the average L-value of each method on all the networks. Because both F-Infomap and Infomap directly optimize the Minimum Description Length L, it is not surprising that they achieve the best and Table 3. Comparison with other methods in terms of modularity (Note that we mark the best results by boldface and our second best results by italic). second best performance, respectively. Therefore, for clarity, we put their results on the right side of the table separately. As we can see, on average our approach achieves the best performance among those methods which are not based on the optimization of function L. To sum up, these two experiments, in terms of disjoint and overlapping communities, both validate the effectiveness of our method.

Modularity
In particular, here we take the dolphins social network as an example to test the performance of our approach in more details. The dolphins social network was reported by Lusseau [37], where vertices represented the dolphins and a link was created if two dolphins were observed together more often than expected by chance from 1994 to 2001. In the regular experiment, the network is often divided into two communities, where one community mainly consists of male dolphins and the other mainly consists of female dolphins, which are marked by square and cycle vertices, respectively (see Figs 7 and 8). Fig 7 illustrates the two communities detected by our approach at the scale of λ = 1.7. Our result is marked by different colors, which successfully discovers the two communities with four overlapping vertices: PL, Oscar, DN63, and SN100. Moreover, as mentioned in [10], one can measure the uncertainty in assigning vertices to communities by the mean entropy, so that they used it to monitor the allocation confidence. Besides, [11] takes a similar strategy, and they also used the mean entropy to infer how active (meaning the degree of uncertain or fuzzy) of a vertex is. According to these literatures, since each row of U in our model represents the propensity that a vertex belongs to the communities, we can normalize each row of U to make its summation to be 1. In this way, the propensity that a vertex belongs to the communities can be considered as a probability distribution. So we can use this to compute the entropy of this vertex, and then measure its uncertainty in terms of the community memberships. High value of entropy stands for high uncertainty, which also means this vertex is more active than other vertices. Therefore, here we monitor the allocation confidence in the viewpoint of the mean entropy (in bits), which is defined as This suggests that these overlapping vertices are more active than others, and also they are located next to the junction of the two communities. Hence it makes sense that they are allocated to both of the two communities. Furthermore, Fig 8 illustrates the four communities at the scale of λ = 1.5. As we can see, it reveals the community structure at a higher resolution, and that is the larger community in  Fig 10. In this figure, the four highest spikes correspond to the SN100, Double, TR99, and Kringel, respectively, which are all the overlapping vertices. Especially we find out that, the mean entropy in Fig 9 is much sparser than that in Fig 10. This may imply that the vertices in smaller communities are usually more active than those in larger communities.

Analysis of the resolution parameter
Here we analyze the resolution parameter λ in the model. This resolution parameter controls the weight of the regularization term in (8). With the increase of λ, the importance of the regularization term will increase. Especially, when λ equals to 0, the regularization term will have no effect on this model. In particular, the larger the resolution parameter, the sparser the membership matrix U. Here we take four real-world networks as an example, which are polbooks, dolphins, football, and karate networks. Firstly, we present the relationship between the resolution parameter λ and the number of communities in Fig 11. As we can see, the number of communities will decrease with the increase of λ. This is because the regularization term will penalize U, making it become sparser with the increase of λ. As a result, more "redundant" communities will be abandoned, and hence the network will consist of only few communities with large size. Especially, when λ > 3, there will be only one community for each of the networks. Similar with other hierarchical methods, sticking to a resolution can get the corresponding community structure at that scale. In general, it is easy to set a large resolution parameter firstly, to get the community structure at low scale first. And then, by decreasing the resolution parameter gradually, one can explore the hierarchical community structures at high scales. Finally, the natural hierarchy of the network will be detected.
Furthermore, we check the qualities of community structures at different scales. Fig 12 represents the results of these networks in terms of NMI and modularity Q, respectively. In general, the trend of community qualities for each of the networks is like a parabola, which increases from a low quality to a high value, and then falls down again. The phenomenon is intuitive. This is because the small resolution parameter leads to many but very small communities which often do not meet the criteria of well-defined communities. With the increase of λ, the quality of its community structure will become better and better. But finally, because large resolution parameter often strongly constrains the sparsity of U, the whole network will become one community at last, which leads to low values of modularity and NMI again. Also, the actual number of communities often corresponds to the largest NMI accuracy. For example, the largest NMI accuracy of dolphins network appears when the resolution parameter is around 1.6, where the number of communities is 2, shown in Figs 11 and 12(a). The largest NMI accuracy of karate network appears when the resolution parameter is around 2.6 where the number of communities is 2 in Figs 11 and 12(b). The same phenomenon also happens on the other two networks.
As discussed in [19] and [21], if one wants to get the hierarchical community structure, one often has to scan across the resolution parameter. But here, we tend to find a "robust" region for this parameter, which can give the satisfactory results in most cases. There have been several strategies to determine resolution parameters, such as the cross validation, the consensus clustering, and so on. Of particular interest is the novel viewpoint proposed by [42][43][44], which focused on a dynamic process to explore the networks. They provided a physical interpretation of the resolution parameter (the inverse of time t), and then gave the robustness of partitions at time t. But these strategies are not suitable for our model. Here we give a new strategy. According to our experiments, with the increase of the resolution parameter λ, the error between the adjacency matrix A and the expected adjacency matrix UU T (the first term of (8)) will usually increase (see Figs 13(a) and 14(a)), and the l 2,1 norm regularization (the second term of (8)) will usually decrease (see Figs 13(b) and 14(b)). This means that, with the increase of λ, the regularization term will make U sparser, which will lead to a smaller value of l 2,1 norm of U and produce more zero columns in U. On the other hand, a sparser U will also result in a larger reconstructed error, and hence the error between the adjacency matrix A and the expected adjacency matrix UU T will be larger. In the following, we tend to find a good balance between these two terms. We depict the ratio of l 2,1 norm regularization term to the error term in Fig 13(c) and 13(c), and express the number of communities obtained at different scales in Figs 13(d) and 14(d). As shown, when this ratio is around 0.5 (marked by the red ellipse), which means the value of the l 2,1 norm regularization term in (8) is about half of value of the error regularization term, we can get a suitable number of communities close to the real number of communities. In this situation, we will often get the satisfactory community results. Thus, in general we recommend the "robust" region for the resolution parameter to be a value making the ratio of the second terms to the first terms in (8) around 0.5. This is also the setting in our experiments.

Application
We use the network of word associations as an application in this section, to demonstrate the superior performance of our approach in solving real-world problems. This network was created by the University of South Florida and University of Kansas [6], which included 5,019 stimulus words. There were in all more than 6,000 participants joining in this project. And they were asked to write the first word in their minds once they heard a word. In this way, the word association network containing 5,017 vertices and 29,148 links was constructed. Originally, the network was weighted where the weight of link represents how frequently two given words were associated. However, we simplified this network as an unweighted network by ignoring weight, according to the method in [22].
The network possesses rich metadata which describes the structural and functional roles of each vertex. Therefore, we can evaluate the performance of different methods by measuring how well the detected community reflect the metadata. We choose CPM to be compared in terms of overlapping community detection, and select Louvain method to be compared in terms of disjoint community detection. CPM does not assign all the vertices in a network, and some of them are considered as "background" vertices and belong to no communities. Thus, when compared with CPM, we filter the "background" vertices and only adopt the remaining subnetwork. However, when comparing our method with Louvain method, we use the whole network as our target network. Because the number of communities K will also affect the evaluation results, we first set K in our model the same as that of each method compared. Specifically, when compared with CPM, we use the number of communities K got by CPM in our model. Similarly, when compared with Louvain method, we specify the K got by Louvain in our model. Furthermore, we also use the method introduced in "Analysis of the resolution parameter" to automatically determine K. To be specific, we got K = 958 communities on the subnetwork filtered by CPM to compare with CPM, and we got K = 961 communities on the whole network to compare with Louvain method.
We use the WordNet database for the metadata [45], which is specifically built for semantic analysis. This database assignes a set of meanings/definitions to each word, known as Synsets. Moreover, a unique ID is also assigned to each detailed meaning of a word. This enables us to make quantitative analysis. In principle, a pair of words can be considered to be similar if they belongs to a same Synset. We can measure the quality of detected community structure by the enrichment of vertex pair similarity [22]. According to [22], the enrichment of vertex pair similarity is where μ(i, j) = 1, if words i and j belong to the same Synset, or 0, otherwise. In other words, the enrichment is the average metadata similarity between all pairs of vertices that share a community, divided by the average metadata similarity between all pairs of vertices. The denominator serves as a baseline similarity and the larger the enrichment, the more similar the vertices, and this indicates better community structure. When we specify the number of communities K same as CPM, the enrichment got by CPM and our method are 30.75 and 56.31, respectively. In spite of using the filtered network which is in favor of CPM, the performance of our result is still better than that of CPM in terms of real semantic. Furthermore, on the whole network, when we specify K same as Louvain method, the enrichment of Louvain method is 15.97, and that of our method is 53.32. This result also indicates the superiority of our method from the point of semantic analysis. Besides, when there is no prior information about the number of communities, the enrichment of our method is 63.42 and 80.45 on the filtered network and the whole network, respectively, which still shows the superiority of our method over CPM and Louvain method.
Here we first analyze all the communities in our result overlapped at a popular word "DAY" when we set K the same as CPM, which is shown in Table 5. As we can see, our method provides much more semantic information than CPM. The reason is that CPM is a k-clique propagation method, and this is a strong constraint on networks. But in practice, networks in real world do not always consist of cliques, so CPM may loose some useful information. For example, in the first community, we detect "WEEKEND", while CPM does not detect it. Obviously, "WEEKEND" is very similar with other words in this community from the perspective of semantic. Therefore, we can say, our approach is free from the clique constraint, and provides more comprehensive semantic information than CPM. Furthermore, we also analyze the communities associated with "DAY" when the number of communities K is determined automatically. As shown in Table 6, we got the similar community results as that of CPM. In summary, under both these two situations, our method can find significant communities with similar semantic.

Discussion
In this work, we propose a novel generative model to detect overlapping and hierarchical community structures, which is based on nonnegative matrix factorization with l 2,1 norm regularization term, balanced by a resolution parameter. In this approach, the NMF technology provides the overlapping communities solution, and the l 2,1 norm regularization term enables us to solve the problem of model selection, i.e., to learn the number of communities automatically. Besides, by varying the resolution parameter, we get the hierarchical organization of networks so as to reveal more comprehensive information. All of these above problems are essential to be solved when dealing with the real-world applications, and here we provide a unified framework. Furthermore, we derive the update rule of the model parameters and give the proof of its correctness. In addition, because our approach is based on NMF, it can not only capture the membership of a vertex in multiple communities, but also measure how strongly that a vertex participates in each of the communities. Finally, the experiments on both of synthetic and real-world networks are presented to show the effectiveness of our approach. However, our method is not perfect, which still has room for further improvements. Our generative model is originally designed for static network. But in many cases, the static network can be considered as a snapshot in the process of network evolution. Therefore, we wish to extend our approach to detect communities as well as its evolution in the dynamic situations. A natural extension maybe design a effective regularization term that smoothes the current membership matrix and the membership matrix in the next time stamp. This will be our main direction in the future.

Author Contributions
Conceived and designed the experiments: XC XW DJ. Performed the experiments: XW XT. Analyzed the data: XW DJ. Contributed reagents/materials/analysis tools: XG. Wrote the paper: XW DJ.