Link Community Detection Using Generative Model and Nonnegative Matrix Factorization

Discovery of communities in complex networks is a fundamental data analysis problem with applications in various domains. While most of the existing approaches have focused on discovering communities of nodes, recent studies have shown the advantages and uses of link community discovery in networks. Generative models provide a promising class of techniques for the identification of modular structures in networks, but most generative models mainly focus on the detection of node communities rather than link communities. In this work, we propose a generative model, which is based on the importance of each node when forming links in each community, to describe the structure of link communities. We proceed to fit the model parameters by taking it as an optimization problem, and solve it using nonnegative matrix factorization. Thereafter, in order to automatically determine the number of communities, we extend the above method by introducing a strategy of iterative bipartition. This extended method not only finds the number of communities all by itself, but also obtains high efficiency, and thus it is more suitable to deal with large and unexplored real networks. We test this approach on both synthetic benchmarks and real-world networks including an application on a large biological network, and compare it with two highly related methods. Results demonstrate the superior performance of our approach over competing methods for the detection of link communities.


Introduction
Many complex systems in the real world exist in the form of networks, such as social networks, biological networks, Web networks, etc., which are collectively referred to as complex networks. One of the main problems in the study of complex networks is the detection of community structure [1], a subject that keeps attracting a great deal of interest. Although no common definition has been agreed upon, a community within a network is usually defined as a group of nodes that are densely connected with respect to the rest of the network. In the past few years, many different approaches have been proposed to uncover community structure in networks. For review, the interested readers can refer to Ref. [2,3].
Although previous research towards community detection mainly focused on the community of nodes, several recent works begin to switch the attention to community of links [4][5][6][7][8][9][10]. The motivation is that link communities are more intuitive than node communities in many real-world networks. This is due to the link usually having a unique identity, while the node tends to have multiple roles. In a social network, for instance, most individuals belong to multiple communities such as families, friends, and coworkers, while the link between a pair of individuals often exists for a dominant reason which may represent family ties, friendship, or professional relationships, et al. Furthermore, the links connected to a single node may belong to several different link communities, thus the node can be assigned to multiple communities of links. Accordingly, overlapping communities of nodes, which is another attractive topic in community detection [11], could be detected as a natural byproduct of link communities.
Recently, a number of approaches to the detection of link communities in graphs have been proposed. For instance, Ahn et al. [4] used hierarchical clustering with a similarity metric between links to build a dendrogram of link communities, which provides a rich hierarchy of structures. Further, in order to obtain the most relevant communities, they introduced a link density function to determine the best level at which to cut the tree. Evans et al. [5,6] transformed the targeted network into the corresponding line graph based on several types of random walks, and then they detected link communities by applying the existing algorithms for node partitioning on this generated line graph. Kim et al. [7] extended the map equation method [8] originally developed for node communities, by assigning the first level code to each link community while still assigning the second level codes to the nodes, so as to find link communities in networks. Pan et al. [9] detected link communities by a local-based method, which finds each natural community through expanding a selected seed by optimizing a proposed local function. He et al. [10] presented a stochastic process based on a link-node-link random walk to unfold the community structure of links, and then used the local mixing properties of the Markov chain to extract the emerged link communities.
Moreover, in face of the good performance and sound theoretical principles, generative models form a promising class of techniques for identifying communities from networks. Techniques that are being actively researched and developed [12]. Recently, several model-based methods have been proposed, which are based on a blockmodel or its variations, and employ different types of inference algorithms, such as expectationmaximization, nonnegative matrix factorization, and others. However, most of them are focused on the detection of node communities [13][14][15][16][17][18][19]. We are aware of only one exception, which is the algorithm designed by Ball et al. [20] that considers the detection of link communities. In Ball's method, the model is parameterized by a set of parameters h iz 's, where h iz denotes the propensity of node i to have links in the z-th community. Then, they take h iz h jz as the expected number of links in the z-th community connecting nodes i and j. Finally, they fit the parameters of this model using a method of maximum likelihood evaluation based on an expectation-maximization algorithm.
In this work, based on Ball's model, we propose a new generative model to describe the community structure of links. This model is parameterized by two sets of parameters v z 's and Q iz 's, where v z denotes the size of a link community z, and Q iz describes the degree of importance when node i forms links in this community. Then, the expected number of links in the z-th community connecting nodes i and j is denoted by v z Q iz Q jz . Compared with Ball's model, here we introduce an additional set of parameters v z 's to characterize the sizes of different communities, aiming to make it more flexible when describing link communities. Thereafter, in order to fit the parameters of this model, we define it as an optimization problem based on squared loss, and solve it by using a technique of nonnegative matrix factorization (NMF). At last, we extend the above method by introducing a strategy of iterative bipartition, namely NMFIB, which can not only determine the number of communities all by itself but also get these results with a high efficiency. Therefore, this combined approach is better suited for discovery in unexplored and large networks. Also of note is that, this iterative bipartition process can be used to improve other model-based methods, such as Ball's method.

Methods
In this section, we first introduce a model for the description of link communities in networks, and then present a method based on nonnegative matrix factorization to fit the model parameters. Thereafter, we offer an example to illustrate the method. At last, we extend the above method to a new one that automatically determines the number of communities.

Generative Model
We define the generative model of link communities, which produces networks with a given number n of vertices and m undirected edges. Assume that the links can be partitioned into c communities using a soft community membership variable R, where R ij z denotes the probability that link (i, j) belongs to community z, subject to P z R z ij~1 . Then the model is parameterized by two sets of parameters v z 's and Q iz 's, where v z denotes the size of community z and is defined as twice the expected number of links (or weight) in this community, and Q iz denotes the probability that community z selects node i when it generates edges. Thus, we have P z v z~2 m and P i w iz~1 . Based on the above model, an edge (i, j) can be generated as follows. First, we choose a community z randomly with an expected size v z . Then by using probabilities Q iz , Q jz , community z selects nodes i, j as a pair. Consequently, the expected number of links in community z, that lies between nodes i and j, can be evaluated asÂ Summing over communities z, the expected number of links between i and j can be written aŝ Note that multiple links and self-edges are both allowed here, which is typical for simple random graph models [18,20].
Under this model, the link communities will naturally form with the generation of networks. Intuitively, two nodes i and j which have large values of Q iz and Q jz , for some given community z with a large size v z , should have a high probability of being connected by a link with index z. Thus, groups of such nodes will tend to be connected by relatively dense webs of z-links, and these sets of edges correctly form the link communities we expect to see.
Formally speaking, assume that the community assignments are represented by a set of variables R ij z 's, where R ij z denotes the fraction by which a link (i, j) belongs to community z. Then we have As the soft membership of communities cannot be used directly, we can simply assign each link (i, j) to community r satisfying r = argmax z {R ij z , z = 1,2,…,c}, and then get the hard partition of links.

Parameter Fitting
The above model is specified by two sets of parameters v z 's and Q iz 's, depicting, respectively, the constraints P z v z~2 m and P i w iz~1 . These parameters have to be fitted from the data of the given network G to be analyzed. The problem of fitting the model to the data of G can be cast as the following optimization problem, where L sq is a squared loss function. The best fit between the expected graph with adjacency matrixÂ A~(Â A ij ) n|n and the given network G with adjacency matrix A = (A ij ) n6n can be achieved by optimizing (4). In the rest of this section, a method based on nonnegative matrix factorization (NMF) is developed to solve the optimization problem in (4). We first introduce an auxiliary matrix X, where X iz is defined as The loss function in (4) can be rewritten as a constrained nonnegative matrix factorization problem, where ||.|| F denotes the Frobenius norm, and m~1 2 P ij A ij . When we get the optimal X, using (5) v z can be given as since we have P i w iz~1 , then Qiz can be given as It is nontrivial to directly optimize (6) with the hard constraints. We relax this optimization problem by introducing a penalty term that represents the hard constraints into the objective function, arriving at minimizing the following objective function, where l is a hyper-parameter that reflects the importance of the hard constraints. Violation to stronger hard constraints incurs a higher penalty to the objective function. In our experiments, we first get an initial value of X 0 by setting l = 0. Then we restart the optimization with X = X 0 and let l to be a relatively large number, e.g. 1000, to minimize the chance of violating the parameter constraints. The purpose of the initialization is to restrict the model search so as to start from some good approximation. Similar to other forms of NMF, the objective function in (9) is not convex w.r.t. X, so that it is computationally intractable to find a global minimum. Therefore, a heuristic, gradient descent strategy is adopted to search for local minima. This gradient descent strategy can be implemented in a multiplicative updating algorithm similar to the method for SNMF [13]. In order to derive the update rule, a Lagrange multiplier matrix H for the nonnegative constraints on X is introduced to (9), resulting in the following equivalent objective function, For any stationary state, we have Using complementary slackness condition (H) iz (X) iz = 0, we have the following equation, This leads to the following update rule for X: When the above iteration rule converges, the converged solution satisfies the Karush-Kuhn-Tucker (KKT) conditions [21].
The convergence of the iterative updating rules follows the theorem below.
Theorem 1: The objective function O in (9) is non-increasing under the update rule in (10). O is invariant under these updates if and only if X becomes stationary.
Proof: We adopt the auxiliary function approach used in Expectation-Maximization and NMF. The basic idea is to construct an auxiliary function C(X ,X X ) such that: Note that, The equality clearly holds when X~X X . Then C(X ,X X ) satisfied the conditions of being an auxiliary function for O(X). We can define the series of update rules as: So we get the update rule for X as in (10).
Notice that, the time to calculate AX in (10) is mc, where m is the number of edges and c is the number of communities. The time to calculate 1 n (1 n T X) is nc, where n is the number of nodes. The time of calculating X(X T X) is nc 2 , and the time of calculating 1 n (1 n T X)(X T 1 n (1 n T X)) is also nc 2 . Consequently, the time to evaluate (10) once is O(mc+nc 2 ), and hence the time complexity of our method is O(T(mc+nc 2 )), where T is the iteration number for convergence. Also, according to [20] the time complexity of Ball's method is O(Tmc). Therefore, the time complexity of our method is competitive with that of Ball's since nc is often competitive with m.

An Illustrative Example
Here we depict the main idea of our method using a simple example, illustrated by Table 1 and Figure 1.
First, we fit the expected graph to the given network G by optimizing (4), and get the best parameters of the model v z 's and Q iz 's which are shown as Table 1. Then using v z 's and Q iz 's, we can form the expected graphs of all the link communities in G according to (1), which are shown as Figure 1(b) and (c), respectively. Further, we can form the expected graph of the whole network G according to (2), which can be regarded as an ensemble of the expected graphs of all its communities, shown as Figure 1(d). Notice that, a value marked between a pair of nodes denotes the expected number (or weight) of links between them. Finally, we can infer the community structure of links according to (3), which is equivalent to dividing the expected graph of the red/ blue link community shown in Figure 1(b)/(c) into the expected graph of G shown in Figure 1(d). As expected, the result perfectly matches the ground-truth given in Figure 1(a).

Determining the Number of Communities Automatically
Nonetheless, the method mentioned above can still be improved. The main drawback is that, our model offers no criteria for determining the value of parameter c, i.e., the number   of communities in a network. This is also a common drawback suffered by almost all methods based on generative models. The statistical model selection applied to generative models may, in principle, be able to find the number of communities in a consistent and satisfactory manner [22,23], but it is, at present, too computationally demanding to be applied to any but some small networks [20]. Still, it is an open problem whether a reliable method can be developed that runs in reasonable time on the large networks of interest to today's scientists [18]. Furthermore, even if the number c of communities is given, as large networks often have large values of c, the convergence rate of the core optimization algorithms (such as expectation-maximization algorithm, nonnegative matrix factorization, et al.) will necessarily become very slow. This is also an important limitation from existing model-based methods when dealing with large-scale networks in the real world.
To mitigate the above problems, we extend our original NMF method proposed above to a more practical one, namely NMFIB, meaning ''NMF with iterative bipartition''. In NMFIB, we first divide a network into two link modules using NMF with the number of communities c = 2, and then recursively subdivide the two parts. In dividing a subnetwork, we isolate it from the rest of the network and perform a 'nested' NMFIB on it, resulting in a link partition of the subnetwork with two smaller link communities. Subsequently, we decide whether to accept this bipartition by a special method based on the link partition quality. We summarize the algorithm NMFIB using the following recursive algorithm: Algorithm P = NMFIB(G)//G is a network, P is a link partition of G. 3. If the link partition quality cannot be improved by this bipartition, return P; //the quality function is to be introduced later. 4. P 1 = NMFIB(N 1 ); 5. P 2 = NMFIB(N 2 ); 6. Return P = P 1| P 2 . Subsequently, our remaining task for NMFIB is focused on determining the termination condition for the repetitive process of subdividing the links of network G, so as to obtain a superior link community structure. There are several measures for community structures, but most of them are defined for node communities [2,24,25]. Fortunately, partition density D [4], which is based on a type of link density, is specially designed for link communities. Here we use it as our quality metric, which is introduced as follows.
For a network with m links and n nodes, P = {P 1 , P 2 , …, P c } is a partition of the links into c communities. The number of links in community z, P z , is m z = |P z |. The number of induced nodes, all nodes that those links touch, is n z = |< eijMPz, {i, j}|. The link density D z of P z is This is m z normalized by the minimum and maximum numbers of links among n z connected nodes. Thus, D z = 1 when P z is a clique, or D z = 0 when P z is a tree. In particular, we assume that D z = 0 if n z = 2 without loss of generality. In essence, D z measures how 'clique-ish' versus 'tree-ish' P z is. Then, the partition density, D, is the average of D z , weighted by the fraction of links that are present: It is worthy of note that, when using function D, the determination of acceptance of the bipartitions for link communities will be independent on the order in which the bipartitioning queue happens in the above iterative procedure. Let we divide an arbitrary community r into two sub-communities r 1 and r 2 . The community structure will become P' = {P 1 , …, P r-1 , P r1 , P r2 , P r+1 , …, P c }, and the its D-value will be and then the variation of the partition density, denoted by DD, will be Then, considering a bipartition for an arbitrary community, the variation of its D-value only depends on this community and its bipartition result, and thus the determination of acceptance of the bipartitions for link communities will be independent on the order of the bipartitioning queue.

Results and Discussion
In order to evaluate the performance of our above method, we tested it both on benchmark synthetic networks and on some widely used real-world networks. The synthetic networks allow us to test the ability of different methods to detect known communities under controlled conditions, while the real networks allow us to observe its performance under practical conditions.  Further, we applied our method on a large biological network derived from real world data. Furthermore, we compared our method with two well-known and highly related methods. The first is a model-based method for link communities proposed by Ball et al [20], and the other is the notable method of link communities proposed by Ahn et al [4]. To the best of our knowledge, our method and Ball's method are the only two methods based on generative models that handle link communities, and our method and Ahn's method are the only two hierarchical methods considering partition density D [4] as the cut metric to detect link communities.  All experiments are done on a single Dell Server (Intel(R) Xeon(R) CPU 5130 @ 2.00 GHz 2.00 GHz processor with 4 Gbytes of main memory). The source code of the algorithms used here can all be obtained from the authors. Especially, the code of our methods, which are written as two functions NMF (our original method) and NMFIB (our method with iterative bipartition) in Matlab, is available in [26]. Also, interested researchers can contact us directly if interested on the code and instructions.

Synthetic Networks
Recently, several types of synthetic benchmarks have been proposed for node communities [1,27,28]. However, there is only one benchmark, to our knowledge, designed for testing the fitness of algorithms with respect to link community detection [20], and thus it is the one selected for use in this evaluation. Furthermore, we employed two accuracy measures introduced in [20], namely ''Fraction of Vertices Classified Correctly (FVCC)'' and ''Jaccard index'', to compare the planted community structure of the network and the one delivered by the algorithm. Notice that Anh's method does not appear here, this because it often finds very small communities, and fails to detect the communities defined in this benchmark.
As done in the experiment designed by Ball et al. in [20], the parameter setting for this benchmark is given as follows. The networks have n = 10000 nodes each, divided into two overlapping (link) communities. We placed x nodes in the first community only, i.e., these nodes have connections exclusively within the community, y nodes in the second community only, and the remaining z = n-x-y nodes in both communities, with equal numbers of connections to nodes in these two communities on average. We set the expected degree of all nodes to a fixed value ,k.. We also varied the parameters x, y, z, and ,k. to generate networks with stark community structures or no structure at all, so as to vary the difficulty of the network instances posed to the algorithms.
We performed three sets of tests. In the first set of experiments, we fixed the size of the overlap between the communities at z = 500, divided the remaining nodes evenly (i.e., x = y = 4750), and varied the value of ,k. from 1 to 16 with an increment of 1. For the second set of tests, we again set the overlap at z = 500 but fixed ,k. = 10 and varied the ratio between x and y. Finally, for the third set of tests, we set ,k. = 10, constrained x and y to be equal, and varied the amount of overlap z.
As the actual number of communities of the benchmark networks used here is known to be 2, for fairness we make it as a priori information for both our NMF method and Ball's method in this comparison. In Figure 2, we show the fraction of correctly classified nodes by the two algorithms for each of the three sets of experiments. To be considered correctly classified, a node's membership in both communities must be reported correctly by an algorithm. As shown in Figure 2, our NMF method outperforms Ball's method in terms of FVCC accuracy in all the three tests.
Furthermore, we adopted the Jaccard index to compare the two algorithms' ability for identifying overlapping (link) communities using the same sets of network instances. Let S be the set of truly overlapping nodes and V be the set of predicted overlapping nodes, the Jaccard index is J = |S>V|/|S<V|. This index is a standard measure of similarity between sets that rewards accurate identification of the overlap while penalizes both false positives and false negatives. Figure 3 shows the result of comparing the two algorithms using the Jaccard index. As shown, our NMF method is also superior to Ball's method in all the three sets of experiments. This result is similar to the results in Figure 2, and they both confirm the validity of our method.

Real Networks
As real-world networks may have some unique topological properties not present in synthetic ones, we now consider some widely used real networks to further evaluate the performance of these algorithms. All the networks we used here are obtained from Newman's website [29], except that 'protein-protein interaction' and 'word association' introduced by [11] are got from Palla's website [30]. Besides, as some of the compared methods partly optimized the partition density D [4], it seems to be unreasonable if we adopt D as the quality metric to compare their results. Fortunately, the extended map equation [7], which is based on the principle of minimum description length (MDL) [31], can naturally measure link communities. Therefore, we used it to evaluate community structures obtained by different methods. Under this measure, the shorter the MDL of an overall community structure, the better the structure is.
In the following, in order to evaluate these methods fairly and completely, we perform three sets of tests. First, we use the number of communities attained by Ahn's method as a priori information needed by Ball's method and our NMF method, and compare these three methods under the condition that the number of communities is the same. The compared results are shown in Table 2. As we can see, our NMF method has the best performance on 12 of the 15 networks in terms of MDL, and Ahn's method performs best on the other 3 networks. Note that our NMF method and Ball's method both do not get the result on the largest network 'word association' within the limited time and memory.
Further, we compared the performance of our NMF method with iterative bipartition (NMFIB), Ball's method with iterative bipartition, as well as Ahn's method. At this time, all these methods can determine the number of communities automatically. The comparison of these algorithms is shown in Table 3. We find out that, our method NMFIB has the best performance on 13 of the 16 networks in terms of MDL, Ball's method performs best on one network, and Ahn's method performs best on the two remaining networks. Finally, we compare our NMF method with iterative bipartition (and also Ball's method with IB) and our original NMF method (and Ball's original method) with the given number of communities c got by the corresponding iterative bipartition method. The compared results are shown in Table 4. As we can see, the clustering quality of the iterative bipartition method is competitive with that of the original method for both our NMF method and Ball's method. But notice that, the efficiency of the iterative bipartition method is much higher than that of the original one.
To sum up, our method with iterative bipartition not only has a higher clustering quality compared with other methods, but also it can determine the number of communities automatically. Thus, it may be more suitable for use when detecting link communities on unexplored real networks.

Application
The large real network we selected for a particular application is the protein-protein interaction (PPI) network of budding yeast Saccharomyces cerevisiae [11,32]. It contains 2,640 nodes (proteins) and 6,600 links (physical interactions between pairs of proteins).
We used the Gene Ontology (GO) terms [33], the most elaborate gene function annotations, as domain metadata for quality assessment. The GO terms include information on functions and cellular locations of a gene and biological pathways that a gene may be involved in. The biological significance of a community of genes (nodes) can be measured by the GO terms enriched in the genes in the community. Enrichment of GO terms can be evaluated by a hyper-geometric test [34], providing a GO term a p-value to quantify the significance of the term. To quantify the biological significance of a community structure, we used as quality metric the average number of significantly enriched GO terms with p-values less than a given threshold for all communities. The larger this average number of significant GO terms, the more biologically significant the community structure is.
Here we compared the results of our method NMFIB, Ball's method with iterative bipartition and Ahn's method, since all of them can automatically determine the number of communities. The results are shown in Figure 4. As we can see, our NMFIB identified PPI community structures with many more significant GO terms than Ball's method and Ahn's method under all 10 different p-value thresholds tested. It stands as an example of the consistent superior performance of our method over all compared competing methods.

Conclusions
In this work, we proposed a generative model for link communities based on Ball's model [20]. Compared with Ball's model, we included an additional set of parameters, v z 's, to characterize the sizes of different communities, which may enable our model to be more flexible in describing link communities. Then, we fitted the model parameters by taking it as an optimization problem, and used an approach of nonnegative matrix factorization to solve it. Thereafter, we extended the above method by introducing a strategy of iterative bipartition. This leads to a new method, namely NMFIB, that we show to be more suitable for structure discovery in unexplored and large networks. Also, this iterative bipartition process is suitable to be used to improve other model-based methods, such as Ball's method. We tested our method both on synthetic benchmarks and on real-world networks including an application on a large biological network, and compared it with two well-known and highly related methods. Experimental results demonstrated the superior performance of our method over the competing methods in the detection of link communities.
Our model was mainly designed to describe link communities, but it can be also extended to find node communities. A simple and approximate method is assigning each node solely to the community to which it most strongly belongs in the overlapping (link) community structure. But in the future, we wish to improve our model from some other viewpoints, and make it able to describe link communities as well as node communities, naturally and in a similar way, rather than being a simple extension. Furthermore, as discussed, the efficiency of our method is competitive with that of Ball's method, and it will become more efficient when introducing the strategy of iterative bipartition. Nevertheless, in order to deal with some very large networks such as the WWW, the Internet, etc, its efficiency needs still to be improved. Possibly, we can devise a pruning strategy that sets to zero any X iz (elements in the auxiliary nonnegative matrix X) that falls below a predetermined threshold, and improve the efficiency of our NMF method by using a technique of sparse matrix calculations. This is one of the directions for our future work. Moreover, in the current work, we only used partition density D [4] as the metric to determine the acceptance of each bipartition. However, there are some other quality metrics for link communities, such as the extended modularity [5,6] or the extended map equation [7], which may be also suitable for our iterative bipartition procedure. Thus in the future, we wish to include in our software the option of choosing different quality metrics, which may make our method more powerful.