Overlapping community detection in networks based on link partitioning and partitioning around medoids

In this paper, we present a new method for detecting overlapping communities in networks with a predefined number of clusters called LPAM (Link Partitioning Around Medoids). The overlapping communities in the graph are obtained by detecting the disjoint communities in the associated line graph employing link partitioning and partitioning around medoids which are done through the use of a distance function defined on the set of nodes. We consider both the commute distance and amplified commute distance as distance functions. The performance of the LPAM method is evaluated with computational experiments on real life instances, as well as synthetic network benchmarks. For small and medium-size networks, the exact solution was found, while for large networks we found solutions with a heuristic version of the LPAM method.


Introduction
Detection of overlapping communities in a network is the task of grouping the nodes of the network into a family of subsets called clusters, so that each cluster contains nodes which are similar with respect to the overall network structure. Overlapping means that clusters can intersect each other, so that a node can belong to several clusters, in contrast with disjoint community detection where the clusters form a partition of the node set.
To this day there is no widely accepted formal definition for the notion of community in a network. This leads to different community definitions and allows for the existence of a variety of graph clustering methods that can be compared only with respect to their computational complexity and the empirical evaluation of their proposed communities. The common approach to formalize the notion of community in a network is through the use of quality functions which attempt to quantify the degree of community structure captured by a given partition of the nodes. That is, a quality function will in principle attain extreme values for clusterings of the nodes which best reflect the community structure of the graph. Given such a quality function then, community detection translates to an optimization problem.
One of the most well known such quality function is modularity [1], where it has been used by many methods that solve the related optimization problem with varying success. However, it is still an open question of what are the properties of a good quality function [2].
Community detection in networks is still an actively developing area connected to many fields of science that need tools for a complex network analysis including molecular biology, sociology, data mining and unsupervised machine learning. Network clustering methods can be classified according to the approaches they are based on.
There is a plethora of different methods and approaches for overlapping community detection in graphs, a fact which can be partially attributed to the absence of a well defined and widely accepted quality function for overlapping communities, as it is the case with non-overlapping community detection. An attempt to axiomatize quality functions for non-overlapping graph clustering in the form of intuitive properties that any such function should satisfy is presented in [2]. The authors in [2], driven by similar results on distance based clustering present six such properties. For instance, the value of a clustering quality function should not decrease if for a given clustering we add edges between nodes in the same clusters. Moreover, they showed that modularity does not satisfy some of these properties. In a more recent and related work, the authors in [3] compiled a survey of the currently known families of quality functions, or metrics as they call them, for both non-overlapping and overlapping graph clustering. Even more so, the authors in [3] present computational experiments on sets of benchmark instances with known community structure, the compare these quality functions in terms of how do they perform in identifying the communities. The most recent overview and classifiction of the state of the art methods for overlapping community detection, as well as a computational comparison of existing methods and benchmark instance evaluation can be found in [4]. In [4] the authors present fourteen different algorithms and they propose a unified framework for testing them.
One approach for overlapping community detection is link partitioning also known as link communities identification. The idea of this approach is the following. If we assume that the nodes of a network represent the entities of a system and the edges the binary relations betweem them, instead of partitioning the nodes to form communities which will be non-overlapping, partition the edges in the sense that the relations between the nodes define the community structure and not the nodes themselves. A node will belong to the communities so defined by its adjacent edges. For example, a person may play soccer with a group of playmates on the weekends and go to work with coworkers on other days. Given that a coworker can also be a playmate we have overlapping communities. That person has two types of relations with other persons: "plays soccer with" and "works with". Thus, the person belongs to two communities: the community of soccer players and the community of his colleagues. Such a person is can be considered to be an overlapped node. Despite the fact that link partitioning for overlapping community detection seems very natural, historically the methods that exploit it appeared relatively late. Thus, in 2009 and later in 2010 Evans and Lambiotte [5], [6] were the first who making node partition of a line graph to get an edge partition of the original graph. So they projected the network into a weighted line graph whose nodes are the links of the original graph and after that they applied one of disjoint community detection algorithm. In 2011, Kim and Jeong [7] proposed a modified version of the map equation method (also known as Infomap [8] ) to detect link communities under the Minimum Description Length (MDL) principle. Also Evans [9] in 2010 extended line graph approach to using clique graph , wherein cliques of a given order are represented as nodes in a weighted graph. The membership strength of a node i to community c is given by the fraction of cliques containing i which are assigned to c.
In the present paper we present research of one combination of methods which previously has not been studied in the literature. The paper is organized as follows. In the section Materials and Methods we give a formal definition of the clustering problem, describe the proposed method and give description of the datasets and compared methods that were used in computational experiments. We briefly discuss how to choose input parameters and give the estimation of computational cost of the proposed method in the section Discussion. And traditionally we finalize the paper with the Conclusion section.

Problem statement
Let G(V, E) be a graph with n nodes V = {v 1 , v 2 , ..., v n } and m edges E ⊆ V × V . For a given natural number k define a cover as a family of k subsets of nodes where each C i is called a cluster or community. The goal in community detection is to find a cover C which best describes the community structure of the graph, in the sense that nodes within clusters are more densely connected than the clusters themselves. We can also associate with C an affiliation matrix F C ∈ R |V |×|C| where F vc corresponds to the degree of affiliation of vertex v with community c ∈ C. If we impose the following constraints then the values of the affiliation matrix are also known as belonging coefficients [10]. In the case of non-overlapping community detection we have that C must be a partition of V , or equivalently, equation (2) is replaced by the binary constraint F vc ∈ {0, 1}.

Proposed method
The proposed method is based on non-overlapping link partitioning. Thus, the task of overlapping community detection is reduced to the problem of finding non-overlapping communities in the set of edges. That also corresponds to the problem of finding non-overlapping communities on a line graph L(G) whose vertices correspond to edges of the original graph G. Two vertices are connected by an edge in L(G) if the corresponding edges in G have a common node. In order to determine disjoint communities in the line graph L(G) we build a distance matrix D = (d ij ) ∈ R m×m based on the structure of L(G), and for doing this we utilize a distance function on the nodes of a graph. For this purpose we tested two distance functions; the commute distance [11] and the amplified commute distance [12].

2/13
Given that we seek to find k overlapping communities in the original graph G, we compute a set S = {s 1 , s 2 , ..., s k } of vertices from L(G) which can be considered the medians with respect to the distances in D, that is Thus, x jc is indicator variable which takes the value 1 when the edge j of the original graph G belongs to cluster c and 0 otherwise. Together expressions 3 and 4 constitute the k-median problem also known as facility location problem which is known to be NP-complete [13,14].
The matrix of belonging coefficients for the final covering is calculated as follows.
for every i ∈ V (G) and c ∈ S, where θ is a threshold parameter, and d i is the degree of vertex i in the graph G. Thus, the belonging coefficient F ic of node i to cluster c is proportional to the number of adjacent edges belonging to the cluster c.
In summary, in order to find k overlapping communities in a graph G our proposed method Link Partitioning Around Medoids (LPAM) consists of the following steps: (a) Compute the distance matrix D between each pair of nodes based on compute distance or amplified computed distance (b) Solve the k-median problem based on the distances in D and compute the medians S = {s 1 , s 2 , ..., s k } 3. Build a cover for the original graph G based on the affiliation matrix F C which is constructed from S and a threshold value θ.

Distance functions
There are various options when it comes to choose a distance function on the nodes of a graph. Intuitively we would like a distance function that reflects the relationship between nodes within the same cluster in a community, so that vertices from the same cluster should have a shorter distance between one another than the distance between them if they were to belong to different clusters. In this paper we employed two distance functions, namely the commute distance and the amplified commute distance.

Commute distance
Commute distance [15] is also known as resistance distance [16] in the literature. The resistance distance can be thought of as the effective resistance between two nodes in a graph, if we consider this graph to be an electrical circuit. It is defined as where K (ij) is the minor of Kirchhoff matrix, and K (i,j) is a second order algebraic complement, that is, a determinant of the matrix obtained from the Kirchhoff matrix by deleting two rows and two columns i, j.
Commute distance d cm (i, j) and resistance distance d r (i, j) are connected by the relation where vol(G) = v∈V (G) d v is the volume of the graph G, and d v is the degree of vertex v. The value of the commute distance d cm (v, w) between node v and node w on the graph G can be interpreted as the expected number of steps that a random walk needs to take in order to reach node w from v and return back. Intuitively the commute distance seems like a good candidate for capturing the communities in a graph, in the sense that nodes within the same community 3/13 should have higher probability to be reachable to each other than nodes from different communities. The number of possible paths between two nodes is directly proportional to the commute distance between these two nodes, and one should expect that pairs of nodes within the same community should have a higher number of paths than pairs from different communities. However it is theoretically flawed when it comes to large graphs. When the size of the graph becomes sufficiently large, the probability to reach a node from another one becomes dependent only on the degree of the destination node, as it was proven in [12]. The authors in [12] call this effect lost in space. In order to overcome this drawback, in the same paper the authors proposed the amplified commute distance as a possible improvement.

Amplified commute distance
The amplified commute distance can be expressed as where the purpose of the negative terms is to reduce the influence of the edges adjacent to i and j, which completely dominate the behavior of the resistance distance. The term amplify is intended to emphasize the general role of the first term. As well as the original commute distance the amplified commute distance is Euclidean [12].

Benchmarks and evaluation
To evaluate the quality of the tested algorithms we employ an implementation of the Normalized Mutual Information measure for sets of overlapping clusters (ONMI) [17]. We used it to measure the difference between the covering produced by the examined algorithm and the known ground truth. In the recent literature, ONMI values have become one of the most widely used measures to calculate the difference between two coverings. Given that many papers in overlapping clustering (e.g. [4,18,19]) include the ONMI values for comparison purposes with benchmark graph instances with known ground truth, it enables us to compare our proposed method without necessarily implementing the other methods, given that we use the same benchmark instance set.

An example
In order to help the readers gain some intuition with respect to what covering result to expect from the proposed method, we created a pedagogical example which is illustrated in Figure 5. Given the regular 8 × 8 lattice, which naturally does not contain any community structure, we applied our method for k = 4 number of overlapping communities. As it can be seen in Figure 5 our method, which produced the same results for both the commute and the amplified commute distance, identifies four equal and overlapping communities in such a way such that each community overlaps exactly with other two. The medoids on the line graph are presented by big circles, while the communities in the lattice are identified with colors and the corresponding medoids with bold edges. Similarly, if we choose k = 2 we get two equal overlapping communities.

Compared Methods
Although the performance of the proposed method could in principle be compared with other published methods based solely on the ONMI value, for sake of consistency we used the publicly available implementations of the following three overlapping community detection methods.

4/13
• Greedy Clique Extension method (GCE) [20]. A method for detecting highly overlapping community structure by greedy clique expansion. • OSLOM [21]. This method uses the metric of the importance of the cluster. The algorithm is based on optimizing this metric by adding new vertices to the cluster or deleting them. This method has the ability to define overlapping clusters, as well as build clusters hierarchies. • COPRA [22]. An iterative method, based on the idea of multi-label propagation with computation complexity close to linear.

Datasets
As real word datasets we used the following four well known network benchmarks with known ground truth.
• Zachary's karate club: social network of friendships between 34 members of a karate club at a US university in the 1970s [23]. confederations [25].
Moreover, we constructed a set of synthetic networks with the prefix bench with known ground truth, using the instance generation tool which is based on the algorithm published in [26]. The parameters that were used for generating these networks can be found in the flags.dat file in the corresponding dataset directory in our git repository. Also we used recently published FARZ network generator [27] All the relevant information for the above mentions benchmark networks can be seen in the Table 1.

Implementation
The main code of the LPAM algorithm is implemented in Java and is accompanied by a set of Jupyter notebooks with Python scipts for running the experiments, as well as the implementations of the comparted methods (OSLOM, GCE, COPRA).
Recall that the LPAM method requires the solution of a k-median problem. For large graphs we solve the k-median problem using an implementation of the CLARANS heuristic [28] from the smile library [29]. For comparison reasons we also solved the benchmark problems with an exact solution of the k-median problem, by employing an efficient mixed integer linear programming model by Goldengorin [30]. The exact solutions of this model were found using the publicly available lp solve solver.
All the source codes including all Jupyter notebooks and data sets are publicly available on github at https: //github.com/aponom84/lpam-clustering.

Results
We have implemented four versions of the LPAM method, given that we solve the k-median problem exactly and with a heuristic, as well as that we chose to use both the commute distance and the amplified commute distance. For each method tested, the ONMI value was calculated between computed clustering and the ground truth. For non-randomized methods we selected the best values of ONMI for the corresponding parameters. For randomized methods we made a sequence of experiments for 10 randomly selected values of a random seed parameter while fixing the other parameters, collect ONMI values and get the average (see S6 Appendix). The summary of the results with the ONMI values for each method are presented in Table 2.
As it can be seen from Table 2, no method dominates the rest with respect to the proximity to the ground truth covering for all data sets. We can see however a superiority of the OSLOM method and the LPAM with the amplified commute distance. The LPAM method with the amplified commute distance get the best results for Politics Book graph and for the four synthetic networks: bench 30, bench 40 and bench 60. Also in our experiments the same combination gives the second best result for School Friendship, Karate club, bench 60 and bench 60 dense networks. An example of method output for the School Friendship instance can be seen in the Figure 2 and the associated line graph in Figure 3.
We should also note that the LPAM method for the regular lattice example naturally produces the covering that matches to the intuitive separation in contrast to the GCE method which doesn't produce any result for this example, and the OLSOM method can only give an accidentally good result for appropriate choosing the input parameters.
The OMNI values with an asterisk in Table 2 correspond to cases where the LPAM method with the heuristic solution of the k-median problem produced slightly better result compared to the LPAM method with the exact solution. Apparently this was caused due to the fact that sometimes the ground truth communities structure does not match to the neighbors of medoids. Thus, the mistake in identification of the global minimum of the k-median problem may lead to the solutions which are closer to the ground truth. Moreover, the ground truth may not necessarily correspond to the true community structure, given that we do not have an exact definition of a community in a network.

Tuning parameters
The behavior of the ONMI value depending on threshold parameter θ for the LPAM method is shown in Figure 4. As it can be seen in most cases the maximum ONMI value is reached when threshold value θ lies between 0.3 and 0.6. This can be attributed to the fact that with respect to the proximity to the ground truth covering, it is usually better to either assign a vertex to one cluster or no cluster rather than to several clusters. The final clustering covering for the four different combination of the exact/heuristic version of the LPAM method with the commute and amplified commute distances for all datasets are presented in S1 Appendix,S2 Appendix, S3 Appendix, S4 Appendix. The resulting pictures for the best ONMI values and the full study of the dependence of ONMI value from the input parameters for the GCE, OSLOM and COPRA methods can be found in S5 Appendix, S6 Appendix, S7 Appendix correspondingly.

Computational Complexity
Aside from the solution of the k-median problem which is NP-Hard, the computational complexity of the method is the following. To build line graph L(G(V, E)) we need Θ(|E|) time. Computing the distance matrix depends on the type

Conclusion
In this paper we propose a new method for the detection of overlapping communities in networks with a predefined number of clusters. The proposed method is based on finding disjoint communities on the line graph of the original network, by partitioning around medoids. The resulting link partitioning naturally produces an overlapping community structure for the original graph. The link partitioning is done using the commute distance and its variation which produces more accurate results.
Experimental results on a set of well known benchmark instances as well as artificially generated instances with known ground truth, demonstrate that the proposed method has competitive performance with respect to existing methods in the literature, which provides a motivation to further improve the method.