Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Overlapping community detection in networks based on link partitioning and partitioning around medoids

Abstract

In this paper, we present a new method for detecting overlapping communities in networks with a predefined number of clusters called LPAM (Link Partitioning Around Medoids). The overlapping communities in the graph are obtained by detecting the disjoint communities in the associated line graph employing link partitioning and partitioning around medoids which are done through the use of a distance function defined on the set of nodes. We consider both the commute distance and amplified commute distance as distance functions. The performance of the LPAM method is evaluated with computational experiments on real life instances, as well as synthetic network benchmarks. For small and medium-size networks, the exact solution was found, while for large networks we found solutions with a heuristic version of the LPAM method.

Introduction

Detection of overlapping communities in a network is the task of grouping the nodes of the network into a family of subsets called clusters so that each cluster contains nodes that are similar with respect to the overall network structure. Overlapping means that one node can belong to multiple clusters, in contrast to disjoint community detection where the clusters form a partition of the set of nodes. To this day, there is no widely accepted formal definition for the notion of community in a network. This leads to different community definitions and allows for the existence of a variety of graph clustering methods that can only be compared based on their computational complexity and the empirical evaluation of their proposed communities. A common approach to formalizing the notion of community in a network is through the use of quality functions that attempt to quantify the degree of the community structure captured by a given partition of the nodes. That is, a quality function will, in principle, attain extreme values for the clustering of the nodes that best reflects the community structure of a graph. Given such a quality function, community detection translates into an optimization problem. Modularity [1] is one of the best-known quality functions. It has been used by many methods that solve the related optimization problem with varying success. However, it is still an open question what the properties of a good quality function are [2].

Community detection in networks is an actively developing area connected to many fields of science that need tools for complex network analysis, including molecular biology, sociology, data mining, and unsupervised machine learning. Network clustering methods can be classified according to the approaches on which they are based.

There is a plethora of different methods and approaches for overlapping community detection in graphs, which can be partially attributed to the absence of a well defined and widely accepted quality function for overlapping communities, as is the case with non-overlapping community detection. An attempt to axiomatize quality functions for non-overlapping graph clustering, in the form of intuitive properties that any such function should satisfy, is presented in [2]. The authors, driven by similar results in distance-based clustering, propose six such properties. For instance, the value of a clustering quality function should not decrease if, for a given clustering, we add edges between nodes in the same clusters. Moreover, they showed that modularity does not satisfy some of these properties.

In a more recent work [3], the authors compiled a survey of the currently known families of quality functions for both non-overlapping and overlapping graph clustering. The authors also present computational experiments on a set of benchmark instances with known community structure, then compare the quality functions in terms of performance in identifying the communities. The most recent overview and classification of the state-of-the-art methods for overlapping community detection, as well as a computational comparison of existing solutions and benchmark instance evaluation, can be found in [4]. In the paper, the authors present 14 different algorithms and propose a unified framework for testing them.

One approach to overlapping community detection is link partitioning, also known as link communities identification, which involves splitting a set of edges instead of partitioning nodes. It is based on the idea that relations between nodes define the community structure, not the nodes themselves. In the case of link partitioning, a node belongs to a community if it has adjacent edges that belong to that community. For example, a person may play soccer with a group of playmates at weekends and go to work with coworkers on weekdays. Given that a coworker can also be a playmate, we have overlapping communities. That person has two types of relations with other people: “plays soccer with” and “works with.” Thus, the person belongs to two communities: “soccer players” and “colleagues.” Thereby, we can consider this person as an overlapped node.

Even though link partitioning for overlapping community detection seems very natural, historically, the methods that exploit it appeared relatively late. In 2009 and later, in 2010, Evans and Lambiotte [5, 6] were the first who performed node partitioning of a line graph to obtain an edge partition of the original graph. So they projected the network into a weighted line graph whose nodes are the links of the original graph, and, after that, they applied one of the disjoint community detection algorithms. In 2011, Kim and Jeong [7] proposed a modified version of the map equation method (also known as Infomap [8]) to detect link communities under the Minimum Description Length (MDL) principle. Also, Evans [9] in 2010, extended the line graph approach to using clique graphs, wherein cliques of a given order are represented as nodes in a weighted graph.

To reveal overlapping communities, calculating the spectrum of a graph with a Laplacian matrix was also exploited in some papers. Selecting the eigenvectors corresponding to the first k smallest eigenvalues of the graph, the Laplacian matrix allows embedding each node into a k-dimensional space, with the expectation that nodes from the same cluster will have a small distance to one another, relative to nodes outside the cluster. To apply the spectral approach to revealing the overlapping community structure, the authors in [10] suggest using the K-medians algorithm, instead of the regular K-means, for clustering in the k-dimensional spectral domain. A Gaussian Mixture Model in the spectral domain was proposed in [11], and a fuzzy c-means algorithm, to obtain a soft assignment, was used in [12].

In this paper, we present a novel approach to overlapping community detection. The paper is organized as follows: In Materials and methods, we give a formal definition of the clustering problem, describe the proposed method and the datasets, and compare the methods used in computational experiments. We briefly discuss how to choose the input parameters and give the estimation of the computational cost of the proposed method in section Discussion Traditionally, we finish the paper with the Conclusion section.

Materials and methods

Problem statement

Let G = (V, E) be a graph with n nodes V = {v1, v2, …, vn} and m edges EV × V. For a given natural number k define a cover as a family of k subsets of nodes where each Ci is called a cluster or community. The goal in community detection is to find a cover which best describes the community structure of the graph, in the sense that nodes within clusters are more densely connected than the clusters themselves. We can also associate with an affiliation matrix where Fvc corresponds to the degree of affiliation of vertex v with community . If we impose the following constraints (1) (2) then the values of the affiliation matrix are also known as belonging coefficients [13]. In the case of non-overlapping community detection we have that must be a partition of V, or equivalently, Eq (2) is replaced by the binary constraint Fvc ∈ {0, 1}. The proposed method is based on non-overlapping link partitioning. Thus, the task of overlapping community detection is reduced to the problem of finding non-overlapping communities in the set of edges, which is equivalent to the problem of finding non-overlapping communities on a line graph L(G) whose vertices correspond to edges of the original graph G. Two vertices are connected by an edge in L(G) if the corresponding edges in G have a common node. In order to determine disjoint communities in the line graph L(G) we build a distance matrix based on the structure of L(G), and for doing this we utilize a distance function on the nodes of a graph. For this purpose we tested two distance functions; the commute distance [14] and the amplified commute distance [15]. Given that we seek to find k overlapping communities in the original graph G, we compute a set S = {s1, s2, …, sk} of vertices from L(G) which can be considered the medians with respect to the distances in D, that is (3) (4) Thus, xjc is an indicator variable which takes the value 1 when the edge j of the original graph G belongs to cluster c and 0 otherwise. The arg min function runs over all possible subsets T of E(G) of the size k. Together, expressions 3 and 4 constitute the k-median problem also known as the facility location problem which is known to be NP-complete [16, 17].

The matrix of belonging coefficients for the final covering is calculated as follows. (5) for every iV(G) and cS, where θ is a threshold parameter, and di is the degree of vertex i in the graph G. Thus, the belonging coefficient Fic of node i to cluster c is proportional to the number of adjacent edges belonging to the cluster c. As the value of θ increases, the degree of overlapping between the communities also increases.

In summary, in order to find k overlapping communities in a graph G, our proposed method Link Partitioning Around Medoids (LPAM)consists of the following steps:

  1. Build the line graph L(G)
  2. Find k disjoint communities in L(G):
    1. Compute the distance matrix D between each pair of nodes based on commute distance or amplified commute distance
    2. Solve the k-median problem based on the distances in D and compute the medians S = {s1, s2, …, sk}
  3. Build a cover for the original graph G based on the affiliation matrix which is constructed from S and a threshold value θ.

The value of the parameter θ should be chosen according to the application. In our experiments, we have found that in most cases a threshold value of 0.5 for θ produces the best result. The main reason for this is that most networks with known cluster assignment have just a few nodes belonging to more than two communities. When θ is large, the algorithm tends to assign nodes to a single cluster. Conversely, with small values of θ, nodes are assigned to many clusters.

It must be noted that the LPAM method is applicable only to unweighted graphs. Since the method works with a line graph where the nodes are the edges of the original graph, it does not take into account the edge weights of the original graph.

Distance functions

There are various options when it comes to choosing a distance function on the nodes of a graph. Intuitively, a proper distance function should reflect the relationship between nodes within the same cluster in a community, so that the vertices from the same cluster should have a shorter distance between one another than the distance between them if they were to belong to different clusters. In this paper, we employed two distance functions, namely commute distance and amplified commute distance.

Commute distance.

Commute distance [18] is also known as resistance distance [19] in the literature. The resistance distance can be thought of as the effective resistance between two nodes in a graph if we consider this graph to be an electrical circuit. It is defined as (6) where K(ij) is the minor of a Kirchhoff matrix, and K(i, j) is a second-order algebraic complement, that is, a determinant of the matrix obtained from the Kirchhoff matrix by deleting two rows and two columns i, j. Commute distance dcm(i, j) and resistance distance dr(i, j) are related as follows (7) where vol(G) = ∑vV(G) dv is the volume of the graph G, and dv is the degree of vertex v. The value of commute distance dcm(v, w) between node v and node w, on the graph G, can be interpreted as the expected number of steps that a random walker needs to take to reach node w from v and return back. Intuitively, commute distance seems like a good candidate for capturing the community structure in a graph, in the sense that nodes within the same community should have a higher probability to be reachable between each other than nodes from different communities. The number of possible paths between two nodes is directly proportional to the commute distance between these two nodes, and one should expect that pairs of nodes, within the same community, should have a higher number of paths than pairs from different communities. However, commute distance is theoretically flawed when it comes to large graphs. In [15], the authors proved that when the size of the graph becomes sufficiently large, the probability of reaching a node from another node becomes dependent only on the degree of the destination node. The authors called this effect losing in space. To overcome this drawback, the authors proposed amplified commute distance as a possible improvement in the same paper.

Amplified commute distance.

Amplified commute distance can be expressed as (8) where the purpose of the negative terms is to reduce the influence of the edges adjacent to i and j, which completely dominate the behavior of resistance distance. The term amplify is intended to emphasize the main role of the first term. As well as the original commute distance, the amplified commute distance is Euclidean [15].

Benchmarks and evaluation

To evaluate the quality of the tested algorithms, we employ an implementation of the Normalized Mutual Information measure for sets of overlapping clusters (ONMI) [20]. We used it to measure the difference between the covering produced by the examined algorithm and the known labels. Here we need to mention that for synthetic networks the known labels can be considered as a ground truth. However in case of real-world networks we can’t know the ground truth, we know only the node attributes. We point reader to the paper [21] where this question is revealed in more detail.

In recent literature, ONMI values have become one of the most widely used measures to calculate the difference between two coverings. Since many papers in overlapping clustering (e.g. [4, 22, 23]) include ONMI values for comparison purposes with benchmark graph instances with known cluster assignment, we can evaluate our proposed method without necessarily implementing other methods as long as we use the same benchmark instance set. In addition to ONMI we used F1, and omega index as scoring functions for measuring similarities between the partitioning produced by a method and the known cluster assigment.

lattice 8x8 example.

To help readers gain some intuition into the proposed method, we created a pedagogical example illustrated in Fig 1. Given a regular 8×8 lattice, which naturally does not contain any community structure, we applied our method for k = 4 overlapping communities. As can be seen in Fig 1, our method, which produced the same results for both commute and amplified commute distance, identifies four equal and overlapping communities in such a way that each community overlaps exactly with two others. The medoids on the line graph are presented by big circles, while the communities in the lattice are identified with colors and the corresponding medoids with bold edges. Similarly, if we choose k = 2, we get two equal overlapping communities.

thumbnail
Fig 1. The 8×8 lattice on the right and its line graph on the left.

4 clusters are highlighted by the rectangles with corresponding colours. The medoids on the line graph are presented by big circles, while in the original graph medoids are identified by bold edges.

https://doi.org/10.1371/journal.pone.0255717.g001

Compared methods.

Although the performance of the proposed method could in principle be compared with other published methods based solely on the ONMI value, for the sake of consistency we used the publicly available implementations of the following three overlapping community detection methods.

Greedy clique extension [24]. The Greedy Clique Extension algorithm (GCE) can be considered a heuristic for the optimization problem of finding community structures according to the Lancichinetti community quality function [25]FS. (9) where is internal degree and external degree is of graph subset S. The algorithm has four stages.

  1. The set of maximum cliques is obtained by using some heuristic algorithms. The cliques with the size of at least r are considered seeds;
  2. Each seed is expanded by a greedy algorithm according to the quality function FS. For a certain seed the expanding process continues until quality function FS is increasing;
  3. The expanded seeds are merged with the help of the symmetric distance function defined between a pair of communities as (10) Communities S and S′ are merged if distance δE(S, S′) is less than threshold ϵ. The authors recommend using the values r = 4 and ϵ = 0.25. The overlapping naturally appears because one vertex can be covered by several seed extensions.
  4. Calculate the final communities covering.

Order Statistics Local Optimization Method (OSLOM) [26]. The main feature of the OSLOM is that it utilizes the statistically significant community measure as the fitness function. In turn, a statistically significant community is defined with the help of the configuration model [27] as a null hypothesis. Similar to GCE, the OSLOM is based on optimizing the fitness function by local search, via adding or removing vertices from the cluster. At the final stage, the OSLOM builds a hierarchical clustering structure. Every cluster is considered as a single vertex. New vertices are connected if the corresponding clusters have common edges. The weights of the new edges are assigned proportionally to the number of the edges between the original clusters.

Summarizing, the algorithm has the following steps:

  1. Find clusters via local search to maximize the fitness function. Repeat until convergence;
  2. Unite or split clusters based on their internal structure;
  3. Consider clusters as vertices. Build a hierarchical structure of clusters iteratively.

COPRA [28]. This is an iterative method, based on the idea of multi-label propagation with computation complexity close to linear. It extends the label propagation algorithm (LPA) [29] with the ability for every node to have multiple labels. One of the several drawbacks of COPRA is that the node can belong at most to the fixed number of communities v which is a parameter of the algorithm. To avoid this problem, the BMLPA method was proposed [30], however, the authors do not provide an implementation, which makes it hard to perform comparison with this method. The non-deterministic nature of the COPRA algorithm with high variance is another drawback which makes it hard to interpret the results. Mainly, the randomness comes from two factors. The first is assigning the labels randomly at the initial stage. The second is part of the label propagation process. If multiple labels have the same maximum belonging coefficient below the threshold, COPRA retains one of them, chosen at random.

Here below we will briefly introduce the remaining methods involved in the comparison. For the consistency.

PercoMVC. The PercoMVC approach consists of the two steps [31]. In the first step, the algorithm attempts to determine all communities that the clique percolation algorithm may find. In the second step, the algorithm computes the Eigenvector Centrality method on the output of the first step to measure the influence of network nodes and reduce the rate of the unclassified nodes.

danmf. The procedure uses telescopic non-negative matrix factorization in order to learn a cluster membership distribution over nodes. The method can be used in an overlapping and non-overlapping way [32].

SLPA. is an overlapping community discovery that extends tha LPA [33]. SLPA consists of the following three stages: 1) the initialization 2) the evolution 3) the post-processing.

egonet splitter. The method first creates the egonets of nodes. A persona-graph is created which is clustered by the Louvain method [34].

Demon. is a node-centric bottom-up overlapping community discovery algorithm. It leverages ego-network structures and overlapping label propagation to identify micro-scale communities that are subsequently merged in mesoscale ones [35].

k-clique. [36] Find k-clique communities in graph using the percolation method. A k-clique community is the union of all cliques of size k that can be reached through adjacent (sharing k-1 nodes) k-cliques.

LAIS2. is an overlapping community discovery algorithm based on the density function. The algorithm considers the density of a group as defined as the average density of the communication exchanges between the actors of the group. LAIS2 IS composed of two procedures LA (Link Aggregate Algorithm) and IS2 (Iterative Scan Algorithm) [37].

Angel. is a node-centric bottom-up community discovery algorithm. It leverages ego-network structures and overlapping label propagation to identify micro-scale communities that are subsequently merged into mesoscale ones. Angel is the, faster, successor of Demon [38].

The Leiden algorithm is an improvement of the Louvain algorithm [39]. The Leiden algorithm consists of three phases:

  1. local moving of nodes,
  2. refinement of the partition
  3. aggregation of the network based on the refined partition, using the non-refined partition to create an initial partition for the aggregate network.

The Label Propagation algorithm (LPA) detects communities using network structure alone [40]. The algorithm doesn’t require a pre-defined objective function or prior information about the communities. It works as follows:

  1. Every node is initialized with a unique label (an identifier)
  2. These labels propagate through the network
  3. At every iteration of propagation, each node updates its label to the one that the maximum numbers of its neighbours belong to. Ties are broken uniformly and randomly.
  4. LPA reaches convergence when each node has the majority label of its neighbours.

Datasets.

As real-world datasets, we used the following four well-known network instances with the known node attributes.

  • School Friendship: a high school friendship network with 6 communities [41].
  • Zachary’s karate club: a social network of friendships between 34 members of a karate club at a US university in the 1970s [42].
  • Word adjacencies: an adjacency network of common adjectives and nouns in the novel David Copperfield by Charles Dickens [43].
  • Books about US politics: A network of books about US politics published around the time of the 2004 presidential election and sold by the online bookseller Amazon.com. The edges between the books represent frequent co-purchasing of books by the same buyers. The dataset can be found on Valdis Krebs’ website http://www.orgnet.com
  • American College Football: Graph of the games between college football teams which belong to 12 different confederations [44].

We constructed synthetic networks of the several types with known ground truth. The graphs created by recently published network generator “FARZ” [45] have prefix “farz”. A float number in the name is a value for parameter beta. For all cases we used n = 200, m = 5, k = 5, alpha = 0.2, gamma = 0.5. Planted partition graphs (PP-graphs) [46] have prefix “PP”. In turn, the networks (SBM-graph) generated by stochastic block model [47] have prefix “SBM”. The first float number in the name of SBM/PP-graphs is an intra cluster probability for the edge. The second is a probability for an edge to be between clusters. We use prefix bench for Lancichinetti networks [48]. A float number after prefix means mixing parameter (mu). The remaining parameters are N = 200, k = 15, maxK = 50, minC = 5, maxC = 50, on = 20, om = 2. The parameters used for generating networks bench_30,…, bench_60_dense can be found in S8 Appendix. All instances of generated graphs can be found in our github repo. In addition, all parameters that were used for the graphs generation can be found inside a script exp_CDLIB.py. Finally, all values of random seeds in experiments were fixed, thus all computational results can be reproduced.

All the relevant information for the above mentions benchmark networks presented in the Table 1.

thumbnail
Table 1. Basic statistics of the networks used in the computational experiments.

https://doi.org/10.1371/journal.pone.0255717.t001

Implementation

For historical reasons, we did two implementations of the LPAM method using Java and Python. Both can be found in our repository on GitHub at https://github.com/aponom84/lpam-clustering. Java implementation includes exact and heuristic versions of the LPAM method. The Python version is available only for the heuristic. The exact version solves the k-median problem, by employing an efficient mixed-integer linear programming model by Goldengorin [49]. The Java heuristic implementations solves the k-median problem using an implementation of the CLARANS heuristic [50] from the smile library [51]. Python implementation commes as two methods (lpam_python_amp for amplified commute distance, and lpam_python_cm for commute distance) included into the script exp_CDLIB.py. The script performs all computational experiments presented in the paper except experiments with the exact version. The source code of the experiments with the exact version is in Jupyter notebooks which call Java-code. The most of the code depends on the CDLIB [52], PyClustering [53], and NetworkX [54] libraries.

Results

In the Table 2 we provide a comparision between the exact and heuristic versions of the LPAM method. The OMNI values with an asterisk correspond to the cases where the heuristic solution of the k-median problem produces slightly better results compared to the the exact solution of the k-median problem. This was caused by the fact that sometimes the ground-truth community structure does not match to neighbourhood of the medoids. Thus, the mistake in the identification of the global minimum of the k-median problem leads to a solution which is closer to the given cluster assigment.

thumbnail
Table 2. Comparison exact and heuristic versions of the LPAM method in terms of ONMI.

https://doi.org/10.1371/journal.pone.0255717.t002

The computational times for the exact and the heuristic Java-versions of the LPAM method can be seen in Table 3.

thumbnail
Table 3. Time efforts for exact version of LPAM method in comparision with heuristic.

https://doi.org/10.1371/journal.pone.0255717.t003

We studied median and maximum values of F1, omega index, and overlapping normalized mutual information (ONMI) scoring functions for measuring similarities between the partitioning produced by a method and the known cluster assignment. Also we studied values of intra cluster edge density, and normalized cut [55]. As can been seen from the Tables 46 the LPAM method gives the best median value of F1 score for several instances of the planted partition model. The GCE and Leiden methods have the best results in most cases. However, the results for maximum F1 score (Table 7), maximum omega index (Table 8), and maximum ONMI (Table 9) show that the LPAM method equpited with amplified commute distance gives the best scores for the most PP-graphs, football club, and lattice8x8. Interesting, that LPAM method with commute distance able to find the exact solution (F1, omega index, and ONMI scores are 1) for the karate club.

thumbnail
Table 4. F1 median score results table.

Maximum values are highlighted by green backgound.

https://doi.org/10.1371/journal.pone.0255717.t004

thumbnail
Table 5. Median values of omega index.

Maximum values are highlighted by green backgound.

https://doi.org/10.1371/journal.pone.0255717.t005

thumbnail
Table 6. Median values of overlapping normalized mutual information (LFK).

Maximum values are highlighted by green backgound.

https://doi.org/10.1371/journal.pone.0255717.t006

thumbnail
Table 7. F1 max score results table.

Maximum values are highlighted by green backgound.

https://doi.org/10.1371/journal.pone.0255717.t007

thumbnail
Table 8. Maximum values of omega index for each clustering method.

Maximum values are highlighted by green backgound.

https://doi.org/10.1371/journal.pone.0255717.t008

thumbnail
Table 9. Maximum values of overlapping normalized mutual information (LFK) producced by clustering algorithms.

Maximum values in a row are highlighted by green backgound.

https://doi.org/10.1371/journal.pone.0255717.t009

The Tables 10 and 11 show that the best solutions can be found by the LPAM-amp method for PP/SBM-graph in terms of internal cluster density. The LPAM method with commute distance gives the most density partitioning for FARZ networks, while the k-clique method can produce the most density solutions for Lancichinetti and FARZ networks.

thumbnail
Table 10. Median values of internal cluster edge density.

Maximum values are highlighted by green backgound.

https://doi.org/10.1371/journal.pone.0255717.t010

thumbnail
Table 11. Maximux Internal cluster edge density median.

Maximum values are highlighted by green backgound.

https://doi.org/10.1371/journal.pone.0255717.t011

Partitioning where no nodes belonging to any community has a zero value of normalized cut. That is why we have many zero values in the Tables 12 and 13.

thumbnail
Table 12. Median values of normalized cut.

Maximum values are highlighted by green backgound.

https://doi.org/10.1371/journal.pone.0255717.t012

thumbnail
Table 13. Minimum values of normalized cut produced by a method.

Minimum values are highlighted by green backgound.

https://doi.org/10.1371/journal.pone.0255717.t013

Generally, based on our communicational experiments we can summarize that there is no method that dominates the rest with respect to the proximity to the ground-truth covering for all data sets.

The average, and maximum execution times for each compared clustering method are presented in Tables 14 and 15 respectively.

thumbnail
Table 14. The average execution time(seconds).

Minimum values are highlighted by green backgound.

https://doi.org/10.1371/journal.pone.0255717.t014

thumbnail
Table 15. Maximum values of time execution (seconds).

Minimum values are highlighted by green backgound.

https://doi.org/10.1371/journal.pone.0255717.t015

An example of method output for the School Friendship instance can be seen in the Fig 2 and the associated line graph in Fig 3.

thumbnail
Fig 2. Clustering results of the exact version of the LPAM method with amplified commute distance for the School Friendship network.

θ = 0.5, k = 7. The paired numbers, separated by a colon inside the nodes, denote the predicted ID of the cluster provided by the LPAM method and the node attribute, respectively. In addition, the LPAM algorithm covering are denoted by colors. As seen in the picture, the algorithm correctly revealed community 3, 5, and 6. Also, the algorithm does not assign the nodes that have connections with more than two clusters, to any cluster. That is because of the parameter θ = 0.5 The LPAM method falsely separates community 4 into two clusters (4—yellow and 3—purple). Moreover, the algorithm almost correctly identifies the small community 1 (green); however, it incorrectly assigns the node from community 2 to this cluster.

https://doi.org/10.1371/journal.pone.0255717.g002

thumbnail
Fig 3. A line graph which is produced by the exact version of the LPAM method with amplified commute distance for the School Friendship network (θ = 0.5, k = 7).

https://doi.org/10.1371/journal.pone.0255717.g003

We should also note that the LPAM method for the regular lattice example naturally produces the covering that matches the natural separation, in contrast to the GCE method which does not produce any result for this example, while the OSLOM method can only give an accidentally good result with appropriately chosen input parameters.

Discussion

School friendship example

For the school friendship network, the LPAM method produced quite accurate results.

In Fig 2, the pairs of numbers, separated by a colon inside the nodes, denote the predicted ID of the cluster provided by the LPAM method and the ground truth, respectively. In addition, the LPAM algorithm coverings are denoted by colors. As seen in the picture, the algorithm correctly finds communities 3, 5, and 6. The two nodes that belong to 3 clusters also have been detected correctly. LPAM method wrongly separates community 4 into the two clusters (yellow and purple). Also, it wrongly decides that one node from community 2 belongs to the green cluster, but the algorithm almost correctly identified the small community 1. Lastly, it should be noted that the community structure produced by LPAM algorithm for 7 clusters (k = 7) has a larger value of ONMI score than for 6 clusters (k = 6).

The behaviour of the ONMI value depending on the threshold parameter θ for the LPAM method is shown in Fig 4. As it can be seen in most cases the maximum ONMI value is reached when the threshold value θ lies between 0.3 and 0.6. This can be attributed to the fact that for the proximity to the ground truth covering, it is usually better to either assign a vertex to one cluster or no cluster rather than to several clusters.

thumbnail
Fig 4. The ONMI results for the exact version of the LPAM method with amplified commute distance depending on the threshold parameter θ.

https://doi.org/10.1371/journal.pone.0255717.g004

The final coverings with the four various combination of the exact/heuristic version of the LPAM method, with commute and amplified commute distances for all datasets, are presented in the S1S4 Appendices. The resulting pictures for the best ONMI values, as well as the full study of the dependence of the ONMI value on the input parameters for the GCE, OSLOM, and COPRA methods, can be found in the S5S7 Appendices, respectively. The clustering results of the heuristic version of the LPAM method with amplified commute distance for the FARZ networks with 200 nodes and 5 communities can be found in Fig 5.

thumbnail
Fig 5. The clustering results of the heuristic version of the LPAM method with amplified commute distance for the FARZ networks with 200 nodes and 5 communities.

(a) β = 1, (b) β = 0.95, (c) β = 0.9, (d) β = 0.85, (e) β = 0.8, (f) β = 0.75, (g) β = 0.7, (h) β = 0.65, (i) β = 0.6, (j) β = 0.55, (k) β = 0.5.

https://doi.org/10.1371/journal.pone.0255717.g005

Computational complexity

The computation complexity of the LPAM method consists of three parts: TIME(building line graph) + TIME(calculating distance matrix) + TIME(solving k-median problem). For the first term TIME(building line graph), we need Θ(|E|) time. The second term TIME(calculating distance matrix) depends on the distance type used. Thus, a matrix with the shortest path distances can be calculated with the Floyd–Warshall algorithm [56] with cubic time on the number of nodes of the linear graph, which means Θ(|E|3) time. Both commute distance and amplified commute distance require the calculation of the Moore-–Penrose pseudo-inverse matrix. Naively, we need Θ(|E|3) time to calculate the Moore-–Penrose pseudo-inverse matrix because we need to get |E| eigenvectors. However, theoretically, it is possible to find an algorithm with computational complexity close to quadratic [57].

Exact version.

Because the k-median problem is NP-complete, it has exponential complexity. In the case of link partitioning the complexity is O(2|E|). Therefore, the total complexity of the LPAM method is dominated by the third exponential term TIME(solving k-median problem).

Heuristic version.

In the case of the heuristic, to obtain a solution we used the CLARANS method which can be considered as a kind of randomized local search heuristic or variable neighbourhood search. CLARANS is an iterative heuristic; on each iteration, it tries to improve the current solution and stops when this is not feasible. It is hard to determine the number of iterations the algorithm makes until it stops. The rough upper bound is the size of the solution space, which, in turn, is exponential. Moreover, the CLARANS method makes several attempts to obtain various local minima and stops when no improvement is made. Thus, it is hard to establish the computational complexity bounds. Like many iterative heuristics, CLARANS stops when it cannot improve the solution.

Conclusion

In this paper, we propose a new method for the detection of overlapping communities in networks with a predefined number of clusters. The proposed method is based on finding disjoint communities on the line graph of the original network, by partitioning around medoids. The resulting link partitioning naturally produces an overlapping community structure for the original graph. The link partitioning uses commute distance and its variation that produces more accurate results.

Experimental results on the set of well-known benchmark instances, as well as artificially generated instances with known ground truth, demonstrate that the proposed method can compete with the existing methods in the literature, which motivates us to further improve the method. The computation results demonstrated that the heuristic version produces results that are very close to the exact version.

Supporting information

S1 Appendix. Computational results of LPAM-AMP-Exact.

The computational results and the clustering of the exact LPAM method.

https://doi.org/10.1371/journal.pone.0255717.s001

(PDF)

S2 Appendix. Computational results of LPAM-AMP-Heuristic.

The computational results and the clustering of the heuristic LPAM method with amplified commute distance using the CLARANCE heuristic to solve the k-median problem.

https://doi.org/10.1371/journal.pone.0255717.s002

(PDF)

S3 Appendix. Computational results of LPAM-CM-Exact.

The computational results for the exact version of the LPAM method with commute distance.

https://doi.org/10.1371/journal.pone.0255717.s003

(PDF)

S4 Appendix. Computational results of LPAM-CM-Heuristic.

The computational results for the heuristic version of the LPAM method with commute distance.

https://doi.org/10.1371/journal.pone.0255717.s004

(PDF)

S5 Appendix. Computational results of GCE.

The computational results for the Greedy Clique Expansion method.

https://doi.org/10.1371/journal.pone.0255717.s005

(PDF)

S6 Appendix. Computational results of OSLOM.

The computational results for the OSLOM method.

https://doi.org/10.1371/journal.pone.0255717.s006

(PDF)

S7 Appendix. Computational results of COPRA.

The computational results for the COPRA method.

https://doi.org/10.1371/journal.pone.0255717.s007

(PDF)

S8 Appendix. Benchmark networks generator flags.

The set of flags used for the benchmark networks generator by Andrea Lancichinetti and Santo Fortunato.

https://doi.org/10.1371/journal.pone.0255717.s008

(PDF)

Acknowledgments

The article was prepared within the framework of the Basic Research Program at the National Research University Higher School of Economics.

The authors would like to thank Anna Yaushkina and Nikita Putikhin for their help with the implementation of the amplified commute distance function, and for implementing the heuristic Java version of the LPAM method. A special thanks goes to Eldar Yusupov who set up the computational cluster of the LATNA laboratory at the HSE. We also thank Alexey Malafeev who helped to improve the language of the paper. Also, authors would like to thank Peter Miasnikof who found a bug in the python implementation of the amplified commute distance.

References

  1. 1. Girvan M, Newman ME. Finding and evaluating community structure in networks. Proceedings of the national academy of sciences. 2004;69:026113.
  2. 2. van Laarhoven T, Marchiori E. Axioms for graph clustering quality functions. Journal of machine learning research. 2014;15:193–215.
  3. 3. Chakraborty T, Dalmia A, Mukherjee A, Ganguly N. Metrics for Community Analysis: A Survey. ACM Comput Surv. 2017;50(4):54:1–54:37.
  4. 4. Xie J, Kelley S, Szymanski BK. Overlapping community detection in networks: The state-of-the-art and comparative study. Acm computing surveys (csur). 2013;45(4):43.
  5. 5. Evans TS, Lambiotte R. Line graphs, link partitions, and overlapping communities. Phys Rev E. 2009;80:016105.
  6. 6. Evans TS, Lambiotte R. Line graphs of weighted networks for overlapping communities. The European Physical Journal B-Condensed Matter and Complex Systems. 2010;77(2):265–272.
  7. 7. Kim Y, Jeong H. Map equation for link communities. Physical Review E. 2011;84(2):026110.
  8. 8. Rosvall M, Bergstrom CT. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences. 2008;105(4):1118–1123.
  9. 9. Evans TS. Clique graphs and overlapping communities. Journal of Statistical Mechanics: Theory and Experiment. 2010;2010(12):P12037.
  10. 10. Zhang Y, Levina E, Zhu J. Detecting overlapping communities in networks using spectral methods. arXiv preprint arXiv:14123432. 2014;.
  11. 11. Magdon-Ismail M, Purnell JT. SSDE-Cluster: Fast Overlapping Clustering of Networks Using Sampled Spectral Distance Embedding and GMMs. 2011 IEEE Third Int’l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int’l Conference on Social Computing. 2011; p. 756–759.
  12. 12. Zhang S,Wang RS, Zhang XS. Identification of overlapping community structure in complex networks using fuzzy c-means clustering. Physica A: Statistical Mechanics and its Applications. 2007;374(1):483–490.
  13. 13. Shen HW, Cheng XQ, Guo JF. Quantifying and identifying the overlapping community structure in networks. Journal of Statistical Mechanics: Theory and Experiment. 2009;2009(07):P07042.
  14. 14. Yen L, Vanvyve D, Wouters F, Fouss F, Verleysen M, Saerens M. clustering using a random walk based distance measure. In: ESANN; 2005. p. 317–324.
  15. 15. Luxburg UV, Radl A, Hein M. Getting lost in space: Large sample analysis of the resistance distance. In: Advances in Neural Information Processing Systems; 2010. p. 2622–2630.
  16. 16. Fowler RJ, Paterson MS, Tanimoto SL. Optimal packing and covering in the plane are NP-complete. Information processing letters. 1981;12(3):133–137.
  17. 17. Gonzalez TF. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science. 1985;38:293–306.
  18. 18. Lovász L. Random Walks on Graphs: A Survey. In: Miklós D, Sós VT, Szőnyi T, editors. Combinatorics, Paul Erdős is Eighty. vol. 2. Budapest: János Bolyai Mathematical Society; 1996. p. 353–398.
  19. 19. Klein DJ, Randić M. Resistance distance. Journal of mathematical chemistry. 1993;12(1):81–95.
  20. 20. McDaid AF, Greene D, Hurley N. Normalized Mutual Information to evaluate overlapping community finding algorithms. 2011;.
  21. 21. Peel L, Larremore DB, Clauset A. The ground truth about metadata and community detection in networks. Science advances. 2017;3(5):e1602548.
  22. 22. Gates AJ, Wood IB, Hetrick WP, Ahn YY. On comparing clusterings: an elementcentric framework unifies overlaps and hierarchy. arXiv preprint arXiv:170606136. 2017;.
  23. 23. Cheraghchi HS, Zakerolhosseini A. Mining Dynamic Communities based on a Novel Link-Clustering Algorithm. International Journal of Information & Communication Technology Research. 2017;9(1):45–51.
  24. 24. Lee C, Reid F, McDaid A, Hurley N. Detecting highly overlapping community structure by greedy clique expansion. arXiv preprint arXiv:10021827. 2010;.
  25. 25. Lancichinetti A, Fortunato S, Kertész J. Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics. 2009;11(3):033015.
  26. 26. Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S. Finding statistically significant communities in networks. PloS one. 2011;6(4):e18961.
  27. 27. Molloy M, Reed B. A critical point for random graphs with a given degree sequence. Random structures & algorithms. 1995;6(2-3):161–180.
  28. 28. Gregory S. Finding overlapping communities using disjoint community detection algorithms. Complex networks. 2009; p. 47–61.
  29. 29. Zhu X, Ghahramani Z. Learning from labeled and unlabeled data with label propagation. 2002;.
  30. 30. Wu ZH, Lin YF, Gregory S, Wan HY, Tian SF. Balanced multi-label propagation for overlapping community detection in social networks. Journal of Computer Science and Technology. 2012;27(3):468–479.
  31. 31. Kasoro N, Kasereka S, Mayogha E, Vinh HT, Kinganga J. PercoMCV: A hybrid approach of community detection in social networks. Procedia Computer Science. 2019;151:45–52.
  32. 32. Ye F, Chen C, Zheng Z. Deep autoencoder-like nonnegative matrix factorization for community detection. In: Proceedings of the 27th ACM international conference on information and knowledge management; 2018. p. 1393–1402.
  33. 33. Xie J, Szymanski BK, Liu X. Slpa: Uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process. In: 2011 ieee 11th international conference on data mining workshops. IEEE; 2011. p. 344–349.
  34. 34. Epasto A, Lattanzi S, Paes Leme R. Ego-splitting framework: From non-overlapping to overlapping clusters. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2017. p. 145–154.
  35. 35. Coscia M, Rossetti G, Giannotti F, Pedreschi D. Uncovering hierarchical and overlapping communities with a local-first approach. ACM Transactions on Knowledge Discovery from Data (TKDD). 2014;9(1):1–27.
  36. 36. Palla G, Derényi I, Farkas I, Vicsek T. Uncovering the overlapping community structure of complex networks in nature and society. nature. 2005;435(7043):814–818.
  37. 37. Baumes J, Goldberg M, Magdon-Ismail M. Efficient identification of overlapping communities. In: International Conference on Intelligence and Security Informatics. Springer; 2005. p. 27–36.
  38. 38. Rossetti G. Exorcising the Demon: Angel, Efficient Node-Centric Community Discovery. In: International Conference on Complex Networks and Their Applications. Springer; 2019. p. 152–163.
  39. 39. Traag VA, Waltman L, Van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Scientific reports. 2019;9(1):1–12.
  40. 40. Raghavan UN, Albert R, Kumara S. Near linear time algorithm to detect community structures in large-scale networks. Physical review E. 2007;76(3):036106.
  41. 41. Ding Z, Zhang X, Sun D, Luo B. Overlapping community detection based on network decomposition. Scientific reports. 2016;6:24115.
  42. 42. Zachary WW. An information flow model for conflict and fission in small groups. Journal of anthropological research. 1977;33(4):452–473.
  43. 43. Newman ME. Finding community structure in networks using the eigenvectors of matrices. Physical review E. 2006;74(3):036104.
  44. 44. Girvan M, Newman ME. Community structure in social and biological networks. Physical review E. 2002;99(12):7821–7826.
  45. 45. Fagnan J, Abnar A, Rabbany R, Zaiane OR. Modular Networks for Validating Community Detection Algorithms. arXiv preprint arXiv:180101229. 2018;.
  46. 46. Fortunato S. Community detection in graphs. Physics reports. 2010;486(3-5):75–174.
  47. 47. Holland PW, Laskey KB, Leinhardt S. Stochastic blockmodels: First steps. Social networks. 1983;5(2):109–137.
  48. 48. Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Physical review E. 2008;78(4):046110.
  49. 49. AlBdaiwi BF, Ghosh D, Goldengorin B. Data aggregation for p-median problems. Journal of Combinatorial Optimization. 2011;21(3):348–363.
  50. 50. Ng RT, Han J. CLARANS: A method for clustering objects for spatial data mining. IEEE transactions on knowledge and data engineering. 2002;14(5):1003–1016.
  51. 51. Smile Statistical Machine Intelligence and Learning Engine;. https://haifengl.github.io/smile/.
  52. 52. Rossetti G, pyup io bot, Letizia, Remy C, dsalvaz, deklanw, et al. GiulioRossetti/cdlib: Beeblebrox Zaphod; 2021. Available from: https://doi.org/10.5281/zenodo.4575156.
  53. 53. Novikov A. PyClustering: Data Mining Library. Journal of Open Source Software. 2019;4(36):1230.
  54. 54. Hagberg A, Swart P, S Chult D. Exploring network structure, dynamics, and function using NetworkX. Los Alamos National Lab.(LANL), Los Alamos, NM (United States); 2008.
  55. 55. Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence. 2000;22(8):888–905.
  56. 56. Floyd RW. Algorithm 97: shortest path. Communications of the ACM. 1962;5(6):345.
  57. 57. Demmel J, Dumitriu I, Holtz O. Fast linear algebra is stable. Numerische Mathematik. 2007;108(1):59–91.