Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Overlapping communities detection through weighted graph community games

  • Stefano Benati,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Dipartimento di Sociologia e Ricerca Sociale, Università di Trento, Trento, Italy

  • Justo Puerto,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation IMUS, Universidad de Sevilla, Sevilla, Spain

  • Antonio M. Rodríguez-Chía,

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Faculty of Sciences, Universidad de Cádiz, Puerto Real (Cádiz), Spain

  • Francisco Temprano

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    ftgarcia@us.es

    Affiliation IMUS, Universidad de Sevilla, Sevilla, Spain

Correction

2 Jan 2024: Benati S, Puerto J, Rodríguez-Chía AM, Temprano F (2024) Correction: Overlapping communities detection through weighted graph community games. PLOS ONE 19(1): e0296580. https://doi.org/10.1371/journal.pone.0296580 View correction

Abstract

We propose a new model to detect the overlapping communities of a network that is based on cooperative games and mathematical programming. More specifically, communities are defined as stable coalitions of a weighted graph community game and they are revealed as the optimal solution of a mixed-integer linear programming problem. Exact optimal solutions are obtained for small and medium sized instances and it is shown that they provide useful information about the network structure, improving on previous contributions. Next, a heuristic algorithm is developed to solve the largest instances and used to compare two variations of the objective function.

Introduction

The community detection problem consists in partitioning the node set of a network, or a graph, in such a way that node subsets can be substantially interpreted as communities. The methods that are proposed in the literature so far differ on two main aspects: the first is how community is translated into mathematics terms, the second is how an algorithm is implemented to outcome communities. To make an example, the classic contribution of [1] defines as a community the group of nodes with an arc density greater than what expected by nodes random pairing, then it proposes a method to find communities based on spectral decomposition. It is beyond our possibility to mention all contributions and developments that followed that seminal paper, see [2] for a comprehensive survey, but we just focus on the two most important lines of research that motivate our contribution. The first innovation recognizes that in some cases it is too restrictive to impose a strict nodes partition, as some node may realistically belongs to more than one community. So, communities can overlap and the solution structure is a node assignment to communities rather than a strict partition. A seminal contribution about overlapping communities can be found in [3] and a summary about first findings can be found in [4]. The second innovation is to formulate community detection as optimization problems, with a clearly stated objective function and well defined constraints. For example, in [5], the modularity model is developed into quadratic integer programming, corresponding to the well-known maximum clique partitioning. Other contributions can be found in [68].

The objective function is merely a simple statistic that evaluates partitions or node assignments. As such, it can be used to compare alternative community structures and to decide what is the most meaningful. One of the most popular statistic is modularity, see [1]. Modularity is an index that, for a given partition, compares the arc density of a subset with the one that is obtained on the assumption of node random pairings. The highest the modularity, the most connected are the nodes within a community, allowing a clear substantial definition of what is a community. The extension of the modularity to the case of overlapping communities has been proposed in [9], using fuzzy membership functions that are optimized using the fuzzy-c-means algorithm. This method has been elaborated further in [1013], where the standard modularity function is modified by node or arc weights, representing node affinity, fuzzy memberships, or other. An alternative version of the objective function proposed in [9] is presented in [14], fixing some biases of the original one. In [7], it is proposed to maximize the modularity function, but with some additional constraints that allow some nodes to belong to more than one community. These nodes are referred to as bridges.

In [15], communities are defined as stable coalitions of a cooperative game. In a cooperative game, a coalition is stable if every member does not take any advantage in leaving the coalition to obtain a better payoff elsewhere, so a community is based on the concept of a common interest. There is a large room to define this common interest through any game characteristic function, such as market, voting, matching games, and so on. To just consider the topological network properties, such as the arc density and the node common neighbors, in [15] a weighted graph community game is proposed, with arc weights defined on some peculiar topological indicators. Next, an objective function is proposed to discern between alternative community structures and a constructive heuristic is implemented to find them.

In our contribution, we formulate the problem of finding communities as stable coalitions proposed in [15], as a mixed-integer linear programming problem. In this way, taking advantage of existing software, we can calculate the optimal communities of that model without resorting to any heuristic consideration. As a result, we can evaluate the optimal solutions of that model without the biases due to the use of the heuristic. Indeed, we found that the communities proposed in [15] are far from the optimal ones and, unfortunately, optimal ones are inconsistent too, in the sense that they do not correspond to what empirically one expects to find out. As it will be discussed, we argue that the reason of the inconsistency is on how costs of the weighted graph community game are defined and therefore we proposed a correction to them. Our correction follows the spirit of the modularity function, [1], in which an actual value of a statistic is compared to an expected value in absence of any community structure. We will show that our correction is reliable and effective as, after many computational tests, we showed that our method can recognize the hidden community structure of the networks. As a by-product of our contribution, we note that our cost definition relies on the calculation of the expected value of some network statistics on the assumption that no community is embedded in the network. To have an accurate cost estimate, we elaborated a new theorem to calculate the exact value of these statistics and it is worth to note that this theorem may have an autonomous interest for other applications in which some exact probabilities can be applied, as the same seminal paper [1].

To summarize, the contributions of our paper are the following:

  1. We provide a mathematical formulation of the method proposed by [15] to detect the overlapping communities of a network.
  2. We show that the communities obtained with this methodology are not the real communities embedded in the network, but we proposed an amendment to the game cost function that correct the bias.
  3. We propose a heuristic algorithm that can calculate the optimal communities when the exact method fails because of the network size.
  4. We apply our new mathematical model to real and artificial test problems and we show its effectiveness and reliability.

The paper is organized in 4 sections. In the Introduction, we motivate the paper purpose and summarize its contribution. In Material and methods Section, we formally introduce the overlapping community detection problem and the methods proposed by [15]. There, we design the exact optimization model and observe the finding of inconsistent communities. In Subsection called Detecting overlapping communities as stable coalitions of a cooperative game, we propose an alternative definition of the costs of the weighted graph community game that leads to a different objective function of the optimization model. In Local Stability Exploration Subsection, we present a heuristic algorithm for solving our model for the cases in which the network size is too large to compute the exact solution in a reasonable amount of time. In Results and discussion Section, we compare the exact and heuristic algorithm and then we report some computational results of a controlled experiment on graphs generated according the method proposed in [16] and we show that our method recovers correctly the community structure. The paper ends with some concluding remarks and outlines for future research in the final section, namely Conclusion.

Material and methods

Detecting overlapping communities as stable coalitions of a cooperative game

In [15], a cooperative game on a weighted graph is defined to characterize overlapping communities. The nodes of a graph are considered as the players of a network game, and then the Shapley value is used to characterize stable coalitions, e.g. subsets of nodes in which no player has any incentive to leave. Specifically, the cooperative game (V, φ) is defined on the weighted graph G = (V, E), with V = {1, …, n}, e.g. players are nodes labeled from 1 to n, weights Wij(≥ 0) are defined for any edge (i, j) ∈ E, then the game characteristic function is: (1)

That is, the value of coalition S is the weights sum of the edges of the subgraph induced by S. The model has been called Weighted Graph Community (WGC) Game in the aforementioned paper.

When a coalition SV is going to form, then the members iS can calculate the gain that they can get from it, e.g. what is their share of the payoff φ(S) that they can receive. A standard result of cooperative games is that the share that they can get is the Shapley value of the game restricted to S: For player i and coalition S, iS, the Shapley value is:

Hence, the profit of player i from coalition S depends on the total weight of its connection with the other members of S.

In [15], a coalition is defined stable if no member of S takes advantage from swinging from coalition S to coalition V \ S. In mathematical terms it occurs if and only if: (2)

Actually, there are different definition of stable coalitions that can be found in the literature: Stable coalition structures are defined in [17, 18], while in [19, 20], condition (2) is called the internal stability property. Moreover, in the latter notion of stability, an additional property is imposed requiring that a coalition S is stable if no member of S takes advantage from swinging from S to any other subset S′ contained in V \ S. This can be formalized as: (3)

However, we are not developing this issue further and we will remain with definition (2).

Formulating a WGC game allows a formal definition of what are the feasible overlapping communities of a network: As a node can belong to more than one stable coalition, communities can overlap. However, a crucial feature of the model is the way in which weights Wij are defined. In [15], the following formula is proposed: Let ki be the adjacency degree of node i (e.g. the number of nodes to which i is connected through an arc), let be defined as the partition ratio and let CNij = (|common neighbors of i and j| + 1)Pij be defined as the neighbourhood ratio of i, jV, then the weight of the arc (i, j), ij is (4)

The formula was proposed in [15] to consider the node similarity as dependent on both the direct and indirect links between i and j. It is straightforward to observe that Wij ≥ 0, but this property has important consequences on the structure of the stable coalitions, as it will be discussed later. For the moment, we focus in the methodology to find all the stable coalitions of a networks. While in [15] a constructive method is proposed, that is, an heuristic technique with some ad-hoc adjustment to find stable coalitions, here we propose a mathematical programming approach in which all considerations about stability discussed in [15] are translated into an objective function and mathematical constraints. We will show that stable coalitions can be represented by linear constraints involving binary variables and then, using an appropriate objective function, stable coalitions can be determined by linear programming.

Let nc be the maximum number of communities to which a node can belong to (this is not a binding constraint to the model, since nc can be large enough to include all the feasible stable communities). For i = 1, …, n and k = 1, …, nc, the model variables are:

For any i, j = 1, …, n such that i < j and k = 1, …, nc:

The relationship between x- and z-variables is given by the logical/quadratic constraints zijk = xikxjk for all i, jV, i < j and all k = 1, …, nc. Then, the quadratic constraint can be replaced by the linear constraints: (5) (6) (7)

Next, using binary x-variables, the stability condition (2) can be characterized by linear constraints too. First, for fixed i and k, consider the quadratic inequality:

If xik = 1, then i belongs to coalition Sk, so that Sk must be stable. For the stability, i-player’s Shapley value from coalition Sk must be greater than its Shapley value from the opposite coalition (V \ Sk) ∪ {i}. The term is the Shapley value of coalition Sk, as all j’s such that xjk = 1 are all the other players of coalition Sk. Conversely, all other j’s such that (1 − xjk) = 1 are the players excluded from Sk. Consequently, is the Shapley value of the opposite coalition, (V \ Sk) ∪ {i}. Finally, their difference must be greater than or equal to 0 for Sk to be stable. Next, the above quadratic inequality can be simplified to the following linear one: (8)

Next, it must be imposed that overlapping coalitions/communities must have non-empty difference, e.g. the same coalition is not selected more than once (a coalition must not be contained in a different one). To prevent inclusion, additional variables h are introduced for i = 1, …, n and pairs k, r such that 1 ≤ k < rnc:

The relation between x- and h-variables is given by the quadratic constraint: hikr = xir(1 − xik), that can be replaced by three linear constraints as done for z-variables in expressions (5)–(7).

To prevent the inclusion of Sr in Sk, it must be that: (9)

The constraint is binding when xir = 1. In that case, coalition Sr must contain at least one element j that is contained in Sr but not in Sk, guaranteeing that SrSk.

To conclude, we introduce inequalities to avoid symmetrical solutions too. Symmetric solutions decrease the efficiency of the Integer Linear Programming solver, as the same structural solution can be obtained by multiple assignments to variables x, z, h, simply giving different labels to coalitions. Note that constraints (9) avoid to replicate the same coalition, so that it is sufficient that, after ranking the communities from the largest to the smallest, they are assigned to decreasing labels k. The following constraints do the task: (10)

Every stable coalition corresponds to a point of the polytope described by the equations and inequalities described so far. To determine what are the most meaningful overlapping communities, in the objective function it is used the nodes Shapley value. If a coalition Sk is established, then player i’s Shapley value from coalition Sk is: . Therefore, for a set of overlapping communities Sk, k = 1, …, nc, the total Shapley value of a player i is the sum of the values it gets from every coalition, that is: (11)

In [15], the most important overlapping coalitions are determined by maximizing the sum of the Shapley values of all nodes. Therefore, this index will be used as the objective function of the following integer programming formulation: (12) s.t.: (5)–(10), (13) (14) (15) (16) (17) (18) (19)

The objective function (12) represents the sum of the Shapley values for all nodes and communities. Constraints (13) guarantee that every node belongs to at least one community. Constraints (14)–(16) are the linear representations of the h-variables. Finally, constraints (17) define binary variables. Note that in (18) and (19), we can relax the z− and h−variables to be continuous, since the constraints on the x-variables force both to be binary.

FShJK is the exact Integer Programming formulation of the model proposed in [15]. However, in that seminal paper the overlapping communities were computed through a heuristic constructive procedure, in which the search for optimal solutions is combined with various ad-hoc adjustments to induce sufficient diversification of coalitions. The advantage of Integer Programming is that the output coalitions of FShJK are exactly the optimal ones, without any bias due to constructive rule-of-thumb procedures. As we will see, this allows us to point out a drawback of the game definition and to suggest a method to adjust it.

We apply formulation FShJK, to the Zachary’s karate club network, fixing nc = 3. Optimal overlapping communities can be seen in Fig 1. As can be seen, selected communities are the grand coalition (all the nodes belong to the same coalition) except one node. That is, communities are subsets S such as |S| = n − 1, in which the discarded node is the one with less connections. It is hard to believe that those sets are of some interest to researchers, as they are far from the communities that were often identified in the Zachary’s network. The same occurs with all the other problems we tested: Overlapping communities are the grand coalition except one node. The reason of this disappointing result is not the solution method, e.g. exact vs heuristic, or the community definition, e.g. using cooperative games and the Shapley value. Rather, the reason is the way in which weights W are formulated in (4). As recognized in [15], if Wij ≥ 0 for all i, j, then the cooperative game (V, φ) is convex, that is for two coalitions S, T such that ST and iT, it always occurs that:

thumbnail
Fig 1. Zachary’s karate club structure obtained by FShJk with nc = 3.

(a) Community 1, (b) Community 2, (c) Community 3.

https://doi.org/10.1371/journal.pone.0283857.g001

This property establishes that the marginal gain player i gets from joining a coalition is always greater when the coalition is larger. Therefore the Shapley values are always the greatest for the largest coalitions and that is why the method proposed is always doomed to mistake the largest subsets as communities. As we have pointed, the weakness is not on using cooperative games to define stable coalitions, but on using convex cooperative games. In this section, we will provide a simple and effective way to adjust this weakness. Our proposal is based on determining stability using a non-convex cooperative game.

The computation of the expected weight on an arc.

As we discussed in the previous section, weighted graph community games in which arc weights Wij ≥ 0 are convex games, so that they imply increasing values of the Shapley values and the tendency of detecting only large size communities. A straightforward way of avoiding convexity is considering an alternative set of weights, non necessarily non-negative, so that optimal stable coalitions of small size may emerge as well. Here, we propose to combine the weights defined by (4) with modularity, so that weights are normalized by their expected values and may take both negative and positive values. As a consequence, the resulting game is non-convex.

The modularity function, see [1], is a well-known index to detect communities in networks. The index compares the edge density of the empirical graph G = (V, E) (unweighted and undirected), |E| = m, with the expected edge density of a theoretical graph G′ = (V, E′) in which there are no communities by assumption. The expected edge density of G′ is calculated using a null hypothesis, e.g. an assumption about the edge distribution, that is called the configuration model, [21]. If the graph does not contain communities, then for any given two nodes i and j with edge degrees ki and kj, the expected number of edges between i and j is approximated by . Let Aij = 1 if (i, j) ∈ E, Aij = 0 otherwise (so that A = [Aij] is the adjacency matrix of G). Moreover, let Π be a partition of V and let δ(i, j) be the Kronecker delta: δ(i, j) = 1 if i, jV belong to the same community, δ(i, j) = 0 otherwise. Then the modularity function of a partition Π is: (20)

In the case under study, weights are defined through expression (4), in which the adjacency between nodes i and j is weighted by the common neighbors. However, modularity can be defined for weighted graphs as well. In the summation terms , entries Aij are replaced by weights Wij, ki replaced by weight sum Wi = ∑jWij, and m replaced by W = ∑(i,j)∈E Wij, as described in [22]. In this way, modularity is still a function that compares the actual indices of an empiric graph with the expected indices of a random graph. Using modularity, we can define modularity game (V, φ) as a weighted graph community game in which the characteristic function φ is defined as in (1), but with the following weights: (21)

In this case, can take both positive and negative values, so that the game resulting from the characteristic function (1) is non-convex.

We elaborate this model further, by noting that the modular term (21) should represent the difference between the empiric value Wij and its expected value under the assumption that the graph does not contain any communities. Unfortunately, the term is only an approximation of the true expectation and this can cause unexpected biases. For example, when weights Wij correspond to the adjacency matrix Aij ∈ {0, 1}, the term is an estimate of the probability of an arc between i and j, but, if the graph is unbalanced, the term can be greater than 1, which results in a non-sense estimation of this probability. In our application, expression (4) contains specific terms about the graph structure, such as the arcs and the common neighbours between two nodes, and potentially the bias between the true expectation and its approximation can be large. For this reason, we made a special effort in calculating the exact equation of the expected values of expression (4) under the assumption that there are no community in the graph.

In [21], the random occurrence of a graph with no communities is calculated through the configuration model. The configuration model can be interpreted as the process of making a random graph with no communities through the following operations. Every arc e = (i, j) of the empirical graph G = (V, E) is cut into two parts, say l1 and l2, with l1 incident to i and l2 incident to j, called stubs. Next, two different stubs are selected randomly and paired. We say that, if l1 and l2 are such stubs, then (l1, l2) is a match, e.g. an arc of the random graph G′ = (V, E′). The way in which G′ is built implies that the adjacency degree ki remains unvaried for all i, but eventual communities are broken by random pairings of stubs. Note that, from construction, we can interpret any occurrence of G′ as a matching of 2m stubs. The process is exemplified in Fig 2.

Here we show how to compute exactly the expected values of expression (4) using the configuration model. Expected weights depend on the the partition ratio Pij and the neighbourhood ratio CNij of the random graphs obtained from the configuration model. By construction, the partition ratio Pij of the random graph is the same as the one of the empiric graph, but the neighbourhood ratio CNij is different.

To calculate CNij, we introduce some notation. Recall that ki is the adjacency degree of node i and assume that the graph has m edges. Let Padjacency(ki, kj, m) be the probability that node i and j are connected by an arc, let Pcommon neighbour(ki, kj, kr, m) be the probability that i and j are arc connected with r, so thar r is a common neighbor, and let Ptriangle(ki, kj, kr, m) be the probability that i and j are arc connected and are also connected with r, so that the three arcs form a triangle. The notation emphasizes that probabilities depend on adjacency degrees ki, kj, kr and the total number of edges m. In the following proposition, we will derive closed form expressions for the above probabilities.

Proposition 1. Let i, j, r be three nodes with adjacency degrees ki, kj, kr, respectively. Then, in the random graph configuration model (22) (23) (24)

Proof. Applying the configuration model to G = (V, E), we obtain two stubs l1 and l2, adjacent to i and j, respectively, for every arc e(i, j) ∈ E. Then, we select two stubs at random and pair them until a random graph G′ is obtained. Note that, from construction, we can interpret any occurrence of G′ as a matching of 2m stubs.

Given i, jV, let Si = {li(1), …, li(ki)} be the set of stubs adjacent to i and Sj = {lj(1), …, lj(kj)} be the set of stubs adjacent to j. Assuming a set of 2m elements, there are different matching, see [23]. Therefore, if two stubs l1Si and l2Sj are matched, there are different matching with the stubs remaining, because there are still 2m − 2 stubs to pair. Due to this, the probability that two stubs l1 and l2 are joined, connecting nodes i and j, is: (25)

Next, we introduce random variables:

Obviously, the probability of is , as stated in (25). We can express the number of edges between two nodes i and j as the sum:

The above expression represents the sum of the variables whose indices are one stub adjacent to i and another stub adjacent to j. Thus, the expected number of edges between i and j is:

Note that in the modularity function (20), this value is approximated by .

As we explain before, the expected number of edges is different to the probability of adjacency. The adjacency between two nodes i and j is the condition that there is at least one arc between i and j and it can be expressed as the union of the events with l1Si and l2Sj, for the sake of simplicity, we refer to this set of events as . So, the adjacency probability of two nodes i and j is: (26)

Let be the set of all the different subsets of Si × Sj with size . Applying the inclusion-exclusion law for the probability of union of events to expression (26), it follows that: (27)

By construction of the random graph G′, observe that the intersection of t different sets , representing the match between stubs l1 and l2, is empty if the same stub, l1 or l2, is repeated more than once in different matches. Therefore, for each t, the non empty sets that appears in (27) are matching with t matches. As a consequence, the summation on t is bounded to min{ki, kj}, because the intersection of more than min{ki, kj} different sets must repeat some stubs and so, its intersection is empty. Moreover, applying the same argument to calculate the probability of joining two stubs (25), the probability of joining t stubs from Si with other t stubs from Sj is:

Finally, to derive expected vales, we need to calculate the number of different subsets from Si × Sj with a size equal to t that do not repeat any stubs. We have to consider t stubs from Si and t from Sj, and then all the possible matchings between stubs of different sets. There are different subsets of t stubs from Si and different subsets of t stubs from Sj. We can match the t stubs of one set with the other t stubs of the other set in t! different ways, obtaining the following expression for the probability of events ensuring that node i and j are connected, in short, : (28)

This is the expression in (22) for Padjacency(ki, kj, m).

Now, we use (28) and the previous arguments to obtain the probability that i and j are connected with a different node r, namely Pcommon neighbour(ki, kj, kr, m), i.e., we compute the probability of the intersection of the event nodes i and r are connected with the event nodes j and r are connected, in short, : (29)

Finally, developing as before, the probability of three nodes i, j and r to be connected each other, namely Ptriangle(ki, kj, kr, m) is: (30)

The above probabilities are necessary to determine the exact value of the expected weight E[Wij], when weights are defined as in formula (4) and the graph is obtained by the configuration model.

Define the following random variables:

Theorem 1. Assume that weights between nodes i and j are defined as in (4), then the expected weight E[Wij] between nodes i and j of the the random graph configuration model is given by the following expressions:

  1. If ki = 1 or kj = 1, (31)
  2. If ki > 1 and kj > 1, (32)

Proof. We can express the weights (4) depending on the cases as follows.

If ki = 1 or kj = 1:

Observe that if the term ∑rV\{i,j} YirYjr = 0 then since the adjacency degree of i or j is one, i and j must be connected and therefore Yij = 1. Thus, the expression above results in YijPij. Otherwise, if ∑rV\{i,j} YirYjr ≠ 0 again since the adjacency degree of i or j is one, Yij = 0 and the expression above simplifies to . Hence, we obtain that

Next, we compute the expected values of the previous expression: (33) and the result follows because the expression above coincides with (31).

If ki > 1 and kj > 1:

Then, the expected value of the expression above is: (34)

Now, we observe that

Finally, substituting the probabilities that appear in (33) and (34) with the expressions in (22), (23) and (24), one obtains the result.

New models for detecting communities using weighted graph modularity games.

In the previous section, we show that the optimal solution of the analyzed instances provided by formulation FShJK was the grand coalition except one node. Since, this type of solutions are meaningless for detecting overlapping communities, in this section, we provide an alternative model taking advantage of Theorem 1. Actually, we propose to define another modularity game (N, φ), in which the characteristic function φ is as in (1), but weights are defined as: (35) where . Observe that, the game is non-convex as can take both positive and negative values.

To calculate the overlapping communities through the coalition stability of a modularity game, the objective function of formulation FShJK must be modified according to Eq (35). Moreover, to avoid double counting (induced by pair of nodes that belongs to the same community in the new objective function), for any 1 ≤ i < jn the next binary variables are introduced:

Observe that if we would have used y-variables in model FShJK, the same solution would have been obtained because all the weights are positive and again the grand coalition would have been the optimal solution.

The final formulation of this model is: (36) s.t.: (5)–(7), (10), (13), (17), (18) (37) (38) (39) (40) (41)

The objective function (36) sums the weights between nodes of the same community only once. In this way, it cannot be the case that a community is a proper subset of another, because its profit would be null. Then, constraints (9), (14), (15), (16) and (19) that were discussed previously are not necessary. With (37) we guarantee that communities are stable for the new weights W*. If xik = 1, then (37) is equivalent to (8). Constraints (38) impose that each node cannot belong to more than p different communities, with p a fixed parameter established by the user. Constraints (39) and (40) impose that yij = 1 if and only if there is a community k to which i and j belong to. Finally, constraints (41) defines our variables as binary, but, from the arithmetic of the model, we can relax them as continuous variables (yij ∈ [0, 1]) because in any case they can take only 0,1 values. The notation stands for the fact that the condition of stability is determined by the Shapley value of a modularity game with weights . In some experimental cases, it is interesting to compare the contribution of Theorem 1 over the approximations , see (21), and therefore, we will refer as to the model in which are replaced by .

The following experiments will highlight differences between models and , and differences between overlapping and non-overlapping communities models. The experiments are run in the Python environment and using the Gurobi solver.

In the first two examples we will show that models , e.g. the exact model, and models , e.g. the approximation, compute different communities, even though they are run with the same parameters and the network size is small. From the tests, we can argue that the contribution of Theorem 1 is substantial.

We apply models and to the Zachary’s karate club network, [24], and compare the results with what obtained in [15]. The overlapping communities of that paper are three, so we fix nc = 3 and p = 2. In Fig 3, each community is represented by the color grey, black or blue and the intersection nodes by red.

thumbnail
Fig 3. Zachary’s karate club community structures.

Community structures obtained by (a) [15], (b) with parameters nc = 3, p = 2, (c) with parameters nc = 3, p = 2.

https://doi.org/10.1371/journal.pone.0283857.g003

Fig 3a and 3c are similar. The only difference is that model detects the node 12 as an intersection. It is reasonable, because node 12 is only connected to the other intersection node and share neighbours with both communities, black and blue. The structure obtained by model is also similar, but detects more intersection nodes, having connections with different communities and sharing neighbours with them. The results highlights that there can be differences between the exact and the approximate models, already when applied to small size graphs.

Next, we analyze models and with other parameters. First, we fix p = 1, so that communities cannot overlap, and we obtain the results in Fig 4.

thumbnail
Fig 4. Zachary’s karate club disjoint community structures.

Community structures obtained by (a) with parameters nc = n, p = 1, (b) with parameters nc = n, p = 1.

https://doi.org/10.1371/journal.pone.0283857.g004

As can be seen, in both cases nodes that belong to the same community have high edge density between them and many common neighbours, even though the two communities in Fig 4a can be further split, as seen in Fig 4b. There, communities have higher edge density, but less common neighbors. It highlights the fact that equation (4) combines two criteria, namely density of common neighbors and number of connections, and the researcher must consider a trade-off between them. Letting communities overlap partially avoids this trade-off: With parameters nc = 4 and p = 2, we obtain the results in Fig 5.

thumbnail
Fig 5. Zachary’s karate club community structures.

Community structures obtained by (a) with parameters nc = 4, p = 2, (b) with parameters nc = 4, p = 2.

https://doi.org/10.1371/journal.pone.0283857.g005

Figs 3b and 5a are similar. The intersection nodes found previously (Fig 3b) are also intersection nodes in Fig 5a with the new parameters. Nevertheless, some other intersection nodes appear that are brought about by the new fourth community of the clustering. Note that communities in Fig 5a are quite different from the ones of Fig 5b, especially for what concerns intersection nodes. As was remarked before, it implies that the differences between the exact and the approximate model are substantial.

Next, we apply models and to the zebra communication network, see [25]. First, model is run with p = 1 and results are in Fig 6a. Results of model are the same. Results of models and with parameters p = 2 and nc = 3 are in Fig 6b and 6c respectively. The former model does not detect any overlapping community, suggesting that they are well separated, while the latter model identifies node 20 as belonging to two communities. Since this model is actually an approximation of the real data, it is likely that the role of node 20 has been mistaken since the communities seems to be separated.

thumbnail
Fig 6. Zebra community structures.

Community structures obtained by (a) with parameters nc = n, p = 1, (b) with parameters nc = 3, p = 2, (c) with parameters nc = 3, p = 2.

https://doi.org/10.1371/journal.pone.0283857.g006

The following two examples compare the communities found by model when community i) cannot overlap (p = 1); ii) can overlap (p > 1). It will be seen that allowing overlapping communities reveals nodes that are structurally different from others, forming the bulk of a core/periphery separation.

First, we apply the model to the the Highland tribes network, see [26]. First, model is run with p = 1 and results are in Fig 7a. There, it can be seen that, if no overlapping communities are allowed, then the model detects one community composed of all the nodes. Conversely, model is run with parameters nc = 3 and p = 2, results are reported in Fig 7b. It can be seen that the role of different nodes is emerged. There, three communities of different size have been detected, with some nodes (the red ones) belonging to more than one community forming the core of the system of alliances.

thumbnail
Fig 7. Highland tribes community structures.

Community structures obtained by (a) with parameters nc = n, p = 1, (b) with parameters nc = 3, p = 2.

https://doi.org/10.1371/journal.pone.0283857.g007

Next, we apply model to the Windsurfers network, see [27]. Run with parameter p = 1, the model detected the two communities reported in Fig 8a. Run with parameters nc = 2 and p = 2, the model detected the communities reported in Fig 8b. As can be seen, the results with overlapping communities are a refinement of the disjoint communities. Nodes that are in the border between the two groups are highlighted as members of both, forming the bulk of a core/periphery network segmentation.

thumbnail
Fig 8. Windsurfers community structures.

Community structures obtained by (a) with parameters nc = n, p = 1, (b) with parameters nc = 2, p = 2.

https://doi.org/10.1371/journal.pone.0283857.g008

To summarize our findings, the test of models on four typical benchmark networks revealed:

  • Results between and are different. As the latter is an approximation of the former, it reveals that the contribution of Theorem 1 to model development is substantial.
  • Results between non-overlapping and overlapping community models are different. The former can reveal not only group membership, but nodes that could act as potential bridges between communities.

Local Stability Exploration: An heuristic algorithms to detect overlapping communities

Problems and are Integer Linear Programming (ILP) models whose solution computational times can be impractical when the instances to solve are large. This is normal when we deal with a NP-hard problem as the case of communities detection. Nevertheless, for large instances the ILP formulation can be applied to devise heuristic algorithms that could approximate the optimal solution in short computing time. Here we propose a method, that we will call Local Stability Exploration (LSE), that is based on local search. Suppose that a set of feasible communities is given, we will call such Π an incumbent solution. Π feasible means that it satisfies the ILP model constraints, so that i) every node belongs to at least one community, , ii) there is not strict inclusion between communities, such that SkSr, iii) the maximum number of communities to which a node can belong is not exceeded by any node, i.e. ∀iV the inequality |{k = 1, …, nc: iSk}| ≤ p is fulfilled; and iv) all communities are stable. Next, we try to modify Π to obtain a new feasible solution Π′ with an improved objective function. We consider three possible modification of Π, obtained by moves that are called Add, Remove, and Swap. Add is the move that joins a node to a community, allowing in this way multiple communities assignments. Remove is the move that takes away a node from a community. Swap is the move that switch two nodes between two communities. These moves are applied if and only if the new obtained Π′ is feasible. That is, after a move it must not occur that 1) a node does not belong to any community 2) a node belongs to more communities than allowed, maximum number of communities p to which a node can belong; 3) one community is included in another, 4) modified communities are not stable.

For a feasible starting solution, the procedure is summarized in Algorithm 1. There, the triplet (i, k, 1) is the move of adding node i to community k, the triplet (i, k, 2) is the move of removing node i from community k, the 5-tuple (i, k, i′, k′, 3) is swapping nodes i and i′ between communities k and k′. It can be seen that from Line 9 to Line 22 all feasible moves are considered. In Lines 12, 15 and 20 the increases of the objective function are calculated using the following notation: Let Ci = {k ∈ {1, …, nc}:iSk}, that is, Ci is the index set of the communities to which i belongs, then the objective function can be written as:

Note that the condition CiCj ≠ ∅ is the condition that there is at least one community to which both i and j belong to. However, from the computational efficiency it is better to calculate just the increase of the objective function, as is done in lines 12, 15, 20. The new solution Π′ is the one that obtains the maximum increase. The algorithm stops when condition of Line 42 applies, as there are no improvements and a local optimum has been reached.

Algorithm 1 Local stability exploration algorithm

1: procedure Local stability exploration

2:       ⊳ Π is obtained by peculiar subroutines

3:  for i in V do

4:   Ci = {k ∈ {1, …, nc}: iSk}

5:  end for

6:          ⊳ Objective function

7:  local_opt = FALSE        ⊳ Condition for a local optimum

8:  while local_opt = FALSE do

9:   Δ ← Feasible_Moves(Π)        ⊳ Δ: list of admissible moves for Π.

10:   for (i, k, d) in Δ do

11:    if d = 1 then

12:     

13:    end if

14:    if d = 2 then

15:     

16:    end if

18:   end for

18:   for (i, k, i′, k′, 3) ∈ Δ do

19:    if d = 3 then

20:     

21:    end if

22:   end for

23:   (i*, k*, d*) ∈ argmax{δikd|(i, k, d)∈Δ}        ⊳ Select the move that increases the most

24:   (i*, k*, i′*, k′*, d*)∈argmax{δikikd|(i, k, i′, k′, d)∈Δ}        ⊳ Select the move that increases the most

25:   if δi*k*i′*k′*d* > max{0, δi*k*d*} then

26:    ff+ δi*k*i′*k′*d*        ⊳ Update f

27:    Sk*Sk* ∪ {i′*}\{i*}

28:    Sk′*Sk′* ∪ {i*}\{i′*}        ⊳ Update Π

29:    Ci*Ci* ∪ {k′*}\{k*}

30:    Ci′*Ci′* ∪ {k*}\{k′*}

31:   else

32:     if δi*k*d* > 0 then

33:     ff + δi*k*d*       ⊳ Update f

34:     if d* = 1 then

35:      Sk*Sk* ∪ {i*}       ⊳ Update Π

36:      Ci*Ci* ∪ {k*}

37:     else

38:      Sk*Sk*\{i*}       ⊳ Update

39:       Ci*Ci*\{k*}

40:     end if

41:    else

42:     local_opt = TRUE

43:    end if

44:   end if

45:  end while

46:  return Π        ⊳ Return the local optimum

47: end procedure

It remains to comment how feasible starting solutions can be obtained in Line 2 of Algorithm LSE. Depending on problems, we tested various procedures. The first possibility is to start with an unfeasible solution Π, because it contains unstable communities. Then Algorithm LSE is run without imposing that new solutions Π′ should be stable, but once that a feasible one has been found, then all forthcoming solutions must remain feasible too. The first unfeasible Π can be a random assignment to communities, but another possibility is solving for p = 1, that is, when overlapping is not allowed, as the problem is usually solved faster than the cases in which p > 1. Another possibility that has been used for the problems with the largest size is solving by branch-and-bound, but stop the search when the first feasible solution has been found and next using it as the starting solution in Line 2. All methods can be combined using any multi-start strategy, that is, repeating Algorithm 1 many times with different starting solutions to obtain sufficient diversification and exploration of the solution space. Finally, Algorithm LSE has been explained to solve model , but it can be applied to with straightforward modifications.

A preliminary test of the quality of the LSE algorithm has been run on the previous networks. We run a multi-start version allowing tmax = 10 starting solutions each run. Results about computational times and solution quality for different parameters configurations are reported in Table 1. It can be seen that the LSE heuristic algorithm reduces the computing time significantly with respect to the ILP solution for both models and , while the optimal solution has been achieved in all the cases but one.

Moreover, we applied the LSE algorithm to some large-scale real data sets that are impractical for any ILP model, in order to test the scalability of our heuristic. The solved data sets are the American college football network with 115 nodes, see [28], the Jazz musician network with 198 nodes, see [29], and C. metabolic network with 453 nodes, see [30]. These real data sets examples are commonly used in literature. The exact expected weights cannot be computed for graphs with a large number of edges, so we used the approximated expected weights . We report the results of our methods in Figs 911.

thumbnail
Fig 9. American college football community structure.

Community structure obtained by LSE heuristic with weights and parameters nc = 7, p = 2.

https://doi.org/10.1371/journal.pone.0283857.g009

thumbnail
Fig 10. Jazz music community structure.

Community structure obtained by LSE heuristic with weights and parameters nc = 6, p = 2.

https://doi.org/10.1371/journal.pone.0283857.g010

thumbnail
Fig 11. C. metabolic community structure.

Community structure obtained by LSE heuristic with weights and parameters nc = 10, p = 2.

https://doi.org/10.1371/journal.pone.0283857.g011

Results and discussion

We are going to analyze the main features of the ILP models , and the heuristic Algorithm 1 when they are applied to medium and large size networks, most precisely, whether they can detect the true overlapping communities of randomly generated networks, as it is done in [31]. Random networks are generated using the procedure proposed in [16], but with some variations to allow for communities that overlap. Most peculiarly, in our simulation we must distinguish between bridge and non-bridge nodes, the former being the nodes that belongs to more than one community. The main parameters characterizing the simulated networks are:

  • N: the number of nodes.
  • nc: the number of communities.
  • p: the maximum number of communities to which a node can belong to.
  • No: the number of nodes that belongs to more than one communities, that is, they are bridges.

Next, communities are defined by the probability by which community nodes can establish a link between themselves. Those probabilities are controlled by parameters:

  • 1 − μ: fraction of links between non-bridge nodes belonging to the same community.
  • 1 − μo: fraction of links between bridge nodes and other nodes of the communities where the bridge node belongs to.

There are other parameters characterizing the simulated networks, such as the number of arcs, the node degrees, the community sizes and so on, whose purpose is to simulate networks with the same characteristics of the empiric ones. We report all these features in S1 Appendix, with the pseudo-code describing our implementation of Lancichenetti et al. algorithm.

The solution quality of our models is measured comparing their results with the true community structures (known by simulation). True and estimated structure may differ for:

  • The community composition;
  • The identification of the bridge nodes.

The statistics to compare the community composition are:

  • the Normalized Mutual Information (NMI) index for overlapping partitions, presented in [32];
  • the Omega index (OI), presented in [33].

Both statistics range between 0 and 1, with values closer to 1 indicating strong correspondence between true and estimated communities.

The statistics to compare the identification of bridge nodes are based on a set of indices which depend on the values of the confusion matrix associated to the identification of bridge nodes. Each element of the confusion matrix is defined as follows

  • True Positive (TP): Nodes successfully detected as bridge.
  • True Negative (TN): Nodes successfully detected as non-bridge.
  • False Positive (FP): Nodes wrongly detected as bridge.
  • False Negative (FN): Nodes wrongly detected as non-bridge.

Then, we consider the following indices.

  1. the accuracy defined as ,
  2. the True Positive Rate (TPR): ,
  3. the False Positive Rate (FPR): ,
  4. the Area Under Curve (AUC): ,
  5. the Precision defined as ,
  6. the F1 score: ;

Test 1: Detecting non overlapping communities: As a first test, we apply the ILP models , and the Algorithms LSE to the case in which communities do not overlap, that is, p = 1, to see whether the approximate result of algorithm LSE are reliable, with respect to what is found by the respective optimal ILP models. The ILP solution of and can be obtained in short computational times only for moderate size networks, so we consider N = 40, 60 to solve within the time limit of 100 or 200 seconds respectively. The LSE heuristic has been run with tmax = 5 multiple starting solution, guaranteeing that its computational times are a fraction of the exact method.

For fixed N and nc, we let μ = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, as in [16] to control for the effect of mixing parameter. For each parameter set, either 50 or 100 random networks are generated and indices are calculated as averages on all instances. Results are reported in Table 2. The first two rows of this table give the ILP formulation ( or ) used in the corresponding method: exact (ILP) or (LSE) heuristic to provide an initial solution. The third row describes the parameters of the instances (N, nc, μ) and the index reported below (NMI or Omega). By columns, the layout of this table is organized in three blocks. The first one with three columns describes the instances. The next two blocks, each one with four columns, report the average values of the NMI and Omega indices for each combination of solution method. Results in bold report the best behaviour among similar index for the corresponding solution methods. One can easily observe that using formulation in the ILP or in the LSE heuristic provides better solutions than .

thumbnail
Table 2. Computational results about networks with non-overlapping communities.

https://doi.org/10.1371/journal.pone.0283857.t002

For each combinations of parameters N and nc, the NMI and OI of each solution method are also shown as a function of μ in Figs 1214 to compare the formulations and . The exact formulation obtains, in general, better NMI results and also better OI results in more cases than ; except for N = 40 and nc = 4. In this case, the behaviour of OI is similar in both formulations. However, also for N = 40, the exact solution of model is superior to the other two approaches, namely the heuristic LSE and the exact model , as the curves of the NMI and Omega statistics are above the others for most values of N, nc and μ.

thumbnail
Fig 12. Test results on non-overlapping communities, parameters N = 40, nc = 6.

(a) Average NMI for each solution method, (b) Average Omega for each solution method.

https://doi.org/10.1371/journal.pone.0283857.g012

thumbnail
Fig 13. Test results on non-overlapping communities, parameters N = 40, nc = 4.

(a) Average NMI for each solution method, (b) Average Omega for each solution method.

https://doi.org/10.1371/journal.pone.0283857.g013

thumbnail
Fig 14. Test results on non-overlapping communities, parameters N = 60, nc = 6.

(a) Average NMI for each solution method, (b) Average Omega for each solution method.

https://doi.org/10.1371/journal.pone.0283857.g014

When μ is above the threshold 0.3, the solution quality of the method deteriorates for the joint effect of two factors: 1) communities are less well-separated, 2) exact solution has not been obtained within the considered time limit. However, this is not an actual drawback since for those parameter values, communities are essentially meaningless.

Test 2: Detecting overlapping communities on small networks: Networks with overlapping communities have been simulated with the same parameters used before, but now communities can overlap. We control the overlap with parameters p = {2, 3} and μo = {0.5, 0.7}. The choice of these parameters is justified since for p = 2 the smallest possible μo value is 0.5 and for p = 3 the smallest possible μo value is approximately 0.7. Moreover, the number of bridge nodes No is approximately 10% of all the nodes, and we change this value to asses how it affects the computational results. Problems with overlapping communities are harder to solve, therefore we limit the graph size to N = 40 and increase the time limit to 200 seconds. Table 3 reports the computational results with a layout similar to Table 2. It can be seen that the best values of both the Omega and NME indices are obtained with the LSE heuristic, applied to the formulation. The LSE heuristic applied to provides the second best results (with a few exceptions in which it becomes the best one) and the third one is the ILP formulation. The reason of the poor performance of the ILP methods is due to the fact that they were not able to terminate the computation in the imposed time limit and the solution that they provide is far from optimality. Results of Table 3 are graphically reported in Figs 1519, where it can be seen that the purple and green curve, representing the LSE heuristics, are very close with each other and they are much above the result of the truncated ILP. It is also noteworthy that weights from improve the results of the ILP method.

thumbnail
Table 3. Computational results about networks with overlapping communities.

https://doi.org/10.1371/journal.pone.0283857.t003

thumbnail
Fig 15. Test results on overlapping communities, parameters p = 2, μo = 0.5, No = 1.

(a) Average NMI for each solution method, (b) Average Omega for each solution method.

https://doi.org/10.1371/journal.pone.0283857.g015

thumbnail
Fig 16. Test results on overlapping communities, parameters p = 2, μo = 0.5, No = 3.

(a) Average NMI for each solution method, (b) Average Omega for each solution method.

https://doi.org/10.1371/journal.pone.0283857.g016

thumbnail
Fig 17. Test results on overlapping communities, parameters p = 2, μo = 0.5, No = 5.

(a) Average NMI for each solution method, (b) Average Omega for each solution method.

https://doi.org/10.1371/journal.pone.0283857.g017

thumbnail
Fig 18. Test results on overlapping communities, parameters p = 2, μo = 0.7, No = 3.

(a) Average NMI for each solution method, (b) Average Omega for each solution method.

https://doi.org/10.1371/journal.pone.0283857.g018

thumbnail
Fig 19. Test results on overlapping communities, parameters p = 3, μo = 0.7, No = 3.

(a) Average NMI for each solution method, (b) Average Omega for each solution method.

https://doi.org/10.1371/journal.pone.0283857.g019

Test 3: Detecting overlapping communities on large-scale networks: In the last experiment, we have applied the LSE heuristics, using both the and models, to the largest networks composed of 500 or 1000 nodes. As before, we control the overlap between communities with parameters p = {2, 3} and μo = {0.6, 0.7}, the number of bridge nodes are No = {20, 50}.

In Table 4, we report the NMI and OI statistics calculated by the two methods. It can be seen that they have lower values than what obtained in the smallest networks, due the fact that communities are harder to find. in most of the cases, model , in which weights are exact, obtains better indices than the approximated weights of . Results of Table 4 are reported in Figs 2023. There, it can be seen that the green line is above the purple one in almost all cases.

thumbnail
Table 4. Computational results about large-scale networks with overlapping communities.

https://doi.org/10.1371/journal.pone.0283857.t004

thumbnail
Fig 20. Test results on overlapping communities, parameters N = 500, nc = 25, p = 2, μo = 0.6, No = 20.

(a) Average NMI for each solution method, (b) Average Omega for each solution method.

https://doi.org/10.1371/journal.pone.0283857.g020

thumbnail
Fig 21. Test results on overlapping communities, parameters N = 500, nc = 25, p = 2, μo = 0.6, No = 50.

(a) Average NMI for each solution method, (b) Average Omega for each solution method.

https://doi.org/10.1371/journal.pone.0283857.g021

thumbnail
Fig 22. Test results on overlapping communities, parameters N = 500, nc = 25, p = 3, μo = 0.7, No = 20.

(a) Average NMI for each solution method, (b) Average Omega for each solution method.

https://doi.org/10.1371/journal.pone.0283857.g022

thumbnail
Fig 23. Test results on overlapping communities, parameters N = 1000, nc = 50, p = 2, μo = 0.6, No = 50.

(a) Average NMI for each solution method, (b) Average Omega for each solution method.

https://doi.org/10.1371/journal.pone.0283857.g023

We can compare models and in term of detecting the network bridge nodes. We considered many statistics: accuracy, TPR, FPR, AUC, precision and the F1 score. They are collected in Table 5 which reports the average values of these metrics obtained by the two LSE heuristics. In all the simulations, the fraction of bridge nodes over all the nodes is less than 0.1. It implies that it is much easier to detect non-bridge nodes rather than bridge ones. Therefore, a method that selects the fewest number of bridge nodes has a numeric advantage in terms of accuracy. Clearly, it could not classify successfully bridge nodes. Looking at Table 5, one can observe that the greatest difference between -LSE and -LSE is on metrics FPR and TPR. Model obtains the best rate of true positive, model -LSE obtains the best rate of false positive. This means that model selects more bridge nodes, but some of them are not actually bridges. Conversely, can successfully detect most of the non-bridge nodes, resulting on higher accuracy just because the majority of nodes are actually non-bridge. However, this is a consequence of a method that takes less risk in detecting a node as a bridge. As far as the AUC is concerned, the results are really similar due to the existing balance between FPR and TPR of both methods.

thumbnail
Table 5. Computational results about large-scale networks with overlapping communities.

https://doi.org/10.1371/journal.pone.0283857.t005

These values confirm that the bridge nodes detected by model are more reliable than the ones detected by , due to the better precision values. Moreover, since the F1-score is equal to the harmonic mean between TPR and precision, also gets better results for this metric.

For the highest values of μ, it is more difficult to distinguish the non-bridge from the bridge nodes, which increases the number of false positives. So, statistics FPR, AUC, precision and F1 decreases. As in the previous experiments, both models -LSE and -LSE obtain the best results when μ is near 0 and when a bridge node belongs to many communities, as it is easier to be detected. In conclusion, detects more bridge nodes, so it obtains the highest TPR, but at the cost of incurring in a higher number of false positive too, which leads to the worst accuracy.

Conclusion

In this paper, we proposed an Integer Linear Programming model to detect overlapping communities in a network. Our contribution identifies communities as stable coalitions and then we select the best of them with an optimization model. Peculiar to this approach is the definition of a weighted graph connection game and its characteristic function. Moreover, we introduced a null hypothesis in the spirit of the modularity function, [1]: We have compared the community node similarity of the actual graph with the node similarity of a random graph with no embedded communities, and in this way we could define a new similarity measure. Then, these similarities are used to define the non-convex cooperative game and the objective function of a maximization problem. Nodes similarities are obtained through the application of Theorem 1, or by a simplified formula, see (21), useful to reduce the computational complexity. Computational tests show that they find similar communities.

Future research can be devoted to define stability with cooperative games others than graph connection games, and they could depend on the actual social or economic activity that is taking place on the network. We could imagine matching or voting game, to define a few, that could promptly be defined and applied to peculiar networks. Moreover, the implementation of the LSE heuristic, Algorithm 1, has been necessary to find solutions in a reasonable computation time and we found that the stability property increased the problem complexity. As stable community structures are poorly analyzed in literature, we expect that there is large room to improve our basic heuristic subroutines.

Finally, our extension of the procedure proposed in [16] to generate controlled overlapping communities can be used to validate any other method or algorithm. Testing algorithms is a big challenge and the generation of heterogeneous networks makes the comparison between algorithms easier. However, the wide combinations of parameters complicates the issue, advancing the need for a general methodology to select the most appropriate scenarios.

Supporting information

S1 Appendix. Appendix: Random networks generation.

Random network generator based on [16] benchmark.

https://doi.org/10.1371/journal.pone.0283857.s001

(PDF)

References

  1. 1. Girvan M, Newman MEJ. Finding and evaluating community structure in networks. Phys Rev E. 2004;(69 (2), 026113). pmid:14995526
  2. 2. Fortunato S, Hric D. Community detection in networks: A user guide. Physics Reports. 2016;659:1–44.
  3. 3. Palla G, Derényi I, Farkas I, Vicsek T. Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society. Nature. 2005;(435 (7043)). pmid:15944704
  4. 4. Xie J, Kelley S, Szymanski BK. Overlapping Community Detection in Networks: The State-of-the-Art and Comparative Study. Comput Surv. 2013;45(43):1–35.
  5. 5. Agarwal G, Kempe D. Modularity-maximizing graph communities via mathematical programming. The European Physical Journal B. 2008;66(3):409–418.
  6. 6. Li Z, Zhang XS, Wang RS, Liu H, Zhang S. Discovering Link Communities in Complex Networks by an Integer Programming Model and a Genetic Algorithm. PLoS ONE. 2013;(8 (12), e83739). pmid:24386268
  7. 7. Bennett L, Kittas A, Liu S, Papageorgiou LG, Tsoka S. Community Structure Detection for Overlapping Modules through Mathematical Programming in Protein Interaction Networks. PLoS ONE. 2014;(9(11): e112821). pmid:25412367
  8. 8. Costa A, Ng TS, Foo LX. Complete mixed integer linear programming formulations for modularity density based clustering. Discrete Optimization. 2017;(25):141–158.
  9. 9. Zhang S, Wang RS, Zhang X. Identification of overlapping community structure in complex networks using fuzzy c-means clustering. Physica A. 2007;(374):483–490.
  10. 10. Nepusz T, Petroczi A, Negyessy L, Bazso F. Fuzzy Communities and the Concept of Bridgeness in Complex Networks. Physical Review E. 2008;77:16–107. pmid:18351915
  11. 11. Nicosia V, Mangioni G, Carchiolo V, Malgeri M. Extending the definition of modularity to directed graphs with overlapping communities. J Stat Mech Theory Exp. 2009;((03) (2009) P03024).
  12. 12. Chen D, Shang M, Fu Y. Detecting overlapping communities of weighted networks via a local algorithm. Physica A: Statistical Mechanics and its Applications. 2010;(389):4177–4187.
  13. 13. Chitra Devi J, Poovammal E. An Analysis of Overlapping Community Detection Algorithms in Social Networks. Procedia Computer Science. 2016;(89):349–358.
  14. 14. Benati S, Puerto J, Rodríguez-Chía AM, Temprano F. A mathematical programming approach to overlapping community detection. Physica A: Statistical Mechanics and its Applications. 2022;602:127628.
  15. 15. Jonnalagadda A, Kuppusamy L. A cooperative game framework for detecting overlapping communities in social networks. Physica A. 2018;(491):498–515.
  16. 16. Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Phys Rev E. 2008;78:046110. pmid:18999496
  17. 17. Demange G. Intermediate preferences and stable coalition structures. Journal of Mathematical Economics. 1994;23(1):45–58.
  18. 18. Carraro C, Marchiori C. DP3258 Stable Coalitions. CEPR Press Discussion Paper. 2002; (3258).
  19. 19. D’Aspremont C, Jacquemin A, Gabszewicz JJ, Weymark JA. On the Stability of Collusive Price Leadership. The Canadian Journal of Economics / Revue canadienne d’Economique. 1983;16(1):17–25.
  20. 20. Caparros A, Giraud-Héraud E, Hammoudi A, Tazdaït T. Coalition Stability with Heterogeneous Agents. Economics Bulletin. 2011;31(1):286–296.
  21. 21. Newman MEJ. Networks: an introduction. Oxford University Press; 2010.
  22. 22. Newman MEJ. Analysis of weighted networks. Phys Rev E. 2004;(70 (5), 056131). pmid:15600716
  23. 23. Callan D. A combinatorial survey of identities for the double factorial. 2009;.
  24. 24. Zachary WW. An Information Flow Model for Conflict and Fission in Small Groups. Journal of Anthropological Research. 1977;33(4):452–473.
  25. 25. Sundaresan SR, Fischhoff IR, Dushoff J, Rubenstein DI. Network metrics reveal diVerences in social organization between two Wssion-fusion species, Grevy’s zebra and onager. Oecologia. 2007;151:140–149. pmid:16964497
  26. 26. Read KE. Cultures of the central highlands, New Guinea. Southwestern Journal of Anthropology. 1954; p. 1–43.
  27. 27. Freeman LC, Freeman SC, Michaelson AG. On human social intelligence. Journal of Social Biological Structure. 1988;11:415–425.
  28. 28. Girvan M, Newman MEJ. Community structure in social and biological networks. Proceedings of the National Academy of Sciences. 2002;99(12):7821–7826. pmid:12060727
  29. 29. Gleiser P, Danon L. Community Structure in Jazz. Advances in Complex Systems (ACS). 2003;06:565–573.
  30. 30. Jeong H, Tombor B, Albert R, Oltvai Z, Barabasi AL. The Large-Scale Organization of Metabolic Networks. Nature. 2000;407(6804):651–654. pmid:11034217
  31. 31. Tandon A, Albeshri A, Thayananthan V, Alhalabi W, Radicchi F, Fortunato S. Community detection in networks using graph embeddings. Phys Rev E. 2021;103:022316. pmid:33736102
  32. 32. Lancichinetti A, Fortunato S, Kertész J. Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics. 2009;11(3).
  33. 33. Collins LM, Dent CW. Omega: A General Formulation of the Rand Index of Cluster Recovery Suitable for Non-disjoint Solutions. Multivariate Behavioral Research. 2005;23:231–242.