## Correction

2 Jan 2024: Benati S, Puerto J, Rodríguez-Chía AM, Temprano F (2024) Correction: Overlapping communities detection through weighted graph community games. PLOS ONE 19(1): e0296580. https://doi.org/10.1371/journal.pone.0296580 View correction

## Figures

## Abstract

We propose a new model to detect the overlapping communities of a network that is based on cooperative games and mathematical programming. More specifically, communities are defined as stable coalitions of a weighted graph community game and they are revealed as the optimal solution of a mixed-integer linear programming problem. Exact optimal solutions are obtained for small and medium sized instances and it is shown that they provide useful information about the network structure, improving on previous contributions. Next, a heuristic algorithm is developed to solve the largest instances and used to compare two variations of the objective function.

**Citation: **Benati S, Puerto J, Rodríguez-Chía AM, Temprano F (2023) Overlapping communities detection through weighted graph community games. PLoS ONE 18(4):
e0283857.
https://doi.org/10.1371/journal.pone.0283857

**Editor: **José F. Vicent,
University of Alicante: Universitat d’Alacant, SPAIN

**Received: **May 24, 2022; **Accepted: **March 19, 2023; **Published: ** April 4, 2023

**Copyright: ** © 2023 Benati et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All relevant data are within the manuscript and its Supporting information files.

**Funding: **The authors of this research Stefano Benati, Antonio Manuel Rodríguez-Chía, Justo Puerto and Francisco Temprano acknowledge financial support by the Spanish Ministerio de Ciencia y Tecnología, Agencia Estatal de Investigación and Fondos Europeos de Desarrollo Regional (FEDER) with grant number: PID2020-114594GB-C21, and Junta de Andalucía with grant number: P18-FR-1422. The authors Stefano Benati, Antonio Manuel Rodríguez-Chía and Justo Puerto also acknowledge partial support from: NetmeetData: Ayudas Fundación BBVA a equipos de investigación científica 2019 with reference "COMPLEX NETWORK". The author Antonio Manuel Rodríguez-Chía also acknowledges the European Regional Development Fund via projects with grant numbers: FEDER-UCA18-106895 and TED2021-130875B-I00; and the Spanish Ministerio de Ciencia y Tecnología, Agencia Estatal de Investigación with grant number: PID2020-114594GB-C22. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

The community detection problem consists in partitioning the node set of a network, or a graph, in such a way that node subsets can be substantially interpreted as communities. The methods that are proposed in the literature so far differ on two main aspects: the first is how community is translated into mathematics terms, the second is how an algorithm is implemented to outcome communities. To make an example, the classic contribution of [1] defines as a community the group of nodes with an arc density greater than what expected by nodes random pairing, then it proposes a method to find communities based on spectral decomposition. It is beyond our possibility to mention all contributions and developments that followed that seminal paper, see [2] for a comprehensive survey, but we just focus on the two most important lines of research that motivate our contribution. The first innovation recognizes that in some cases it is too restrictive to impose a strict nodes partition, as some node may realistically belongs to more than one community. So, communities can overlap and the solution structure is a node *assignment* to communities rather than a strict *partition*. A seminal contribution about overlapping communities can be found in [3] and a summary about first findings can be found in [4]. The second innovation is to formulate community detection as optimization problems, with a clearly stated objective function and well defined constraints. For example, in [5], the modularity model is developed into quadratic integer programming, corresponding to the well-known maximum clique partitioning. Other contributions can be found in [6–8].

The objective function is merely a simple statistic that evaluates partitions or node assignments. As such, it can be used to compare alternative community structures and to decide what is the most meaningful. One of the most popular statistic is modularity, see [1]. Modularity is an index that, for a given partition, compares the arc density of a subset with the one that is obtained on the assumption of node random pairings. The highest the modularity, the most connected are the nodes within a community, allowing a clear substantial definition of what is a community. The extension of the modularity to the case of overlapping communities has been proposed in [9], using fuzzy membership functions that are optimized using the fuzzy-*c*-means algorithm. This method has been elaborated further in [10–13], where the standard modularity function is modified by node or arc weights, representing node affinity, fuzzy memberships, or other. An alternative version of the objective function proposed in [9] is presented in [14], fixing some biases of the original one. In [7], it is proposed to maximize the modularity function, but with some additional constraints that allow some nodes to belong to more than one community. These nodes are referred to as *bridges*.

In [15], communities are defined as stable coalitions of a cooperative game. In a cooperative game, a coalition is stable if every member does not take any advantage in leaving the coalition to obtain a better payoff elsewhere, so a community is based on the concept of a common interest. There is a large room to define this common interest through any game characteristic function, such as market, voting, matching games, and so on. To just consider the topological network properties, such as the arc density and the node common neighbors, in [15] a weighted graph community game is proposed, with arc weights defined on some peculiar topological indicators. Next, an objective function is proposed to discern between alternative community structures and a constructive heuristic is implemented to find them.

In our contribution, we formulate the problem of finding communities as stable coalitions proposed in [15], as a mixed-integer linear programming problem. In this way, taking advantage of existing software, we can calculate the optimal communities of that model without resorting to any heuristic consideration. As a result, we can evaluate the optimal solutions of that model without the biases due to the use of the heuristic. Indeed, we found that the communities proposed in [15] are far from the optimal ones and, unfortunately, optimal ones are inconsistent too, in the sense that they do not correspond to what empirically one expects to find out. As it will be discussed, we argue that the reason of the inconsistency is on how costs of the weighted graph community game are defined and therefore we proposed a correction to them. Our correction follows the spirit of the modularity function, [1], in which an actual value of a statistic is compared to an expected value in absence of any community structure. We will show that our correction is reliable and effective as, after many computational tests, we showed that our method can recognize the hidden community structure of the networks. As a by-product of our contribution, we note that our cost definition relies on the calculation of the expected value of some network statistics on the assumption that no community is embedded in the network. To have an accurate cost estimate, we elaborated a new theorem to calculate the exact value of these statistics and it is worth to note that this theorem may have an autonomous interest for other applications in which some exact probabilities can be applied, as the same seminal paper [1].

To summarize, the contributions of our paper are the following:

- We provide a mathematical formulation of the method proposed by [15] to detect the overlapping communities of a network.
- We show that the communities obtained with this methodology are not the real communities embedded in the network, but we proposed an amendment to the game cost function that correct the bias.
- We propose a heuristic algorithm that can calculate the optimal communities when the exact method fails because of the network size.
- We apply our new mathematical model to real and artificial test problems and we show its effectiveness and reliability.

The paper is organized in 4 sections. In the Introduction, we motivate the paper purpose and summarize its contribution. In Material and methods Section, we formally introduce the overlapping community detection problem and the methods proposed by [15]. There, we design the exact optimization model and observe the finding of inconsistent communities. In Subsection called Detecting overlapping communities as stable coalitions of a cooperative game, we propose an alternative definition of the costs of the weighted graph community game that leads to a different objective function of the optimization model. In Local Stability Exploration Subsection, we present a heuristic algorithm for solving our model for the cases in which the network size is too large to compute the exact solution in a reasonable amount of time. In Results and discussion Section, we compare the exact and heuristic algorithm and then we report some computational results of a controlled experiment on graphs generated according the method proposed in [16] and we show that our method recovers correctly the community structure. The paper ends with some concluding remarks and outlines for future research in the final section, namely Conclusion.

## Material and methods

### Detecting overlapping communities as stable coalitions of a cooperative game

In [15], a cooperative game on a weighted graph is defined to characterize overlapping communities. The nodes of a graph are considered as the players of a network game, and then the Shapley value is used to characterize stable coalitions, e.g. subsets of nodes in which no player has any incentive to leave. Specifically, the cooperative game (*V*, *φ*) is defined on the weighted graph *G* = (*V*, *E*), with *V* = {1, …, *n*}, e.g. players are nodes labeled from 1 to *n*, weights *W*_{ij}(≥ 0) are defined for any edge (*i*, *j*) ∈ *E*, then the game characteristic function is:
(1)

That is, the value of coalition *S* is the weights sum of the edges of the subgraph induced by *S*. The model has been called Weighted Graph Community (WGC) Game in the aforementioned paper.

When a coalition *S* ⊆ *V* is going to form, then the members *i* ∈ *S* can calculate the gain that they can get from it, e.g. what is their share of the payoff *φ*(*S*) that they can receive. A standard result of cooperative games is that the share that they can get is the Shapley value of the game restricted to *S*: For player *i* and coalition *S*, *i* ∈ *S*, the Shapley value is:

Hence, the profit of player *i* from coalition *S* depends on the total weight of its connection with the other members of *S*.

In [15], a coalition is defined *stable* if no member of *S* takes advantage from swinging from coalition *S* to coalition *V* \ *S*. In mathematical terms it occurs if and only if:
(2)

Actually, there are different definition of stable coalitions that can be found in the literature: *Stable coalition structures* are defined in [17, 18], while in [19, 20], condition (2) is called the *internal stability property*. Moreover, in the latter notion of stability, an additional property is imposed requiring that a coalition *S* is stable if no member of *S* takes advantage from swinging from *S* to any other subset *S*′ contained in *V* \ *S*. This can be formalized as:
(3)

However, we are not developing this issue further and we will remain with definition (2).

Formulating a WGC game allows a formal definition of what are the feasible overlapping communities of a network: As a node can belong to more than one stable coalition, communities can overlap. However, a crucial feature of the model is the way in which weights *W*_{ij} are defined. In [15], the following formula is proposed: Let *k*_{i} be the *adjacency degree* of node *i* (e.g. the number of nodes to which *i* is connected through an arc), let be defined as the *partition ratio* and let *CN*_{ij} = (|common neighbors of i and j| + 1)*P*_{ij} be defined as the *neighbourhood ratio* of *i*, *j* ∈ *V*, then the weight of the arc (*i*, *j*), *i* ≠ *j* is
(4)

The formula was proposed in [15] to consider the node similarity as dependent on both the direct and indirect links between *i* and *j*. It is straightforward to observe that *W*_{ij} ≥ 0, but this property has important consequences on the structure of the stable coalitions, as it will be discussed later. For the moment, we focus in the methodology to find all the stable coalitions of a networks. While in [15] a constructive method is proposed, that is, an heuristic technique with some ad-hoc adjustment to find stable coalitions, here we propose a mathematical programming approach in which all considerations about stability discussed in [15] are translated into an objective function and mathematical constraints. We will show that stable coalitions can be represented by linear constraints involving binary variables and then, using an appropriate objective function, stable coalitions can be determined by linear programming.

Let *n*_{c} be the maximum number of communities to which a node can belong to (this is not a binding constraint to the model, since *n*_{c} can be large enough to include all the feasible stable communities). For *i* = 1, …, *n* and *k* = 1, …, *n*_{c}, the model variables are:

For any *i*, *j* = 1, …, *n* such that *i* < *j* and *k* = 1, …, *n*_{c}:

The relationship between *x*- and *z*-variables is given by the logical/quadratic constraints *z*_{ijk} = *x*_{ik}*x*_{jk} for all *i*, *j* ∈ *V*, *i* < *j* and all *k* = 1, …, *n*_{c}. Then, the quadratic constraint can be replaced by the linear constraints:
(5) (6) (7)

Next, using binary *x*-variables, the stability condition (2) can be characterized by linear constraints too. First, for fixed *i* and *k*, consider the quadratic inequality:

If *x*_{ik} = 1, then *i* belongs to coalition *S*_{k}, so that *S*_{k} must be stable. For the stability, *i*-player’s Shapley value from coalition *S*_{k} must be greater than its Shapley value from the opposite coalition (*V* \ *S*_{k}) ∪ {*i*}. The term is the Shapley value of coalition *S*_{k}, as all *j*’s such that *x*_{jk} = 1 are all the other players of coalition *S*_{k}. Conversely, all other *j*’s such that (1 − *x*_{jk}) = 1 are the players excluded from *S*_{k}. Consequently, is the Shapley value of the opposite coalition, (*V* \ *S*_{k}) ∪ {*i*}. Finally, their difference must be greater than or equal to 0 for *S*_{k} to be stable. Next, the above quadratic inequality can be simplified to the following linear one:
(8)

Next, it must be imposed that overlapping coalitions/communities must have non-empty difference, e.g. the same coalition is not selected more than once (a coalition must not be contained in a different one). To prevent inclusion, additional variables *h* are introduced for *i* = 1, …, *n* and pairs *k*, *r* such that 1 ≤ *k* < *r* ≤ *n*_{c}:

The relation between *x*- and *h*-variables is given by the quadratic constraint: *h*_{ikr} = *x*_{ir}(1 − *x*_{ik}), that can be replaced by three linear constraints as done for *z*-variables in expressions (5)–(7).

To prevent the inclusion of *S*_{r} in *S*_{k}, it must be that:
(9)

The constraint is binding when *x*_{ir} = 1. In that case, coalition *S*_{r} must contain at least one element *j* that is contained in *S*_{r} but not in *S*_{k}, guaranteeing that *S*_{r} ⊄ *S*_{k}.

To conclude, we introduce inequalities to avoid symmetrical solutions too. Symmetric solutions decrease the efficiency of the Integer Linear Programming solver, as the same structural solution can be obtained by multiple assignments to variables *x*, *z*, *h*, simply giving different labels to coalitions. Note that constraints (9) avoid to replicate the same coalition, so that it is sufficient that, after ranking the communities from the largest to the smallest, they are assigned to decreasing labels *k*. The following constraints do the task:
(10)

Every stable coalition corresponds to a point of the polytope described by the equations and inequalities described so far. To determine what are the most meaningful overlapping communities, in the objective function it is used the nodes Shapley value. If a coalition *S*_{k} is established, then player *i*’s Shapley value from coalition *S*_{k} is: . Therefore, for a set of overlapping communities *S*_{k}, *k* = 1, …, *n*_{c}, the total Shapley value of a player *i* is the sum of the values it gets from every coalition, that is:
(11)

In [15], the most important overlapping coalitions are determined by maximizing the sum of the Shapley values of all nodes. Therefore, this index will be used as the objective function of the following integer programming formulation:
(12)
*s.t*.: (5)–(10),
(13) (14) (15) (16) (17) (18) (19)

The objective function (12) represents the sum of the Shapley values for all nodes and communities. Constraints (13) guarantee that every node belongs to at least one community. Constraints (14)–(16) are the linear representations of the *h*-variables. Finally, constraints (17) define binary variables. Note that in (18) and (19), we can relax the *z*− and *h*−variables to be continuous, since the constraints on the *x*-variables force both to be binary.

*F*_{Sh−JK} is the exact Integer Programming formulation of the model proposed in [15]. However, in that seminal paper the overlapping communities were computed through a heuristic constructive procedure, in which the search for optimal solutions is combined with various ad-hoc adjustments to induce sufficient diversification of coalitions. The advantage of Integer Programming is that the output coalitions of *F*_{Sh−JK} are exactly the optimal ones, without any bias due to constructive rule-of-thumb procedures. As we will see, this allows us to point out a drawback of the game definition and to suggest a method to adjust it.

We apply formulation *F*_{Sh−JK}, to the Zachary’s karate club network, fixing *n*_{c} = 3. Optimal overlapping communities can be seen in Fig 1. As can be seen, selected communities are the grand coalition (all the nodes belong to the same coalition) except one node. That is, communities are subsets *S* such as |*S*| = *n* − 1, in which the discarded node is the one with less connections. It is hard to believe that those sets are of some interest to researchers, as they are far from the communities that were often identified in the Zachary’s network. The same occurs with all the other problems we tested: Overlapping communities are the grand coalition except one node. The reason of this disappointing result is not the solution method, e.g. exact vs heuristic, or the community definition, e.g. using cooperative games and the Shapley value. Rather, the reason is the way in which weights *W* are formulated in (4). As recognized in [15], if *W*_{ij} ≥ 0 for all *i*, *j*, then the cooperative game (*V*, *φ*) is convex, that is for two coalitions *S*, *T* such that *S* ⊂ *T* and *i* ∉ *T*, it always occurs that:

(a) Community 1, (b) Community 2, (c) Community 3.

This property establishes that the marginal gain player *i* gets from joining a coalition is always greater when the coalition is larger. Therefore the Shapley values are always the greatest for the largest coalitions and that is why the method proposed is *always* doomed to mistake the largest subsets as communities. As we have pointed, the weakness is not on using cooperative games to define stable coalitions, but on using *convex* cooperative games. In this section, we will provide a simple and effective way to adjust this weakness. Our proposal is based on determining stability using a *non-convex* cooperative game.

#### The computation of the expected weight on an arc.

As we discussed in the previous section, weighted graph community games in which arc weights *W*_{ij} ≥ 0 are convex games, so that they imply increasing values of the Shapley values and the tendency of detecting only large size communities. A straightforward way of avoiding convexity is considering an alternative set of weights, non necessarily non-negative, so that optimal stable coalitions of small size may emerge as well. Here, we propose to combine the weights defined by (4) with modularity, so that weights are normalized by their expected values and may take both negative and positive values. As a consequence, the resulting game is non-convex.

The modularity function, see [1], is a well-known index to detect communities in networks. The index compares the edge density of the empirical graph *G* = (*V*, *E*) (unweighted and undirected), |*E*| = *m*, with the expected edge density of a theoretical graph *G*′ = (*V*, *E*′) in which there are no communities by assumption. The expected edge density of *G*′ is calculated using a null hypothesis, e.g. an assumption about the edge distribution, that is called the *configuration model*, [21]. If the graph does not contain communities, then for any given two nodes *i* and *j* with edge degrees *k*_{i} and *k*_{j}, the expected number of edges between *i* and *j* is approximated by . Let *A*_{ij} = 1 if (*i*, *j*) ∈ *E*, *A*_{ij} = 0 otherwise (so that *A* = [*A*_{ij}] is the adjacency matrix of *G*). Moreover, let Π be a partition of *V* and let *δ*(*i*, *j*) be the Kronecker delta: *δ*(*i*, *j*) = 1 if *i*, *j* ∈ *V* belong to the same community, *δ*(*i*, *j*) = 0 otherwise. Then the modularity function of a partition Π is:
(20)

In the case under study, weights are defined through expression (4), in which the adjacency between nodes *i* and *j* is weighted by the common neighbors. However, modularity can be defined for weighted graphs as well. In the summation terms , entries *A*_{ij} are replaced by weights *W*_{ij}, *k*_{i} replaced by weight sum *W*_{i} = ∑_{j}*W*_{ij}, and *m* replaced by *W* = ∑_{(i,j)∈E} *W*_{ij}, as described in [22]. In this way, modularity is still a function that compares the actual indices of an empiric graph with the expected indices of a random graph. Using modularity, we can define modularity game (*V*, *φ*) as a weighted graph community game in which the characteristic function *φ* is defined as in (1), but with the following weights:
(21)

In this case, can take both positive and negative values, so that the game resulting from the characteristic function (1) is non-convex.

We elaborate this model further, by noting that the modular term (21) should represent the difference between the empiric value *W*_{ij} and its expected value under the assumption that the graph does not contain any communities. Unfortunately, the term is only an approximation of the true expectation and this can cause unexpected biases. For example, when weights *W*_{ij} correspond to the adjacency matrix *A*_{ij} ∈ {0, 1}, the term is an estimate of the probability of an arc between *i* and *j*, but, if the graph is unbalanced, the term can be greater than 1, which results in a non-sense estimation of this probability. In our application, expression (4) contains specific terms about the graph structure, such as the arcs and the common neighbours between two nodes, and potentially the bias between the true expectation and its approximation can be large. For this reason, we made a special effort in calculating the exact equation of the expected values of expression (4) under the assumption that there are no community in the graph.

In [21], the random occurrence of a graph with no communities is calculated through the configuration model. The configuration model can be interpreted as the process of making a random graph with no communities through the following operations. Every arc *e* = (*i*, *j*) of the empirical graph *G* = (*V*, *E*) is cut into two parts, say *l*_{1} and *l*_{2}, with *l*_{1} incident to *i* and *l*_{2} incident to *j*, called *stubs*. Next, two different stubs are selected randomly and paired. We say that, if *l*_{1} and *l*_{2} are such stubs, then (*l*_{1}, *l*_{2}) is a match, e.g. an arc of the random graph *G*′ = (*V*, *E*′). The way in which *G*′ is built implies that the adjacency degree *k*_{i} remains unvaried for all *i*, but eventual communities are broken by random pairings of stubs. Note that, from construction, we can interpret any occurrence of *G*′ as a matching of 2*m* stubs. The process is exemplified in Fig 2.

Here we show how to compute exactly the expected values of expression (4) using the configuration model. Expected weights depend on the the partition ratio *P*_{ij} and the neighbourhood ratio *CN*_{ij} of the random graphs obtained from the configuration model. By construction, the partition ratio *P*_{ij} of the random graph is the same as the one of the empiric graph, but the neighbourhood ratio *CN*_{ij} is different.

To calculate *CN*_{ij}, we introduce some notation. Recall that *k*_{i} is the adjacency degree of node *i* and assume that the graph has *m* edges. Let *P*_{adjacency}(*k*_{i}, *k*_{j}, *m*) be the probability that node *i* and *j* are connected by an arc, let *P*_{common neighbour}(*k*_{i}, *k*_{j}, *k*_{r}, *m*) be the probability that *i* and *j* are arc connected with *r*, so thar *r* is a common neighbor, and let *P*_{triangle}(*k*_{i}, *k*_{j}, *k*_{r}, *m*) be the probability that *i* and *j* are arc connected and are also connected with *r*, so that the three arcs form a triangle. The notation emphasizes that probabilities depend on adjacency degrees *k*_{i}, *k*_{j}, *k*_{r} and the total number of edges *m*. In the following proposition, we will derive closed form expressions for the above probabilities.

**Proposition 1**. *Let i, j, r be three nodes with adjacency degrees k _{i}, k_{j}, k_{r}, respectively. Then, in the random graph configuration model*
(22) (23) (24)

*Proof*. Applying the configuration model to *G* = (*V*, *E*), we obtain two stubs *l*_{1} and *l*_{2}, adjacent to *i* and *j*, respectively, for every arc *e*(*i*, *j*) ∈ *E*. Then, we select two stubs at random and pair them until a random graph *G*′ is obtained. Note that, from construction, we can interpret any occurrence of *G*′ as a matching of 2*m* stubs.

Given *i*, *j* ∈ *V*, let *S*_{i} = {*l*_{i(1)}, …, *l*_{i(ki)}} be the set of stubs adjacent to *i* and *S*_{j} = {*l*_{j(1)}, …, *l*_{j(kj)}} be the set of stubs adjacent to *j*. Assuming a set of 2*m* elements, there are different matching, see [23]. Therefore, if two stubs *l*_{1} ∈ *S*_{i} and *l*_{2} ∈ *S*_{j} are matched, there are different matching with the stubs remaining, because there are still 2*m* − 2 stubs to pair. Due to this, the probability that two stubs *l*_{1} and *l*_{2} are joined, connecting nodes *i* and *j*, is:
(25)

Next, we introduce random variables:

Obviously, the probability of is , as stated in (25). We can express the number of edges between two nodes *i* and *j* as the sum:

The above expression represents the sum of the variables whose indices are one stub adjacent to *i* and another stub adjacent to *j*. Thus, the expected number of edges between *i* and *j* is:

Note that in the modularity function (20), this value is approximated by .

As we explain before, the expected number of edges is different to the probability of adjacency. The adjacency between two nodes *i* and *j* is the condition that there is at least one arc between *i* and *j* and it can be expressed as the union of the events with *l*_{1} ∈ *S*_{i} and *l*_{2} ∈ *S*_{j}, for the sake of simplicity, we refer to this set of events as . So, the adjacency probability of two nodes *i* and *j* is:
(26)

Let be the set of all the different subsets of *S*_{i} × *S*_{j} with size . Applying the inclusion-exclusion law for the probability of union of events to expression (26), it follows that:
(27)

By construction of the random graph *G*′, observe that the intersection of *t* different sets , representing the match between stubs *l*_{1} and *l*_{2}, is empty if the same stub, *l*_{1} or *l*_{2}, is repeated more than once in different matches. Therefore, for each *t*, the non empty sets that appears in (27) are matching with *t* matches. As a consequence, the summation on *t* is bounded to min{*k*_{i}, *k*_{j}}, because the intersection of more than min{*k*_{i}, *k*_{j}} different sets must repeat some stubs and so, its intersection is empty. Moreover, applying the same argument to calculate the probability of joining two stubs (25), the probability of joining *t* stubs from *S*_{i} with other *t* stubs from *S*_{j} is:

Finally, to derive expected vales, we need to calculate the number of different subsets from *S*_{i} × *S*_{j} with a size equal to *t* that do not repeat any stubs. We have to consider *t* stubs from *S*_{i} and *t* from *S*_{j}, and then all the possible matchings between stubs of different sets. There are different subsets of *t* stubs from *S*_{i} and different subsets of *t* stubs from *S*_{j}. We can match the *t* stubs of one set with the other *t* stubs of the other set in *t*! different ways, obtaining the following expression for the probability of events ensuring that node *i* and *j* are connected, in short, :
(28)

This is the expression in (22) for *P*_{adjacency}(*k*_{i}, *k*_{j}, *m*).

Now, we use (28) and the previous arguments to obtain the probability that *i* and *j* are connected with a different node *r*, namely *P*_{common neighbour}(*k*_{i}, *k*_{j}, *k*_{r}, *m*), i.e., we compute the probability of the intersection of the event nodes *i* and *r* are connected with the event nodes *j* and *r* are connected, in short, :
(29)

Finally, developing as before, the probability of three nodes *i*, *j* and *r* to be connected each other, namely *P*_{triangle}(*k*_{i}, *k*_{j}, *k*_{r}, *m*) is:
(30)

The above probabilities are necessary to determine the exact value of the expected weight *E*[*W*_{ij}], when weights are defined as in formula (4) and the graph is obtained by the configuration model.

Define the following random variables:

**Theorem 1**. *Assume that weights between nodes i and j are defined as in* (4), *then the expected weight E[W _{ij}] between nodes i and j of the the random graph configuration model is given by the following expressions*:

*Proof*. We can express the weights (4) depending on the cases as follows.

Observe that if the term ∑_{r∈V\{i,j}} *Y*_{ir}*Y*_{jr} = 0 then since the adjacency degree of *i* or *j* is one, *i* and *j* must be connected and therefore *Y*_{ij} = 1. Thus, the expression above results in *Y*_{ij}*P*_{ij}. Otherwise, if ∑_{r∈V\{i,j}} *Y*_{ir}*Y*_{jr} ≠ 0 again since the adjacency degree of *i* or *j* is one, *Y*_{ij} = 0 and the expression above simplifies to . Hence, we obtain that

Next, we compute the expected values of the previous expression: (33) and the result follows because the expression above coincides with (31).

Then, the expected value of the expression above is: (34)

Finally, substituting the probabilities that appear in (33) and (34) with the expressions in (22), (23) and (24), one obtains the result.

#### New models for detecting communities using weighted graph modularity games.

In the previous section, we show that the optimal solution of the analyzed instances provided by formulation *F*_{Sh−JK} was the grand coalition except one node. Since, this type of solutions are meaningless for detecting overlapping communities, in this section, we provide an alternative model taking advantage of Theorem 1. Actually, we propose to define another modularity game (*N*, *φ*), in which the characteristic function *φ* is as in (1), but weights are defined as:
(35)
where . Observe that, the game is non-convex as can take both positive and negative values.

To calculate the overlapping communities through the coalition stability of a modularity game, the objective function of formulation *F*_{Sh−JK} must be modified according to Eq (35). Moreover, to avoid double counting (induced by pair of nodes that belongs to the same community in the new objective function), for any 1 ≤ *i* < *j* ≤ *n* the next binary variables are introduced:

Observe that if we would have used *y*-variables in model *F*_{Sh−JK}, the same solution would have been obtained because all the weights are positive and again the grand coalition would have been the optimal solution.

The final formulation of this model is: (36) s.t.: (5)–(7), (10), (13), (17), (18) (37) (38) (39) (40) (41)

The objective function (36) sums the weights between nodes of the same community only once. In this way, it cannot be the case that a community is a proper subset of another, because its profit would be null. Then, constraints (9), (14), (15), (16) and (19) that were discussed previously are not necessary. With (37) we guarantee that communities are stable for the new weights *W**. If *x*_{ik} = 1, then (37) is equivalent to (8). Constraints (38) impose that each node cannot belong to more than *p* different communities, with *p* a fixed parameter established by the user. Constraints (39) and (40) impose that *y*_{ij} = 1 if and only if there is a community *k* to which *i* and *j* belong to. Finally, constraints (41) defines our variables as binary, but, from the arithmetic of the model, we can relax them as continuous variables (*y*_{ij} ∈ [0, 1]) because in any case they can take only 0,1 values. The notation stands for the fact that the condition of stability is determined by the Shapley value of a modularity game with weights . In some experimental cases, it is interesting to compare the contribution of Theorem 1 over the approximations , see (21), and therefore, we will refer as to the model in which are replaced by .

The following experiments will highlight differences between models and , and differences between overlapping and non-overlapping communities models. The experiments are run in the Python environment and using the Gurobi solver.

In the first two examples we will show that models , e.g. the exact model, and models , e.g. the approximation, compute different communities, even though they are run with the same parameters and the network size is small. From the tests, we can argue that the contribution of Theorem 1 is substantial.

We apply models and to the Zachary’s karate club network, [24], and compare the results with what obtained in [15]. The overlapping communities of that paper are three, so we fix *n*_{c} = 3 and *p* = 2. In Fig 3, each community is represented by the color grey, black or blue and the intersection nodes by red.

Community structures obtained by (a) [15], (b) with parameters *n*_{c} = 3, *p* = 2, (c) with parameters *n*_{c} = 3, *p* = 2.

Fig 3a and 3c are similar. The only difference is that model detects the node 12 as an intersection. It is reasonable, because node 12 is only connected to the other intersection node and share neighbours with both communities, black and blue. The structure obtained by model is also similar, but detects more intersection nodes, having connections with different communities and sharing neighbours with them. The results highlights that there can be differences between the exact and the approximate models, already when applied to small size graphs.

Next, we analyze models and with other parameters. First, we fix *p* = 1, so that communities cannot overlap, and we obtain the results in Fig 4.

Community structures obtained by (a) with parameters *n*_{c} = *n*, *p* = 1, (b) with parameters *n*_{c} = *n*, *p* = 1.

As can be seen, in both cases nodes that belong to the same community have high edge density between them and many common neighbours, even though the two communities in Fig 4a can be further split, as seen in Fig 4b. There, communities have higher edge density, but less common neighbors. It highlights the fact that equation (4) combines two criteria, namely density of common neighbors and number of connections, and the researcher must consider a trade-off between them. Letting communities overlap partially avoids this trade-off: With parameters *n*_{c} = 4 and *p* = 2, we obtain the results in Fig 5.

Community structures obtained by (a) with parameters *n*_{c} = 4, *p* = 2, (b) with parameters *n*_{c} = 4, *p* = 2.

Figs 3b and 5a are similar. The intersection nodes found previously (Fig 3b) are also intersection nodes in Fig 5a with the new parameters. Nevertheless, some other intersection nodes appear that are brought about by the new fourth community of the clustering. Note that communities in Fig 5a are quite different from the ones of Fig 5b, especially for what concerns intersection nodes. As was remarked before, it implies that the differences between the exact and the approximate model are substantial.

Next, we apply models and to the zebra communication network, see [25]. First, model is run with *p* = 1 and results are in Fig 6a. Results of model are the same. Results of models and with parameters *p* = 2 and *n*_{c} = 3 are in Fig 6b and 6c respectively. The former model does not detect any overlapping community, suggesting that they are well separated, while the latter model identifies node 20 as belonging to two communities. Since this model is actually an approximation of the real data, it is likely that the role of node 20 has been mistaken since the communities seems to be separated.

Community structures obtained by (a) with parameters *n*_{c} = *n*, *p* = 1, (b) with parameters *n*_{c} = 3, *p* = 2, (c) with parameters *n*_{c} = 3, *p* = 2.

The following two examples compare the communities found by model when community i) cannot overlap (*p* = 1); ii) can overlap (*p* > 1). It will be seen that allowing overlapping communities reveals nodes that are structurally different from others, forming the bulk of a core/periphery separation.

First, we apply the model to the the Highland tribes network, see [26]. First, model is run with *p* = 1 and results are in Fig 7a. There, it can be seen that, if no overlapping communities are allowed, then the model detects one community composed of all the nodes. Conversely, model is run with parameters *n*_{c} = 3 and *p* = 2, results are reported in Fig 7b. It can be seen that the role of different nodes is emerged. There, three communities of different size have been detected, with some nodes (the red ones) belonging to more than one community forming the core of the system of alliances.

Community structures obtained by (a) with parameters *n*_{c} = *n*, *p* = 1, (b) with parameters *n*_{c} = 3, *p* = 2.

Next, we apply model to the Windsurfers network, see [27]. Run with parameter *p* = 1, the model detected the two communities reported in Fig 8a. Run with parameters *n*_{c} = 2 and *p* = 2, the model detected the communities reported in Fig 8b. As can be seen, the results with overlapping communities are a refinement of the disjoint communities. Nodes that are in the border between the two groups are highlighted as members of both, forming the bulk of a core/periphery network segmentation.

Community structures obtained by (a) with parameters *n*_{c} = *n*, *p* = 1, (b) with parameters *n*_{c} = 2, *p* = 2.

To summarize our findings, the test of models on four typical benchmark networks revealed:

- Results between and are different. As the latter is an approximation of the former, it reveals that the contribution of Theorem 1 to model development is substantial.
- Results between non-overlapping and overlapping community models are different. The former can reveal not only group membership, but nodes that could act as potential bridges between communities.

### Local Stability Exploration: An heuristic algorithms to detect overlapping communities

Problems and are Integer Linear Programming (ILP) models whose solution computational times can be impractical when the instances to solve are large. This is normal when we deal with a NP-hard problem as the case of communities detection. Nevertheless, for large instances the ILP formulation can be applied to devise heuristic algorithms that could approximate the optimal solution in short computing time. Here we propose a method, that we will call Local Stability Exploration (LSE), that is based on local search. Suppose that a set of feasible communities is given, we will call such Π an incumbent solution. Π feasible means that it satisfies the ILP model constraints, so that i) every node belongs to at least one community, , ii) there is not strict inclusion between communities, such that *S*_{k} ⊆ *S*_{r}, iii) the maximum number of communities to which a node can belong is not exceeded by any node, i.e. ∀*i* ∈ *V* the inequality |{*k* = 1, …, *n*_{c}: *i* ∈ *S*_{k}}| ≤ *p* is fulfilled; and iv) all communities are stable. Next, we try to modify Π to obtain a new feasible solution Π′ with an improved objective function. We consider three possible modification of Π, obtained by moves that are called Add, Remove, and Swap. Add is the move that joins a node to a community, allowing in this way multiple communities assignments. Remove is the move that takes away a node from a community. Swap is the move that switch two nodes between two communities. These moves are applied if and only if the new obtained Π′ is feasible. That is, after a move it must not occur that 1) a node does not belong to any community 2) a node belongs to more communities than allowed, maximum number of communities *p* to which a node can belong; 3) one community is included in another, 4) modified communities are not stable.

For a feasible starting solution, the procedure is summarized in Algorithm 1. There, the triplet (*i*, *k*, 1) is the move of adding node *i* to community *k*, the triplet (*i*, *k*, 2) is the move of removing node *i* from community *k*, the 5-tuple (*i*, *k*, *i*′, *k*′, 3) is swapping nodes *i* and *i*′ between communities *k* and *k*′. It can be seen that from Line 9 to Line 22 all feasible moves are considered. In Lines 12, 15 and 20 the increases of the objective function are calculated using the following notation: Let *C*_{i} = {*k* ∈ {1, …, *n*_{c}}:*i* ∈ *S*_{k}}, that is, *C*_{i} is the index set of the communities to which *i* belongs, then the objective function can be written as:

Note that the condition *C*_{i} ∩ *C*_{j} ≠ ∅ is the condition that there is at least one community to which both *i* and *j* belong to. However, from the computational efficiency it is better to calculate just the increase of the objective function, as is done in lines 12, 15, 20. The new solution Π′ is the one that obtains the maximum increase. The algorithm stops when condition of Line 42 applies, as there are no improvements and a local optimum has been reached.

**Algorithm 1** Local stability exploration algorithm

1: **procedure** Local stability exploration

2: ⊳ Π is obtained by peculiar subroutines

3: **for** *i* in *V* **do**

4: *C*_{i} = {*k* ∈ {1, …, *n*_{c}}: *i* ∈ *S*_{k}}

5: **end for**

6: ⊳ Objective function

7: *local*_*opt* = *FALSE* ⊳ Condition for a local optimum

8: **while** *local*_*opt* = *FALSE* **do**

9: Δ ← *Feasible*_*Moves*(Π) ⊳ Δ: list of admissible moves for Π.

10: **for** (*i*, *k*, *d*) in Δ **do**

11: **if** d = 1 **then**

12:

13: **end if**

14: **if** d = 2 **then**

15:

16: **end if**

18: **end for**

18: **for** (*i*, *k*, *i*′, *k*′, 3) ∈ Δ **do**

19: **if** d = 3 **then**

20:

21: **end if**

22: **end for**

23: (*i**, *k**, *d**) ∈ argmax{*δ*_{ikd}|(*i*, *k*, *d*)∈Δ} ⊳ Select the move that increases the most

24: (*i**, *k**, *i*′*, *k*′*, *d**)∈argmax{*δ*_{iki′k′d}|(*i*, *k*, *i*′, *k*′, *d*)∈Δ} ⊳ Select the move that increases the most

25: **if** *δ*_{i*k*i′*k′*d*} > max{0, *δ*_{i*k*d*}} **then**

26: *f* ← *f*+ *δ*_{i*k*i′*k′*d*} ⊳ Update *f*

27: *S*_{k*} ← *S*_{k*} ∪ {*i*′*}\{*i**}

28: *S*_{k′*} ← *S*_{k′*} ∪ {*i**}\{*i*′*} ⊳ Update Π

29: *C*_{i*} ← *C*_{i*} ∪ {*k*′*}\{*k**}

30: *C*_{i′*} ← *C*_{i′*} ∪ {*k**}\{*k*′*}

31: **else**

32: **if** *δ*_{i*k*d*} > 0 **then**

33: *f* ← *f* + *δ*_{i*k*d*} ⊳ Update *f*

34: **if** *d** = 1 **then**

35: *S*_{k*} ← *S*_{k*} ∪ {*i**} ⊳ Update Π

36: *C*_{i*} ← *C*_{i*} ∪ {*k**}

37: **else**

38: *S*_{k*} ← *S*_{k*}\{*i**} ⊳ Update

39: *C*_{i*} ← *C*_{i*}\{*k**}

40: **end if**

41: **else**

42: *local*_*opt* = *TRUE*

43: **end if**

44: **end if**

45: **end while**

46: **return** Π ⊳ Return the local optimum

47: **end procedure**

It remains to comment how feasible starting solutions can be obtained in Line 2 of Algorithm LSE. Depending on problems, we tested various procedures. The first possibility is to start with an unfeasible solution Π, because it contains unstable communities. Then Algorithm LSE is run without imposing that new solutions Π′ should be stable, but once that a feasible one has been found, then all forthcoming solutions must remain feasible too. The first unfeasible Π can be a random assignment to communities, but another possibility is solving for *p* = 1, that is, when overlapping is not allowed, as the problem is usually solved faster than the cases in which *p* > 1. Another possibility that has been used for the problems with the largest size is solving by branch-and-bound, but stop the search when the first feasible solution has been found and next using it as the starting solution in Line 2. All methods can be combined using any multi-start strategy, that is, repeating Algorithm 1 many times with different starting solutions to obtain sufficient diversification and exploration of the solution space. Finally, Algorithm LSE has been explained to solve model , but it can be applied to with straightforward modifications.

A preliminary test of the quality of the LSE algorithm has been run on the previous networks. We run a multi-start version allowing *t*_{max} = 10 starting solutions each run. Results about computational times and solution quality for different parameters configurations are reported in Table 1. It can be seen that the LSE heuristic algorithm reduces the computing time significantly with respect to the ILP solution for both models and , while the optimal solution has been achieved in all the cases but one.

Moreover, we applied the LSE algorithm to some large-scale real data sets that are impractical for any ILP model, in order to test the scalability of our heuristic. The solved data sets are the American college football network with 115 nodes, see [28], the Jazz musician network with 198 nodes, see [29], and C. metabolic network with 453 nodes, see [30]. These real data sets examples are commonly used in literature. The exact expected weights cannot be computed for graphs with a large number of edges, so we used the approximated expected weights . We report the results of our methods in Figs 9–11.

Community structure obtained by LSE heuristic with weights and parameters *n*_{c} = 7, *p* = 2.

Community structure obtained by LSE heuristic with weights and parameters *n*_{c} = 6, *p* = 2.

Community structure obtained by LSE heuristic with weights and parameters *n*_{c} = 10, *p* = 2.

## Results and discussion

We are going to analyze the main features of the ILP models , and the heuristic Algorithm 1 when they are applied to medium and large size networks, most precisely, whether they can detect the true overlapping communities of randomly generated networks, as it is done in [31]. Random networks are generated using the procedure proposed in [16], but with some variations to allow for communities that overlap. Most peculiarly, in our simulation we must distinguish between bridge and non-bridge nodes, the former being the nodes that belongs to more than one community. The main parameters characterizing the simulated networks are:

*N*: the number of nodes.*n*_{c}: the number of communities.*p*: the maximum number of communities to which a node can belong to.*N*_{o}: the number of nodes that belongs to more than one communities, that is, they are bridges.

Next, communities are defined by the probability by which community nodes can establish a link between themselves. Those probabilities are controlled by parameters:

- 1 −
*μ*: fraction of links between non-bridge nodes belonging to the same community. - 1 −
*μ*_{o}: fraction of links between bridge nodes and other nodes of the communities where the bridge node belongs to.

There are other parameters characterizing the simulated networks, such as the number of arcs, the node degrees, the community sizes and so on, whose purpose is to simulate networks with the same characteristics of the empiric ones. We report all these features in S1 Appendix, with the pseudo-code describing our implementation of Lancichenetti *et al*. algorithm.

The solution quality of our models is measured comparing their results with the true community structures (known by simulation). True and estimated structure may differ for:

- The community composition;
- The identification of the bridge nodes.

The statistics to compare the community composition are:

- the
*Normalized Mutual Information (NMI) index*for overlapping partitions, presented in [32]; - the
*Omega index (OI)*, presented in [33].

Both statistics range between 0 and 1, with values closer to 1 indicating strong correspondence between true and estimated communities.

The statistics to compare the identification of bridge nodes are based on a set of indices which depend on the values of the confusion matrix associated to the identification of bridge nodes. Each element of the confusion matrix is defined as follows

- True Positive (TP): Nodes successfully detected as bridge.
- True Negative (TN): Nodes successfully detected as non-bridge.
- False Positive (FP): Nodes wrongly detected as bridge.
- False Negative (FN): Nodes wrongly detected as non-bridge.

Then, we consider the following indices.

- the
*accuracy*defined as , - the
*True Positive Rate (TPR)*: , - the
*False Positive Rate (FPR)*: , - the
*Area Under Curve (AUC)*: , - the
*Precision*defined as , - the
*F1 score*: ;

*Test 1: Detecting non overlapping communities:* As a first test, we apply the ILP models , and the Algorithms LSE to the case in which communities do not overlap, that is, *p* = 1, to see whether the approximate result of algorithm LSE are reliable, with respect to what is found by the respective optimal ILP models. The ILP solution of and can be obtained in short computational times only for moderate size networks, so we consider *N* = 40, 60 to solve within the time limit of 100 or 200 seconds respectively. The LSE heuristic has been run with *t*_{max} = 5 multiple starting solution, guaranteeing that its computational times are a fraction of the exact method.

For fixed *N* and *n*_{c}, we let *μ* = 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, as in [16] to control for the effect of mixing parameter. For each parameter set, either 50 or 100 random networks are generated and indices are calculated as averages on all instances. Results are reported in Table 2. The first two rows of this table give the ILP formulation ( or ) used in the corresponding method: exact (ILP) or (LSE) heuristic to provide an initial solution. The third row describes the parameters of the instances (*N*, *n*_{c}, *μ*) and the index reported below (NMI or Omega). By columns, the layout of this table is organized in three blocks. The first one with three columns describes the instances. The next two blocks, each one with four columns, report the average values of the NMI and Omega indices for each combination of solution method. Results in bold report the best behaviour among similar index for the corresponding solution methods. One can easily observe that using formulation in the ILP or in the LSE heuristic provides better solutions than .

For each combinations of parameters *N* and *n*_{c}, the NMI and OI of each solution method are also shown as a function of *μ* in Figs 12–14 to compare the formulations and . The exact formulation obtains, in general, better NMI results and also better *OI* results in more cases than ; except for *N* = 40 and *n*_{c} = 4. In this case, the behaviour of *OI* is similar in both formulations. However, also for *N* = 40, the exact solution of model is superior to the other two approaches, namely the heuristic LSE and the exact model , as the curves of the NMI and Omega statistics are above the others for most values of *N*, *n*_{c} and *μ*.

(a) Average *NMI* for each solution method, (b) Average *Omega* for each solution method.

(a) Average *NMI* for each solution method, (b) Average *Omega* for each solution method.

(a) Average *NMI* for each solution method, (b) Average *Omega* for each solution method.

When *μ* is above the threshold 0.3, the solution quality of the method deteriorates for the joint effect of two factors: 1) communities are less well-separated, 2) exact solution has not been obtained within the considered time limit. However, this is not an actual drawback since for those parameter values, communities are essentially meaningless.

*Test 2: Detecting overlapping communities on small networks:* Networks with overlapping communities have been simulated with the same parameters used before, but now communities can overlap. We control the overlap with parameters *p* = {2, 3} and *μ*_{o} = {0.5, 0.7}. The choice of these parameters is justified since for *p* = 2 the smallest possible *μ*_{o} value is 0.5 and for *p* = 3 the smallest possible *μ*_{o} value is approximately 0.7. Moreover, the number of bridge nodes *N*_{o} is approximately 10% of all the nodes, and we change this value to asses how it affects the computational results. Problems with overlapping communities are harder to solve, therefore we limit the graph size to *N* = 40 and increase the time limit to 200 seconds. Table 3 reports the computational results with a layout similar to Table 2. It can be seen that the best values of both the Omega and NME indices are obtained with the LSE heuristic, applied to the formulation. The LSE heuristic applied to provides the second best results (with a few exceptions in which it becomes the best one) and the third one is the ILP formulation. The reason of the poor performance of the ILP methods is due to the fact that they were not able to terminate the computation in the imposed time limit and the solution that they provide is far from optimality. Results of Table 3 are graphically reported in Figs 15–19, where it can be seen that the purple and green curve, representing the LSE heuristics, are very close with each other and they are much above the result of the truncated ILP. It is also noteworthy that weights from improve the results of the ILP method.

(a) Average *NMI* for each solution method, (b) Average *Omega* for each solution method.

(a) Average *NMI* for each solution method, (b) Average *Omega* for each solution method.

(a) Average *NMI* for each solution method, (b) Average *Omega* for each solution method.

(a) Average *NMI* for each solution method, (b) Average *Omega* for each solution method.

(a) Average *NMI* for each solution method, (b) Average *Omega* for each solution method.

*Test 3: Detecting overlapping communities on large-scale networks:* In the last experiment, we have applied the LSE heuristics, using both the and models, to the largest networks composed of 500 or 1000 nodes. As before, we control the overlap between communities with parameters *p* = {2, 3} and *μ*_{o} = {0.6, 0.7}, the number of bridge nodes are *N*_{o} = {20, 50}.

In Table 4, we report the *NMI* and *OI* statistics calculated by the two methods. It can be seen that they have lower values than what obtained in the smallest networks, due the fact that communities are harder to find. in most of the cases, model , in which weights are exact, obtains better indices than the approximated weights of . Results of Table 4 are reported in Figs 20–23. There, it can be seen that the green line is above the purple one in almost all cases.

(a) Average *NMI* for each solution method, (b) Average *Omega* for each solution method.

(a) Average *NMI* for each solution method, (b) Average *Omega* for each solution method.

(a) Average *NMI* for each solution method, (b) Average *Omega* for each solution method.

(a) Average *NMI* for each solution method, (b) Average *Omega* for each solution method.

We can compare models and in term of detecting the network bridge nodes. We considered many statistics: *accuracy*, *TPR*, *FPR*, *AUC*, *precision* and the *F1 score*. They are collected in Table 5 which reports the average values of these metrics obtained by the two LSE heuristics. In all the simulations, the fraction of bridge nodes over all the nodes is less than 0.1. It implies that it is much easier to detect non-bridge nodes rather than bridge ones. Therefore, a method that selects the fewest number of bridge nodes has a numeric advantage in terms of *accuracy*. Clearly, it could not classify successfully bridge nodes. Looking at Table 5, one can observe that the greatest difference between -LSE and -LSE is on metrics *FPR* and *TPR*. Model obtains the best rate of *true positive*, model -LSE obtains the best rate of *false positive*. This means that model selects more bridge nodes, but some of them are not actually bridges. Conversely, can successfully detect most of the non-bridge nodes, resulting on higher *accuracy* just because the majority of nodes are actually non-bridge. However, this is a consequence of a method that takes less risk in detecting a node as a bridge. As far as the *AUC* is concerned, the results are really similar due to the existing balance between *FPR* and *TPR* of both methods.

These values confirm that the bridge nodes detected by model are more reliable than the ones detected by , due to the better *precision* values. Moreover, since the *F*1-score is equal to the harmonic mean between *TPR* and *precision*, also gets better results for this metric.

For the highest values of *μ*, it is more difficult to distinguish the non-bridge from the bridge nodes, which increases the number of *false positives*. So, statistics *FPR*, *AUC*, *precision* and *F1* decreases. As in the previous experiments, both models -LSE and -LSE obtain the best results when *μ* is near 0 and when a bridge node belongs to many communities, as it is easier to be detected. In conclusion, detects more bridge nodes, so it obtains the highest *TPR*, but at the cost of incurring in a higher number of *false positive* too, which leads to the worst *accuracy*.

## Conclusion

In this paper, we proposed an Integer Linear Programming model to detect overlapping communities in a network. Our contribution identifies communities as stable coalitions and then we select the best of them with an optimization model. Peculiar to this approach is the definition of a weighted graph connection game and its characteristic function. Moreover, we introduced a null hypothesis in the spirit of the modularity function, [1]: We have compared the community node similarity of the actual graph with the node similarity of a random graph with no embedded communities, and in this way we could define a new similarity measure. Then, these similarities are used to define the non-convex cooperative game and the objective function of a maximization problem. Nodes similarities are obtained through the application of Theorem 1, or by a simplified formula, see (21), useful to reduce the computational complexity. Computational tests show that they find similar communities.

Future research can be devoted to define stability with cooperative games others than graph connection games, and they could depend on the actual social or economic activity that is taking place on the network. We could imagine matching or voting game, to define a few, that could promptly be defined and applied to peculiar networks. Moreover, the implementation of the LSE heuristic, Algorithm 1, has been necessary to find solutions in a reasonable computation time and we found that the stability property increased the problem complexity. As stable community structures are poorly analyzed in literature, we expect that there is large room to improve our basic heuristic subroutines.

Finally, our extension of the procedure proposed in [16] to generate controlled overlapping communities can be used to validate any other method or algorithm. Testing algorithms is a big challenge and the generation of heterogeneous networks makes the comparison between algorithms easier. However, the wide combinations of parameters complicates the issue, advancing the need for a general methodology to select the most appropriate scenarios.

## Supporting information

### S1 Appendix. Appendix: Random networks generation.

Random network generator based on [16] benchmark.

https://doi.org/10.1371/journal.pone.0283857.s001

(PDF)

## References

- 1. Girvan M, Newman MEJ. Finding and evaluating community structure in networks. Phys Rev E. 2004;(69 (2), 026113). pmid:14995526
- 2. Fortunato S, Hric D. Community detection in networks: A user guide. Physics Reports. 2016;659:1–44.
- 3. Palla G, Derényi I, Farkas I, Vicsek T. Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society. Nature. 2005;(435 (7043)). pmid:15944704
- 4. Xie J, Kelley S, Szymanski BK. Overlapping Community Detection in Networks: The State-of-the-Art and Comparative Study. Comput Surv. 2013;45(43):1–35.
- 5. Agarwal G, Kempe D. Modularity-maximizing graph communities via mathematical programming. The European Physical Journal B. 2008;66(3):409–418.
- 6. Li Z, Zhang XS, Wang RS, Liu H, Zhang S. Discovering Link Communities in Complex Networks by an Integer Programming Model and a Genetic Algorithm. PLoS ONE. 2013;(8 (12), e83739). pmid:24386268
- 7. Bennett L, Kittas A, Liu S, Papageorgiou LG, Tsoka S. Community Structure Detection for Overlapping Modules through Mathematical Programming in Protein Interaction Networks. PLoS ONE. 2014;(9(11): e112821). pmid:25412367
- 8. Costa A, Ng TS, Foo LX. Complete mixed integer linear programming formulations for modularity density based clustering. Discrete Optimization. 2017;(25):141–158.
- 9. Zhang S, Wang RS, Zhang X. Identification of overlapping community structure in complex networks using fuzzy c-means clustering. Physica A. 2007;(374):483–490.
- 10. Nepusz T, Petroczi A, Negyessy L, Bazso F. Fuzzy Communities and the Concept of Bridgeness in Complex Networks. Physical Review E. 2008;77:16–107. pmid:18351915
- 11. Nicosia V, Mangioni G, Carchiolo V, Malgeri M. Extending the definition of modularity to directed graphs with overlapping communities. J Stat Mech Theory Exp. 2009;((03) (2009) P03024).
- 12. Chen D, Shang M, Fu Y. Detecting overlapping communities of weighted networks via a local algorithm. Physica A: Statistical Mechanics and its Applications. 2010;(389):4177–4187.
- 13. Chitra Devi J, Poovammal E. An Analysis of Overlapping Community Detection Algorithms in Social Networks. Procedia Computer Science. 2016;(89):349–358.
- 14. Benati S, Puerto J, Rodríguez-Chía AM, Temprano F. A mathematical programming approach to overlapping community detection. Physica A: Statistical Mechanics and its Applications. 2022;602:127628.
- 15. Jonnalagadda A, Kuppusamy L. A cooperative game framework for detecting overlapping communities in social networks. Physica A. 2018;(491):498–515.
- 16. Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Phys Rev E. 2008;78:046110. pmid:18999496
- 17. Demange G. Intermediate preferences and stable coalition structures. Journal of Mathematical Economics. 1994;23(1):45–58.
- 18.
Carraro C, Marchiori C. DP3258 Stable Coalitions. CEPR Press Discussion Paper. 2002; (3258).
- 19. D’Aspremont C, Jacquemin A, Gabszewicz JJ, Weymark JA. On the Stability of Collusive Price Leadership. The Canadian Journal of Economics / Revue canadienne d’Economique. 1983;16(1):17–25.
- 20. Caparros A, Giraud-Héraud E, Hammoudi A, Tazdaït T. Coalition Stability with Heterogeneous Agents. Economics Bulletin. 2011;31(1):286–296.
- 21.
Newman MEJ. Networks: an introduction. Oxford University Press; 2010.
- 22. Newman MEJ. Analysis of weighted networks. Phys Rev E. 2004;(70 (5), 056131). pmid:15600716
- 23.
Callan D. A combinatorial survey of identities for the double factorial. 2009;.
- 24. Zachary WW. An Information Flow Model for Conflict and Fission in Small Groups. Journal of Anthropological Research. 1977;33(4):452–473.
- 25. Sundaresan SR, Fischhoff IR, Dushoff J, Rubenstein DI. Network metrics reveal diVerences in social organization between two Wssion-fusion species, Grevy’s zebra and onager. Oecologia. 2007;151:140–149. pmid:16964497
- 26. Read KE. Cultures of the central highlands, New Guinea. Southwestern Journal of Anthropology. 1954; p. 1–43.
- 27. Freeman LC, Freeman SC, Michaelson AG. On human social intelligence. Journal of Social Biological Structure. 1988;11:415–425.
- 28. Girvan M, Newman MEJ. Community structure in social and biological networks. Proceedings of the National Academy of Sciences. 2002;99(12):7821–7826. pmid:12060727
- 29. Gleiser P, Danon L. Community Structure in Jazz. Advances in Complex Systems (ACS). 2003;06:565–573.
- 30. Jeong H, Tombor B, Albert R, Oltvai Z, Barabasi AL. The Large-Scale Organization of Metabolic Networks. Nature. 2000;407(6804):651–654. pmid:11034217
- 31. Tandon A, Albeshri A, Thayananthan V, Alhalabi W, Radicchi F, Fortunato S. Community detection in networks using graph embeddings. Phys Rev E. 2021;103:022316. pmid:33736102
- 32. Lancichinetti A, Fortunato S, Kertész J. Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics. 2009;11(3).
- 33. Collins LM, Dent CW. Omega: A General Formulation of the Rand Index of Cluster Recovery Suitable for Non-disjoint Solutions. Multivariate Behavioral Research. 2005;23:231–242.