Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A memetic algorithm for finding multiple subgraphs that optimally cover an input network

Abstract

Finding dense subgraphs is a central problem in graph mining, with a variety of real-world application domains including biological analysis, financial market evaluation, and sociological surveys. While a series of studies have been devoted to finding subgraphs with maximum density, the problem of finding multiple subgraphs that best cover an input network has not been systematically explored. The present study discusses a variant of the densest subgraph problem and presents a mathematical model for optimizing the total coverage of an input network by extracting multiple subgraphs. A memetic algorithm that maximizes coverage is proposed and shown to be both effective and efficient. The method is applied to real-world networks. The empirical meaning of the optimal sampling method is discussed.

1. Introduction

Over the past several decades there has been substantial interest in studying social networks beyond the traditional social sciences while maintaining a focus on social structures. Specifically, instead of focusing on demographic attributes of a certain population, an increasing number of studies have focused on the structure of relationships that connect individual behaviors with collective dynamics [1]. One focus of the analysis of network structure has concerned cohesive subgraphs [2]. Notable examples of this work are sociometric cliques [3] and variants such as n-cliques, n-clan, k-plex, or k-core [4]. Related work has focused on detecting core/periphery structures [5], rich clubs [6] or communities [7]. Generally, the aim of these studies has been to find one or more subgraphs that maximizes some notion of density.

One popular notion of density that has been widely explored in the literature is the average degree (measured by edge-to-vertex ratio), and the problem of finding a subgraph that maximizes the average degree is called the densest subgraph problem (DSP) [8]. Analysis of the DSP has been applied to DNA analysis [9, 10], financial market evaluation [11], social surveys [12, 13], and theoretical computer science [14, 15]. In the Web domain, Gibson et al. identified the link spams by extracting dense subgraphs in large graphs [16], which is one of the greatest challenges in evaluating search engine rankings [17]. In the social context, DSP was applied to expert team formation [15, 18] as well as party organization [19, 20]. Angel et al. detected real-time stories by searching for dense subgraphs in the entity co-occurrence graph constructed from micro-blogging streams [21]. DSP has been also employed to find teams with higher collaborative compatibility [22].

DSP aims at extracting a single subgraph, but many real-world cases seek a collection of dense subgraphs, such as communities or social stories [23]. There are relatively few studies in this direction, one of which, by Balalau et al., focused on finding a set of m subgraphs that maximizes the total density of each subgraph (denoted as the “multiple-m densest subgraphs problem”, MmDSP) [23]. Variants of this model have been proposed subsequently [24, 25]. These studies have solved the problem of how to extract multiple dense subgraphs, but the process of covering the input network by extracting multiple subgraphs has not been addressed. Maximizing the subgraph density and maximizing the covering have different social meanings. In many real-world cases, the density of subgraphs does not have to be large. For example, a collection of network surveys may not focus on how dense each investigated network is, but on how best to cover the whole population, which is a boundary specification problem [26, 27]. In a network survey, self-report of social relationships is commonly used to collect network data. Specifically, given a list of participants, the data are obtained from answers to single-item questions that ask participants to enumerate individuals to whom they are connected by a direct relationship of a specified kind [1, 28]. The main purpose of such a network survey is to best cover the interactional relationships. Besides network surveys, the covering problem can be also applied to influence maximization [29], network tomography [30], or pinning control [31].

The present study addresses the problem of how to find multiple subgraphs that best cover the input network. We call the problem “multiple-m covering k-subgraphs problem” (MmCkSP), i.e., maximizing the covering of the network edges given m subgraphs of limited size k. Unlike the classic graph partitioning problem and the densest subgraph problem, the present study aims to find how to multiply extract subgraphs that leads to the best coverage of the network relationships. Two illustrations that show the difference between MmCkSP and the densest subgraph problem are in Fig 1. Given the input network in Fig 1A, standard community detection finds the partitions of {1,2,3,4,5} and {6,7,8,9,10}. If we set the number of subgraphs to 3 and the subgraph size to 5, MmDSP may present the best solution as the partitions of {1,2,3,4,5}, {6,7,8,9,10} and {3,6,8,9,10} in order to maximize the density of each extracted subgraph, while MmCkSP may extract the subgraphs {1,2,3,4,5}, {6,7,8,9,10} and {1,3,5,7,8}, which may cover all the network edges even though the subgraph {1,3,5,7,8} is not dense. MmDSP and MmCkSP also extract different subgraphs in Fig 1B. Here the edges a, b in (a) and c in (b) are ignored using MmDSP, but these omitted edges connecting different communities may sometimes have a useful social interpretation. Covering these edges can help provide a better understanding of the input network structure.

thumbnail
Fig 1. Two illustrations of the difference between maximizing the density and the coverage.

The upper two networks are the input networks. The lower part contains the solution of extracting multiple subgraphs. In the solution part, grey nodes represent the included nodes and white nodes represent the omitted nodes. The dashed lines represent the unobserved ties, while the bold solid lines represent the extracted ties.

https://doi.org/10.1371/journal.pone.0280506.g001

In real world cases, the subgraph size and the number of subgraphs should be constrained because they are always associated with costs. Taking network surveys as an example, a larger nominalist of nodes makes the burden on respondents greater, in which case ties are more likely to be missed because respondents may not be able to recall enough to fully capture the network structure [32]. Here, we formalize MmCkSP as a new optimization problem, which goes beyond the conventional strategy of optimizing network density. An illustration of the optimization is shown in Fig 2. Given an input network consisting of six nodes and nine edges, if we constrain the subgraph size to 4, we can extract subgraphs {1, 2, 3, 4}, {1, 4, 5, 6} and {2, 3, 5, 6} that cover all ties in the entire population. Solution 2 extracts {1, 2, 3, 6} and {3, 4, 5, 6}, which can also include all the edges. Obviously, solution 2 in Fig 2 is more cost-effective than solution 1. Here, we design an algorithm that can find the most cost-effective solution.

thumbnail
Fig 2. An illustration of the optimization problem of finding multiple subgraphs that best cover the input networks.

The left part is the input network which consists of six nodes and nine edges. The right part contains the solutions by extracting multiple subgraphs. Solution 1 requires three subgraphs and solution 2 requires two subgraphs. In the solution part, grey nodes represent the included nodes and white nodes represent the omitted nodes. Dashed lines represent unobserved ties, while bold solid lines represent the extracted ties.

https://doi.org/10.1371/journal.pone.0280506.g002

The present study is organized as follows: related background, including the densest subgraph problem, and corresponding strategies for the problem with multiple subgraphs, including optimization models, are presented in section 2. In section 3, we propose a memetic algorithm that optimizes the covering problem for each subgraph. Experiments with the proposed algorithm on computer-generated and real-world networks are described in section 4. Section 5 presents the conclusion and discussion.

2. Background

2.1. The densest subgraph problem and the solution approach

The densest subgraph problem (DSP) refers to how to obtain a list of members with the highest density. Given a graph G(V, E), where {vi} ∈ V denotes the set of nodes and {eij} ∈ E denotes the set of relationships, DSP aims to find a subgraph G’(V’, E’) whose average density of G’ computed as is the largest [8]. The optimization of DSP can then be formulated as (1), below. Solution of the DSP has been shown to require polynomial time [8, 3335].

(1)

The average density of the extracted subgraph in DSP is associated with the subgraph size, and there is a tradeoff between the density and size [36]. From DSP, one may extract smaller subgraphs in sparser networks but extract larger subgraphs in denser networks. However, in real applications there always exists an upper bound for the subgraph, and one may constrain the size of dense subgraphs [36, 37]. If all subgraphs have the same (bounded) size, the problem, which then becomes NP-hard [14, 33], has been investigated under various names including the “k-cluster problem” [3840], the “k-cardinality subgraph problem” [41], or the “densest k-subgraph problem” (DkSP) [42, 43]. This problem is formulated as (2), below. Some variants of DkSP has been proposed. If the extracted subgraph is required to be connected, the problem is referred as to the densest connected k-subgraph problem (DCkSP) [44]. In weighted networks, finding the subgraph with k nodes that has the highest sum of the weights (edges) is called the “heaviest k-subgraph problem” (HkSP) [45]: (2)

DkSP actually has important interpretations in social science. A social problem related to DkSP is called the “boundary specification problem” (BSP), which aims to find a list of samples that best represents the population [26]. When nodes are excluded from the system, the observed network structure differs from the actual one. Simulations have examined features of missing actors and have shown the detrimental impact of incomplete sampling [27, 28, 46, 47]. The similarity between the sample and the complete network declines as more nodes are excluded, and missing nodes substantially affect measures related to the complete network [28, 46, 47].

Solving DkSP can help to solve BSP, as illustrated in Fig 3. The input network consists of eight nodes, and we set the subgraph size k at 7. If we exclude node 5, then the network has a ring structure, which is quite different from the original structure. If node 6 is omitted by accident or for convenience, the whole network becomes unconnected. This example illustrates how a minor change in network structure can have a dramatic effect on inference about network properties as a whole [48]. Only for special cases can the sampled network have a similar structure to the complete network [49], while the solution that excludes node 8 can be the special case that is also the best solution of DkSP. If we exclude node 8, most of edges can be preserved because of the principle of largest density.

thumbnail
Fig 3. DkSP in solving the boundary specification problem.

Grey nodes represent the included nodes and white nodes represent the omitted nodes. The dashed lines represent unobserved ties. The left network is the input network, while the three networks on the right represent the solution made up of extracted subgraphs. The upper right network excludes node 5 and becomes a ring; the middle right network excludes node 6 and becomes unconnected; the lower right network excludes node 7, and is the best solution of DkSP.

https://doi.org/10.1371/journal.pone.0280506.g003

To solve DkSP, a number of studies have focused on the use of semidefinite programming; that is, the problem is transformed into a semidefinite programming problem for each node of a branch-and-bound tree [39, 40]. Some semidefinite programming relaxations have been also used to approximate DkSP [50, 51]. Other studies wrote DkSP as a problem of rank-constrained cardinality minimization, and relaxed it by the use of the nuclear norm [52, 53]. Also, a series of heuristic algorithms have been employed in solving the problem. Kincaid proposed a simulated annealing algorithm and a tabu search algorithm to solve the NP-hard DkSP [54]. Macambira employed a tabu search algorithm which was shown to outperform greedy search [55]. A variable neighborhood search heuristic proposed by Brimberg et al. was shown to be effective in solving the DkSP [56].

From a sociological view, given that inappropriate boundary specification can have a detrimental effect on estimating the structure of a real population, a list of sampling methods related to the sampling in network surveys has been also proposed. For example, randomly selecting individuals is a common method of sampling in social science investigations [27, 45, 57]. Top-down sampling (choosing the top nodes ordered by size) has also been widely used and yields estimates of network properties that are highly consistent with those obtained from whole network analysis [58, 59].

2.2. Covering problem with multiple graphs

Finding multiple densest subgraphs has recently been discussed [2325, 60]. Balalau et al. focused on finding a set of m subgraphs that maximize the total density of each subgraph with the constraint of an upper bound on the pairwise Jaccard coefficient between the sets of nodes of the subgraphs (denoted as “multiple-m densest subgraphs problem”, MmDSP) [23]. Nasir et al. proposed a dynamic variant of this problem, where a collection of m disjoint subgraphs is found in a sliding window [25]. An approach similar to MmDSP was proposed by Galbrun et al. where the objective function takes both the total density and the distance between the subgraphs into account [24]. Dondi et al. addressed the approximability and computational complexity of this problem [60]. An application of MmDSP on dual networks has been also studied [61]. In this paper, we study the multiple-m densest subgraphs problem (MmDSP) proposed by Balalau et al. [23]. MmDSP aims to find a collection of m subgraphs {Gi(Vi,Ei)} for which the sum of the average density of each subgraph is maximized [23]. Optimization of MmDSP can be formulated as problem (3) below, where a is the upper bound on the pairwise Jaccard coefficient.

(3)

MmDSP has focused on improving the density of each subgraph but has ignored the covering of the input network by extracting subgraphs. Although techniques such as the pairwise Jaccard coefficient or the distance between subgraphs have been invoked to avoid too much overlap between the extracted subgraphs [23, 24], the literature still lacks a focus on the network covering problem. The present study aims to find an optimal method for finding multiple subgraphs that best cover the input network, denoted as MmCkSP. There are three key elements associated with the sampling process: the covering of the input network (C), the bound on the subgraph size (k), and the number of subgraphs (m). Given the size of each subgraph |Vi|, practitioners need to assemble the collected ties into a network that can best cover the input network (C). The objective function of MmCkSP is then formulated as (4), below. When the limited number of subgraphs is 1, problem (4) can be transformed to problem (2).

(4)

Here we use the fraction of extracted edges to measure C, i.e., C(E1,E2,…,Em) = cover(…cover(cover(E1,E2),E3),…,Em)/|E|, where cover(Ei, Ej) = |EiEj| − |EiEj|. An illustration of the measurement of C is shown in Fig 4. Compared with problem (3), C contains the physical significance of a (a parameter for avoiding subgraphs being too similar in problem (3)). Since the objective is to maximize the total covering of the input network, the extracted subgraphs should be different, and thus we do not necessarily employ a in problem (4).

thumbnail
Fig 4. Computing the objective function.

The left part is the input network which consists of five nodes and seven edges. The right part is a solution containing three subgraphs, where grey nodes represent the included nodes and white nodes represent the omitted nodes. Dashed lines represent unobserved ties, while bold solid lines represent the extracted ties. The sum of the average density of this solution is 8/3, while the covering of the input network is 6/7, because only six edges (excluding edges between nodes 3 and 4) have been extracted.

https://doi.org/10.1371/journal.pone.0280506.g004

The functional relationship between the three elements listed above is non-linear as can be seen from a simulation of the random sampling shown in Fig 5. We find that increasing subgraph size is more helpful in promoting representativeness than increasing the number of subgraphs, because the gradient is greater than .

thumbnail
Fig 5. Relationship between the three key elements in the sampling process.

X-axis, Y-axis and Z-axis are, respectively, the subgraph size, the number of subgraphs and the covering of the input data. The simulation was performed on an ER random network with 100 nodes and 1,000 edges. The mean value of the results across 10 runs is presented in the figure.

https://doi.org/10.1371/journal.pone.0280506.g005

3. Algorithm

Given a fixed number of subgraphs (m), subgraph size (k) and the entire population (N), the number of possible extracted subgraphs is

Traversing all these solutions cannot be computed polynomial time, and thus MmCkSP constitutes an NP-hard problem. Compared with the NP-hard MmDSP proposed by Balalau et al. [23], MmCkSP is more complicated because of the higher time cost for computing the covering in place of the average density, as well as setting the bound k on subgraph size. In this section, we introduce a memetic algorithm that combines a genetic algorithm and a heuristic local search called the memetic algorithm to find multiple subgraphs that cover the input network (MA-MmCkSP). The memetic operation includes both long-distance and short-distance search and has proved to be effective in solving NP-hard problems [62, 63].

3.1. Framework

The framework of MA-MmCkSP is shown in Algorithm 1. We first input necessary parameters and the adjacency matrix of the input network. An initial population P is generated that consists of a list of solutions (coded as chromosomes), and then the process is repeated until the maximum number of iterations is reached or the coverage of the input network remains unchanged over 50 iterations. At each iteration, tournament selection is used to select a parent population Pparent with the highest representativeness. Next, we perform a genetic operation on Pparent to form an offspring population Poffspring. Then the local-search function is applied to find the local maximum solution for the offspring population. Then an updating function is used to construct a new population P with better solutions. After repeating, we output the fittest solution by decoding.

Algorithm 1. Framework of MA-MmCkSP.

  1. Input: Population size (Sp), Tournament size (Stour), Mating pool size (Spool), Crossover probability (Pc), Maximum number of iterations (Mi), Number of nodes (N), Number of subgraphs (m), subgraph size (k), Adjacency matrix of networks (A).
  2. P ← Initialization (Sp, N, m, k);
  3. Repeat
  4. Pparent ← Selection (P, Stour, Spool);
  5. Poffspring ← Genetic Operation (Pparent, Pc, Pm, N, m, k);
  6. Poffspring ← Local Search (Poffspring, N, m, k);
  7. P ← Update (P, Pparent, Poffspring);
  8. Until Termination (Imax)
  9. Decode (P)
  10. Output: the best solution of the finding multiple subgraphs and its covering.

3.2. Representation and initialization

Each solution is encoded as a chromosome that consists of m substrings X = [X1, X2, …, Xm], where m is the number of subgraphs. Each substring represents the node set in a subgraph and is denoted by a list of genes x ∈ {1, 2, …, n} that specifies which nodes should be included. Fig 6 illustrates the representation for a subgraph of size 5, and the number of subgraphs is set to 4, so the chromosome is formed as five genes with four substrings. If we change the 5th gene from 5 to 10 in the first substring, the new solution will substitute node 10 for node 5 in the first subgraph.

thumbnail
Fig 6. Illustration of the representation.

The upper two chains denote two chromosomes consisting of genes. The lower left part is an input network, which consists of ten nodes. The lower right part is two solutions of extracted multiple subgraphs corresponding to the two chromosomes, where grey nodes represent the included nodes and white nodes represent the omitted nodes. The dashed line represents unobserved ties, while the bold solid lines represent the extracted ties.

https://doi.org/10.1371/journal.pone.0280506.g006

For the initialization, we generate a population and randomly select the nodes for each substring in every chromosome.

3.3. Genetic operation

The genetic operation includes both crossover and mutation, which are the primary operations in the genetic algorithm. The algorithm performs the crossover procedure with probability Pc, and executes the mutation procedure with probability Pm = 1−Pc. To some extent crossover represents long-term search, while mutation represents short-term search. Thus appropriate setting of Pm = 1−Pc enables a balance to be found between long-term and short-term search, which helps to increase the efficiency of the genetic algorithm [64, 65].

In the crossover operation, two parental chromosomes are chosen using tournament selection. We first disorganize the order of the substrings for each chromosome to maintain diversity, and then find the genes that differ between the chromosomes in each substring. Given each pair of different genes, we generate a random number γ; if γ< 0.5, the gene remains unchanged; and if γ ≥ 0.5, the corresponding genes are swapped between the two chromosomes. Finally, we add the common genes and form the two offspring chromosomes. The crossover operation is illustrated in Fig 7. After changing the substring disorder, substring 3 in parent 1 and substring 2 in parent 2 are reassigned to the first substring. The genes that differ between parent 1 and parent 2 are grey. Since the generated random numbers are 0.3, 0.6, 0.9, and 0.4, respectively, for the first substring, we swap the second and third different genes between the two parental chromosomes because the corresponding γ ≥ 0.5.

thumbnail
Fig 7. The crossover operation.

Grey elements represent genes that differ between the two parental chromosomes. The substrings of two parent chromosomes are first disorganized, and we check all elements that differ between the two parental chromosomes. If the random number γ < 0.5, the element remains unchanged in the offspring chromosomes, while if γ ≥ 0.5, the corresponding elements are swapped.

https://doi.org/10.1371/journal.pone.0280506.g007

In the mutation operation, we randomly select an element xi in each substring and then randomly assign a different node number that is also different from other node numbers within the same substring as the element xi.

3.4. Local search

Local search is effective in reducing inefficient exploration and not only improves the accuracy but also speeds up the convergence [6466]. Here we employ a hill-climbing technique presented as Algorithm 2. We check each element in a chromosome and replace the original gene with a node number that increases the objective function (the coverage of the input network) on substitution. The chromosome can then reach a local optimum.

Algorithm 2. Local Search in MA-MmCkSP.

  1. Input: The best offspring chromosome (Coffspring), number of nodes (N), number of subgraphs (m) and subgraph size (k);
  2. For i = 1; ik × m; i++
  3. Coffspringnew = Coffspring
  4. For j = 1; jN; j++
  5.   Coffspringnew(i) = j;
  6.   If Obj(Coffspringnew) > Obj(Coffspring)
  7.    Coffspring(i) = j;
  8.   End If
  9. End For
  10. End For
  11. Output: Coffspring

3.5. Complexity analysis

Given a network with N nodes, number of subgraphs m and the subgraph size k, the time-complexity of MA-MmCkSP is analyzed as follow. At each iteration, we need to execute the crossover operation Spool/2 times (where Spool is the size of the mating pool) and the mutation operation Spool times at most. Since computing the covering costs O(mk), the time-complexity for performing the genetic operation is O(mkSpool). In the local search procedure, finding the best neighbor for each gene needs O(Nmk), and thus to find the local optimal chromosome will cost O(Nm2k2). Since O(mkSpool) < O(Nm2k2), the total time complexity of the proposed algorithm is O(Nm2k2).

4. Results

In this section, we show the effectiveness and efficiency of MA-MmCkSP running on a computer-generated random network. We also carry out the procedure on various real-world networks and interpret the optimal method in the social context. The experiments were carried out on a 2.11 GHz CPU with 16.00 GB memory computer, running on Windows 10 using MATLAB to execute the procedure. Table 1 shows the parameters used in the experiments that gave the best performance for the proposed algorithms.

4.1. Results for computer-generated networks

In order to assess the effectiveness of MA-MmCkSP, we compare it with random extraction (RE), the big-degree sampling method where big-degree nodes have a higher probability of being extracted (BD-MmCkSP), the greedy algorithm based on the operation of local search (GR-MmCkSP) and the genetic algorithm without local search (GA-MmCkSP). The five methods were carried out on an ER random network consisting of 100 nodes and 1,000 edges. Fig 8A and 8B present the maximum and mean values of 10-runs, comparing covering of the input network for different settings of subgraph size and number of subgraphs. The figures show that MA-MmCkSP performs the best, GR-MmCkSP performs second best, GA-MmCkSP performs the third, while BD-MmCkSP and RE perform the worst. From Fig 8A we see that the whole network’s ties can be collected using only 10 subgraphs when the subgraph size reaches 37, which is much smaller than the theoretical maximum . We conclude that MA-MmCkSP is effective in solving the optimization problem of sampling in multiple social surveys. The slopes of the curves in Fig 8A are much higher than those in Fig 8B, which suggests that focusing on subgraph size is more important than focusing on numbers of subgraphs.

thumbnail
Fig 8. Covering the input random network with RE, BD-MmCkSP, GR-MmCkSP, GA-MmCkSP and MA-MmCkSP.

(a) shows the result for different subgraph sizes given the number of subgraphs m = 10; (b) shows the result for different numbers of subgraphs given the subgraph size k = 10. The solid line with stars represents the maximum value of the ten-runs experiment, while the dashed-dotted line with the circles represents the mean value.

https://doi.org/10.1371/journal.pone.0280506.g008

We computed the average density of the extracted solutions corresponding to Fig 8. Fig 9A and 9B present the mean values of the average density in 10-runs for different settings of subgraph size and number of subgraphs. The figures show that the extracted subgraphs derived from MA-MmCkSP are densest, GR-MmCkSP are the second densest, GA- MmCkSP are the third densest, while subgraphs in BD-MmCkSP and RE are sparsest. The results suggest that the optimal extracted subgraphs are more likely to be denser. The proposed algorithm can provide a new alternative for solving the multiple-m densest subgraphs problem (MmDSP). In addition, we find the density of subgraphs increases as the subgraph size increases, but decreases as the number of subgraphs decreases. There is a tradeoff among the subgraph density, subgraph size and the number of subgraphs.

thumbnail
Fig 9. Average density of the extracted subgraphs for RE, BD-MmCkSP, GR-MmCkSP, GA-MmCkSP and MA-MmCkSP.

(a) shows the result for different subgraph sizes given the number of subgraphs m = 10; (b) shows the result for different numbers of subgraphs given the subgraph size k = 10.

https://doi.org/10.1371/journal.pone.0280506.g009

We also compared the results obtained using MA-MmCkSP with those using GA-MmCkSP on each iteration, and we see that the memetic operation is more efficient. Fig 10 show the results for the two methods with different settings for subgraph size and number. MA-MmCkSP performs much better and converges faster than GA-MmCkSP. The difference is especially apparent in Fig 10D, where MA-MmCkSP is able to reach a covering of 100% at the first iteration, while GA-MmCkSP converges after the 80th iteration and even then does not reach 100%.

thumbnail
Fig 10. Covering of the input random network with GA-MmCkSP and MA-MmCkSP for each iteration.

Figures (a)-(d) are, respectively, the results for the settings k = 10 and m = 10, k = 10 and m = 30, k = 30 and m = 10, k = 30 and m = 30. The solid line with the stars represents the maximum value of the ten-runs experiment, while the dashed-dotted line with the circles represents the mean value. Grey and black, respectively, represent the results of GA-MmCkSP and MA-MmCkSP.

https://doi.org/10.1371/journal.pone.0280506.g010

In order to find characteristics of the extracted nodes, we compute the correlation between the number of times each node is selected using MA-MmCkSP and the network centrality, as shown in Fig 11. Comparing Figs 8A and 11A, when the network cannot be completely collected (i.e., the subgraph size is smaller than 37), the probability of a node being selected is highly correlated with its centrality. The correlation dramatically decreases as the boundary size surpasses the critical value. Fig 11B shows a similar result if central nodes are more likely to be included repeatedly. The results suggest that including central nodes is helpful in achieving the network covering.

thumbnail
Fig 11. Correlation between the number of times each node is selected and the network centrality corresponding to Fig 6.

(a) shows the result for different subgraph sizes given the number of subgraphs m = 10; (b) shows the result for different number of subgraphs given the subgraph size k = 10. The different curves represent the correlations with degree centrality, betweenness centrality and closeness centrality.

https://doi.org/10.1371/journal.pone.0280506.g011

A sensitivity analysis for the proposed algorithm on a network with 1,000 nodes and 10,000 edges is conducted. The results show that MA-MmCkSP still performs the best in maximizing the coverage of networks of larger size as shown in Fig 12. We also test the performance of MA-MmCkSP for networks with different average densities and find that the extracted subgraphs are less covering given the fixed number of subgraphs and the subgraph size when the density of the input network increases as shown in Fig 13. A subgraph of larger size is required if we aim to investigate a denser social network.

thumbnail
Fig 12. Covering of the 1000-node input random network acquired by RE, BD-MmCkSP, GR-MmCkSP, GA-MmCkSP and MA-MmCkSP for different subgraph sizes given the number of subgraphs m = 10.

https://doi.org/10.1371/journal.pone.0280506.g012

thumbnail
Fig 13. Covering of the input random 100-node network with different average densities given the number of subgraphs m = 10 and the subgraph size k = 10, 20 and 30 acquired by MA-MmCkSP.

https://doi.org/10.1371/journal.pone.0280506.g013

4.2. Results for real-world networks

In this section, we test RE, GA-MmCkSP and MA-MmCkSP on six real-world networks, namely Zachary’s Karate Club network, Bottlenose Dolphins network, American College Football network and three migrant workers’ networks of ADS, YDSC and WH companies in Shenzhen, China.

Zachary’s Karate Club network consists of 34 karate-club members and 78 social ties observed by Zachary over two years [67]. The Bottlenose Dolphins network was constructed by Lusseau [68], who observed 62 bottlenose dolphins and their 159 connections over seven years. The American College Football network was constructed from the schedule of Division Ⅰ games during the year 2000 football season. The network consists of 115 nodes that represent teams and 616 edges that represent the regular season games between the two teams that they connect [7]. The next three examples are networks of migrant workers in ADS, YDSC, and WH companies investigated by the New Urbanization and Sustainable Development Group of Xi’an Jiaotong University [65]. The three networks were constructed from a single-item question that asked the participant to enumerate individuals with whom they are often in contact at work. ADS network consists of 165 nodes and 1196 edges; YDSC network consists of 70 nodes and 272 edges; WH network consists of 193 nodes and 887 edges. The survey involved both network-level and individual-level investigations.

For each network, the number of subgraphs is m = 5, 10, or 30, and the subgraph size is chosen from k = 0.1N, k = 0.2N, k = 0.3N, k = 0.4N, k = 0.5N, where N is the network size. Table 2 shows the mean and maximum value of representativeness over 10 runs produced by RE, BD-MmCkSP, GR-MmCkSP, GA-MmCkSP and MA-MmCkSP with different values of m and k. We find MA-MmCkSP performs much better than other algorithms. Moreover, subgraph size k plays a much more important role in the multiple extractions: a small increase in k can produce a large improvement in covering. Even random extraction is able to cover all the edges when k reaches 0.4N.

thumbnail
Table 2. Mean and maximum values of the covering for the real-world networks.

https://doi.org/10.1371/journal.pone.0280506.t002

By decoding the best chromosomes generated by the proposed algorithm, we can extract the specific sampling solution in each subgraph. Fig 14 presents one of the best extraction methods for Zachary’s Karate Club network with k = 0.3N≈10 and m = 5. The present solution is able to collect all the edges, i.e. the covering C = 100%. In Zachary’s Karate Club network, nodes 1, 2, 3, 33, 34 are key individuals who have the highest centrality, and we find that at least two central nodes are needed to include as many edges as possible. However, there is no solution that includes all five central nodes within the same subgraph. This is because a pair of central nodes may be disconnected, while including these nodes may not collect any edges. For example, investigating nodes 1, 3, 34 cannot collect any edges although they have important positions in the network. This suggests that including the central nodes is important, but extracting only the central nodes may not lead to a result that gives the best coverage.

thumbnail
Fig 14. The sampling solution of Zachary’s Karate Club network with k = 0.3N and m = 5.

The upper part is the topology of Zachary’s Karate Club network. The lower part is the solution of extracting multiple subgraphs. In the solution part, grey nodes represent included nodes and white nodes represent omitted nodes. The dashed lines represent the unobserved ties, while the bold solid lines represent the extracted ties.

https://doi.org/10.1371/journal.pone.0280506.g014

We find that the optimal method is associated with the community structure. Zachary’s Karate Club network is a typical network with characteristic community structure [67]. The network can be naturally divided into two communities where edges are denser within the same community but sparser between the different communities. Fig 15 presents the optimized solution of Zachary’s Karate Club network with k = 0.2N≈7 and m = 5. This solution cannot collect all the edges (C = 0.769) because of the limited subgraph size. Most of the extracted nodes within the same communities are included in the one independent subgraph, which suggests that collecting nodes within the same community in each subgraph is helpful for collecting as many edges as possible; we call this the “community collecting method” (CCM). However, the edges between different communities can be hard to detect using CCM. Therefore, CCM is appropriate where the subgraph size or number are so limited that the optimized solution cannot collect all the edges (in other words, C<1). Another limitation of CCM is that it may not work effectively on networks without community structure (modularity Q<0.3) [7].

thumbnail
Fig 15. The sampling solution of Zachary’s Karate Club network with k = 0.2N and m = 5.

The club members are divided into two communities. Included nodes are grey, and white nodes represent omitted nodes. The dashed line represents the unobserved ties, while the bold solid line represents the extracted ties.

https://doi.org/10.1371/journal.pone.0280506.g015

In order to test the performance of CCM, we ran MA-MmCkSP on the benchmark networks proposed by Lancichinetti et al. [69]. Each network consists of 128 nodes with the average degree of 16. These nodes are evenly assigned one of the clustering attributes {1, 2, 3, 4}. We introduce a mixing parameter that denotes the fraction of edges for one node linking to other nodes with different clustering attributes. A higher mixing parameter represents a smaller modularity of the input network. We generated nine networks for values of mixing parameter ranging from 0 to 0.5. Fig 16 shows the covering result for different mixing parameters given the number of subgraphs m = 10 and the subgraph size k = 10, 20 and 30. We find that the extracted subgraphs are less covering as the mixing parameter increases. This is because the optimal solution is based on CCM, while CCM performs less efficiently when the feature of community structure of the input network declines.

thumbnail
Fig 16. Covering of the benchmark networks for different mixing parameters given the number of subgraphs m = 10 and the subgraph size k = 10, 20 and 30 acquired by MA-MmCkSP.

https://doi.org/10.1371/journal.pone.0280506.g016

5. Conclusion and discussion

The present study provides a new perspective on addressing the multiple densest subgraph problem. We advance research on this topic by formulating the problem of covering the input network as an optimization problem and propose a model that maximizes the covering of the observed network by extracting multiple subgraphs. A memetic algorithm combined with a genetic algorithm and local search optimizes the extraction in each independent subgraph.

The proposed algorithm can solve the optimization problem effectively. Compared to adding the number of extractions, increasing subgraph size is more helpful in improving the coverage of the network. Including nodes with higher centrality is necessary, but investigating only those nodes cannot fully reproduce the input network structure because the common edges connected with normal crowds (nodes with lower centrality) can easily be ignored. When subgraph size or numbers are constrained, the community collecting method, which includes nodes within the same community in each subgraph, can be an effective way of enhancing the covering. A suggestion for practitioners is to recognize the potential community structure of research objects before conducting the extractions.

From a sociological review, previous research has highlighted the effectiveness of random sampling [27, 45, 57], but this method is not effective when surveys are conducted repeatedly. This is because random sampling in multiple surveys leads to redundancy, where an edge may be detected many times. The top-down sampling method (choosing the top nodes ordered by size) is also of limited value in repeated surveys, because edges connected by nodes with different rank sizes cannot be collected. Including central nodes helps to enhance the covering, but including only the representative nodes may not lead to a representative result. On the other hand, node size is difficult to estimate precisely in social networks. Before acquiring the whole structure of a network, it is difficult to judge whether an individual is a central or marginal member. An illustration is presented in Fig 14, which shows the difference between different methods in recognizing core nodes. The network in Fig 17 is the ADS migrant workers’ network. By asking “how many friends or acquaintances do you have in Shenzhen (ADS is located in this city)?” in the individual-level questionnaire, we can divide the company members into “big-size” individuals, who have 30 or more friends or acquaintances, and small-size individuals, who do not have as many as 30 friends (see Fig 17A). By applying the core-periphery model [5] to the whole network, we can also find big-size individuals and small-size individuals as shown in Fig 17B. This figure is derived using Eq (5), where αij is relationship between nodes i and j, and ci is one of node i’s attributes (core or periphery), “●” indicates a missing value which treat the off-diagonal regions of αij as missing data that helps maximize density in the core and minimize density in the periphery. The inconsistency between (a) and (b) suggests that top-down sampling may choose some fake big-size nodes which undermines the accuracy of network estimation.

(5)
thumbnail
Fig 17. Identification of central and marginal members of the ADS migrant workers’ network using different methods.

(a) uses the number of friends or acquaintances in the individual-level questionnaire; (b) uses the core-periphery model. The dark nodes are central members, while the white nodes are marginal members.

https://doi.org/10.1371/journal.pone.0280506.g017

In most natural settings, practitioners have no idea as to the real structure of the actual network. In order to collect all the potential edges, practitioners should assume that the actual network is completely connected, so that each pair of nodes is connected. The proposed algorithm can also be applied to the covering for a completely connected network. Despite the merits of these new proposals, there are some limitations to the present study. The algorithm may sometimes be trapped in a local maximum, and we plan to design a more intelligent algorithm in the future. The objective function (coverage of the input network) in this paper is the detected number of edges divided by the total number of edges, while other indices, such as centrality, might also be employed in the optimization model. A meaningful analysis of social networks requires both individual-level and network-level investigations, and thus an index for measuring the covering of multiple subgraphs that considers both individual and relational attributes needs to be designed in the future.

References

  1. 1. Marsden PV. Network data and measurement. Annu. Rev. Sociol. 1990; 16(1): 435–463.
  2. 2. Kumar R, Raghavan P, Rajagopalan S, Tomkins A. Trawling the web for emerging cyber-communities. Comput. Netw. 1999; 31(11–16): 1481–1493.
  3. 3. Alba RD. A graph-theoretic definition of a sociometric clique. J. Math. Sociol. 1973; 3: 113–126.
  4. 4. Wasserman S, Faust K. Social Network Analysis: Methods and Applications. Cambridge: Cambridge University Press; 1994.
  5. 5. Borgatti SP, Everett MG. Models of core/periphery structures. Soc. Networks 2000; 21(4): 375–395.
  6. 6. Barabási AL, Albert R. Emergence of scaling in random networks. Science 1999; 286(5439): 509–512. pmid:10521342
  7. 7. Girvan M, Newman ME. Community structure in social and biological networks. P. Natl. Acad. Sci. USA 2002; 99(12): 7821–7826. pmid:12060727
  8. 8. Goldberg AV. Finding a maximum density subgraph. Berkeley, University of California; 1984.
  9. 9. Langston MA, Lin L, Peng X, Baldwin NE, Symons CT, Zhang B, et al. A combinatorial approach to the analysis of differential gene expression data. Methods of Microarray Data Analysis, Springer; 2005.
  10. 10. Fratkin E, Naughton BT, Brutlag DL, Batzoglou S. MotifCut: regulatory motifs finding with maximum density subgraphs. Bioinform. 2006; 22(14): 150–157. pmid:16873465
  11. 11. Du X, Jin R, Ding L, Lee VE, Thornton JH. Migration motif: a spatial-temporal pattern mining approach for financial markets, P. 15th ACM SIGKDD Int. Conf. Data. Min. Knowl. Disc. 2009; 1135–1144.
  12. 12. Tang L, Liu H. Graph mining applications to social network analysis. Managing and Mining Graph Data, Springer; 2010.
  13. 13. Lee VE, Ruan N, Jin R, Aggarwal C. A survey of algorithms for dense subgraph discovery. Managing and Mining Graph Data. Springer; 2010.
  14. 14. Feige U, Peleg D, Kortsarz G. The dense k-subgraph problem. Algorithmica 2001; 29(3): 410–421.
  15. 15. Tsourakakis C, Bonchi F, Gionis A, Gullo F, Tsiarli M. Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees. P. 19th ACM SIGKDD Int. Conf. Knowl. Disc. Data. Min. 2013; 104–112.
  16. 16. Gibson D, Kumar R, Tomkins A. Discovering large dense subgraphs in massive graphs. P. 31st Int. Conf. VLDB 2005; 721–732.
  17. 17. Henzinger MR, Motwani R, Silverstein C. Challenges in web search engines. ACM SIGIR Forum 2002; 36(2): 11–22.
  18. 18. Bonchi F, Gullo F, Kaltenbrunner A, Volkovich Y. Core decomposition of uncertain graphs. P. 20th ACM SIGKDD Int. Conf. Knowl. Disc. Data. Min. 2014; 1316–1325.
  19. 19. Sozio M, Gionis A. The community-search problem and how to plan a successful cocktail party. P. 16th ACM SIGKDD Int. Conf. Knowl. Disc. Data. Min. 2010; 939–948.
  20. 20. Tsourakakis C. The k-clique densest subgraph problem. P. 24th Int. Conf. World Wide Web 2015; 1122–1132.
  21. 21. Angel A, Koudas N, Sarkas N, Srivastava D, Svendsen M, Tirthapura S. Dense subgraph maintenance under streaming edge weight updates for real-time story identification, VLDB J. 2014; 23(2): 175–199.
  22. 22. Gajewar A, Das Sarma A. Multi-skill collaborative teams based on densest subgraph. P. SIAM Int. Conf. Data Min. 2012; 165–176.
  23. 23. Balalau OD, Bonchi F, Chan TH, Gullo F, Sozio M. Finding subgraphs with maximum total density and limited overlap. P. 8th ACM Int. Conf. Web Search Data Min. 2015; 379–388.
  24. 24. Galbrun E, Gionis A, Tatti N. Top-k overlapping densest subgraphs. Data. Min. Knowl. Disc. 2016; 30(5): 1134–1165.
  25. 25. Nasir MAU, Gionis A, Morales GDF, Girdzijauskas S. Fully dynamic algorithm for top-k densest subgraphs. P. ACM Conf. Inform. Knowl. Manage. 2017; 1817–1826.
  26. 26. Laumann EO, Marsden PV, Prensky D. The boundary specification problem in network analysis. Res. Methods Soc. Netw. Anal. 1989; (61).
  27. 27. Borgatti SP, Carley KM, Krackhardt D. On the robustness of centrality measures under conditions of imperfect data. Soc. Networks 2006; 28: 124–136.
  28. 28. Kossinets G. Effects of missing data in social networks. Soc. Networks 2006; 28(3): 247–268.
  29. 29. Kempe D, Kleinberg J, Tardos É. Maximizing the spread of influence through a social network. P. 9th Int. Conf. ACM SIGKDD 2003; 137–146.
  30. 30. Lawrence E, Michailidis G, Nair VN, Xi B. Network tomography: A review and recent developments. Front. Stat. 2006; 345–366.
  31. 31. Cheng Z, Xin Y, Cao J, Yu X, Lu G. Selecting pinning nodes to control complex networked systems. Sci. China Technol. Sci. 2018; 61(10): 1537–1545.
  32. 32. McCarty C, Killworth PD, Rennell J. Impact of methods for reducing respondent burden on personal network structural measures. Soc. Networks 2007; 29(2): 300–315.
  33. 33. Asahiro Y, Hassin R, Iwama K. Complexity of finding dense subgraphs. Discrete Appl. Math. 2002; 121(1–3): 15–26.
  34. 34. Charikar M. Greedy approximation algorithms for finding dense components in a graph. Int. Workshop Approx. Algorithms Comb. Optim. 2000; 84–95.
  35. 35. Kawase Y, Miyauchi A. The densest subgraph problem with a convex/concave size function. Algorithmica 2018; 80(12): 3461–3480.
  36. 36. Wang Z, Chu L, Pei J, Al-Barakati A, Chen E. Tradeoffs between density and size in extracting dense subgraphs: A unified framework. IEEE/ACM Int. Conf. Adv. Soc. Networks Anal. Min. 2016; 41–48.
  37. 37. Dunbar RI. Neocortex size as a constraint on group size in primates. J. Hum. Evol. 1992; 22(6): 469–493.
  38. 38. Corneil DG, Perl Y. Clustering and domination in perfect graphs. Discrete Appl. Math. 1984; 9(1): 27–39.
  39. 39. Malick J, Roupin F. Solving k-cluster problems to optimality with semidefinite programming. Math. Program. 2012; 136(2): 279–300.
  40. 40. Krislock N, Malick J, Roupin F. Computational results of a semidefinite branch-and-bound algorithm for k-cluster. Comput. Oper. Res. 2016; 66: 153–159.
  41. 41. Bruglieri M, Ehrgott M, Hamacher HW, Maffiolia F. An annotated bibliography of combinatorial optimization problems with fixed cardinality constraints. Discrete Appl. Math. 2006; 154(9): 1344–1357.
  42. 42. Dondi R, Hermelin D. Computing the k Densest Subgraphs of a Graph. arXiv preprint arXiv:2002.07695. 2020.
  43. 43. Sotirov R. On solving the densest k-subgraph problem on large graphs. Optim. Method. Softw. 2020; 35(6): 1160–1178.
  44. 44. Chen X, Hu X, Wang C. Finding connected k-subgraphs with high density. Inform. Comput. 2017; 256 160–173.
  45. 45. Letsios M, Balalau OD, Danisch M, Orsini E, Sozio M. Finding heaviest k-subgraphs and events in social media. IEEE 16th ICDMW 2016; 113–120.
  46. 46. Costenbader E, Valente TW. The stability of centrality measures when networks are sampled. Soc. Networks 2003; 25(4): 283–307.
  47. 47. Smith JA, Moody J. Structural effects of network sampling coverage I: Nodes missing at random. Soc. Networks 2013; 35(4): 652–668.
  48. 48. Krebs VE. Mapping networks of terrorist cells. Connect. 2002; 24(3): 43–52.
  49. 49. Stumpf MP, Wiuf C, May RM. Subnets of scale-free networks are not scale-free: sampling properties of networks. P. Natl. Acad. Sci. USA 2005; 102(12): 4221–4224.
  50. 50. Ye Y, Zhang J. Approximation of dense-n/2-subgraph and the complement of min-bisection. J. Global Optim. 2003; 25(1): 55–73.
  51. 51. Rendl F. Semidefinite relaxations for partitioning, assignment and ordering problems. Ann. Oper. Res. 2016; 240(1): 119–140.
  52. 52. Ames BP. Guaranteed recovery of planted cliques and dense subgraphs by convex relaxation. J. Optimiz. Theory App. 2015; 167(2): 653–675.
  53. 53. Li X, Chen Y, Xu J. Convex relaxation methods for community detection. Stat. Sci. 2021; 36(1): 2–15.
  54. 54. Kincaid RK. Good solutions to discrete noxious location problems via metaheuristics. Ann. Oper. Res. 1992; 40(1): 265–281.
  55. 55. Macambira EM. An application of tabu search heuristic for the maximum edge-weighted subgraph problem. Ann. Oper. Res. 2002; 117(1): 175–190.
  56. 56. Brimberg J, Mladenović N, Urošević D, Ngai E. Variable neighborhood search for the heaviest k-subgraph. Comput. Oper. Res. 2009; 36(11): 2885–2891.
  57. 57. Galaskiewicz J. Estimating point centrality using different network sampling techniques. Soc. Networks 1991; 13(4): 347–386.
  58. 58. Alderson AS, Beckfield J, Sprague-Jones J. Intercity relations and globalisation: the evolution of the global urban hierarchy, 1981–2007. Urban Stud. 2010; 47(9): 1899–1923.
  59. 59. Pažitka V, Wójcik D. The network boundary specification problem in the global and world city research: investigation of the reliability of empirical results from sampled networks. J. Geogr. Syst. 2021; 23(1): 97–114.
  60. 60. Dondi R, Hosseinzadeh MM, Mauri G, Zoppis I. Top-k overlapping densest subgraphs: approximation algorithms and computational complexity. J. Comb. Optim. 2021; 41(1): 80–104.
  61. 61. Dondi R, Guzzi PH, Hosseinzadeh MM. Top-k connected overlapping densest subgraphs in dual networks. Int. Conf. Complex Netw. Appl. 2020; 585–596.
  62. 62. Ong YS, Lim MH, Chen X. Memetic Computation-Past, Present & Future Research Frontier. IEEE Comput. Intell. M. 2010; 5(2): 24.
  63. 63. Neri F, Cotta C. Memetic algorithms and memetic computing optimization: A literature review. Swarm. Evol. Comput. 2012; 2: 1–14.
  64. 64. Du H, He X, Wang J, Feldman MW. Reversing structural balance in signed networks. Physica A 2018; 503: 780–792.
  65. 65. He X, Du H, Xu X, Du W. An Energy Function for Computing Structural Balance in Fully Signed Network. IEEE T. Computat. Soc. Syst. 2020; 7(3): 696–708.
  66. 66. Wang S, Gong M, Du H, Ma L, Miao Q, Du W. Optimizing dynamical changes of structural balance in signed network based on memetic algorithm. Soc. Networks 2016; 44: 64–73.
  67. 67. Zachary WW. An information flow model for conflict and fission in small groups. J. Anthropol. Res. 1977; 33(4): 452–473.
  68. 68. Lusseau D. The emergent properties of a dolphin social network. P. R. Soc. Lond. B-Biol. Sci. 2003; 270(2): 186–188. pmid:14667378
  69. 69. Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms, Phys. Rev. E. 2008; 78 (4): 046110. pmid:18999496