A propagation-based seed-centric local community detection for multilayer environment: The case study of colon adenocarcinoma

Regardless of all efforts on community discovery algorithms, it is still an open and challenging subject in network science. Recognizing communities in a multilayer network, where there are several layers (types) of connections, is even more complicated. Here, we concentrated on a specific type of communities called seed-centric local communities in the multilayer environment and developed a novel method based on the information cascade concept, called PLCDM. Our simulations on three datasets (real and artificial) signify that the suggested method outstrips two known earlier seed-centric local methods. Additionally, we compared it with other global multilayer and single-layer methods. Eventually, we applied our method on a biological two-layer network of Colon Adenocarcinoma (COAD), reconstructed from transcriptomic and post-transcriptomic datasets, and assessed the output modules. The functional enrichment consequences infer that the modules of interest hold biomolecules involved in the pathways associated with the carcinogenesis.


Introduction
In the network analysis fields, such as social networks and network biology, community detection is an essential topic in discovering strongly interconnected objects with similar identity or behavior [1,2]. The problem of unfolding community structures is known to be computationally difficult to solve, while its approximate solutions have to cope with accuracy and efficiency issues that become more severe when the network gets larger [3]. It was proven that these communities (modules) play vital roles in the performance of the whole system and their identification yields remarkable empirical insights about complex systems [4][5][6][7][8]. For example, in a metabolic network, communities correspond to the biological functions of the cell [9,10]. In the same way, in a web graph, communities exhibit topics of interest [11,12]. Presently, there are varieties of methods to explore communities in networks with dissimilar topological PLCDM implementation is available on our GitHub page (https://github.com/LBBSoft/ PLCDM).
In the roadmap of this study, as depicted in Fig 1, two stages of implementation were defined: first, evaluation of the PLCDM, and second, a case study of the application of PLCDM on the colon cancer data to identify related biomolecules involved in carcinogenesis. Besides, in the first stage, to prepare gold-standard data for the verification of the suggested methodology, we designed two additional algorithms: 1-modular graph generation, and 2multilayer network simulation from a single-layer graph, which are both reported in the Supporting information.

Community detection
Herein, we presented a multilayer seed-based local community detection approach based on a biased random walk with a random restart process, simulating information diffusion from a predefined seed node. From this viewpoint, in a multilayer graph G (Eq 1) with n = |V| nodes and k = |L| layers, L = {l 1 , l 2 ,. . .,l k }, indicating different types of interactions (undirected networks), a local community around the seed node s is defined as a set of nodes having a considerable extent of information streamed from s and shared by layers.
To achieve such a community, utilizing the information diffusion idea, we established a graph random walker that moves randomly to neighbor nodes and scores them in each step. Then the community is exploited based on the scores that each node gained. To implement this kind of random walk, we have considered three criteria that are formulated in Eqs 2 to 5. First, we predisposed the walker to move to neighbors that have a large number of neighbors in common with the primary node. It originates from social relationships that the greater the number of mutual friends between two individuals, the stronger friendship (higher diffusion probability) between them. If we were to describe this phenomenon in a biological subject as protein-protein interaction (PPI) network, the two receptors that bind to a large number of common ligands are similar in 3-D structure and conformation and have analogous binding sites. Therefore, they may have the same function and activity and are likely to be in the same biological pathway (module). Second, the walker should not go in the direction of nodes of other communities that have common neighbors with the host node. Without this constraint, in a scale-free topology, the walker tends to move toward other hub nodes (which have more common neighbors and their degree is high). In another biological simulation, this constraint prevents housekeeping genes from being part of a particular pathway (community). Third, selecting common connections is another critical point that leads to a correct calculation. For example, if there are different relationships between two people (Facebook, Twitter, email, etc.), there will be a high probability that they are both in the same community. Alternatively, in a biological example, if there is a co-expressional pattern between the two genes and their products form protein complexes, this increases the likelihood of their joint involvement in biological processes. This tendency should be reflected in the behavior of random walker. To this end, we let the walker move through edges that exist in most layers with higher probability. Applying common-edges influence in the random walk probability will lead to the extraction of a shared structure between the layers. We multiply this value to the normalized weight to force the walker to move towards the common structure between layers.
Mathematically, we formulated these three concerns in Eq 2, where P l (i,j) is the intra-layer probability of moving from a node i to a node j in layer l. In this equation, each term considers one of the constraints mentioned above. It is worth mentioning that Γ l (i) denotes the neighbor set of node i in layer l (Eq 3), and ω(i,j) is the weight of edge (i,j) (defined in Eq 4) in the multilayer network; However, other edge weighting methods could be utilized (i.e. Entropy-based weighting proposed by Hmimida et al. [41]). It is a value in the range [0, 1] and does not depend on the layer; it is used to smooth probabilities. Also, A l stands for the adjacency matrix of layer l. Here, to verify the performance of the proposed algorithm, three datasets were used: 1-a multilayer network constructed from a generated artificial modular singlenetwork (created through a modular single-network generation algorithm, see S1 Appendix), 2-a multilayer network simulated from a synthetic modular single-network provided by the Graph Challenge data repository, and 3-a real social multilayer network published by Rossi et al. [57], based on five social interconnections (Facebook, lunch, coauthor, leisure, and work). The multilayer simulation algorithm in cases 1 and 2 was described in S1 Appendix. In all these three multilayer datasets, gold-standard communities were specified previously. For every dataset, the original multilayer and the aggregated single-layer network is used by methods PLCDM and ML-LCD (multilayer-specific), and Seed-Set-Expansion (single-layer-specific) to explore communities. Stage B), Application of the PLCDM on a reconstructed two-layer network of the colon adenocarcinoma (COAD). https://doi.org/10.1371/journal.pone.0255718.g001

PLOS ONE
In addition, the random walker must be capable to change layers and score nodes according to the topology of layers. Here in Eq 5, R i (l s , l t ) denotes the probability of moving from a typical node i in layer l s , to its counterpart node i in layer l t . Ideally, as much as the two counterpart nodes are similar in their neighbors, the probability of moving between them should be larger. To capture this kind of similarity, we used the Jaccard similarity measure. Here also, it is possible that in a layer, a node to be isolated without any connection to other nodes. Therefore, to prevent division by zero we added 1 to the numerator and denominator of the equation. A typical information diffusion process (simulated by a random walk with random restart from a seed node) in the multilayer network is illustrated in Fig 2B.
Starting from the seed node and proceeding random walks toward the most probable neighbors, nodes in the vicinity of the seed node will have a score indicating their share of information. After a few steps (which is specified randomly), the walker jumps to the seed node to start a new propagation. This is equal to multiple parallel information cascade from the host node s.
It is worth mentioning that PLCDM is applicable to single-layer graphs (e.g., aggregation of multilayer networks). In this condition, the walker will always move on the existing layer without changing layers. Algorithm 1. PLCDM 1. Input: multilayer, seed, iteration_count, random_jump_prob, layer_change_prob 2. Output: community 3. V = set of nodes in the multilayer 4. n = number of nodes 5. layers = set of layers in the multilayer 6. k = number of layers 7. scores = vector of nodes scores initialized to 0 8. R = inter-layer transition probability matrices initialized to 1/k 9. P = intra-layer transition probability matrices initialized to 1/n 10. for layer l in layers:

11.
for nodes i, j in V:

12.
calculate P l (i,j) 13. for node i in V: 14. for layers s, t in layers: 15. calculate R i (l s ,l t ) 16. current_node = seed 17 If change_layer: 29. current_layer = select a layer randomly from layers considering inter-layer probabilities R 30. scores = normalize (scores) 31. community = select nodes with score > 0 32. return community As the scores are calculated, they are transformed by the simple z-score method [58]. With this transformation, not only scores fall in a smaller range, but also, scores larger than the average will be positive, and scores smaller than the average will be negative. Finally, the output community is specified from nodes that their normalized score is positive (their scores are larger than the average, see Fig 2A). At every step, probabilities of moving to the neighbors of the current node are rescaled such that their summation is one. The whole procedure is described in Algorithm 1.
The proposed method is also applicable to single-layer graphs. In this case, there is not any layer change (the change_layer_or_stay function in the algorithm always will return false), and the walker keeps moving based on intra-layer probabilities.
The algorithmic complexity of PLCDM is composed of the complexity of inter-layer and intra-layer probabilities calculation, and the random walk iterations. By considering n as the number of nodes in the multilayer network, and k as the number of layers such that n�k, the total complexity is [O(kn 2 )+O(nk 2 )+O(t)]ffiO(kn 2 ) where t is the number movements (a constant value). Here, O(kn 2 ) is the complexity of intra-layer probabilities calculations and O(nk 2 ) denotes the complexity of inter-layer probabilities calculations. It demonstrates that the timeconsuming part is the process of calculating intra-layer probabilities of layers, and then the output community is extracted in a linear time. Moreover, the memory space complexity is [O (kn 2 )+O(nk 2 )+O(n)]ffiO(kn 2 ). For example, in a typical implementation of the algorithm for a 5-layer network and 10 4 nodes in each layer, it uses about 3.73 GB of memory and runs in 23 minutes to calculate the probability matrices. Then, the calculation of the output community will terminate in a few seconds. However, the time consumed depends on the processing power of the system.

Network preparation
For the two stages accomplished in this study, stages A and B in Fig 1, different datasets were utilized. To confirm the performance of the presented method (stage A), we need gold-standard multilayer networks with prespecified communities. Here, three datasets with predefined ground-truth communities were employed; one real multilayer network (based on five social interconnections: Facebook, lunch, coauthor, leisure, and work), labeled Multilayer1, published by Rossi et al. [57], and two simulated multilayer networks (S1 Appendix) from two distinct single-layer graphs, prepared by (1) Graph Challenge data repository (https:// graphchallenge.mit.edu/data-sets), labeled Multilayer2, and (2) our modular network generation algorithm available in S1 Appendix, labeled Multilayer3. In the case of multilayer simulation from a single-layer network (see S1 Appendix), since the generation is achieved using a probabilistic model; it is obvious that the obtained network from the layer aggregation of the simulated multilayer network, differs from the original single-layer network. In all these three multilayer networks, edges are undirected and unweighted and nodes may be different in every layer.
At stage B, we apply the PLCDM method on a reconstructed two-layer network of colon cancer to test if it can uncover functional modules having a role in the carcinogenesis biological pathways. In this multilayer, the first layer is a gene co-expression, and the second one is a protein-protein interaction graph. In the co-expression layer, nodes are genes, and edges show similar expressional patterns between genes. However, in the PPI layer, nodes are proteins, and their interrelations are physical binding interactions. As proteins are gene products, and to align layer nodes, we used the corresponding gene symbols as node labels in both layers. This structure is used in multiple earlier works [36,59,60]. In the proposed structure, we only considered intra-layer and coupling inter-layer edges, regardless of possible (non-coupling) gene-protein binding interactions. Herein, two types of biological data were utilized to model cancer states. For the co-expression layer, FPKM-UQ (Fragments Per Kilobase of transcript per Million mapped reads upper quartile) normalized gene expression profile of COAD patients were downloaded from the GDC data portal (https://portal.gdc.cancer.gov/). After preprocessing the expression data, genes with the following conditions were excluded: (1) genes with missing values in any sample [61], (2) genes with the expression count value of zero in more than 80% of all samples [61][62][63], (3) genes that possessed the expression rates with zero standard deviation across all samples, and (4) genes with average CPM (count per million) lower than 1 [61]. Afterward, a gene co-expression network was reconstructed using the WGCNA package [64] in R. For the co-expression layer, we defined a correlation threshold to filter significant edges. Conversely, in the second layer, curated Protein-Protein interactions (PPIs) for the same genes were pulled out from interactome data published by Menche et al. [65]. Here there is an assumption: in the reconstruction of this multilayer network, if a gene encodes several proteins or vice versa, to simplify the problem, only one-to-one mappings are considered. However, implementing one-to-many mappings is possible by letting the walker to select randomly from correspondents in layer change process.

Method evaluation
To assess the method truthfulness in finding correct modules, we compared PLCDM with two other prominent local community detection methods, ML-LCD [25] and Seed-Set-Expansion [21]. ML-LCD is a multilayer seed-centric local community detection and, on the other hand, Seed-Set-Expansion is a single-layer local community detection method. First, we compared PLCDM with ML-LCD, by applying both methods on every multilayer network of stage A, declared in the Network Preparation section. Second, we are interested in investigating PLCDM performance on the three multilayer networks, in comparison with the Seed-Set-Expansion performance on the aggregated networks of those multilayers. Our objective from performing this two-step evaluation is not only assessing PLCDM accuracy against other methods, but also its advantages over layer aggregation. In every multilayer network (one real and two synthetic networks), ground-truth communities were specified previously. Each node inside these predefined communities is used as a seed item, and the resulting community is calculated using PLCDM, ML-LCD, and Seed-Set-Expansion algorithms. Afterward, these three methods are compared to see the similarity of their result community to the groundtruth community.
For example, suppose that for a multilayer network M, the set of predefined ground-truth communities is C = {C 1 , C 2 , C 3 } in which every community contains some nodes (e.g.
In the evaluation process, every node such as v 1 is considered as the seed, and using the three methods (PLCDM, ML-LCD and Seed-Set-Expansion), their output communities are predicted (e.g.  Table 1. This practice is performed for entire nodes of ground-truth communities. Now, for every seed item of ground-truth communities, we have a set of evaluation measures that could be used to compare methods.

Networks properties
In this subsection, we briefly describe the properties of networks employed in the two stages of the workflow. Topological properties of datasets, were reported in Table 2. As mentioned before, the real social multilayer was obtained and the two synthetic multilayers were simulated from single-layer modular graphs. In the simulation of a multilayer network from its

PLOS ONE
single-layer modular graph, to be fair in generation of layers (S1 Appendix) edge-selectionprobability parameter was set to 0.5. Manually, the layer count value was set to 5, but it was possible to set any other value (generate any number of layers). It should be noted, all layers are undirected and unweighted graphs. At stage B, The obtained dataset from the TCGA includes total RNAseq expression data for 60483 coding/non-coding RNAs in 49 samples that consisted of samples collected from patients with colon cancer. Then, we carried out gene filtering as described in the "Materials and Methods" section. The cleaned data include the expression count value for 14515 genes in 49 samples. The co-expression layer type was set to 'unsigned' and it was excerpted for the most correlated genes (correlation threshold was set to 0.8 to consider only 20% of high correlations [66]). Nevertheless, in the physical binding layer, all protein interactions for the same nodes of the first layer were selected. The generated co-expression layer comprises 5993 nodes and 75121 edges. On the other side, the physical interaction layer was directly generated from source datasets. The physical interaction layer (PPI) has 12751 nodes and 135712 edges.

PLCDM comparison
In order to verify the PLCDM, in an iterating process every node in the ground-truth communities of the three evaluation datasets, was considered as the seed node and the output community of that seed were computed (predicted) through PLCDM, ML-LCD and Seed-Set-Expansion methods. Then, the three predicted communities are compared with the original ground-truth community based on different evaluation metrics.
In the PLCDM execution, the Jump probability was calculated using trial and error procedure. We set this value to 0.5 so that the walker could reach to most nodes (in some cases, some nodes are isolated and therefore unreachable) and score nodes around the seed with higher probability.
The results of this evaluation is illustrated in Fig 3. On the Multilayer1, in all extents of evaluation measures except Recall, PLCDM outperforms the others (Table 3). Nevertheless, in the second dataset (Multilayer2), ML-LCD's performance is superior in five measures and PLCDM gains excellence in terms of FDR, Specificity, and Precision (Table 4). For the Multi-layer3, as revealed in Table 5, PLCDM performs much better than ML-LCD and Seed-Set-Expansion methods. Multilayer3 dataset (real data for social connections), in the predefined ground-truth communities, every node was selected as the seed node, and using three local community detection methods (PLCDM, ML-LCD, and Seed-Set-Expansion) result community was predicted for that selected seed. Then, the predicted community of each method was compared to the original ground-truth community and evaluated using eight metrics. As demonstrated, except Recall, in all measures PLCDM has better results. Notably, the false discovery rate of PLCDM is smaller than the other two methods. Every boxplot demonstrates the values gained by a method in this iterative process. Therefore their interquartile range and their average are comparable. https://doi.org/10.1371/journal.pone.0255718.g003

Local vs. global community detection
The PLCDM is a local community detection scheme and its objective is to the discover a single local community that the predefined seed node belongs to it. However, the global community detection methods are used to acquire any community structure in the network. Here, we make a comparison between local and global methods, which is illustrated in Fig 4. For this purpose, a typical synthetic multilayer network was employed that was created by the "modular graph generation" and "multilayer simulation from single-layer graph" algorithms, having 25 nodes and two modules of size five in three layers. The original single-layer modular graph is illustrated in Fig 4A. First of all, communities of the aggregated single-layer graph, unfolded by Louvain [28] algorithm, was displayed in Fig 4B. Secondly, the three methods PLCDM (local, Fig 4C), generalized Louvain (gLouvain) [67] (global, Fig 4D), and multilayer Infomap [44] (global, Fig 4E) were applied on the multilayer network. For the PLCDM, nodes 1 and 7 were randomly specified as initial seed nodes, in two separate executions. However, in the other two methods, communities other than the two communities in question are also extracted. It is noticeable that the PLCDM detects both communities in a precise way.
As pointed previously, we set the walker to move from edges present in most layers to focus on strong edges. Here, in a separate step, to show the effect of the edges that are present in most layers and to observe the difference between a biased random walk on the multilayer network and its aggregated network, Table 6 is presented.

Application of PLCDM on COAD-specific multilayer network
Once the assessment process was achieved, as a case study, we applied the PLCDM on the colon cancer network as we clarified formerly. We ran the method by manually seeding it with HMMR as a known gene involved in cancerous signaling pathways. The PLCDM-explored community around the HMMR comprises 97 genes that computationally are similar in

PLOS ONE
function and action. To confirm this finding, we benefited ToppFun enrichment portal of ToppGene [68] and FunRich [69] utilities and studied the role of discovered biomolecules in tumor-associated pathways and biological processes. The pathway enrichment results of the output module are shown in Fig 5. However, the detailed information of gene ontology (biological processes, molecular function, and cellular component), pathways, protein domain, and expression site are accessible in S2 Appendix. The outcomes suggest that the objects exploited in the yielded module are related to pathways involved in cell growth and proliferation, and carcinogenesis of the colon, such as Cell Cycle, DNA Replication, APC/C-mediated degradation of cell cycle proteins, and Mitotic M-M/G1 phases. As reported by the ToppFun, this module is correlated with the Colorectal Neoplasms significantly (p-value = 3.433E-5).

PLOS ONE
The APC produces a protein that acts in the Wnt signaling pathway with a tumor-suppressing function. It has roles in other biological processes, such as cell migration and adhesion, transcriptional activation, and apoptosis. Defects in the APC lead to familial adenomatous polyposis (FAP) that commonly moves toward malignancy. Mutations in this gene have been shown to occur in most colorectal cancers [70]. Additionally, it is closely associated with cell division, which is an important aspect of cancer cells.
The extracted module is involved in the FOXM1 transcription factor network. The FOXM1 gene encodes a transcriptional activator protein that is involved in cell proliferation. This protein is phosphorylated in the M phase and controls numerous genes in the cell cycle (e.g., cyclin B1 and cyclin D1). For the FOXM1, several transcript variants coding isoforms have been found (https://www.genecards.org/).
Next, in another parallel execution, the method was applied with the seed ECT2 that is a colorectal cancer-associated gene. This time, the extracted module comprises 101 genes. The functional enrichment results demonstrate that the gene-set is associated with apoptosis, cell proliferation, cell cycle, DNA repair, and DNA replication significantly (Fig 6). All the stated pathways are essential for carcinogenesis and the rapid development of tumor cells. Cyclin D plays role in controlling cell cycle development. The cyclin D synthesis is originated during G1 and excites the G1/S phase transition. Li et al. [71] previously mentioned on Prognostic Significance of Cyclin D1 Expression in Colorectal Cancer. Here, the genes in the output module are correlated with the Cyclin D-associated events in G1. For more information about these results, see S3 Appendix.

Conclusion
Communities are the core and functional parts of networks that contain objects similar in form or behavior. In this way, numerous community detection algorithms were proposed to reveal modules in different network types, such as multilayer networks. In real-world applications, sometimes, finding communities in a network is not as important as extracting the superior community of a solitary preexposed node. As a biological case, in cancer applications, when an oncogene is known experimentally, finding its related component is more important than the discovery of every module in the whole system (that may be non-correlated to cancer). Herein, we propose a new seed-centric local community detection method for the multilayer environment that is based on information cascade theory. In this method, by propagating information from the predefined seed, nodes in the neighborhood of the seed are selected based on the information proportion that they obtained. To confirm the algorithm, various multilayer datasets with predefined ground-truth communities were employed. The results designate that PLCDM does well against earlier approaches with regard to evaluation measures. Finally, we applied the method on reconstructed colon adenocarcinoma multilayer networks and examined the outcome modules in terms of functional enrichment and GO annotations. Our assessments disclose that the local modules are correlated with pathways and biological processes involved in colon cancer.