Disease-related gene module detection based on a multi-label propagation clustering algorithm

Detecting disease-related gene modules by analyzing gene expression data is of great significance. It is helpful for exploratory analysis of the interaction mechanisms of genes under complex disease phenotypes. The multi-label propagation algorithm (MLPA) has been widely used in module detection for its fast and easy implementation. The accuracy of MLPA greatly depends on the connections between nodes, and most existing research focuses on measuring the similarity between nodes. However, MLPA does not perform well with loose connections between disease-related genes. Moreover, the biological significance of modules obtained by MLPA has not been demonstrated. To solve these problems, we designed a double label propagation clustering algorithm (DLPCA) based on MLPA to study Huntington’s disease. In DLPCA, in addition to category labels, we introduced pathogenic labels to supervise the process of multi-label propagation clustering. The pathogenic labels contain pathogenic information about disease genes and the hierarchical structure of gene expression data. Experimental results demonstrated the superior performance of DLPCA compared with other conventional gene-clustering algorithms.


Introduction
High throughput biotechnologies have been routinely used in biological and biomedical research. As a result, tremendous amounts of large-scale omics data have been generated, providing not only great opportunities but also challenges for understanding the molecular mechanisms of complex diseases [1]. Detecting disease-related gene modules by analyzing gene expression data represents one of these opportunities and challenges. Genes with similar expression patterns, as well as those with similar functions, are more likely to be regulated via the same mechanisms [2]. Therefore, we can extract disease-related molecular mechanisms through gene co-expression analysis if the genes involved in the mechanism form a significant co-expression gene module that contains known disease genes [3,4]. The essence of such coexpression analysis is a clustering problem. PLOS  topological structure of a gene co-expression network. Experimental results demonstrated the feasibility and effectiveness of DLPCA as well as the superior performance of DLPCA compared with other conventional gene clustering algorithms. The rest of this study is organized as follows: Materials used in our study and methods proposed in this paper are presented in Section 2. Experiments that analyze the performance of DLPCA and the overall discussion of DLPCA are reported in Section 3. Conclusions, along with some suggestions for future research, are presented in Section 4.

Materials and methods
In this section, first, the gene expression data used in our study are described. Next, the construction of the gene co-expression network is briefly introduced. Then, we present the seed selection method based on local topological information. Finally, we describe the DLPCA.

Gene expression data
The gene expression data used in our study were RNA-seq data downloaded from http://www. hdinhd.org. The data were obtained from the striatum tissue of 6-month-old Huntington's disease (HD) mice. The gene expression data contain 4 genotypes, including polyQ 92, polyQ 111, polyQ 140, and polyQ 175. Each genotype has 8 replications. Thus, the gene expression data comprise 32 samples in total. The gene expression data contain 23,351 genes. After removal of genes with insignificant expression changes, 9578 genes remain for further consideration. The data on modifier genes were from Langfelder [29], which contain 520 genes in the training set, including 89 disease genes and 431 non-disease genes.
HD is a type of neurodegenerative diseases that is reported to be caused by a triplet repeat elongation in the Huntington gene (IT15), which leads to neuronal malfunction and degeneration through numerous interactions between genes and a number of different molecular pathways. The course of the disease is a constant progression of symptoms lasting 15 to 20 years after diagnosis and eventually leading to death. Several molecular mechanisms are involved in HD that lead to neuronal dysfunction. Genes with similar expression patterns are usually regulated via the same mechanism, forming modules in the gene co-expression network. Accordingly, if a module contains a relatively large number of disease genes, the biological function of the module may be highly relevant to the disease. This explains why we seek to extract diseaserelated modules from the gene co-expression network of HD.

Construction of the gene co-expression network
To conduct the multi-label propagation algorithm, we must construct a gene co-expression network using gene expression data. The gene co-expression network used in our study was constructed using the WGCNA software package [30,31]. As a scale-free network largely corresponds with biological networks, we used the WGCNA software package in our study to ensure that the gene co-expression network is scale-free [32]. Let x i denote the expression profile of gene i and x j denote the expression profile of gene j. The weight of the connection between gene i and gene j is w ij , where w ij = |cor(x i , x j )| β . The parameter β is a soft threshold, which is set as the minimal positive integer that ensures the scale-free topology fit of the gene co-expression network is no more than 0.8. It should be noted that the stronger the Pearson correlation, the larger the weight [30,31]. In the co-expression network G = (V, E), V is the set of nodes, where one node corresponds to a gene. E is the set of edges, showing the mutual interactions between genes. W = [w ij ] is the weight matrix of the gene co-expression network. The adjacency matrix is A = [a ij ], where a ij represents the interactions between node i and j.
The calculation of a ij is given by The transition probability matrix is P = [p ij ], where p ij denotes the probability of transition from node i to node j. In fact, P is a normalized matrix of W along the row vector. The calculation of p ij is given by where N i is the set of neighboring nodes of node i in the gene co-expression network.

Selection of seed nodes
Gene co-expression networks have been shown to exhibit a modular structure. Good seed nodes are helpful for module detection [33]. According to the local topological structure of the gene co-expression network, we selected seed nodes to accelerate the convergence speed and improve the cluster robustness of MLPA [34]. Since nodes with large clustering coefficients and large degrees can spread information quickly and easily, we selected seed nodes based on degree and clustering coefficient. The details for seed nodes selection are shown below.
Step 1. Compute the clustering coefficient of node i, where d i represents the degree of node i. Then, rank all the nodes in descending order according to the clustering coefficient c i . R c i represents the ranking of node i in the ranked list.
Step 2. Compute the degree of node i, d i = ∑ j2N i a ij . Then, rank all the nodes in descending order according to their degrees. R d i represents the ranking of node i in the rank list.
Step 3. The rank-product strategy [35] yields the comprehensive ranking of node i, Step 4. Rank R i , i 2 V, in ascending order and select the first m nodes as seeds. We denote the seed set as S, while the category label of seed node i is f i , i 2 S.
It should be clarified that the category labels of seeds are used to extract modules from the gene co-expression network. In the MLPA results nodes with the same category label are considered a module.

Double label propagation clustering algorithm
To make full use of some genes with pathogenic information and improve the biological meaning of the clusters, we take the pathogenic information of genes into consideration during category label propagation. The initial pathogenic label of a gene is given by Eq (3).
We conduct semi-supervised pathogenic label propagation using the known pathogenic information of some genes to supervise the multi-label propagation clustering, thus obtaining the most likely disease-related modules.
Definition 1 Category label update rule. When multi-label propagation is used to detect functional modules, the following update rule for the category labels is used during the label propagation.
The category label of node i is where N f n i represents the neighboring nodes of node i with the category label f n , n 2 S. λ 1 , λ 2 , λ 3 are parameters. λ 1 controls the effects caused by weighted connectivity. λ 2 controls the effects caused by the number of neighboring nodes. λ 3 controls the effects caused by the pathogenic information of the neighboring nodes. We assumed that the weighted connectivity, the degree and the pathogenic information have equal influence on the category label of a gene.
Definition 2 Pathogenic label update rule. Based on the topological structure of the gene co-expression network, update the pathogenic label of other nodes in the network by using the small amounts of genes with known pathogenic information.
The pathogenic label of node i is where l t i is the pathogenic label of node i at the tth iteration, N f ðiÞ i represents the neighboring nodes of node i whose category label is the same as node i. N À f ðiÞ i represents the neighboring nodes of node i whose category label is different from node i. The symbols β 1 , β 2 , β 3 are parameters. The parameter β 1 regulates the pathogenic effects caused by the nodes in N f ðiÞ i . The parameter β 2 regulates the pathogenic effects caused by the nodes in N À f ðiÞ i . The b 3 l tÀ 1 i ensures that the pathogenic label of node i is stable during the pathogenic label updating process. In addition, β 1 + β 2 + β 3 = 1 ensures that the pathogenic label is ultimately convergent [20]. Definition 3 Conditions for termination of iteration. The conditions for termination of DLPCA are that the category labels of the nodes in the network stop changing or that the change of the pathogenic information for any node is less than the threshold. In this study, the threshold is 0.1.
The DLPCA procedure is summarized in Algorithm 1. Algorithm 1: DLPCA Input: gene expression data, parameters: λ 1 , λ 2 , λ 3 , β 1 , β 2 , β 3 Input: pathogenic labels of some genes Output: gene category label 1: Construct the gene co-expression network. Compute the weight matrix, the adjacency matrix, and the transition probability matrix 2: Select the seed nodes 3: repeat 4: Update the gene category label according to Eq (4) 5: Update the gene pathogenic label according to Eq (5) 6: until conditions for termination are satisfied 7: return gene category label

Results and discussion
In this section, the selection of parameters is first described. Next, we compare the DLPCA with other conventional methods to demonstrate the superior performance of DLPCA. Second, we analyze the time complexity of DLPCA and MLPA. Third, we conduct an enrichment analysis of the modules obtained using DLPCA. Finally, we present an overall discussion to clearly illustrate the purpose of this study and demonstrate the key point of the algorithm.

Parameter selection
Topological information of the co-expression network is shown in Table 1.   Disease-related gene modules detection based on MLPA According to Table 1 (the average degree and the average weighted connectivity of the co-expression network) and Fig 2 (the near-linear correlation between the degree and the weighted connectivity), we know that the correlation coefficient is roughly equal to the ratio of the average degree to the average weighted connectivity. Considering that the weighted connectivity, degree, and pathogenic information have equal influence on the category label of a node in the present study, we obtained λ 1 : λ 2 = 65.23: 18.97. Following the semi-supervised pathogenic label propagation, the average pathogenic information of all nodes is 0.236. We then obtained λ 2 : λ 3 % 1: 0.236 × 0.236. Therefore, we set λ 1 = 3.44, λ 2 = 1.0, λ 3 = 20.0 for computational convenience. It should be noted that traditional MLPA only considers weighted connectivity in category label propagation.
In addition, to analyze the impact of parameter m on the clustering results, we selected 350, 500, and 750 seed nodes to conduct category label propagation for the traditional MLPA method and the DLPCA method, respectively.

Performance comparison between DLPCA and MLPA
To evaluate the performance of different clustering algorithms, the following criteria were used: the coverage, the significance of the disease-related module, and the significance of scatters. The coverage is defined as the ratio of genes in modules to all genes in the network. The significance of the disease-related module is defined as the ratio of disease genes to genes in the training set included in the module. The significance of scatters is defined as the ratio of disease genes to genes in the training set included in the scatters. The clustering results are improved along with increased significance of disease-related modules and decreased significance of scatters. The clustering results of MLPA and DLPCA are shown in Table 2.
Figs 3 and 4 present the clustering results of these methods. As illustrated in Fig 3, with the same seed numbers, the average significance of disease-related modules obtained using  DLPCA with β 1 = β 2 (the average significances of disease modules is 0.837) is higher than that of the other experiments. As shown in Fig 4, with the same number of seeds, the significance of scatters obtained using DLPCA with β 1 = β 2 (the average significances of scatters is 0.111) is lower than that of the other experiments. It is also clear that the average significance of disease-related modules with different numbers of seeds obtained using DLPCA are similar ( Fig  3). The significance of scatters obtained using different numbers of seeds in DLPCA are also similar (Fig 4). These results suggest that the clustering results of DLPCA are insensitive to seed number. Figs 3 and 4 also show that the clustering results of DLPCA with β 1 = β 2 are much better than that of DLPCA with β 1 > β 2 , indicating that different parameter combinations significantly impact the clustering results. When the coefficients of the two category labels are equal, i.e., β 1 = β 2 in DLPCA, the average significance of the disease-related modules (the average significance of the disease modules is 0.837) and the significance of the scatters (the significances of the scatters is 0.111) are the best. It demonstrates that DLPCA with β 1 = β 2 can separate disease genes from non-disease genes very well during the clustering process. When the coefficients of the two category labels are not equal, generally, the neighboring nodes whose category labels are the same as that of node i have a greater impact on the pathogenic label of node i, i.e., β 1 > β 2 in DLPCA. Affected by the interaction of the category label and the pathogenic label, DLPCA with β 1 > β 2 may easily fall into local optimization. This situation could be prevented by setting β 1 = β 2 , ensuring that category label updating is immune to pathogenic label updating. Furthermore, clusters obtained by MLPA have often been shown to contain few genes, which is also confirmed in this study (the average module size of MLPA is shown in Table 2). The experimental results also suggest that MLPA fails to effectively separate disease genes from non-disease genes. However, DLPCA contains a semi-supervised pathogenic label propagation step, which is very helpful for separating disease genes from non-disease genes. DLPCA greatly improves the average significance of disease-related modules compared with MLPA. In the DLPCA results, the sizes of disease-related modules are between 20 and 300 except for two large modules whose sizes are larger than 1000. In summary, DLPCA can effectively improve the performance of clustering results by selecting the appropriate parameters as suggested in our study.

Performance comparison between DLPCA and DCOTCA
To compare the performance of DLPCA with other algorithms, we conducted experiments using the dynamic cut-off tree clustering algorithm (DCOTCA). The clustering results are illustrated in Table 3.

Time complexity analysis of DLPCA and MLPA
When the number of nodes in the gene co-expression network is n and the number of seed nodes is m, the time complexity of MLPA in each iteration is O(m Á n 2 ). Approximately 10 iterations in each MLPA experiment are needed to reach convergence (Table 4). Given the interaction of the category label and pathogenic label in DLPCA with β 1 > β 2 , fewer iteration times are needed relative to DLPCA with β 1 = β 2 to reach convergence. During the category label propagation process, traditional MLPA only considers the impact of weighted connectivity on category label according to a static network; thus, the iteration process of MLPA is the fastest.
The time of per iteration varies. Generally, increasing the seed number increases the time. The average time of each iteration is displayed in Table 5. Note that we used a server with the Linux operating system, 100 GB memory, and an Intel (R) Xeon (R) E5-2603 v3 @1.60GHZ CPU for the data analysis. The algorithm was run on Java 1.7.0_17.

Enrichment analysis
In addition, we conducted enrichment analysis using the DAVID [36] to determine the biological function of the modules obtained using the DLPCA. The clustering results of DLPCA with Disease-related gene modules detection based on MLPA β 1 = β 2 and 350 seed genes were used in the enrichment analysis. We listed annotation clusters with high enrichment scores (ES) for the 9 disease-related modules. We also investigated the enrichment annotations for another two modules that are not associated with the disease to analyze the factors that are not effected by or do not contribute to the disease. The detailed annotations of these modules are shown in Table 6. For module 1 (disease-related module with 91 genes, including 19 disease genes and 1 non-disease genes), identified annotation clusters include metal-binding (cluster 1 with enrichment score 2.69), sequence repeat (cluster 2 with enrichment score 2.54), calcium icon binding (cluster 3 with enrichment score 2.42) and ribosome (cluster 4 with enrichment score 2.14). The significance of the module is 0.95. The annotations for the module suggest that HD maybe associated with these functional annotations above. In fact, HD is caused by the excessive repetition of CAG in the fourth chromosome, which corresponds to the functional annotation, i.e., sequence repeat, of the disease-related module. On the other hand, for module 10 (the non-disease-related module with 623 genes, including 0 disease genes and 137 non-disease genes), annotation clusters such as lysosome (cluster 1 with enrichment score 8.04), cilium (cluster 2 with enrichment score 7.36), Glycoprotein (cluster 3 with enrichment score 5.95) and extracellular matrix (cluster 4 with enrichment score 5.48) were identified. Since the module contains no disease genes, the above functions are most likely not affected by the disease.
The pathology of Huntington disease is very complex and many factors are involved in the disease progression, including inflammation, impaired metabolic pathways, protein mis- folding [37][38][39], etc. The enrichment analysis results demonstrate that a disease-related module often contains many functional annotations that could reflect complicated pathologies and also verifies the effectiveness and reasonability of the DLPCA.

An overall discussion
Although tremendous amounts of omics data are being collected along with the rapid development of high-throughput technology, only a small amount of data contain clearly biological annotations, e.g., pathogenic information on genes for specific complex diseases. The challenge is how to fully utilize the small amounts of labeled data to discover effective knowledge from the genome-wide data.
The DLPCA designed in this study aims to mine the most likely disease-related modules from gene expression data by making full use of the pathogenic information of a small number of genes. In addition, DLPCA also makes full use of the hierarchical structures in the network, including the structures represented by the category labels and those represented by the pathogenic labels. This computational method can improve the efficiency and effectiveness of downstream biological experimental analysis. To clarify the main idea of this study and the key point of the DLPCA, we have drawn Fig 8 to clearly demonstrate the properties of DLPCA compared with MLPA. DLPCA is helpful for classifying genes with similar biological properties into one module. Compared with MLPA, DLPCA effectively improves the biological significance of the gene clusters.

Conclusions
In this study, we designed a double label propagation clustering algorithm for detecting disease-related modules. This algorithm takes the pathogenic information of genes as a property of nodes in the gene co-expression network. During the clustering process of MLPA, DLPCA not only considers the topological structures of the network but also the biological properties of the nodes in the network. In addition, to accelerate convergence and improve cluster robustness, we also proposed a seed selection strategy according to the local topological structure of the gene co-expression network. Compared with the aforementioned conventional methods, DLPCA effectively improves the accuracy of disease-related module identification. Disease-related gene modules detection based on MLPA  However, it should be stated that DLPCA could be applied equally well to other biological networks and genomic data. Recently, new module detection methods integrating different network structures have been proposed [40]. Generally, the accuracy of disease module detection may be further improved by integrating other biological data as well as gene expression data, especially for gene expression data characterized by large amounts of noise. Therefore, our future efforts will focus on integrating multi-source biological data to further improve the accuracy of diseaserelated modules.