Network Propagation with Dual Flow for Gene Prioritization

Shunyao Wu; Fengjing Shao; Jun Ji; Rencheng Sun; Rizhuang Dong; Yuanke Zhou; Shaojie Xu; Yi Sui; Jianlong Hu

doi:10.1371/journal.pone.0116505

Abstract

Based on the hypothesis that the neighbors of disease genes trend to cause similar diseases, network-based methods for disease prediction have received increasing attention. Taking full advantage of network structure, the performance of global distance measurements is generally superior to local distance measurements. However, some problems exist in the global distance measurements. For example, global distance measurements may mistake non-disease hub proteins that have dense interactions with known disease proteins for potential disease proteins. To find a new method to avoid the aforementioned problem, we analyzed the differences between disease proteins and other proteins by using essential proteins (proteins encoded by essential genes) as references. We find that disease proteins are not well connected with essential proteins in the protein interaction networks. Based on this new finding, we proposed a novel strategy for gene prioritization based on protein interaction networks. We allocated positive flow to disease genes and negative flow to essential genes, and adopted network propagation for gene prioritization. Experimental results on 110 diseases verified the effectiveness and potential of the proposed method.

Citation: Wu S, Shao F, Ji J, Sun R, Dong R, Zhou Y, et al. (2015) Network Propagation with Dual Flow for Gene Prioritization. PLoS ONE 10(2): e0116505. https://doi.org/10.1371/journal.pone.0116505

Academic Editor: Francisco J. Esteban, University of Jaén, SPAIN

Received: May 16, 2014; Accepted: November 24, 2014; Published: February 17, 2015

Copyright: © 2015 Wu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: The disease gene list are available from the OMIM database (http://www.ncbi.nlm.nih.gov/omim/) under the DOI 10.1093/nar/gki033. Housekeeping genes are third party data and are available from the research of Chang et al. (http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0022859.s008). 110 hereditary diseases and corresponding disease genes are third party data and are available from the work of Kohler et al. (http://download.cell.com/AJHG/mmcs/journals/0002-9297/PIIS0002929708001729.mmc1.zip). The human protein interactions are third party data and are available from the i2d database (http://ophid.utoronto.ca/ophidv2.204/) and the STRING database (http://string-db.org/).

Funding: 1. The State Key Program of National Natural Science of China (No. 91130035), National Natural Science Foundation of China (http://www.nsfc.gov.cn/), FS; 2. The National Science Foundation of Shandong Province (No. ZR2012FZ003), Shandong Provincial Natural Science Foundation, China (http://www.sdnsf.gov.cn/portal/), FS; 3. The National Science Foundation of Shandong Province (No. ZR2012FQ017), Shandong Provincial Natural Science Foundation, China (http://www.sdnsf.gov.cn/portal/), RS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Disease gene prediction is an important task in bioinformatics. It aims to discover potential disease genes based on known disease genes and omics data, such as metabolic pathways and protein-protein interactions, by utilizing machine learning and complex network theory. It is very important to understand the pathogenesis of hereditary diseases and improve the quality of diagnosis [1].

As a meaningful strategy of disease gene prediction, gene classification aims to construct a binary classification model to automatically determine whether an unknown gene is a disease gene. To effectively distinguish disease genes from non-disease genes, some researchers have utilized sequence-based characteristics to construct classifiers [2]. At the same time, the hypothesis that the neighbors of disease genes are likely to cause diseases prompted scholars to exploit the topological features in protein-protein interaction networks for detecting disease genes [3]. Many studies have explored the integration of various types of features [4–6]. Although gene classification has brought some success, two major problems still exist. First, gene classification selects negative samples (non-disease genes) from unknown genes. However, there are also unrecognized disease genes (false negative samples) that may seriously affect the construction of an accurate classifier [5]. Second, generally, gene classification cannot predict associations between genes and diseases [3, 4, 6]. Only a few disease genes have been verified for each hereditary disease, which is insufficient to train an excellent classifier.

Unlike gene classification, gene prioritization can overcome the two problems mentioned above. The main idea of gene prioritization can be described as follows. Given a disease and its known disease genes, gene prioritization estimates the similarities between unknown genes and known disease genes according to omics data; then, the similarities are sorted in descending order and the top ranked genes are classified as potential disease genes. This provides a convenient method for biomedical experts to select top ranked genes on which to perform experimental verification. The omics data discussed in this paper is protein-protein interaction data. In recent years, gene prioritization based on protein-protein interaction networks has become a hot research topic in bioinformatics [1, 7]. The basic idea is to discover potential disease genes that are closer to or have more interactions with known disease genes.

Gene prioritization can be divided into two types: local distance measurements and global distance measurements. Local distance measurements detect disease proteins according to the local interaction network structure, such as counting the number of known disease proteins in the direct neighbors (Direct Neighbors [8, 9]), or computing the average shortest path to known disease proteins (Shortest Path [10, 11]). Local distance measurements are simple and have low computational complexity, but their performance has been shown to be unsatisfactory. Thus, global distance measurements that can take full advantage of global topological structure have received increasing attention. Random walk with restart [7, 12], kernel diffusion [7] and network propagation [13] are classical global distance measurements. They can effectively detect potential disease genes, which have a high number of interactions with known disease genes. A detailed introduction about gene prioritization has been previously published [14, 15].

One limitation is that global distance measurements may mistake hub proteins with high betweenness for potential disease genes, while hub proteins are probably essential proteins. Thus, it is necessary to identify a method to further determine if the hub proteins are essential proteins, disease proteins or other proteins.

The existing research on protein interaction network analysis is mainly focused on differences in topological importance between essential proteins, disease proteins and other proteins (unknown proteins) [16, 17]. So far, few studies have exploited essential proteins to distinguish disease proteins from other proteins, except our recent research. Our recent study showed that, compared with other proteins, disease proteins are topologically more important [18]. And, disease proteins are closer to the center of the protein interaction network, but are not well connected with essential proteins. We propose that if there are too many essential proteins as neighbors of a candidate protein, the protein is unlikely to cause diseases. However, our recent study only analyzed the proportions of essential proteins among 1-direct neighbors (nearest neighbors) and 2-indirect neighbors (1-direct neighbors’ nearest neighbor [3]) of disease proteins [18]. Thus, more evidence is required to support this new hypothesis.

This paper systematically analyzed the topology associations between disease proteins and essential proteins within protein interaction networks. Empirical results demonstrated that disease genes are not well connected with essential genes. Furthermore, we improved the network propagation method according to the new hypothesis. The main idea is similar to two competing pathogens spreading on a network [19]. We assume that known disease proteins carry positive flow, while essential proteins carry negative flow. And network propagation is considered as the competition between disease proteins and essential proteins. Proteins with more positive flow trend to cause diseases, while proteins with more negative flow are probably non-disease proteins. Thus, by network propagation we can find potential disease proteins that have more interactions with known disease proteins (indicating that they probably have similar functions), but fewer interactions with essential proteins (suggesting that the disease proteins are not well connected with essential proteins). Experimental results on 110 hereditary diseases verified the effectiveness and potential of the proposed method.

Materials and Methods

Human gene list, hereditary disease list and human protein-protein interaction data

The disease gene list was downloaded from the Online Mendelian Inheritance in Man database (OMIM) [20]. We selected 2931 disease genes with tag “3” from 6285 entries. Genes with tag “3” have been verified by the presence of a mutation. Then, we obtained housekeeping genes from the research of Chang et al. [21]. Housekeeping genes are universally expressed in normal tissues or cells and are vital to maintaining fundamental life activities. Thus, housekeeping genes can be deemed as essential genes [16].

We obatined 110 hereditary diseases and corresponding disease genes from Kohler et al. (http://download.cell.com/AJHG/mmcs/journals/0002-9297/PIIS0002929708001729.mmc1.zip). Kohler et al. [7] collected the associations between genetic diseases and disease genes from OMIM, domain knowledge and medicinal literatures. Here, 110 diseases are accounted for by 794 disease genes; there were 681 unique genes listed (one gene may cause more than one disease).

The human protein interactions were downloaded from the i2d (http://ophid.utoronto.ca/ophidv2.204/) and STRING (http://string-db.org/) databases. Table 1 lists the statistics of networks constructed based on the protein interactions. The i2d database uses proteins as interactors. Thus, we mapped genes to proteins according to the UniProt database (http://www.UniProt.org). Unlike the i2d database, the STRING database uses genes as interactors, and provides a score to evaluate the reliability between two interactors. Similar to Kohler et al. [7], we set a threshold score of 0.4 to extract unweighted interactions. We integrated all the data from the two databases to construct a larger network (this paper refers it to “integrated protein interaction network”) for disease gene prediction.

Download:

Table 1. Networks used in this work.

https://doi.org/10.1371/journal.pone.0116505.t001

In this paper, we annotated essential proteins/genes and disease proteins/genes as E and D respectively, and the remaining proteins/genes (O = ¬(E ⋃ D)) were treated as other proteins/genes. Table 2 and Table 3 list the statistics of different types of interactors in the protein interaction networks constructed based on the i2d and STRING databases. For the sake of brevity, ¬D ⋂ E is denoted by E⁻ and ¬E ⋂ D is denoted by D⁻.

Download:

Table 2. Statistics of the proteins in the protein interaction network constructed based on the i2d database.

https://doi.org/10.1371/journal.pone.0116505.t002

Download:

Table 3. Statistics of the genes in the protein interaction network constructed based on the STRING database.

https://doi.org/10.1371/journal.pone.0116505.t003

Analysis of the topology associations between disease proteins and essential proteins

Essential genes were initially considered to be stable genes unaffected by other factors. However, recent studies have indicated that the expression of essential genes can be influenced by other factors, such as diseases [22–24]. Our recent study analyzed the associations between disease genes and essential genes in the protein interaction network. Empirical results demonstrated that even though non-essential disease proteins are closer to essential proteins, the proportions of non-disease essential proteins among 1-direct neighbors of non-essential disease proteins are similar to those of other proteins, and the proportions of non-disease essential proteins among 2-indirect neighbors of non-essential disease proteins are statistically smaller than those of other proteins. This finding illustrates that disease proteins are not well connected with essential proteins. In this paper, we systematically study the topology associations between disease proteins and essential proteins.

n neighbors of node i are defined as node set $Q_{i}^{n}$ , in which the shortest path of each element to node i is n. Here, n is a positive integer. For instance, $Q_{i}^{1}$ is the set of direct neighbors of node i. We intend to compare the differences of the proportions of non-disease essential proteins among n neighbors of non-essential disease proteins and other proteins. For the sake of brevity, the intersection of set $Q_{i}^{n}$ and set E⁻ is denoted by $Q_{E_{i}^{-}}^{n}$ , $Q_{E_{i}^{-}}^{n} = Q_{i}^{n} ⋂ E^{-}$ ; the size of set $Q_{i}^{n}$ is denoted by $q_{i}^{n}$ , $q_{i}^{n} = ∣ Q_{i}^{n} ∣$ ; the size of set $Q_{E_{i}^{-}}^{n}$ is denoted by $q_{E_{i}^{-}}^{n}$ , $q_{E_{i}^{-}}^{n} = ∣ Q_{E_{i}^{-}}^{n} ∣$ . In this paper, the proportion of non-disease essential proteins among n neighbors of node i is defined as follows. (1)

In this paper, ${p_{E_{i}^{-}}^{n} ∣ i \in X}$ is denoted by $P_{E^{-} X}^{n}$ and the median of $P_{E^{-} X}^{n}$ is denoted by $M d (P_{E^{-} X}^{n})$ .

Gene prioritization

In this work, the network propagation method was adopted to detect disease genes.

Network propagation on a network can be understood as simulating a process, in which nodes iteratively pump flow to their neighbors [13]. A node would pump equal flow to each of its direct neighbors for each timestamp. We denote the network as G = (V, L). Here, V is the node set of the network and L is the edge set of the network. Given one positive unit flow to node x, the flow pumped from node x to node y is W(x, y) = A(x, y)/k(x). Here, k(x) is the degree of node x, A is the adjacency matrix, and W denotes the normalized adjacency matrix. A(x, y) = 1 if, and only if, (x, y) ∈ L; otherwise, A(x, y) = 0. In this way, we can evaluate the similarities between other nodes and node x based on the network structure.

Furthermore, in order to combine prior knowledge (nodes that are allocated prior information should have more flow) and network structures (adjacent nodes are assigned with similar flow), network propagation can be defined as follows: (2) Here, F^t is a vector in which i-th element holds the flow allocated to node i at timestamp t, α is a parameter controlling the prevalence of prior information Y (a ∣V∣ * 1 vector), and F¹ = Y. Given F^t+1 = F^t, we can obtain the steady-state solution F^∞ to equation (2): (3)

Denote α(I − (1 − α)W)⁻¹ as S, and the element S(x, y) stands for the similarity between node x and y. Given a hereditary disease h and its known disease genes T_h, the similarity of candidate gene x with disease genes can be computed as follows. (4)

The above equation is a particular solution of equation (2) when each disease gene of disease h is assigned +1 unit flow for the prior information Y. According to the above equation, we can rank the candidate disease genes. This is a global distance measurement for disease gene prediction, called “NP_D”. NP_D is mainly based on the well-known hypothesis that the neighbors of disease genes are likely to cause the same or similar diseases. Because NP_D can effectively exploit global topological structures, such as dense indirect interactions between disease proteins, the performance is obviously better than local distance measurements.

We intend to exploit a new hypothesis that, if too many non-disease essential proteins exist as neighbors of a candidate protein, the protein is unlikely to cause diseases. According to this hypothesis, we can assign −1 unit flow to each non-disease essential protein for the prior information Y. The dissimilarity of candidate gene x with non-disease essential genes can be computed as follows. (5)

In this paper, this is termed “NP_E”.

This paper integrates the above two hypotheses. We allocate positive flow to the disease proteins and negative flow to the non-disease essential proteins to set the prior information Y. Additionally, we ensured that the amount of positive flow is equal to that of negative flow. In the experiment, +1 unit flow was assigned to all disease proteins, while −1 unit flow was allocated to all non-disease essential proteins. The rank of candidate gene x was assigned with its score defined as (6)

This paper named the new strategy “NP_D&E”.

To validate the new strategy, we utilized Leave-One-Out Cross-Validation [7] in the experiments. Given a hereditary disease and the corresponding disease genes (suppose the total number of disease genes is m), we selected each disease gene as a test set in turn, while leaving the remaining m − 1 disease genes as the training set. Therefore, we performed trials m times, and adopted the mean value of the results as the performance of the method. In this paper, we used enrichment-analysis [7] and AUC-analysis [25] to evaluate the performance for detecting disease genes.

Enrichment Score is a typical evaluation index for gene prioritization. For each disease gene used as a test gene, we selected 100 closest genes to the gene on the same chromosome to construct a candidate gene list (including the test gene). If the final flow allocated to the test gene is ranked r_th, the Enrichment Score is $\frac{50}{r}$ . If the test gene has the same flow as other candidate genes, it is ranked last among them. Additionally, if the protein encoded by the test gene is not in the protein-protein interaction network, we consider the rank to be 100 (Enrichment Score is 0.5). In the experiments, we obtained two results for Enrichment Scores. One is termed “Enrichment score 1” and includes disease genes not in the protein-protein interaction network. The other is termed “Enrichment score 2” and eliminates disease genes not in the protein-protein interaction network.

AUC (Area Under ROC Curve) evaluates the performance of gene prioritization according to ROC (Receiver-Operating Characteristic). AUC is the area under the ROC curve. ROC analysis can effectively estimate the performance of binary classifiers, and gene prioritization can be deemed as binary classification by setting a rank threshold [25]. Candidate genes above the threshold are considered as positive samples (disease genes), while genes below the threshold are negative samples (non-disease genes). Given a certain threshold, we can evaluate the sensitivity and specificity of the method. Specificity is the proportion of the true disease genes above the threshold among the total prioritizations. Since there were 794 disease genes for the 110 hereditary diseases investigated, the number of prioritizations in the experiments was 794. Specificity is the proportion of genes below the threshold among all of the candidate genes. ROC curve can be drawn by plotting the Specificity versus (1-Specificity) subject to the threshold separating the prediction class. A detailed introduction about the ROC curve can be found in references [7] and [25].

Results

Disease genes are not well connected with essential genes

In this paper, we systematically study the topology associations between disease proteins and essential proteins.

We analyzed the proportions of non-disease essential proteins among n neighbors of disease proteins and other proteins, respectively. Fig. 1 and Fig. 2 demonstrate $M d (P_{E^{-} D^{-}}^{n})$ and $M d (P_{E^{-} O}^{n})$ in the protein interaction networks constructed based on the i2d database and STRING databases. As the diameter of the protein interaction network constructed based on the i2d database is 12, n ∈ {1, 2, …, 12} in Fig. 1. Similarily, n ∈ {1, 2, …, 11} in Fig. 2. The difference between the curves of non-essential disease proteins and other proteins in Fig. 1 and Fig. 2 seems small. However, on the whole, $M d (P_{E^{-} D^{-}}^{n})$ are statically smaller than $M d (P_{E^{-} O}^{n})$ as shown in Table 4 and Table 5. Table 4 and Table 5 provide the statistics of $M d (P_{E^{-} D^{-}}^{n})$ and $M d (P_{E^{-} O}^{n})$ in the protein interaction networks constructed based on the i2d database and STRING databases. The median values of $P_{E^{-} D^{-}}^{n}$ and $P_{E^{-} O}^{n}$ (n ∈ {7, 8, 9, 10, 11, 12}) in the protein interaction network constructed based on the i2d database are both 0.00%, and there are no obvious differences. Thus, $P_{E^{-} D^{-}}^{n}$ and $P_{E^{-} O}^{n}$ (n ∈ {7, 8, 9, 10, 11, 12}) was ignored in Table 4. Similarily, $P_{E^{-} D^{-}}^{n}$ and $P_{E^{-} O}^{n}$ (n ∈ {8, 9, 10, 11}) was ignored in Table 5. Significances between the two protein populations in Table 4 and Table 5 were calculated by the Rank sum test. As shown in Table 4, $M d (P_{E^{-} D^{-}}^{n})$ (n ∈ {2, 3, 4, 5, 6}) were significantly smaller than $M d (P_{E^{-} O}^{n})$ in the protein interaction network constructed based on the i2d database. As shown in Table 5, $M d (P_{E^{-} D^{-}}^{n})$ (n ∈ {1, 2, 3, 4}) were significantly smaller than $M d (P_{E^{-} O}^{n})$ in the protein interaction network constructed based on the STRING database. Thus, disease genes are not well connected with essential genes in the protein interaction networks.

Download:

Fig 1. Median values of the proportions of non-disease essential proteins among n (n ∈ {1, 2, …, 12}) neighbors in the protein interaction network constructed based on the i2d database.

https://doi.org/10.1371/journal.pone.0116505.g001

Download:

Fig 2. Median values of the proportions of non-disease essential proteins among n (n ∈ {1, 2, …, 11}) neighbors in the protein interaction network constructed based on the STRING database.

https://doi.org/10.1371/journal.pone.0116505.g002

Download:

Table 4. Median values of the proportions of non-disease essential proteins among n (n ∈ {1, 2, 3, 4, 5, 6}) neighbors of nonessential disease proteins (D⁻) and other proteins (O) in the protein interaction network constructed based on the i2d database.

https://doi.org/10.1371/journal.pone.0116505.t004

Download:

Table 5. Median values of the proportions of non-disease essential proteins among n (n ∈ {1, 2, 3, 4, 5, 6, 7}) neighbors of nonessential disease proteins (D⁻) and other proteins (O) in the protein interaction network constructed based on the STRING database.

https://doi.org/10.1371/journal.pone.0116505.t005

Goh et al. explained their finding about topology importance of disease genes by using an evolutionary argument [26]. Similarily, our new finding can also be explained using an evolutionary argument. If disease genes have many interactions with essential genes, mutations of disease genes are likely to seriously affect essential genes. This would probably lead to serious disease or even death. Thus, people whose disease genes have more interactions with essential genes were eliminated over the course of evolution. The existing protein-protein interaction network structure can protect the primary normal functions for life.

Disease genes prediction for 110 diseases

Based on the hypothesis that the neighbors of disease genes are likely to cause the same or similar diseases, local distance measurements, such as Direct Neighbors [8, 9] or Shortest Path [10, 11] have been widely used to detect disease genes. However, local distance measurements have many limitations. One major problem is that they cannot effectively detect disease proteins, which are far away from other disease proteins, but have many interactions with them. Thus, Kohler et al. [7] adopted global distance measurements, such as Random Walk with Restart and Kernel Diffusion, to detect disease genes. Global distance measurements can take full advantage of the topological structure of the protein-protein interaction networks, and estimate the similarity between any two proteins based on all of the paths between them. Thus, they can detect candidate disease proteins that have dense interactions with known disease proteins. Fig. 3(a) shows an example. Local distance measurements will mistake the protein d for a disease protein, while global distance measurements can correctly identify the disease protein c.

Download:

Fig 3. An example of gene prioritization based on network.

(a) The disease proteins a and b are selected as the training set, while c as the test disease protein. (b) Global distance measurements may mistake the non-disease hub protein e for a disease protein.

https://doi.org/10.1371/journal.pone.0116505.g003

Even though the performance of global distance measurements is superior to local distance measurements, hub proteins with high betweenness (essential proteins or other proteins) may be mistaken for candidate disease proteins in some cases. As shown in Fig. 3(b), the non-disease protein e has the largest number of interactions with disease proteins and is therefore mistaken for the disease protein. Thus, a novel method is required to select the true disease protein c. The empirical analyses in the previous section indicate that disease proteins are not well connected with essential proteins. Additionally, hub proteins with high betweenness that are mistaken for disease genes are probably essential proteins that have numerous interactions with essential proteins. Therefore, we can attempt to avoid mistakes such as those shown in Fig. 3(b) by investigating the proportions of essential proteins among neighbors of candidate proteins. As shown in Fig. 3(b), many essential proteins (green nodes in Fig. 3(b)) exist among neighbors of e. This can decrease the probability of mistaking e for a disease protein, and enables the correct identification of the disease protein c. In the following section, we will demonstrate the advantages of our approach for 110 hereditary diseases.

First, we compared the enrichment score of NP_D&E, NP_D and NP_E for 110 hereditary diseases with the integrated protein interaction network. As shown in S1 Table, NP_D&E can rank all of the disease genes of 18 diseases first (Enrichment score 2 is 50), such as Alzheimer Disease (4 disease genes), multiple epiphyseal dysplasia AD (5 disease genes) and so on. Specifically, the performance of NP_D&E was much better than that of NP_D (the improvement of Enrichment score 2 was greater than 5) for 41 diseases, and slightly better (the improvement of Enrichment score 2 was less than 5) for 33 diseases; the performance of NP_D&E was the same as NP_D for 20 diseases, and worse than NP_D for 16 diseases.

As shown in Table 6, we performed further statistical analysis on NP_D&E, NP_D and NP_E for 110 diseases (S1 Table). Compared with NP_D, the average of Enrichment score 1 and the average of Enrichment score 2 of NP_D&E improved by 3.340 and 3.915, respectively. Table 7 presents the probability associated with a one-tailed student’s t-test and demonstrates that the improvement in NP_D&E is statistically significant. Moreover, we compared the performance of NP_D and NP_D&E on monogenic disease, complex disease and cancer, which were divided by Kohler et al. [7]. As shown in Table 6 and Table 7, the improvement in NP_D&E for monogenic diseases was the most obvious, and there was a slight improvement in complex diseases. However, the performance of NP_D&E in cancer was similar with NP_D (p − value > 0.99). The reason for this may be that disease genes associated with cancer are usually essential genes, and essential proteins have lots of interactions with other essential proteins, which probably affects the performance of NP_D&E. Additionally, ROC analysis was adopted to compare the performance of NP_D&E and NP_D. The disease genes that did not have corresponding proteins in the protein interaction network were excluded in ROC analysis. Fig. 4 indicates that the performance of NP_D&E was superior to NP_D with a t-test p-value of 3.3307e-016 for NP_D&E versus NP_D.

Download:

Table 6. Statistics of the performance (the average values of enrichment score) with disease as a unit.

https://doi.org/10.1371/journal.pone.0116505.t006

Download:

Table 7. One tailed t-Tests for Table 6: NP_D&E versus Competing Approaches.

https://doi.org/10.1371/journal.pone.0116505.t007

Download:

Fig 4. ROC curves.

https://doi.org/10.1371/journal.pone.0116505.g004

Next, to compare the ability of NP_D&E and NP_D to detect new disease genes, we used the disease genes verified before 2008 as the training set and the disease genes verified after 2008 were used as the test set. The test set consists of 447 new disease genes of 83 diseases verified after 2008 from the OMIM database. Table 8 shows the statistical analyses of the performance of the ability of the two strategies to detect disease genes verified after 2008. NP_D&E was able to identify new disease genes more effectively than NP_D. According to the statistical analyses, the average rank of disease genes according to the Enrichment score 2 of NP_D&E was $\frac{50}{14.548} \approx 3$ . This result implies that NP_D&E can assist biomedicine experts to efficiently discover new disease gene with a small amount of medical experiments.

Download:

Table 8. Statistics of the performance (the average values of enrichment score) with disease as a unit to detect disease genes of 83 diseases verified after 2008.

Significances (p-value) between the results of NP_D and NP_D&E were calculated by the one tailed student’s t-test.

https://doi.org/10.1371/journal.pone.0116505.t008

Finally, we provided a true example of effectively detecting disease genes by NP_D&E. Fig. 5 offers the disease proteins of Leukoencephalopathy with vanishing white matter and their interactions in the protein interaction network constructed based on the i2d database. NP_D&E was able to correctly identify each disease protein, while NP_D failed to identify the disease protein Q5QP88. In Fig. 5, white nodes stand for other proteins, blue nodes denote non-disease essential proteins, red nodes indicate disease proteins that were correctly identified by NP_D, the purple node signifies a disease protein that was not correctly identified (Q5QP88 ranked 14th) by NP_D, and the yellow node is a non-disease protein that was mistaken for a disease protein by NP_D. Because disease proteins Q13144, Q14232, Q9UI10 and P49770 are closer to each other and have many interactions between them, they can be correctly identified by NP_D. However, Q5QP88 is located at a distance from other disease proteins and there are fewer interactions between them. Thus, in the prioritization of NP_D, the final flow allocated to Q5QP88 was 1.15e-04 while that for Q06830 was 3.49e-04, and Q06830 was mistaken for the disease protein. The proportion of essential proteins among the neighbors of Q06830 was very high indicating that Q06830 was not a disease protein according to our hypothesis. In contrast to NP_D, in the prioritization of NP_D&E, the flow allocated to Q5QP88 was 9.668e-05 (Q5QP88 ranked 1st) while Q06830 was −8.997e-05 (Q5QP88 ranked last).

Download:

Fig 5. Leukoencephalopathy with Vanishing White Matter Protein-Protein Interaction Network.

https://doi.org/10.1371/journal.pone.0116505.g005

Discussion

Molecular networks describe interactions among molecules that can reflect functional linkages. Thus, network-based methods have been widely researched to discover potential disease genes with similar functions to known disease genes. By taking full advantage of global topology structure, global distance measurements can achieve superior performance compared to local distance measurements. However, some problems exist in the global distance measurements. For example, Yang et al. [27] indicated that network-based methods are limited by detecting potential disease genes only in the small regions of known disease genes. As shown in Fig. 5, global distance measurements may mistake non-disease hub proteins for potential disease proteins. One main cause of the above problems is that the existing network-based methods are designed based on the typical hypothesis that the neighbors of disease genes are likely to cause the same or similar diseases. Thus, the methods can only detect potential disease genes that have high topological similarities with known disease genes.

To solve the above problems, this paper attempted to discover new properties of disease genes by analyzing the topology associations between disease proteins and essential proteins in the protein interaction network. Empirical results demonstrate that disease genes are not well connected with essential genes in the protein interaction networks. The new finding can be utilized to explain the conclusion that disease proteins are topologically more important than other proteins [18].

One major hypothesis of molecular network analysis is that “there is a tight relation between network structure and biological function” [28]. Thus, many studies analyzed the properties of disease genes with protein interaction networks [3, 17, 18, 26], and demonstrated that disease proteins are topologically important [3, 17]. However, Goh et al. [26] indicated that a small amount of essential genes exist in the disease genes, and this may affect the correctness of analyses. Goh et al. selected mouse lethal orthologs of human genes as human essential genes and demonstrated the majority of disease proteins are topologically neutral. Nevertheless, a knockout for their mouse orthologs has not been reported for 60% of disease genes [29]. We analyzed the topology importance of disease proteins by utilizing housekeeping genes as essential genes [18]. Empirical results demonstrated that disease proteins are topologically more important than other proteins. However, a new question was raised: because disease proteins are topologically important, would disease genes seriously affect human survival? Our new finding can answer the question to some extent. Because disease genes are not well correlated with essential genes, disease genes would not seriously affect normal activities. Additionally, our finding provides new insights into understanding of the pathogenesis of diseases.

Based on the new finding, we proposed a new hypothesis that if too many non-disease essential proteins exist as neighbors of a candidate protein, then the protein is unlikely to cause diseases. We proposed a network propagation method based on the typical hypothesis and the new hypothesis. The method not only considers the topological similarities of candidate proteins with known disease proteins but also exploits the topological dissimilarities of candidate proteins with essential proteins. To some extent the method can avoid mistaking non-disease hub proteins as potential disease proteins. Our strategy will be beneficial creating new ideas and new visions for disease gene prediction and will be insightful and helpful for predicting genotype-phenotype associations with the phenome-interactome network [27].

Our future works will be the further studies of the dual flows integration for detecting disease genes based on game theory. Additionally, we intend to apply our strategy to assist molecular diagnosis, in order to speed up the identification of disease genes in next-generation sequencing data [30]. Itan et al. utilized a local distance measurement that adopts shortest path to the core gene for monogenic disorders [30]. It could be beneficial to utilize our new global measurement for improving the quality of molecular diagnosis.

Supporting Information

S1 Table. Enrichment results with the integrated protein interaction network.

https://doi.org/10.1371/journal.pone.0116505.s001

(DOC)

Author Contributions

Conceived and designed the experiments: SW FS RS YS. Performed the experiments: SW YZ JH. Analyzed the data: SW YZ JH. Contributed reagents/materials/analysis tools: SW YZ. Wrote the paper: SW JJ RD SX.

References

1. Bromberg Y (2013) Disease gene prioritization. PLoS computational biology 9: e1002902. pmid:23633938
- View Article
- PubMed/NCBI
- Google Scholar
2. Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS (2005) Speeding disease gene discovery by sequence based candidate prioritization. BMC bioinformatics 6: 55. pmid:15766383
- View Article
- PubMed/NCBI
- Google Scholar
3. Xu J, Li Y (2006) Discovering disease-genes by topological features in human protein–protein interaction network. Bioinformatics 22: 2800–2805. pmid:16954137
- View Article
- PubMed/NCBI
- Google Scholar
4. Smalter A, Lei SF, Chen Xw (2007) Human disease-gene classification with integrative sequence-based and topological features of protein–protein interaction networks. In: Bioinformatics and Biomedicine, 2007. BIBM 2007. IEEE International Conference on. IEEE, pp. 209–216.
5. Yang P, Li XL, Mei JP, Kwoh CK, Ng SK (2012) Positive-unlabeled learning for disease gene identification. Bioinformatics 28: 2640–2647. pmid:22923290
- View Article
- PubMed/NCBI
- Google Scholar
6. Nguyen TP, Ho TB (2012) Detecting disease genes based on semi-supervised learning and protein–protein interaction networks. Artificial intelligence in medicine 54: 63–71. pmid:22000346
- View Article
- PubMed/NCBI
- Google Scholar
7. Köhler S, Bauer S, Horn D, Robinson PN (2008) Walking the interactome for prioritization of candidate disease genes. The American Journal of Human Genetics 82: 949–958.
- View Article
- Google Scholar
8. Oti M, Snel B, Huynen MA, Brunner HG (2006) Predicting disease genes using protein–protein interactions. Journal of medical genetics 43: 691–698. pmid:16611749
- View Article
- PubMed/NCBI
- Google Scholar
9. Linghu B, Snitkin ES, Hu Z, Xia Y, DeLisi C, et al. (2009) Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network. Genome Biol 10: R91. pmid:19728866
- View Article
- PubMed/NCBI
- Google Scholar
10. Radivojac P, Peng K, Clark WT, Peters BJ, Mohan A, et al. (2008) An integrated approach to inferring gene–disease associations in humans. Proteins: Structure, Function, and Bioinformatics 72: 1030–1037.
- View Article
- Google Scholar
11. Franke L, Bakel Hv, Fokkens L, De Jong ED, Egmont-Petersen M, et al. (2006) Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. The American Journal of Human Genetics 78: 1011–1025.
- View Article
- Google Scholar
12. Li Y, Patra JC (2010) Genome-wide inferring gene–phenotype relationship by walking on the heterogeneous network. Bioinformatics 26: 1219–1224. pmid:20215462
- View Article
- PubMed/NCBI
- Google Scholar
13. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R (2010) Associating genes and protein complexes with disease via network propagation. PLoS computational biology 6: e1000641. pmid:20090828
- View Article
- PubMed/NCBI
- Google Scholar
14. Wang X, Gulbahce N, Yu H (2011) Network-based methods for human disease gene prediction. Briefings in functional genomics 10: 280–293. pmid:21764832
- View Article
- PubMed/NCBI
- Google Scholar
15. Barabási AL, Gulbahce N, Loscalzo J (2011) Network medicine: a network-based approach to human disease. Nature Reviews Genetics 12: 56–68. pmid:21164525
- View Article
- PubMed/NCBI
- Google Scholar
16. Tu Z, Wang L, Xu M, Zhou X, Chen T, et al. (2006) Further understanding human disease genes by comparing with housekeeping genes and other genes. BMC genomics 7: 31. pmid:16504025
- View Article
- PubMed/NCBI
- Google Scholar
17. Jin W, Qin P, Lou H, Jin L, Xu S (2012) A systematic characterization of genes underlying both complex and mendelian diseases. Human molecular genetics 21: 1611–1624. pmid:22186022
- View Article
- PubMed/NCBI
- Google Scholar
18. Wu Sy, Shao Fj, Sun Rc, Sui Y, Wang Y, et al. (2014) Analysis of human genes with protein–protein interaction network for detecting disease genes. Physica A: Statistical Mechanics and its Applications 398: 217–228.
- View Article
- Google Scholar
19. Newman ME (2005) Threshold effects for two pathogens spreading on a network. Physical review letters 95: 108701. pmid:16196976
- View Article
- PubMed/NCBI
- Google Scholar
20. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders. Nucleic acids research 33: D514–D517. pmid:15608251
- View Article
- PubMed/NCBI
- Google Scholar
21. Chang CW, Cheng WC, Chen CR, Shu WY, Tsai ML, et al. (2011) Identification of human housekeeping genes and tissue-selective genes by microarray meta-analysis. PloS one 6: e22859. pmid:21818400
- View Article
- PubMed/NCBI
- Google Scholar
22. Congiu M, Slavin JL, Desmond PV (2011) Expression of common housekeeping genes is affected by disease in human hepatitis c virus-infected liver. Liver International 31: 386–390. pmid:21073651
- View Article
- PubMed/NCBI
- Google Scholar
23. Waxman S, Wurmbach E (2007) De-regulation of common housekeeping genes in hepatocellular carcinoma. BMC genomics 8: 243. pmid:17640361
- View Article
- PubMed/NCBI
- Google Scholar
24. Guibinga GH, Hsu S, Friedmann T (2010) Deficiency of the housekeeping gene hypoxanthine–guanine phosphoribosyltransferase (hprt) dysregulates neurogenesis. Molecular Therapy 18: 54–62. pmid:19672249
- View Article
- PubMed/NCBI
- Google Scholar
25. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, et al. (2006) Gene prioritization through genomic data fusion. Nature biotechnology 24: 537–544. pmid:16680138
- View Article
- PubMed/NCBI
- Google Scholar
26. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, et al. (2007) The human disease network. Proceedings of the National Academy of Sciences 104: 8685–8690.
- View Article
- Google Scholar
27. Yang P, Li X, Wu M, Kwoh CK, Ng SK (2011) Inferring gene-phenotype associations via global protein complex network propagation. PloS one 6: e21502. pmid:21799737
- View Article
- PubMed/NCBI
- Google Scholar
28. Furlong LI (2013) Human diseases through the lens of network biology. Trends in Genetics 29: 150–159. pmid:23219555
- View Article
- PubMed/NCBI
- Google Scholar
29. Dickerson JE, Zhu A, Robertson DL, Hentges KE (2011) Defining the role of essential genes in human disease. PloS one 6: e27368. pmid:22096564
- View Article
- PubMed/NCBI
- Google Scholar
30. Itan Y, Zhang SY, Vogt G, Abhyankar A, Herman M, et al. (2013) The human gene connectome as a map of short cuts for morbid allele discovery. Proceedings of the National Academy of Sciences 110: 5558–5563.
- View Article
- Google Scholar

[ref1] 1. Bromberg Y (2013) Disease gene prioritization. PLoS computational biology 9: e1002902. pmid:23633938
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS (2005) Speeding disease gene discovery by sequence based candidate prioritization. BMC bioinformatics 6: 55. pmid:15766383
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Xu J, Li Y (2006) Discovering disease-genes by topological features in human protein–protein interaction network. Bioinformatics 22: 2800–2805. pmid:16954137
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Smalter A, Lei SF, Chen Xw (2007) Human disease-gene classification with integrative sequence-based and topological features of protein–protein interaction networks. In: Bioinformatics and Biomedicine, 2007. BIBM 2007. IEEE International Conference on. IEEE, pp. 209–216.

[ref5] 5. Yang P, Li XL, Mei JP, Kwoh CK, Ng SK (2012) Positive-unlabeled learning for disease gene identification. Bioinformatics 28: 2640–2647. pmid:22923290
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref6] 6. Nguyen TP, Ho TB (2012) Detecting disease genes based on semi-supervised learning and protein–protein interaction networks. Artificial intelligence in medicine 54: 63–71. pmid:22000346
View Article
PubMed/NCBI
Google Scholar

[19] View Article

[20] PubMed/NCBI

[21] Google Scholar

[ref7] 7. Köhler S, Bauer S, Horn D, Robinson PN (2008) Walking the interactome for prioritization of candidate disease genes. The American Journal of Human Genetics 82: 949–958.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref8] 8. Oti M, Snel B, Huynen MA, Brunner HG (2006) Predicting disease genes using protein–protein interactions. Journal of medical genetics 43: 691–698. pmid:16611749
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref9] 9. Linghu B, Snitkin ES, Hu Z, Xia Y, DeLisi C, et al. (2009) Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network. Genome Biol 10: R91. pmid:19728866
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref10] 10. Radivojac P, Peng K, Clark WT, Peters BJ, Mohan A, et al. (2008) An integrated approach to inferring gene–disease associations in humans. Proteins: Structure, Function, and Bioinformatics 72: 1030–1037.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref11] 11. Franke L, Bakel Hv, Fokkens L, De Jong ED, Egmont-Petersen M, et al. (2006) Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. The American Journal of Human Genetics 78: 1011–1025.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref12] 12. Li Y, Patra JC (2010) Genome-wide inferring gene–phenotype relationship by walking on the heterogeneous network. Bioinformatics 26: 1219–1224. pmid:20215462
View Article
PubMed/NCBI
Google Scholar

[40] View Article

[41] PubMed/NCBI

[42] Google Scholar

[ref13] 13. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R (2010) Associating genes and protein complexes with disease via network propagation. PLoS computational biology 6: e1000641. pmid:20090828
View Article
PubMed/NCBI
Google Scholar

[44] View Article

[45] PubMed/NCBI

[46] Google Scholar

[ref14] 14. Wang X, Gulbahce N, Yu H (2011) Network-based methods for human disease gene prediction. Briefings in functional genomics 10: 280–293. pmid:21764832
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref15] 15. Barabási AL, Gulbahce N, Loscalzo J (2011) Network medicine: a network-based approach to human disease. Nature Reviews Genetics 12: 56–68. pmid:21164525
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref16] 16. Tu Z, Wang L, Xu M, Zhou X, Chen T, et al. (2006) Further understanding human disease genes by comparing with housekeeping genes and other genes. BMC genomics 7: 31. pmid:16504025
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref17] 17. Jin W, Qin P, Lou H, Jin L, Xu S (2012) A systematic characterization of genes underlying both complex and mendelian diseases. Human molecular genetics 21: 1611–1624. pmid:22186022
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref18] 18. Wu Sy, Shao Fj, Sun Rc, Sui Y, Wang Y, et al. (2014) Analysis of human genes with protein–protein interaction network for detecting disease genes. Physica A: Statistical Mechanics and its Applications 398: 217–228.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref19] 19. Newman ME (2005) Threshold effects for two pathogens spreading on a network. Physical review letters 95: 108701. pmid:16196976
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref20] 20. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA (2005) Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders. Nucleic acids research 33: D514–D517. pmid:15608251
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

[ref21] 21. Chang CW, Cheng WC, Chen CR, Shu WY, Tsai ML, et al. (2011) Identification of human housekeeping genes and tissue-selective genes by microarray meta-analysis. PloS one 6: e22859. pmid:21818400
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref22] 22. Congiu M, Slavin JL, Desmond PV (2011) Expression of common housekeeping genes is affected by disease in human hepatitis c virus-infected liver. Liver International 31: 386–390. pmid:21073651
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref23] 23. Waxman S, Wurmbach E (2007) De-regulation of common housekeeping genes in hepatocellular carcinoma. BMC genomics 8: 243. pmid:17640361
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref24] 24. Guibinga GH, Hsu S, Friedmann T (2010) Deficiency of the housekeeping gene hypoxanthine–guanine phosphoribosyltransferase (hprt) dysregulates neurogenesis. Molecular Therapy 18: 54–62. pmid:19672249
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref25] 25. Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, et al. (2006) Gene prioritization through genomic data fusion. Nature biotechnology 24: 537–544. pmid:16680138
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref26] 26. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, et al. (2007) The human disease network. Proceedings of the National Academy of Sciences 104: 8685–8690.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref27] 27. Yang P, Li X, Wu M, Kwoh CK, Ng SK (2011) Inferring gene-phenotype associations via global protein complex network propagation. PloS one 6: e21502. pmid:21799737
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

[ref28] 28. Furlong LI (2013) Human diseases through the lens of network biology. Trends in Genetics 29: 150–159. pmid:23219555
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref29] 29. Dickerson JE, Zhu A, Robertson DL, Hentges KE (2011) Defining the role of essential genes in human disease. PloS one 6: e27368. pmid:22096564
View Article
PubMed/NCBI
Google Scholar

[106] View Article

[107] PubMed/NCBI

[108] Google Scholar

[ref30] 30. Itan Y, Zhang SY, Vogt G, Abhyankar A, Herman M, et al. (2013) The human gene connectome as a map of short cuts for morbid allele discovery. Proceedings of the National Academy of Sciences 110: 5558–5563.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

Figures

Abstract

Introduction

Materials and Methods

Human gene list, hereditary disease list and human protein-protein interaction data

Analysis of the topology associations between disease proteins and essential proteins

Gene prioritization

Results

Disease genes are not well connected with essential genes

Disease genes prediction for 110 diseases

Discussion

Supporting Information

S1 Table. Enrichment results with the integrated protein interaction network.

Author Contributions

References