Network Propagation with Dual Flow for Gene Prioritization

Based on the hypothesis that the neighbors of disease genes trend to cause similar diseases, network-based methods for disease prediction have received increasing attention. Taking full advantage of network structure, the performance of global distance measurements is generally superior to local distance measurements. However, some problems exist in the global distance measurements. For example, global distance measurements may mistake non-disease hub proteins that have dense interactions with known disease proteins for potential disease proteins. To find a new method to avoid the aforementioned problem, we analyzed the differences between disease proteins and other proteins by using essential proteins (proteins encoded by essential genes) as references. We find that disease proteins are not well connected with essential proteins in the protein interaction networks. Based on this new finding, we proposed a novel strategy for gene prioritization based on protein interaction networks. We allocated positive flow to disease genes and negative flow to essential genes, and adopted network propagation for gene prioritization. Experimental results on 110 diseases verified the effectiveness and potential of the proposed method.


Introduction
Disease gene prediction is an important task in bioinformatics. It aims to discover potential disease genes based on known disease genes and omics data, such as metabolic pathways and protein-protein interactions, by utilizing machine learning and complex network theory. It is very important to understand the pathogenesis of hereditary diseases and improve the quality of diagnosis [1].
As a meaningful strategy of disease gene prediction, gene classification aims to construct a binary classification model to automatically determine whether an unknown gene is a disease gene. To effectively distinguish disease genes from non-disease genes, some researchers have utilized sequence-based characteristics to construct classifiers [2]. At the same time, the hypothesis that the neighbors of disease genes are likely to cause diseases prompted scholars to positive flow, while essential proteins carry negative flow. And network propagation is considered as the competition between disease proteins and essential proteins. Proteins with more positive flow trend to cause diseases, while proteins with more negative flow are probably nondisease proteins. Thus, by network propagation we can find potential disease proteins that have more interactions with known disease proteins (indicating that they probably have similar functions), but fewer interactions with essential proteins (suggesting that the disease proteins are not well connected with essential proteins). Experimental results on 110 hereditary diseases verified the effectiveness and potential of the proposed method.

Materials and Methods
Human gene list, hereditary disease list and human protein-protein interaction data The disease gene list was downloaded from the Online Mendelian Inheritance in Man database (OMIM) [20]. We selected 2931 disease genes with tag "3" from 6285 entries. Genes with tag "3" have been verified by the presence of a mutation. Then, we obtained housekeeping genes from the research of Chang et al. [21]. Housekeeping genes are universally expressed in normal tissues or cells and are vital to maintaining fundamental life activities. Thus, housekeeping genes can be deemed as essential genes [16].
We obatined 110 hereditary diseases and corresponding disease genes from Kohler et al. (http://download.cell.com/AJHG/mmcs/journals/0002-9297/PIIS0002929708001729.mmc1. zip). Kohler et al. [7] collected the associations between genetic diseases and disease genes from OMIM, domain knowledge and medicinal literatures. Here, 110 diseases are accounted for by 794 disease genes; there were 681 unique genes listed (one gene may cause more than one disease).
The human protein interactions were downloaded from the i2d (http://ophid.utoronto.ca/ ophidv2.204/) and STRING (http://string-db.org/) databases. Table 1 lists the statistics of networks constructed based on the protein interactions. The i2d database uses proteins as interactors. Thus, we mapped genes to proteins according to the UniProt database (http://www. UniProt.org). Unlike the i2d database, the STRING database uses genes as interactors, and provides a score to evaluate the reliability between two interactors. Similar to Kohler et al. [7], we set a threshold score of 0.4 to extract unweighted interactions. We integrated all the data from the two databases to construct a larger network (this paper refers it to "integrated protein interaction network") for disease gene prediction.
In this paper, we annotated essential proteins/genes and disease proteins/genes as E and D respectively, and the remaining proteins/genes (O = ¬(E S D)) were treated as other proteins/ genes. Table 2 and Table 3 list the statistics of different types of interactors in the protein interaction networks constructed based on the i2d and STRING databases. For the sake of brevity, ¬D T E is denoted by E − and ¬E T D is denoted by D − . Analysis of the topology associations between disease proteins and essential proteins Essential genes were initially considered to be stable genes unaffected by other factors. However, recent studies have indicated that the expression of essential genes can be influenced by other factors, such as diseases [22][23][24]. Our recent study analyzed the associations between disease genes and essential genes in the protein interaction network. Empirical results demonstrated that even though non-essential disease proteins are closer to essential proteins, the proportions of non-disease essential proteins among 1-direct neighbors of non-essential disease proteins are similar to those of other proteins, and the proportions of non-disease essential proteins among 2-indirect neighbors of non-essential disease proteins are statistically smaller than those of other proteins. This finding illustrates that disease proteins are not well connected with essential proteins. In this paper, we systematically study the topology associations between disease proteins and essential proteins. n neighbors of node i are defined as node set Q n i , in which the shortest path of each element to node i is n. Here, n is a positive integer. For instance, Q 1 i is the set of direct neighbors of node i. We intend to compare the differences of the proportions of non-disease essential proteins among n neighbors of non-essential disease proteins and other proteins. For the sake of brevity, the intersection of set Q n i and set E − is denoted by Q n In this paper, the proportion of non-disease essential proteins among n neighbors of node i is defined as follows. In this paper, fp n E À i j i 2 Xg is denoted by P n E À X and the median of P n E À X is denoted by MdðP n E À X Þ.

Gene prioritization
In this work, the network propagation method was adopted to detect disease genes. Network propagation on a network can be understood as simulating a process, in which nodes iteratively pump flow to their neighbors [13]. A node would pump equal flow to each of its direct neighbors for each timestamp. We denote the network as G = (V, L). Here, V is the node set of the network and L is the edge set of the network. Given one positive unit flow to node x, the flow pumped from node x to node y is W(x, y) = A(x, y)/k(x). Here, k(x) is the degree of node x, A is the adjacency matrix, and W denotes the normalized adjacency matrix. A (x, y) = 1 if, and only if, (x, y) 2 L; otherwise, A(x, y) = 0. In this way, we can evaluate the similarities between other nodes and node x based on the network structure.
Furthermore, in order to combine prior knowledge (nodes that are allocated prior information should have more flow) and network structures (adjacent nodes are assigned with similar flow), network propagation can be defined as follows: Here, F t is a vector in which i-th element holds the flow allocated to node i at timestamp t, α is a parameter controlling the prevalence of prior information Y (a jVj Ã 1 vector), and F 1 = Y. Given F t+1 = F t , we can obtain the steady-state solution F 1 to equation (2): Denote α(I − (1 − α)W) −1 as S, and the element S(x, y) stands for the similarity between node x and y. Given a hereditary disease h and its known disease genes T h , the similarity of candidate gene x with disease genes can be computed as follows.
The above equation is a particular solution of equation (2) when each disease gene of disease h is assigned +1 unit flow for the prior information Y. According to the above equation, we can rank the candidate disease genes. This is a global distance measurement for disease gene prediction, called "NP D ". NP D is mainly based on the well-known hypothesis that the neighbors of disease genes are likely to cause the same or similar diseases. Because NP D can effectively exploit global topological structures, such as dense indirect interactions between disease proteins, the performance is obviously better than local distance measurements.
We intend to exploit a new hypothesis that, if too many non-disease essential proteins exist as neighbors of a candidate protein, the protein is unlikely to cause diseases. According to this hypothesis, we can assign −1 unit flow to each non-disease essential protein for the prior information Y. The dissimilarity of candidate gene x with non-disease essential genes can be computed as follows.
In this paper, this is termed "NP E ".
This paper integrates the above two hypotheses. We allocate positive flow to the disease proteins and negative flow to the non-disease essential proteins to set the prior information Y. Additionally, we ensured that the amount of positive flow is equal to that of negative flow. In the experiment, +1 unit flow was assigned to all disease proteins, while −1 unit flow was allocated to all non-disease essential proteins. The rank of candidate gene x was assigned with its score defined as This paper named the new strategy "NP D&E ".
To validate the new strategy, we utilized Leave-One-Out Cross-Validation [7] in the experiments. Given a hereditary disease and the corresponding disease genes (suppose the total number of disease genes is m), we selected each disease gene as a test set in turn, while leaving the remaining m − 1 disease genes as the training set. Therefore, we performed trials m times, and adopted the mean value of the results as the performance of the method. In this paper, we used enrichment-analysis [7] and AUC-analysis [25] to evaluate the performance for detecting disease genes.
Enrichment Score is a typical evaluation index for gene prioritization. For each disease gene used as a test gene, we selected 100 closest genes to the gene on the same chromosome to construct a candidate gene list (including the test gene). If the final flow allocated to the test gene is ranked r th , the Enrichment Score is 50 r . If the test gene has the same flow as other candidate genes, it is ranked last among them. Additionally, if the protein encoded by the test gene is not in the protein-protein interaction network, we consider the rank to be 100 (Enrichment Score is 0.5). In the experiments, we obtained two results for Enrichment Scores. One is termed "Enrichment score 1" and includes disease genes not in the protein-protein interaction network. The other is termed "Enrichment score 2" and eliminates disease genes not in the protein-protein interaction network.
AUC (Area Under ROC Curve) evaluates the performance of gene prioritization according to ROC (Receiver-Operating Characteristic). AUC is the area under the ROC curve. ROC analysis can effectively estimate the performance of binary classifiers, and gene prioritization can be deemed as binary classification by setting a rank threshold [25]. Candidate genes above the threshold are considered as positive samples (disease genes), while genes below the threshold are negative samples (non-disease genes). Given a certain threshold, we can evaluate the sensitivity and specificity of the method. Specificity is the proportion of the true disease genes above the threshold among the total prioritizations. Since there were 794 disease genes for the 110 hereditary diseases investigated, the number of prioritizations in the experiments was 794. Specificity is the proportion of genes below the threshold among all of the candidate genes. ROC curve can be drawn by plotting the Specificity versus (1-Specificity) subject to the threshold separating the prediction class. A detailed introduction about the ROC curve can be found in references [7] and [25].

Results
Disease genes are not well connected with essential genes In this paper, we systematically study the topology associations between disease proteins and essential proteins.
We analyzed the proportions of non-disease essential proteins among n neighbors of disease proteins and other proteins, respectively. Fig. 1 and Fig. 2 demonstrate MdðP n E À D À Þ and MdðP n E À O Þ in the protein interaction networks constructed based on the i2d database and STRING databases. As the diameter of the protein interaction network constructed based on the i2d database is 12, n 2 {1, 2, . . ., 12} in Fig. 1. Similarily, n 2 {1, 2, . . ., 11} in Fig. 2. The difference between the curves of non-essential disease proteins and other proteins in Fig. 1 and Fig. 2 seems small. However, on the whole, MdðP n E À D À Þ are statically smaller than MdðP n E À O Þ as shown in Table 4 and Table 5. Table 4 and Table 5 provide the statistics of MdðP n E À D À Þ and MdðP n E À O Þ in the protein interaction networks constructed based on the i2d database and STRING databases. The median values of P n E À D À and P n E À O (n 2 {7, 8, 9, 10, 11, 12}) in the protein interaction network constructed based on the i2d database are both 0.00%, and there are no obvious differences. Thus, P n E À D À and P n E À O (n 2 {7, 8, 9, 10, 11, 12}) was ignored in Table 4. Similarily, P n E À D À and P n E À O (n 2 {8, 9, 10, 11}) was ignored in Table 5. Significances between the two protein populations in Table 4 and Table 5 were calculated by the Rank sum test. As shown in Table 4, MdðP n E À D À Þ (n 2 {2, 3, 4, 5, 6}) were significantly smaller than MdðP n E À O Þ in the protein interaction network constructed based on the i2d database. As shown in Table 5, MdðP n E À D À Þ (n 2 {1, 2, 3, 4}) were significantly smaller than MdðP n E À O Þ in the protein interaction network constructed based on the STRING database. Thus, disease genes are not well connected with essential genes in the protein interaction networks.
Goh et al. explained their finding about topology importance of disease genes by using an evolutionary argument [26]. Similarily, our new finding can also be explained using an evolutionary argument. If disease genes have many interactions with essential genes, mutations of disease genes are likely to seriously affect essential genes. This would probably lead to serious disease or even death. Thus, people whose disease genes have more interactions with essential Table 4. Median values of the proportions of non-disease essential proteins among n (n 2 {1, 2, 3, 4, 5, 6}) neighbors of nonessential disease proteins (D − ) and other proteins (O) in the protein interaction network constructed based on the i2d database.  Table 5. Median values of the proportions of non-disease essential proteins among n (n 2 {1, 2, 3, 4, 5, 6, 7}) neighbors of nonessential disease proteins (D − ) and other proteins (O) in the protein interaction network constructed based on the STRING database. genes were eliminated over the course of evolution. The existing protein-protein interaction network structure can protect the primary normal functions for life.

Disease genes prediction for 110 diseases
Based on the hypothesis that the neighbors of disease genes are likely to cause the same or similar diseases, local distance measurements, such as Direct Neighbors [8,9] or Shortest Path [10,11] have been widely used to detect disease genes. However, local distance measurements have many limitations. One major problem is that they cannot effectively detect disease proteins, which are far away from other disease proteins, but have many interactions with them. Thus, Kohler et al. [7] adopted global distance measurements, such as Random Walk with Restart and Kernel Diffusion, to detect disease genes. Global distance measurements can take full advantage of the topological structure of the protein-protein interaction networks, and estimate the similarity between any two proteins based on all of the paths between them. Thus, they can detect candidate disease proteins that have dense interactions with known disease proteins. Fig. 3(a) shows an example. Local distance measurements will mistake the protein d for a disease protein, while global distance measurements can correctly identify the disease protein c.
Even though the performance of global distance measurements is superior to local distance measurements, hub proteins with high betweenness (essential proteins or other proteins) may be mistaken for candidate disease proteins in some cases. As shown in Fig. 3(b), the non-disease protein e has the largest number of interactions with disease proteins and is therefore mistaken for the disease protein. Thus, a novel method is required to select the true disease protein c. The empirical analyses in the previous section indicate that disease proteins are not well connected with essential proteins. Additionally, hub proteins with high betweenness that are mistaken for disease genes are probably essential proteins that have numerous interactions with essential proteins. Therefore, we can attempt to avoid mistakes such as those shown in Fig. 3(b) by investigating the proportions of essential proteins among neighbors of candidate proteins. As shown in Fig. 3(b), many essential proteins (green nodes in Fig. 3(b)) exist among neighbors of e. This can decrease the probability of mistaking e for a disease protein, and enables the correct identification of the disease protein c. In the following section, we will demonstrate the advantages of our approach for 110 hereditary diseases.
First, we compared the enrichment score of NP D&E , NP D and NP E for 110 hereditary diseases with the integrated protein interaction network. As shown in S1 Table, NP D&E can rank all of the disease genes of 18 diseases first (Enrichment score 2 is 50), such as Alzheimer Disease (4 disease genes), multiple epiphyseal dysplasia AD (5 disease genes) and so on. Specifically, the performance of NP D&E was much better than that of NP D (the improvement of Enrichment score 2 was greater than 5) for 41 diseases, and slightly better (the improvement of Enrichment score 2 was less than 5) for 33 diseases; the performance of NP D&E was the same as NP D for 20 diseases, and worse than NP D for 16 diseases.
As shown in Table 6, we performed further statistical analysis on NP D&E , NP D and NP E for 110 diseases (S1 Table). Compared with NP D , the average of Enrichment score 1 and the average of Enrichment score 2 of NP D&E improved by 3.340 and 3.915, respectively. Table 7 presents the probability associated with a one-tailed student's t-test and demonstrates that the improvement in NP D&E is statistically significant. Moreover, we compared the performance of NP D and NP D&E on monogenic disease, complex disease and cancer, which were divided by Kohler et al. [7]. As shown in Table 6 and Table 7, the improvement in NP D&E for monogenic diseases was the most obvious, and there was a slight improvement in complex diseases. However, the performance of NP D&E in cancer was similar with NP D (p − value > 0.99). The reason for this may be that disease genes associated with cancer are usually essential genes, and essential proteins have lots of interactions with other essential proteins, which probably affects the performance of NP D&E . Additionally, ROC analysis was adopted to compare the performance of NP D&E and NP D . The disease genes that did not have corresponding proteins in the protein interaction network were excluded in ROC analysis. Fig. 4 indicates that the performance of NP D&E was superior to NP D with a t-test p-value of 3.3307e-016 for NP D&E versus NP D .
Next, to compare the ability of NP D&E and NP D to detect new disease genes, we used the disease genes verified before 2008 as the training set and the disease genes verified after 2008 were  Table 8 shows the statistical analyses of the performance of the ability of the two strategies to detect disease genes verified after 2008. NP D&E was able to identify new disease genes more effectively than NP D . According to the statistical analyses, the average rank of disease genes according to the Enrichment score 2 of NP D&E was 50 14:548 % 3. This result implies that NP D&E can assist biomedicine experts to efficiently discover new disease gene with a small amount of medical experiments.
Finally, we provided a true example of effectively detecting disease genes by NP D&E .  was able to correctly identify each disease protein, while NP D failed to identify the disease protein Q5QP88. In Fig. 5, white nodes stand for other proteins, blue nodes denote non-disease essential proteins, red nodes indicate disease proteins that were correctly identified by NP D , the purple node signifies a disease protein that was not correctly identified (Q5QP88 ranked 14th) by NP D , and the yellow node is a non-disease protein that was mistaken for a disease protein by NP D . Because disease proteins Q13144, Q14232, Q9UI10 and P49770 are closer to each other and have many interactions between them, they can be correctly identified by NP D . However, Q5QP88 is located at a distance from other disease proteins and there are fewer interactions between them. Thus, in the prioritization of NP D , the final flow allocated to Q5QP88 was 1.15e-04 while that for Q06830 was 3.49e-04, and Q06830 was mistaken for the disease protein.
The proportion of essential proteins among the neighbors of Q06830 was very high indicating that Q06830 was not a disease protein according to our hypothesis. In contrast to NP D , in the prioritization of NP D&E , the flow allocated to Q5QP88 was 9.668e-05 (Q5QP88 ranked 1st) while Q06830 was −8.997e-05 (Q5QP88 ranked last).

Discussion
Molecular networks describe interactions among molecules that can reflect functional linkages. Thus, network-based methods have been widely researched to discover potential disease genes with similar functions to known disease genes. By taking full advantage of global topology structure, global distance measurements can achieve superior performance compared to local distance measurements. However, some problems exist in the global distance measurements. For example, Yang et al. [27] indicated that network-based methods are limited by detecting potential disease genes only in the small regions of known disease genes. As shown in Fig. 5, global distance measurements may mistake non-disease hub proteins for potential disease proteins. One main cause of the above problems is that the existing network-based methods are designed based on the typical hypothesis that the neighbors of disease genes are likely to cause the same or similar diseases. Thus, the methods can only detect potential disease genes that have high topological similarities with known disease genes. To solve the above problems, this paper attempted to discover new properties of disease genes by analyzing the topology associations between disease proteins and essential proteins in the protein interaction network. Empirical results demonstrate that disease genes are not well connected with essential genes in the protein interaction networks. The new finding can be utilized to explain the conclusion that disease proteins are topologically more important than other proteins [18].
One major hypothesis of molecular network analysis is that "there is a tight relation between network structure and biological function" [28]. Thus, many studies analyzed the properties of disease genes with protein interaction networks [3,17,18,26], and demonstrated that disease proteins are topologically important [3,17]. However, Goh et al. [26] indicated that a small amount of essential genes exist in the disease genes, and this may affect the correctness of analyses. Goh et al. selected mouse lethal orthologs of human genes as human essential genes and demonstrated the majority of disease proteins are topologically neutral. Nevertheless, a knockout for their mouse orthologs has not been reported for 60% of disease genes [29]. We analyzed the topology importance of disease proteins by utilizing housekeeping genes as essential genes [18]. Empirical results demonstrated that disease proteins are topologically more important than other proteins. However, a new question was raised: because disease proteins are topologically important, would disease genes seriously affect human survival? Our new finding can answer the question to some extent. Because disease genes are not well correlated with essential genes, disease genes would not seriously affect normal activities. Additionally, our finding provides new insights into understanding of the pathogenesis of diseases.
Based on the new finding, we proposed a new hypothesis that if too many non-disease essential proteins exist as neighbors of a candidate protein, then the protein is unlikely to cause diseases. We proposed a network propagation method based on the typical hypothesis and the new hypothesis. The method not only considers the topological similarities of candidate proteins with known disease proteins but also exploits the topological dissimilarities of candidate proteins with essential proteins. To some extent the method can avoid mistaking non-disease hub proteins as potential disease proteins. Our strategy will be beneficial creating new ideas and new visions for disease gene prediction and will be insightful and helpful for predicting genotype-phenotype associations with the phenome-interactome network [27].
Our future works will be the further studies of the dual flows integration for detecting disease genes based on game theory. Additionally, we intend to apply our strategy to assist molecular diagnosis, in order to speed up the identification of disease genes in next-generation sequencing data [30]. Itan et al. utilized a local distance measurement that adopts shortest path to the core gene for monogenic disorders [30]. It could be beneficial to utilize our new global measurement for improving the quality of molecular diagnosis.
Supporting Information S1 Table. Enrichment results with the integrated protein interaction network.