Ensemble Positive Unlabeled Learning for Disease Gene Identification

An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.


Introduction
While high-throughput genomic studies have led to the discovery of hundreds and thousands of candidate disease genes, the identification of genes involved in specific human diseases has remained a fundamental challenge, requiring time-consuming and expensive experimentation. Computational approaches that can reliably predict novel disease genes from the vast number of unknown genes will provide a useful alternative to speed up the long and arduous searches for the genetic causes of various human disorders.
Given that an increasing number of genes have been experimentally confirmed over the years as causative genes to various human diseases, it will be useful to develop machine learning methods to identify novel disease genes from the confirmed disease genes as positive training examples, based on the observation that genes associated with similar disease phenotypes are likely to share similar biological characteristics. For example, proteins involved in hereditary diseases tend to be long, with more homologs with distant species, but fewer paralogs within human genome [1]. They are also likely to attach together to form functional modules such as protein complexes [2]. In fact, various studies have shown that genes associated with similar disorders tend to demonstrate similar gene expression profiling [3], high functional similarities [4] [5] and physical interactions between their gene products [6] [7].
In addition, with disease phenotype similarity data, genes associated with same/similar disease phenotypes are likely to share similar biological functions. Given a phonotype ph i , we can infer its potential disease genes from those disease genes associated with phenotypes ph j (ph i and ph j are very similar) [8].
A number of methods above have thus been proposed to prioritize candidate genes based on different kinds of biological data, such as gene sequence data, gene expression profile, evolutionary features, functional annotation data and PPI dataset. Adie et al. [9] employed a decision tree algorithm based on a variety of genomic sequence and evolutionary features, such as coding sequence length and evolutionary conservation, presence, and closeness of paralogs in the human genome. Topological information on PPI network has also been demonstrated to be useful for disease gene prediction. Smalter et al. [10] applied support vector machines (SVM) classifier using PPI topological features in addition to sequence-derived and evolutionary features, while Radivojac et al. [11] built three individual SVM classifiers using three types of features2PPI network, protein sequence and protein functional information2and then built a final classifier to combine the predictions from three individual classifiers for candidate gene prediction.
The research work mentioned above employed classical machine learning methods to build a binary classifier where the confirmed disease genes are used as positive training set P and unknown genes as negative training set N. However, these machine learning techniques hardly perform as well as they could because the negative set N that they used contained unconfirmed disease genes (false negatives). In light of aforementioned limitation, recently positive unlabeled learning (PU learning) methods have been proposed for the task by building a classification model in which unknown genes are appropriately treated as an unlabeled set U (instead of a negative set N). For example, Mordelet et al. proposed a bagging method ProDiGe for disease gene prediction. It iteratively choosed random subsets (RS) from U and then trained multiple classifiers using bias SVM to discriminate P from each subset RS. The multiple classifiers were subsequently aggregated to generate the final classifier [12]. Given that the RS's were likely to contain less noise (unknown disease genes) than the original set U, it was able to perform better than classical binary classification models that inappropriately used U as negative training data. More recently, Yang et al. designed a novel multi-level PU learning algorithm PUDI to build a classifier with better performance for disease gene identification where the unlabeled set U was partitioned into multiple positive and negative sets with confidence scores for building the classifier [13].
The prior works have clearly shown that integration of various biological data sources is not only desirable but also essential for robust disease gene prediction, since using only a single source of data for prediction is susceptible to incompleteness and noise in the genomic data. It is also advantageous to employ an ensemble approach for prediction, since using a single machine learning predictor is similarly in risk of potential bias caused by inherent limitations of individual prediction models. In this paper, we propose an effective PU learning framework to integrate multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. In our proposed framework, we first extract multiple positive and negative samples from unlabeled set U through performing random walk with restart on different biological networks. We use three biological networks for this paper: protein interaction network, gene expression similarity network, and GO similarity network. Then, we build multiple independent PU learning models that utilize the extracted positive and negative samples as training data with different confidence scores. Finally, we design a novel ensemble strategy called EPU (Ensemble Positive Unlabeled learning) giving optimized weights to base PU learning models to minimize the overall error rate for accurate disease gene predictions.
We compare EPU with multiple state-of-the-art techniques, namely, multi-level example based learning [14], Smalter's method [10], Xu's method [15] and ProDiGe method [12]. The experimental results show that EPU outperforms the existing methods significantly for identifying disease genes on 6 disease groups. In addition, our proposed EPU algorithm also achieves better results when it is compared to three base PU learning classifiers, demonstrating that our proposed ensemble-based approach is able to effectively utilize individual classifiers for better performance. Finally, we also conduct a case study to show how our proposed EPU algorithm can discover novel disease genes for endocrine and cancer diseases.

Materials and Methods
In this section, we begin with the description of the experimental data used and briefly introduce how the protein interaction network, gene expression similarity network, and GO similarity network [16] [17] [18] are constructed. Then we will present the schema of our proposed EPU algorithm.

Experimental data and gene network modeling
In this paper, we have exploited the following biological data human protein interaction data, gene expression data, gene ontology, and phenotype-gene association data.
Human protein interaction data (PPI) is downloaded from the Human Protein Reference Database (HPRD) [19] and Online Predicted Human Interaction Database (OPHID) [20]. The combined PPI dataset contains 143,939 PPIs among a total of 13,035 human proteins. We build a protein interaction network G PPI = (V PPI , E PPI ) where V PPI represents the set of vertices (proteins) and E PPI denotes all edges (detected pairwise interactions between proteins). G PPI can be represented as its matrix format, i.e.
Gene expression data is obtained from RNASeq data which is made publicly available in the EBI ArrayExpress, by the Illumina Human BodyMap 2.0 (http://www.ncbi.nlm.nih.gov/geo/query/ acc.cgi?acc = GSE30611). The dataset comprises Fastq reads from the paired-end sequencing of cells from 16 human tissue types, including colon, heart, kidney, white blood cells and so on, using the Illumina HiSeq next generation sequencing platform. This dataset represents the expression values of 17,652 human genes on 16 human tissue types. Suppose gene g i and g j are represented as their gene expression profile vectors (x i1 , x i2 ,…, x in ) and (x j1 , x j2 ,…, x jn ) respectively where x ik (k = 1, 2, …, n) denotes the expression value of gene i in the k-th tissue. Pearson correlation coefficient is employed to measure the similarity between g i and g j : where x x i~1 n P n k~1 x ik , x x j~1 n P n k~1 x jk . We build a gene expression similarity network G GE = (V GE , E GE ), where V GE represents a set of genes occurring in the gene expression data and E GE represents a set of edges between the genes in V GE . For each gene g i , we ranked all genes g j in V GE where i?j according to decreasing order of sim GE (g i , g j ), and add an edge (g i , g j ) to E GE if g j is in the top 5 of the list. This helps to filter low similarity pairs and potential noise in gene expression data. Then we transform the gene expression network G GE to its matrix format where the edges of two genes are reformatted to their gene expression similarity in equation (1).
Gene Ontology (GO, http://www.geneontology.org/) is a set of controlled vocabulary used to annotate genes and gene products [21]. Gene Ontology provides three sub-ontologies, namely, biological process (BP), molecular function (MF) and cellular components (CC) [21]. For each gene, we build a feature vector using its annotations from three sub-onotolgies, i.e. {MF 1 ,…,MF |SMF| , BP 1 ,…,BP |SBP| , CC 1 ,…,CC |SCC| }. For example, a gene g i is represented as gene vector g i = (mf i1 , …, mf i|SMF| , bp i1 , …, bp i|SBP| , cc i1 , …, cc i|SCC| ), where mf ij (similar for bp ij , cc ij ) is GO term similarity between g i and the feature MF j . Since the GO terms of BP, MF and CC are organized into DAG structure, we use the computational method in [22] to measure the similarity of two GO terms. And |SMF| is number of selected MF term features.
We choose the GO features that could help distinguish disease genes from non-disease genes with strategy in [13] and top 1000 scored features were selected for each of three feature groups, i.e. BP, MF and CC, respectively. We then build GO similarity network G GO = (V GO , E GO ), where V GO is the gene set annotated in GO dataset and E GO is a set of edges between the genes in V GO . Similarly to the gene expression similarity network, we keep those top 5 edges which have highest similarities to each gene and other edges. G GO can be represented as its matrix format, i.e. W GO = [w ij ]. Given a gene g i , if g j is one of top 5 lists of g i , w ij is normalized as: otherwise, w ij = 0, where Dis(g i , g j ) denotes Euclidean distance between g i and g j and 0#w ij #1. Phenotype-gene association data. 4260 phenotype-gene association data, spanning 2659 known disease genes and 3200 disease phenotypes, are obtained from the latest version of OMIM (http://omim.org/) [23]. Goh et al. [6] have categorized the 3200 disease phenotypes in OMIM database into 22 disease groups/ classes, i.e. Cancer, Metabolic, Neurological, Endocrine, etc, based on the physiological system affected. For example, the Endocrine disease group comprises 62 OMIM phenotypes, including OMIM 241850 (Bamforth-Lazarus syndrome) and OMIM 304800 (Diabetes insipidus, nephrogenic) etc.
Phenotype similarity network. Disease phenotype similarity network [24], is defined as G PH = (V PH , E PH ), where V PH denotes the set of disease phenotypes and E PH denotes relevant phenotype pairs. Disease phenotypes in V PH are represented as feature vectors in which feature terms are Medical Subject Headings (MeSH) controlled vocabulary, and phenotype similarities in E PH are evaluated underline concept relevance and frequency of MeSH terms appearing in text description of OMIM documents. According to Vanunu et al. [8], phenotype pairs with high similarities are regarded as informative and reliable. Therefore, we apply logistic function to filter out low phenotypic similarities in E PH , following [2] [8].

The proposed technique EPU
The schema of our EPU algorithm is presented in Figure 1. EPU first selects candidate positives from positive genes and reliable negatives from unlabeled genes. It then builds three gene similarity networks using PPI data, gene expression data and Gene Ontology data and applies random walk on the three networks to propagate weights to unlabeled genes that reflect likelihoods of belonging to positive/negative class. We then exploit the weighted genes to build three diverse classification models to predict ''soft'' labels for test genes. Finally, an ensemble learning algorithm combines the prediction results from the classifiers to make a final prediction for the classification of the unknown test gene.
Suppose all disease genes from OMIM are stored into a disease gene set DIS. All the other genes that are not a member of DIS will be treated as unknown/unlabeled genes and be stored into a set UG (contains 16,570 genes) [25]. Each gene in DIS and UG is represented as a feature vector, namely, g !~f f 1 ,:::,f m g where m is the total number of features from GO terms, protein domains and PPI topological features, following our previous work [13].
In the next section, we describe how to predict novel disease genes given the confirmed disease genes for a particular disease or disorder. The confirmed disease genes for the given disorder group are treated as positive set P (P5DIS) while randomly selected unknown genes from UG are treated as unlabeled set U (U5UG, |U| = |P|), following the settings in [9] [10] [15].

Weighting unlabeled genes by integrating multiple biological evidences
Given a particular disease class and its known associated disease genes, we first build the training data sets for machine learning by prioritizing the candidate positives and reliable negatives based on their similarity to the query disease class. We build three gene similarity networks using PPI, gene expression and GO as described above, and perform a random walk with restart algorithm on these three gene similarity networks to estimate the likelihood of the unlabeled genes belonging to disease class or nondisease class. The details are as follows.
Extracting candidate positives and reliable negatives. As a typical positive set P is relatively small, we want to find a set of candidate positive genes CP to complement P. Given that recent studies have shown that similar phenotypes are often caused by functionally related disease genes [4] [6], we could populate the set of candidate positive genes CP with genes associated to similar/relevant phenotypes, based on the principle of guilt-by-association. In other words, given a query disease group/class, we can use its associated phenotypes to uncover similar disease phenotypes, as shown in Figure S1. Having identified the candidate positive genes CP, let us now describe how to extract reliable negative gene set RN. We consider reliable negatives as those unlabeled genes that are very different from positive set P. To identify such genes, we first build a ''positive representative vector'' (pr) by summing up gene vectors in P and normalizing it. Then, we compute the average Euclidean distance [26] of every unlabeled gene g i in U from pr. To extract the reliable negative genes for RN, we regard an unlabeled gene g i as a member of RN if its distance from pr is longer than the average distance (of all the genes in U) from pr, formalized as follows: where dis(pr,g i ) is the Euclidean distance between gene g i and positive representative vector pr. We compute an average distance D D of all the unlabeled gene in U from pr as: D D~1 DUD P DUD i~1 dis(pr,g i ): Ensemble weighted unlabeled genes via performing label propagation on multiple networks. We now have the given positive set P, a candidate positive set CP, a reliable negative set RN and a remaining unlabeled set U 0~U {RN for machine learning. To build a robust classification model, we will extract those genes with reliable labels that are near the decision boundary between the positive and negative classes. We adapt the Random Walk with Restart algorithm [27] to perform flow propagation which spreads the label information from P, CP and RN to the unlabeled genes in U 0 on the biological networks that we have constructed, namely the PPI network G PPI , the GO similarity network G GO and the gene expression similarity network G GE as described earlier.
Formally, let R 0 be an initialization vector where primitive scores are assigned to all genes in three networks to indicate the genes' potential classification labels. Let p 0 , p 0 0 and n 0 denote the initial values for genes in P, CP and RN respectively, as follows. The genes g i [P are all given a score p 0 (g i ) = +1, indicating their disease gene status. Each candidate positive gene g i [CP is assigned a score that computes its maximal phenotypic similarity to the known disease genes in P, p 0 0 g i ð Þ~max Vphi[PH(gi),Vphj [PH(P) sim(ph i ,ph j ), where PH(g i ) denotes disease phenotypes caused by gene g i , and PH(P) denotes disease phenotypes caused by disease set P. For genes in reliable negative set RN, to balance total amount of flows between positive genes and negative genes, the initial score for negative gene g i is assigned with where is a total amount of positive gene set. The remaining unlabeled genes in U 0 are assigned an initial score of 0.
For each of the three biological networks G PPI , G GE and G GO , prior influence from the seed nodes in P, CP and RN are propagated to their direct neighbors, and then continue to spread to other adjacent nodes iteratively across the network. Given R 0 the initial score vector (step 0), R t , the score vector at step t, can be calculated as follows: where R 1 = R 0 and W = D 21 W is a normalized format of matrix W, W [fW PPI ,W GO ,W GE g. Here D is the diagonal matrix and D ii~P k W ik . a represents the percentage of flow back to original seed nodes in P, CP and RN during each iteration. The default value of 0.7 is used for a, following [16]. Eventually, the information flow will converge to a steady state [27]. In our case, the Random Walk with Restart algorithm will stop its iterative process when the difference between two steps R t and R t-1 is less than 10 26 [16], measured by L1 norm. The scores for unlabeled genes from the three gene networks are combined into one integrated score: where R t (g,W PPI ), R t (g,W GO ) and R t (g,W GE ) are the scores for gene g in the PPI, GO similarity and gene expression similarity networks respectively.

Ensemble positive unlabeled learning EPU
Next, we describe how to build three separate PU learning classification models Support Vector Machine SVM, K-Nearest Neighbor, and Naïve Bayes classifier2to classify genes into two classes C = {+, 2}, where '+' denotes positive/disease class and '2' presents negative/non-disease class.
PU learning model 1: Weighted K-Nearest Neighbor (WKNN). KNN is an instance based learning method, which classifies an unknown test gene based on the class labels of its top K nearest training example genes, i.e. based on the majority class vote of its nearest K neighbors. The distance between the test gene and other training examples can be computed using common distance metrics such as Euclidean distance. Given a test gene g i and its k nearest neighbor set D i , we divide D i into positive and negative training subsets, namely D i+ = {g|Int_score(g)$0, gMD i } and D i-= {g|Int_score(g),0, gMD i } based on these neighbors' integrated scores. The conditional probability of the test gene g i with respect to disease (+)/non-disease class (2), is measured as Note that weighted KNN accumulates both positive and negative integrated scores in D i and estimates the probability of g i belonging to positive (or negative) class based on the accumulated scores in that class.
PU learning model 2: Weighted Naïve Bayes (WNB). Given a test gene g i , the probability that gene g i belongs to a class c j (c j [C~fz,{g) can be computed using Bayes' theorem as: where the probability P(g i ) is a constant for the positive and negative classes. Here, we define the prior probabilities of the positive and negative classes as 0.5, i.e. P(Y = +) = P(Y = 2) = 0.5. Given a gene vector g !~f g f 1 ,:::,g fm g, the conditional probability of feature f k associated with class c j , denoted as P(f k |Y = c j ), is calculated as: where g(f k ) is the value of feature f k in gene vector g ! , D c j is defined as either D z~f g[Djint score(g)w0g or D {f g[Djint score(g)v0g, depending on c j is positive class (+) or negative class (2).
By assuming that the probabilities of features are independent given the class Y = c j , we obtain the Naïve Bayes classifier: PU learning model 3: Multi-level Support Vector Machine (MSVM). Based on the integrated score Int_score(g), we further partition the unlabeled genes g[ U{RN ð Þinto three parts: likely positive set LP (genes get higher positive integrated scores), likely negative set LN (genes get lower negative integrated scores) and weak negative set WN (remaining genes) using the following criteria: We then build a multi-level classifier based on positive training set P, reliable negative set RN, and three newly generated sets LP, LN, and WN, via weighted support vector machine technique [28] [29], to take into account of the inherently different levels of trustworthiness of labels in the five gene set.
The objective function of Weighted SVM can be defined as [14]: { can be decided by using cross-validation techniques. Finally, we apply our MSVM model P Y~c j jg i ,h~MSVM À Á to compute the probability of test gene g i belonging to class c j (c j [C~fz,{g) for its classification.
Note that while the candidate positive set CP plays a role in assigning the genes in U -RN to one of the 3 subsets LP, LN and WN, it does not overlap with the training set P|U and hence is not used in the construction of the MSVM model.

Ensemble-based algorithm for integration of individual
classifiers. In order to perform more robust classification, we design a novel ensemble learning model to integrate the three classification models constructed above. Suppose x ij~P Y~cjg i ,h j À Á denotes the probability of gene g i belonging to class c as predicted by the j th classifier. We can organize the genes in D in the following matrix: where k is the number of individual classifiers (here, k = 3), and |D| is the size of training set D.
Our ensemble model ox x i ð Þ integrates the outputs from the multiple classification models as follows: where w ! is a weight vector that indicates the importance of individual models. The final output value ''1'' denotes disease/ positive class and '21' denotes non-disease/negative class. The classifier weightw w can be learned from training set D as follows. We define Ew w ð Þ as training error of the hypothesis of our ensemble model: The following training rule guarantees thatw w is adjusted in the direction of steepest descent along the error surface:w w/w wzDw w, where Dw w~{g+Ew w ð Þ. g is a small positive constant, called learning rate, to determine the step size in gradient decent exploration. We set g = 0.001, following previous work [30]. The negative gradient {+Ew w ð Þ gives the direction of steepest decrease. According to equations above, we update the gradient descent rule as follows: The overall ensemble learning method is summarized in Figure 2. First, we assign an initial random weight vector forw w. The ensemble model is then applied to all training genes and each weight is then updated by adding Dw j computed according to equation (17) above. This process is repeated untilw w converges. Note that if g is a large number, the search exploration might overstep the minimum point in the error surface rather than settling into it. Therefore, the value of g should be gradually reduced as the number of iteration grows.

Experimental Results
For evaluation, we benchmark our proposed EPU algorithm against four state-of-the-art techniques for disease gene prediction: PUDI method, Smalter's method, Xu's method and ProDiGe. In addition, we also compare the performance of EPU with its base learning models, namely MSVM, WKNN and WNB. Finally, we demonstrate novel disease gene prediction using the EPU algorithm.

Experimental settings
We use the disease classes with at least 50 confirmed disease genes from the 21 specific disease classes in [6] for evaluating our classification algorithm. There are six such disease classes: cardiovascular disease, endocrine disease, cancer disease, metabolic disease, neurological disease, and ophthalmological disease (See Table S1 for the exact numbers of disease genes for each class). Given a particular disease class, the positive set P consists of all its confirmed disease genes, while the unlabeled set U is formed by randomly selecting from the genes that are not known to be associated with any disease such that |P| = |U|, following the setting in [9] [10] [15]. To avoid bias in sampling, we randomly select 10 groups of unlabeled set U. All experimental evaluations of the classification models are done on identical groups of training and test data, and we report the average performance over the 10 groups of (P+U) sets. To evaluate the performance of our algorithm, 3 fold cross validation is applied where two folds in P+U as the training set build classifier while remaining one fold is the test set. Next, positive training genes in P are used as seed nodes on multiple genetic networks to weight unlabeled training genes in U via flow propagation. Then, to obtain the 'soft' classes of training genes from component learning models, leave-one-out cross-validation (LOOCV) is used on 2 fold training samples, from which each training sample is singled out to evaluate its 'soft'  Figure 2) to predict 1 fold test fold. The average results on 3610 groups of (P+U) sets are reported on experimental part.

Evaluation metrics
We use precision, recall, and F-measure to measure the performance of our classification models on each of the six disease classes. The F-measure is the harmonic mean of precision (denoted as p) and recall (denoted as r), defined as The value of F-measure is large only when both of p and r are high, and small when either of them is poor. This is appropriate for our objective of accurately predicting disease genes, as deficiencies in either precision or recall will be reflected by a low F-measure.  [15] and ProDiGe [12]. Table 1 shows that our proposed EPU, on average, is 6.5%, 15.1%, 16.2% and 16.4% better than PUDI, ProDiGe, Smalter's method, Xu's method in terms of F-measure respectively. In particular, EPU can achieve much better precision and consistently better recall when compared against the recently proposed method PUDI. It shows that EPU can effectively extract hidden positive and negative data from the unlabeled data to boost classification performance.
Comparison of EPU with base classifiers. Next, we compared the performance of our proposed EPU against its base classifiers MSVM, WNB, WKNN. As shown in Table 2, on average, MSVM achieved the highest F-measure (81.3%), much higher than WNB (69.5%) and WKNN (68.7%). This is not surprising as MSVM can handle multiple weighted positive and negative sets when building its classification model. Furthermore, SVM is known to perform significantly better than NB and KNN in many real-world applications.
Our proposed ensemble learning method EPU is able to achieve 84.8% in terms of F-measure, which is 3.5%, 15.3% and 16.1% better than MSVM, WNB and WKNN respectively. Moreover, EPU consistently outperformed all 3 component classifiers for every disease class. This strongly demonstrates that EPU can effectively integrate multiple classification models and minimize the overall error rate through dynamically assigning different weights to different classification models.
Sensitivity study on the parameter g in EPU and parameter k in genetic similarity networks. We perform the sensitivity study for parameter g in EPU and coverage of genetic similarity networks. Parameter g is the learning rate in EPU algorithm. We perform EPU on six disease groups with g from 0.001 to 0.03. The result indicates that step size within 0.001 is small enough to move optimal value point in hypothesis space and our EPU is robust and stable when g is small (Table S3 for the detailed results). In addition, we study the effect of the parameter k that determines the number of neighbors of each gene in biological networks. The results in Table S4 show that EPU consistently achieved best performance with k in (1, 9).
Comparing EPU with existing ensemble learning approaches. We also compared our proposed EPU with two existing ensemble approaches, majority vote and weighted majority vote [31], in terms of F-measure across the six disease classes. EPU was shown to outperform the existing ensemble methods (see Table S2 for the detailed results), indicating that our proposed EPU is a superior ensemble strategy for integrating multiple classification models for disease gene prediction. Predicting novel disease genes for disease groups. To demonstrate novel disease gene prediction using the EPU algorithm, we selected two important disease groups, namely, metabolic and cancer, as detailed case studies. For each target disease class, we obtained a set of confirmed disease genes from OMIM and GENECARD as the positive training set, and applied our proposed EPU algorithm to prioritize a novel disease gene from the unlabeled gene set.
We first applied our EPU algorithm to discover novel disease genes for metabolic diseases. 12 unlabeled genes were detected to be associated with target disease using our algorithm. For verification, we searched the literature for evidence that supports the association of these predicted disease genes to metabolic diseases. We found that two predicted genes, RHEB and DOK5, have indeed been reported to be associated with metabolic diseases. Rheb, a GTP-binding protein, was reported to be inactivated to protect cardiomyocyte during energy deprivation via activation of autophagy. This implies that RHEB is a key regulator of autophagy during myocardial ischemia, which has implications in patients with obesity and metabolic syndrome [32]. As for DOK5, Tabassum et al. identified that it is a novel candidate disease genes associated with type 2 diabetes, which is a metabolic disorder due to obesity [33].
Our EPU model also predicted 32 unlabeled genes as candidate genes associated with cancer. Seven of them, SIGLEC7, PRDX4, PRDX5, HNRNPL, SRPK1, ABCB10 and PHF10 have been reported to be associated with cancer diseases. Table 3 lists these candidate disease genes and the supporting literature evidence that we have found.
For other candidate cancer genes without literature evidence support, seven of them, PMM1, SRCIN1, ISY1, KDM4A, CIR1, PPP2R5A and NOL3 have been shown to associate with cancer diseases in GO similarity network, GE similarity network and PPI network. From GO similarity network, PMM1 is one of top 5 nearest neighbours of cancer disease gene PPM1D and SCRIN1 is one of neighbours of disease gene CTNNB1. In GE similarity network, ISY1 is linked to disease gene P2RX7, KDM4A and CIR1 are interacted with disease genes CTNNB1 and MSH2 respectively, indicating that three suspicious genes are highly correlated with cancer disease genes in terms of gene expression. From PPI network, PPP2R5A is directly interacted with two disease genes, BCL2 and TP53, and NOL3 is linking to two disease genes, BAX and CASP8.

Conclusions and Discussion
Despite the considerable progress in disease gene discovery, there are still many unknown disease genes that are yet to be characterized. Machine learning methods can be used to predict novel disease genes from the confirmed disease genes, based on the observation that genes associated with similar disease phenotypes are likely to share similar biological characteristics. However, there are two challenging issues for disease gene predictions. Firstly, how to leverage various biological sources during our model building process, which could effectively alleviate the bias issues from the incompleteness and noise in the data. Secondly, how to integrate Table 3. Novel cancer-related genes predicted by EPU.

Gene ID
Supported literatures multiple computational models to minimize the potential bias and errors as individual learning methods has their inherent limitations and they could predict accurately for some disease genes but could fail badly for the other ones. In this work, we have designed a novel ensemble learning method EPU for predicting disease genes via using a network-based random walk with restart approach on multiple biological networks, and an ensemble classification approach on multiple machine-learned prediction models. By using multiple biological data sources, EPU is less susceptible to potential bias, incompleteness and noise in individual data source.
In this paper, we choose Nearest Neighbor, Naïve Bayes and SVM as three base learning models of EPU due to three reasons: they are the state-of-the-art learning techniques that have been widely used in disease gene identification filed [10] [11] [15] [17] [34]; 2) we are combining PU learning models instead of traditional classification models -we choose the three classification models as they can be easily adapted to build PU learning models; 3) they are quite diverse with learning criterions, so that their complementary nature may contribute a more accurate and robust combinational result. By employing an ensemble approach for prediction, EPU also minimizes the inherent limitations of individual prediction models. Finally, by employing PU learning techniques for building its ensemble of classification models, EPU is able to treat the unknown genes appropriately as an unlabeled set U (instead of a negative set N) for training, thereby resulting in more robust predictions. Experimental evaluations have confirmed the effectiveness of our proposed approach, with our EPU method consistently performing much better than the existing state-ofthe-art techniques for disease gene prediction on six disease classes. As more biological data sources and machine learning classifiers become available in the future, our EPU method can be an effective framework to integrate the additional biological and computational resources for better disease gene predictions. For further work, we will explore the inclusion of other biological data sources for disease gene prediction using our framework. Given that many machine learning problems in biomedical research do involve ensemble Positive Unlabeled data, we can also adapt our EPU framework to other applications, such as drug-target interaction prediction [35] [36].   Author Contributions