PBMDA: A novel and effective path-based computational model for miRNA-disease association prediction

In the recent few years, an increasing number of studies have shown that microRNAs (miRNAs) play critical roles in many fundamental and important biological processes. As one of pathogenetic factors, the molecular mechanisms underlying human complex diseases still have not been completely understood from the perspective of miRNA. Predicting potential miRNA-disease associations makes important contributions to understanding the pathogenesis of diseases, developing new drugs, and formulating individualized diagnosis and treatment for diverse human complex diseases. Instead of only depending on expensive and time-consuming biological experiments, computational prediction models are effective by predicting potential miRNA-disease associations, prioritizing candidate miRNAs for the investigated diseases, and selecting those miRNAs with higher association probabilities for further experimental validation. In this study, Path-Based MiRNA-Disease Association (PBMDA) prediction model was proposed by integrating known human miRNA-disease associations, miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity for miRNAs and diseases. This model constructed a heterogeneous graph consisting of three interlinked sub-graphs and further adopted depth-first search algorithm to infer potential miRNA-disease associations. As a result, PBMDA achieved reliable performance in the frameworks of both local and global LOOCV (AUCs of 0.8341 and 0.9169, respectively) and 5-fold cross validation (average AUC of 0.9172). In the cases studies of three important human diseases, 88% (Esophageal Neoplasms), 88% (Kidney Neoplasms) and 90% (Colon Neoplasms) of top-50 predicted miRNAs have been manually confirmed by previous experimental reports from literatures. Through the comparison performance between PBMDA and other previous models in case studies, the reliable performance also demonstrates that PBMDA could serve as a powerful computational tool to accelerate the identification of disease-miRNA associations.

In the recent few years, an increasing number of studies have shown that microRNAs (miR-NAs) play critical roles in many fundamental and important biological processes. As one of pathogenetic factors, the molecular mechanisms underlying human complex diseases still have not been completely understood from the perspective of miRNA. Predicting potential miRNA-disease associations makes important contributions to understanding the pathogenesis of diseases, developing new drugs, and formulating individualized diagnosis and treatment for diverse human complex diseases. Instead of only depending on expensive and time-consuming biological experiments, computational prediction models are effective by predicting potential miRNA-disease associations, prioritizing candidate miRNAs for the investigated diseases, and selecting those miRNAs with higher association probabilities for further experimental validation. In this study, Path-Based MiRNA-Disease Association (PBMDA) prediction model was proposed by integrating known human miRNA-disease associations, miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity for miRNAs and diseases. This model constructed a heterogeneous graph consisting of three interlinked sub-graphs and further adopted depth-first search algorithm to infer potential miRNA-disease associations. As a result, PBMDA achieved reliable performance in the frameworks of both local and global LOOCV (AUCs of 0.8341 and 0.9169, respectively) and 5-fold cross validation (average AUC of 0.9172). In the cases studies of three important human diseases, 88% (Esophageal Neoplasms), 88% (Kidney Neoplasms) and 90% (Colon Neoplasms) of top-50 predicted miRNAs have been manually confirmed by previous experimental reports from literatures. Through the comparison performance between PBMDA and other previous models in case studies, the reliable performance also demonstrates that PBMDA could serve as a powerful computational tool to accelerate the identification of disease-miRNA associations. PLOS  Introduction MicroRNAs (miRNAs) are an abundant class of small (20~25 nucleotides) endogenous noncoding RNAs, which were normally deemed as negative gene regulators by suppressing the expression of messenger RNAs (mRNAs) in a sequence-specific manner and repressing the protein translation of their target genes [1][2][3][4]. Since the two members of the miRNA family (i.e., lin-4 and let-7) were firstly discovered [5][6][7], mounting biological observations and studies have indicated that miRNAs play important roles in many important biological processes. To date, 2588 miRNAs have been discovered in the human genome [8]. With the advances in molecular biology and biotechnology, miRNAs have been proven to influence many important physiological processes such as cell growth [9], immune reaction [10], cell differentiation [11], cell development [9], cell cycle regulation [12], inflammation [13], cell apoptosis [14], stress response [9,15], and tumor invasion [16]. In addition, emerging evidences imply the strong links between miRNAs and diseases. For examples, miR-129, miR-142-5p, and miR-25 were found to be differentially expressed in all pediatric brain tumor types [17]. MiR-145 was observed to target the insulin receptor substrate-1 and restrain the growth of colon cancer cells [18]. MiR-23/27/24 clusters were involved in angiogenesis and endothelial apoptosis, which has potential therapeutic applications in both of vascular disorders and ischemic heart disease [19]. Therefore, the accumulating miRNA-disease associations could be utilized for the pathological classification, individualized diagnosis, and disease treatment [20]. Some databases (e.g. HMDD [21] and miR2Disease [22]) have been constructed for providing accumulating insights into the relationship between miRNAs and diseases. To date, HMDD has already collected 10368 entries including 572 miRNAs and 378 diseases, which shed some light on the molecular mechanisms of diseases. The existing knowledge of miRNAdisease associations mainly comes from previous biological experiments. However, only depending on the traditional biological experiments is time-consuming and costly, therefore unpractical to detect miRNA-disease associations on a large scale in a short term. For addressing these challenges, increasing attentions have been devoted to developing computational models to predict potential miRNA-disease associations by integrating various experimentally confirmed and heterogeneous datasets [23][24][25][26][27][28][29].
For predicting or prioritizing disease-related miRNAs, there have been some computational methods proposed, which are mostly based on the assumption that functionally related miR-NAs tend to be involved in phenotypically similar diseases and vice versa. For examples, Jiang et al. [30] constructed functionally related miRNA network and human phenome-micro-RNAome network and prioritized the potential miRNA-disease associations according to the cumulative hypergeometric distribution. However, this method excessively depended on the predicted miRNA-target associations which include a high rate of false-positive and high falsenegative results. Xuan et al. [31] presented the prediction model of HDMP based on the most weighted similar neighbors to predict potential disease-associated miRNAs. This method integrated the information content of disease terms and phenotype similarity between diseases to infer the functional similarity of miRNA pairs. However, it becomes invalid when no known associated miRNA is available for diseases which are less investigated. Mørk et al. [32] developed a protein-driven method called miRPD for inferring miRNA-disease associations by calculating the product of the two scoring functions, which were calculated based on the miRNA-protein and protein-disease associations. These three aforementioned methods only considered miRNA neighbor information in their ranking system, which can be summarized as traditional local network similarity measure-based computational models. Some methods were mainly proposed based on the more effective global network similarity measure rather than traditional local network similarity measures. For example, Chen et al. [33] proposed the computational model of Random Walk with Restart for MiRNA-Disease Association (RWRMDA) to identify novel disease-related miRNAs by implementing random walks on the miRNA-miRNA functional similarity network. However, this model cannot work for new diseases without any known related miRNAs. Similarly, Xuan et.al. [34] developed a computational model of MIRNAs associated with Diseases Prediction (MIDP) and its extension version named MIDPE for the diseases with known related miRNAs and without any known related miRNAs, respectively. They established the transition matrices between the labeled and unlabeled nodes for exploring the prior information of nodes and the different ranges of topologies. Shi et.al. [35] also performed random walk analysis to explore the relationships between miRNAs and diseases by calculating the functional associations between miRNA targets and disease genes in protein-protein interaction networks. Later, Chen et al. [36] developed the method of Within and Between Score for MiRNA-Disease Association prediction (WBSMDA) to uncover the potential miRNA-disease associations by integrating several heterogeneous biological datasets, which improves the prediction accuracy of previous classical computational models. By calculating and combining Within-Score and Between-Score, WBSMDA can work for new diseases without any known related miRNAs and new miRNAs without associated diseases. Some proposed machine learning-based models demonstrated their power in this field. Based on the assumption that aberrant regulations of target mRNAs occur because their miR-NAs are implicated in a specific disease, Xu et al. [37] prioritized novel disease-related miR-NAs by constructing the MiRNA Target-Dysregulated Network (MTDN). In addition, they built a support vector machine (SVM) based supervised classifier to distinguish prostate cancer and non-prostate cancer related miRNAs by combining four topological features extracted from MTDN. However, the prediction performance suffered from unavailable verified negative miRNA-disease association samples. A semi-supervised method called Regularized Least Squares for MiRNA-Disease Association (RLSMDA) was developed by Chen et al. [38]. Considering obtaining verified negative samples (i.e. those disease-miRNA pairs without any known association evidences) is difficult and even impossible, cost functions were defined and minimized in the framework of Regularized Least Squares (RLS). Then the optimal classifiers in the disease and miRNA spaces were yielded and combined to obtain the final predictive results. Chen et al. [39] also developed the model of Restricted Boltzmann Machine (RBM) for multiple types of miRNA-disease association prediction (RBMMMDA) for predicting various types of miRNA-disease associations. Based on the known miRNA-disease association network, RBMs were constructed and trained by Contrastive Divergence (CD) algorithm. Finally, this model implemented prediction by calculating conditional probabilities.
The knowledge on miRNAs provides valuable information for the prevention, diagnosis and treatment of human diseases. There is an urgent need to accelerate the identification of disease-miRNA associations for further studies on pathology as well as drug development. We here proposed a novel Path-Based MiRNA-Disease Association (PBMDA) prediction method by constructing a heterogeneous graph consisting of three interlinked sub-graphs (i.e., miRNA-miRNA similarity network, disease-disease similarity network and known miRNAdisease association network). Integrating different types of heterogeneous biological datasets allows that PBMDA could be applied to the new diseases with no known associated miRNAs and the new miRNAs with no known associated diseases. In addition, the proposed method can simultaneously prioritize all unknown miRNAs for all the investigated diseases.
In this work, three evaluation frameworks, including global leave-one-out cross validation (global LOOCV), local leave-one-out cross validation (local LOOCV) and 5-fold cross validation (5-fold CV), were implemented to evaluate the prediction performance of PBMDA. When performing on the HMDD database, PBMDA obtained the best performance in the frameworks of both local and global LOOCV (AUCs of 0.8341 and 0.9169, respectively) and 5-fold cross validation (average AUC of 0.9172) compared with several state-of-the-art computational models [31,33,36]. To further evaluate the performance of PBMDA, we implemented case studies of three important human diseases. As a result, most of top-50 predicted disease-related miRNAs (44/50 for Esophageal Neoplasms; 44/50 for Kidney Neoplasms; 45/50 for Colon Neoplasms) were verified by previously published literatures, respectively. Besides, 9 out of top-10 predicted obesity-related were manually validated based on the published literatures. Our model also represents an improvement to the prediction accuracy through the comparison performance between PBMDA and other previous representative models in case studies. A simulation experiment also proves the applicability of our model to a new disease (no known associated miRNAs).

Materials and methods
Human miRNA-disease associations HMDD database (http://www.cuilab.cn/hmdd) has collected 5430 experimentally verified human miRNA-disease associations (see S1 Table), involving 495 miRNAs and 383 diseases (see S2 and S3 Tables). The adjacency matrix Y is constructed to describe the confirmed associations between miRNA and disease. Namely, if miRNA m(i) is recorded to be associated with disease d(j), the entity Y(i,j) is equal to 1, otherwise 0. For further description in detail, the investigated numbers of miRNAs and diseases in our study are represented by variables nm and nd, respectively. To evaluate the prediction lists for case studies, another two independent databases (i.e. dbDEMC [40] and miR2Disease [22]) are utilized for validation.

MiRNA functional similarity
According to previous literature [41], it could be concluded that miRNAs with similar functions are more likely associated with similar diseases. Under this assumption, miRNA functional similarity score was calculated (http://www.cuilab.cn/files/images/cuilab/misim.zip). We therefore utilized these data to construct miRNA functional similarity symmetric matrix FS (see S4 Table), in which the entity FS(m(i),m(j)) indicates how is miRNA m(i) functionally similar to another miRNA m(j).

Disease semantic similarity
Mesh database (http://www.ncbi.nlm.nih.gov/), a strict system for disease classification, is available for effectively researching the relationship between different diseases. Disease could be transformed into corresponding Directed Acyclic Graph (DAG), such as DAG(D) = (T(D), E(D)), where T(D) indicates the node set including node D and its ancestor nodes, and E(D) is the edge set of corresponding direct links from a parent node to a child node, which represents the relationship between different diseases [41]. Based on disease DAG, the contribution of disease term d to the semantic value of disease D and the semantic value of disease D itself can be formulated by the following two equations, respectively.
Where 4 is a the semantic contribution decay factor, which shows that as the distances between disease D and its ancestor diseases increases, their contribution to the semantic value of disease D progressively decreases. Accordingly, disease D locates in the 0th layer, the contribution to the semantic value of disease D itself was defined as 1. The contribution of its ancestor disease should be multiplied by the semantic contribution decay factor. Therefore, 4 should be assigned a value between 0 and 1, and that this value was set as 0.5 here according to some previous important literatures [42,43]. Based on this way to measure disease semantic similarity, it should be considered that two diseases sharing more common parts of their DAGs should obtain higher semantic similarity. Under this assumption, the semantic similarity between two diseases d(i) and d(j) can be calculated as: where the entity SS(d(i),d(j)) in row i column j represents the disease semantic similarity between d(i) and d(j).

Gaussian interaction profile kernel similarity for diseases
According to the basic assumption that two miRNAs with more functional similarity tend to be more associated with similar diseases, the topologic information of the known miRNA-disease association network could be used to measure disease similarity. We therefore introduce Gaussian interaction profile kernel for calculating the network topologic similarity between diseases. A binary vector IP(d(i)), i.e. the ith column of matrix Y, is recorded as the interaction profiles of disease d(i) for representing associations between d(i) itself and each miRNA. We then utilized Eq (4) to compute Gaussian kernel similarity between disease d(i) and disease d (j) based on their interaction profiles.
where parameter γ d is a regulation parameter of the kernel bandwidth. As a symmetric matrix, KD represents the Gaussian interaction profile kernel similarity for all investigated diseases. Parameter γ d is needed to be updated by using a new bandwidth parameter γ ' d divided by the average value of associations with miRNAs for all diseases. Based on the previous successful research about lncRNA-disease association prediction [44], γ ' d is set to 1 for controlling the kernel bandwidth. So γ d can be formulated as: Gaussian interaction profile kernel similarity for miRNAs Similarly, we also calculated the Gaussian interaction profile kernel similarity for miRNAs, which can be calculated by Eqs (6) and (7): where γ ' m = 1 and KM is a symmetric matrix, whose entity KM(m(i),m(j)) denotes the Gaussian interaction profile kernel similarity between miRNA m(i) and miRNA m(j).

Integrated similarity for miRNA and disease
MiRNA functional similarity and disease semantic similarity are the primary data to construct the disease and miRNA similarity matrix. However, these matrices have the problem of sparsity, so we calculated the Gaussian interaction profile kernel similarity based on the known miRNA-disease associations for calculating the similarity of those disease-disease or miRNA-miRNA pairs without corresponding disease semantic similarity or miRNA functional similarity. For constructing two integrated similarity matrix (i.e., miRNA similarity matrix S m and disease similarity matrix S d ), we integrated miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity for miRNAs and diseases by judging whether miRNA m(i)/disease d(i) has functional/semantic similarity with another miRNA m (j)/disease d(j) or not.  1). For eliminating the node sets with the weak interaction, we set the threshold variable T to 0.5 based on the previous literature research [45], which means we did not take such links into consideration if the similarity between these nodes was less than 0.5. So that three weighted matrixes can be represented as: In this way, we constructed a heterogeneous graph with lots of paths, which consisted of these three weighted matrixes. A path was defined as a connection between a miRNA and a disease. Furthermore, all paths between a miRNA and a disease must be acyclic to avoid the visited nodes from being traversed repeatedly. A specific depth-first search algorithm was adopted to traverse all paths in the graph, which is easy to be implemented as a recursive algorithm. For saving time, we set a parameter L to limit the maximum length of paths. According to previous literature research [45] and comparison experiments, we found that it was suitable to set L = 3 after trying different values from 2 to 4 increasingly, i.e. one path was not allowed to include more than three edges. Based on the assumption that if more paths are found to connect a miRNA and a disease, they are more likely to have associations, the accumulative contributions from all paths between a miRNA-disease pair could be integrated as a final score. Accordingly, the scoring formula can be defined as Eq (11) with the exponential decay function F decay (p), which is depended on the path between a specific miRNA m i and a specific disease d j : where p = {p 1 ,p 2 ,. . .,p n } is a set of paths linking up a miRNA m i and a disease d j , and ∏p w represents the product of the weight of the all the edges in path p w obtained from the Eq (10). Generally, longer paths between a miRNA and a disease should have less confidence to directly demonstrate their relationship, i.e. the contributions from the longer path should be cut down more sharply. So the decay function F decay (p) can be calculated as follows: where parameter α is a decay factor, which was set 2.26, according to previous literature research [45], and len(p) is the length of path p. After traversing all paths in the graph, each miRNA-disease pair could obtain a final score representing the association confidence between this miRNA and disease, i.e. the higher score they obtain, the more closely related they should be. As an example in Fig 1, the score value of miRNA1 (m 1 ) and disease1 (d 1 ) is calculated as: The edge between d 4 and d 1 is not taken into consideration, because its weight is less than threshold T. The code and data of PBMDA is freely available at http://www.escience.cn/system/file?fileId=84394.

LOOCV and 5-fold CV
To evaluate the predictive performance of PBMDA, we implemented LOOCV and 5-fold CV based on known miRNA-disease associations downloaded from HMDD database [21]. LOOCV could be divided into two evaluation frameworks based on the ranking scope (i.e., global LOOCV considers all investigated diseases while local LOOCV only includes a given disease). They both followed the common framework of LOOCV, i.e. each known miRNA-disease association was left in turns served as a test sample and other known miRNA-disease associations were regarded as training samples. Test sample was ranked among the candidate miRNA-disease associations without any known association evidences. The test samples with higher ranks than the specific threshold would be considered as successful predictions. In the framework of 5-fold CV, all known verified miRNA-disease associations were randomly divided into five uncrossed groups, of which one was regarded as testing samples and the other four were used for training in turns. In this paper, we randomly implemented 100 divisions of all known verified miRNA-disease associations to reduce bias brought by sample divisions. The receiver operating characteristic (ROC) curves were drawn for performance evaluation by calculating true positive rate (TPR, sensitivity) and false positive rate (FPR, 1-specificity) based on the varying threshold. Sensitivity indicates the percentage of the positive test samples which are ranked higher than the given threshold; specificity indicates the percentage of candidate samples which are ranked lower than the given threshold. In this way, the ROC curves were plotted based on TPR versus FPR. The areas under ROC curves (AUCs) were also calculated for a numerical evaluation of model performance. AUC = 0.5 denotes a purely random prediction while AUC = 1 denotes a perfect prediction. As a result, the reliable AUCs of 0.9169 and 0.8341 in the frameworks of global and local LOOCV were obtained by PBMDA. Furthermore, the average and the standard deviation of AUC in the framework of 5-fold CV are 0.9172 and 0.0007, respectively. It is anticipated that PBMDA could serve as an effective and robust computational prediction model.

Performance comparison with other methods
We further compared the prediction performance of PBMDA model with four state-of-theart computational prediction models (i.e., WBSMDA [36], RLSMDA [38], HDMP [31] and RWRMDA [33]). RWRMDA and HDMP are the representational methods in this domain. They were often chosen as benchmarking methods to validate the later developed methods, such as: MIDP [35] and Shi's method [36]. .8342+/-0.0010, respectively, which was observed that PBMDA obtained the best performance based on 5-fold CV. In conclusion, PBMDA significantly improves prediction performance of previous computational models by demonstrating its reliable and robust performance from these evaluation frameworks.

Effects of parameters
For simplicity, the maximum path length L and weight threshold T were respectively selected 3 and 0.5 for the prediction. It is possible to obtain the better prediction performance by adjusting these parameters values. 5-fold CV was implemented over 100 times for further evaluation. As a result, given T = 0.5, the average AUC respectively equals to 0.7726+/-0.0016 (when L = 2) and 0.9172+/-0.0007 (when L = 3). It seemed that, the performance of our model could be affected by being restrained with the insufficient neighbor length in each path. Because it took too long to run the program, we did not complete the experiment when L = 4. It seems that as L increases, the computational complexity represents an exponential growth. However, we also implemented 5-fold CV 100 times based on our model and a smaller dataset (HMDD v1.0, 1395 known miRNA-disease associations involved in 271 miRNAs and 137 diseases) for testing performance with increasing L (L = 2~4) and fixed T (T = 0.5). As we can see in Table 1, the prediction accuracy of proposed model is deceased with the increasing L based on a smaller dataset. It is assumed that the increasing L tends to cause overfitting problems for a small network dataset. Therefore, it is not necessary to obtain the improvement of prediction accuracy by increasing the path length L. We should take the size of a network dataset into consideration when selecting parameter L. It seems like that L should be generally increased, We also implemented a series of 5-fold CV experiments with the increasing T values to obtain the optimal setting (see Table 2). It is showed that, the selection of T value cannot greatly affect the accuracy of PBMDA, which demonstrates a strong robustness on this parameter. So we made a better trade-off to set T = 0.5.

Case studies
Many miRNAs in the top rank were predicted to have associations with digestive system and urinary system. It seems that miRNA functional expression is closely related to the dysfunction of both digestive system and urinary system. We attempted to explore their potential relationships and what a role of miRNA plays in disease mechanisms of digestive and urinary system. Esophagus and colon belong to digestive system, while kidney belongs to urinary system. Therefore, to further evaluate the prediction performance of PBMDA, Esophageal Neoplasms, Kidney Neoplasms and Colon Neoplasms were investigated to infer their underlying associated miRNAs. Two independent databases (i.e., dbDEMC [40] and miR2Disease [22]) were used as benchmark datasets to verify the predictive results. The quantitative statistics demonstrates the reasonability of this benchmarking method (see Table 3). Besides, miR2Disease and dbDEMC are commonly utilized to be benchmark datasets in this domain, such as HDMP and MIDP models. It is worthwhile to note that, the predicted miRNA-disease associations were not included in HMDD database, including those highly-ranked disease-related miRNAs listed in Tables 4-7. For the sake of space, this article only mentions these three cancers. Actually, our model can make a successful prediction for almost all of given important diseases. We also present the verification of top-50 prediction list for those important diseases investigated by other previous computational models, such as: prostate neoplasms, breast neoplasms and lung neoplasms in HDMP (see Table 8 and S5 Table). Although HMDD database included plentiful miRNA-disease association prediction of known miRNA-disease associations, there were still many other miRNA-disease associations existing in other two independent benchmark databases, which were not overlapped in HMDD database. So that, we utilized the known miRNA-disease associations (included in HMDD) to prioritize the novel miRNA-disease associations (not included in HMDD), and evaluated the prediction performance of PBMDA by observing how many these novel associations were matched by other two independent benchmark databases. Esophageal Neoplasms is one of the most common digestive carcinomas with poor prognosis. With the growth of the tumor, patients could cause corresponding symptoms, such as difficult or painful swallowing, weight loss and coughing up blood. Cisplatin-based chemotherapy is the main approach for the treatment of Esophageal Neoplasms but the chemotherapy response is difficultly detected. Some studies suggested that miRNAs could be considered as effective prognostic biomarkers for Esophageal Neoplasms. For examples, hsa-let-7 can be considered as a prognostic biomarker for measuring the response to chemotherapy. In addition, when a recurrence of disease happens, patients have relatively higher expression of mature hsa-miR-143 and mature hsa-miR-145 than normal people. A case study of Esophageal Neoplasms was implemented on PBMDA for yielding the most probable related miRNAs (see Table 4  miRNA-disease association prediction growth in Esophageal Neoplasms [46]. Previous research showed that hsa-miR-125b (2nd in the prediction list) can promote cell proliferation in Esophageal Neoplasms by influencing the target transcripts: CYP24, ERBB2 and ERBB3 [47]. Moreover, hsa-miR-221 (3rd in the prediction list) can be regarded as a useful diagnostic marker for measuring the sensitivity to the treatment of Esophageal Neoplasms [48]. Kidney Neoplasms is the most rapidly increasing tumor type in incidence rate, especially among black persons. And more than 80 percent of patients are found to have renal-cell carcinoma (RCC). Recent studies found that patients with RCC usually have overexpression of miR-34a, which plays a critical role in slowing the growth of RCC. Besides, MiRs-141/200c were considered as the most down-regulated miRNAs in RCC by targeting ZEB2, which is a type of transcriptional repressor [49,50]. In order to identify potential disease-miRNA associations, we implemented the case study of Kidney Neoplasms. The existing experimental literatures have demonstrated 9 of top-10 and 44 of top-50 potential miRNA candidates were correctly associated with this important human disease (see Table 5). For example, miR-155 (1st in the prediction list), miR-126 (4th in the prediction list) and miR-20a (5th in the prediction list) were identified to be upregulated in clear-cell type human renal cell carcinoma (ccRCC), relative to normal kidney samples [51,52]. Furthermore, it was found that miR-145 (2nd in the prediction list) and miR-146a (3rd in the prediction list) with over expression suppress their target mRNA and protein expression of the STAT-1 pathway in kidney tissues [53]. Colon Neoplasms maintains the second leading cause of cancer-related death in the United States [54]. Although chemotherapy has important therapeutic value, surgery is still the only curative way for the treatment of Colon Neoplasms. There is an urgent need to find potential biomarkers, which have a strong response to the clinical observations. By using in situ hybridization technique, researchers have confirmed that miR-21 has high expression levels in colonic carcinoma cells [55]. What's more, let-7 functions as a potential growth suppressor in human colon cancer tumors and cell lines [56]. Identifying more miRNAs associated with Colon Neoplasms helps accurately evaluate the clinical outcomes. Therefore, we implemented the case study of Colon Neoplasms based on PBMDA. In the prediction list, 9 of top-10 and 45 of top-50 predicted miRNAs obtained confirmation of their associations with Colon Neoplasms based on recent experimental literatures (see Table 6). For examples, preclinical research showed that the expression of miR-21 (1st in the prediction list) is related to clinicopathologic features of colorectal cancer [57]. Experimental studies also found that miR-20a (2nd in the prediction list) shows significantly higher expression in colon cancer tissues than normal tissues [58]. MiR-18a (3rd in the prediction list) is considered as a colon tumor suppressor by targeting on K-Ras (mRNA) to influence cell proliferation and anchorage-independent growth [59]. What's more, miR-34a (6th in the prediction list) has important potential to be used as potential diagnostic and prognostic biomarker by using its expression at different stages of Colon Neoplasms [60]. Considering that most cancers are characterized by some extent of genetic and genomic modifications, so the fact that cancers are associated with miRNA dysregulation is perhaps an obvious form of validation, which inspired us to know whether PBMDA can achieve the similar effectiveness for another disease type, such as obesity. Because there is only one entry corresponding to the miRNA-obesity association in both HMDD and the two independent benchmark databases (i.e., dbDEMC [40] and miR2Disease [22]), we decided to manually validate the top-10 predicted obesity-related miRNAs based on the published literatures. As we can see the validation results from Table 7, 9 out of top-10 predicted miRNAs have been demonstrated to be associated with obesity, which demonstrated that PBMDA also work effectively for other diseases.
Because only WBSMDA and PBMDA chose the latest version of HMDD for the prediction, we decided to compare the performance between PBMDA and WBSMDA by observing how many top-50 predicted miRNA-disease associations have been confirmed by dbdemc and miR2Disease databases for these three important diseases. Their validation results were compared in Table 9. As we can see from this table, PBMDA perform better than WBSMDA in general.
Besides, we have also implemented PBMDA on the older version of HMDD (v1.0) for further comparison between PBMDA and another three compared computational models, i.e. HDMP, RWRMDA and RLSMDA. For a fair comparison, we only took those top-50 predicted associations verified by three benchmark databases (i.e. HMDD v2.0, dbDEMC and miR2Disease) into consideration. Namely, those predicted associations additionally verified by extra literatures were not included. Besides, we could only compare performance between PBMDA and other compared models for those given diseases, whose verification was published in their articles, e.g. prostatic, breast and lung neoplasms in HDMP. Based on these comparison results (see Table 10), our model generally improves prediction performance for various selected diseases, relative to other compared computational models. To prove the applicability of our model to a new disease (no known associated miRNAs), we selected Glioblastoma for further verification. We removed all records about Glioblastoma from the known miRNA-disease association network derived from HMDD and that Glioblastoma could be regarded as a new disease. We implemented our model for prediction and then the prediction result about Glioblastoma was yielded. Similarly, we verified top-50 predicted miRNA-disease associations by HMDD, dbDEMC and miR2Disease (see Table 11). Fifteen out of top-20 and 37 out of top-50 predicted miRNAs have been verified to be associated with Glioblastoma by HMDD, dbDEMC and miR2Disease. Based on these prediction results, we can safely conclude that PBMDA can still achieve the reliable prediction performance for a new disease. Most importantly, it also demonstrates that our model is indeed applicable to a new disease.
As a global computational model, PBMDA was also implemented to simultaneously prioritize the potential miRNAs for all investigated diseases. Owing to the limited prior knowledge, some promising disease-related miRNAs have not been validated yet. We therefore listed the top-100 potential associations in S6 Table.

Discussions
With the great amount of researches, it was found that miRNAs play increasingly significant roles in many physiological processes including complex human diseases. Researchers attempt to identify disease-related miRNAs as valuable biomarkers for clinical measure, diagnosis, prognosis and treatment. The biological experiment-based verification is not only time-consuming but also expensive, which boosts the development of computational predictive models. A novel Path-Based MiRNA-Disease Association (PBMDA) computational prediction model was proposed here by integrating heterogeneous biological networks. PBMDA could construct a heterogeneous graph by padding internal connections, including miRNA-miRNA similarity, disease-disease similarity and known miRNA-disease associations. MiRNA-miRNA similarity and disease-disease similarity are inferred from Gaussian interaction profile kernel similarity for miRNA and disease, miRNA functional similarity network, and disease semantic similarity By manually validating the predicted obesity-related miRNAs based on the published literatures, 9 out of top-10 predicted miRNAs have been demonstrated to be associated with obesity. Furthermore, through the comparison performance between PBMDA and other previous models in case studies, it is anticipated that PBMDA would significantly accelerate the identification of miRNA-disease associations. We are planning to provide a standalone tool or webserver for users in the future. This study is aimed to firstly propose the computational model for the next schedule. There are several major factors contributing to the high prediction performance of PBMDA. First, reliable biological datasets were utilized to establish an integrated similarity network, which represents three relationships (i.e., miRNA-miRNA similarity, disease-disease similarity, and miRNA-disease associations). Second, as a path-based model, PBMDA can effectively take advantage of topological information implied in the integrated heterogeneous network. Third, Table 11. We removed all records about Glioblastoma from the known miRNA-disease association network derived from HMDD and implemented our model for prediction. Fifteen out of top-20 and 37 out of top-50 predicted miRNAs have been verified to be associated with Glioblastoma by HMDD, miR2Disease and dbDEMC databases.
PBMDA can be applied for new disease (no known associated miRNAs) and new miRNAs (no known associated diseases), which greatly improves the practicability and reliability of the PBMDA. Depending on the disease semantic similarity and miRNA functional similarity, we can construct the disease-disease and miRNA-miRNA similarity network. Depth-first search algorithm can be used to assign the scores to the paths like: disease new $disease$miRNA and miRNA new $miRNA$disease. In this way, the unverified disease-miRNA associations including new diseases and/or new miRNAs also can be prioritized based on their aggregated scores. Fourth, the model of PBMDA can be easily introduced together with other biological information (e.g. various miRNA-related interactions and disease phenotypic similarity [61,62]) for further improving the quality of the integrated heterogeneous network. Last but not least, PBMDA could simultaneously prioritize candidate miRNAs for all investigated diseases.
There is still a vast potential to boost the prediction performance of PBMDA, which still have some limitations. For examples, the miRNA-disease associations obtained from HMDD database are far from enough, which greatly influences the performance of our approach. The disease semantic similarity and miRNA functional similarity have problem of sparsity, which was remedied by integrating the Gaussian interaction profile kernel similarity inferred from the known miRNA-disease associations. It inevitably did bring the predicted error to the constructed heterogeneous graph. Finally, the distance-decay function in our approach is relatively simple, and it could be reconstructed based on the machine learning methods.
Supporting information S1  Table. PBMDA's top-50 prediction list verified by dbDEMC and miR2Disease databases. (XLSX) S6 Table. As a global measure model, PBMDA can simultaneously prioritize all unknown potential related miRNAs for all investigated diseases. We here publicly released the top-100 potential miRNA-disease associations predicted by PBMDA.