MDHGI: Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction

Recently, a growing number of biological research and scientific experiments have demonstrated that microRNA (miRNA) affects the development of human complex diseases. Discovering miRNA-disease associations plays an increasingly vital role in devising diagnostic and therapeutic tools for diseases. However, since uncovering associations via experimental methods is expensive and time-consuming, novel and effective computational methods for association prediction are in demand. In this study, we developed a computational model of Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction (MDHGI) to discover new miRNA-disease associations by integrating the predicted association probability obtained from matrix decomposition through sparse learning method, the miRNA functional similarity, the disease semantic similarity, and the Gaussian interaction profile kernel similarity for diseases and miRNAs into a heterogeneous network. Compared with previous computational models based on heterogeneous networks, our model took full advantage of matrix decomposition before the construction of heterogeneous network, thereby improving the prediction accuracy. MDHGI obtained AUCs of 0.8945 and 0.8240 in the global and the local leave-one-out cross validation, respectively. Moreover, the AUC of 0.8794+/-0.0021 in 5-fold cross validation confirmed its stability of predictive performance. In addition, to further evaluate the model's accuracy, we applied MDHGI to four important human cancers in three different kinds of case studies. In the first type, 98% (Esophageal Neoplasms) and 98% (Lymphoma) of top 50 predicted miRNAs have been confirmed by at least one of the two databases (dbDEMC and miR2Disease) or at least one experimental literature in PubMed. In the second type of case study, what made a difference was that we removed all known associations between the miRNAs and Lung Neoplasms before implementing MDHGI on Lung Neoplasms. As a result, 100% (Lung Neoplasms) of top 50 related miRNAs have been indexed by at least one of the three databases (dbDEMC, miR2Disease and HMDD V2.0) or at least one experimental literature in PubMed. Furthermore, we also tested our prediction method on the HMDD V1.0 database to prove the applicability of MDHGI to different datasets. The results showed that 50 out of top 50 miRNAs related with the breast neoplasms were validated by at least one of the three databases (HMDD V2.0, dbDEMC, and miR2Disease) or at least one experimental literature.

Introduction MicroRNA (miRNA) are one class of important short noncoding RNA (~22nt) molecules that mostly inhibit gene expression at the post-transcriptional level [1][2][3][4]. In 1993, lin-4 was the first miRNA detected as a result of research on the timing of C. elegans larval development [5]. Unlike conventional protein coding genes, lin-4 coded for a 22 nucleotide regulatory RNA rather than a protein [5,6]. Since then, thousands of miRNAs have been discovered in many living organisms, and currently 2588 miRNAs in the human genome have been annotated [7]. Because each miRNA is probably able to control the expression of hundreds of target genes, the whole miRNA pathway is a critical mechanism for gene expression control [2,[8][9][10][11][12][13].
Recently, more and more studies have shown that miRNAs play critical roles in diverse important biological processes. Therefore, it is no surprise that miRNA could be associated with cancers [14,15] and other kinds of diseases [16]. As indicated by increasing evidences, miRNAs are emerging as novel potential biomarkers or diagnostic/therapeutic tools for diseases [17][18][19][20][21][22]. For example, miR-708 affects the progress of bladder carcinoma through direct inhibition of Caspase-2 [23]. MiR-29c down-regulation results in derepression of its target DNA methyltransferase 3a, which promotes the development of ischemic brain damage [24]. Another example is that let-7b expression has a positive correlation with patient age (R = 0.472; p<0.001) [25]. Higher Nuclear opalescence or cataract scores for Nuclear color (N), Cortical (C) and Posterior (P) was discovered positively associated with higher expression of let-7b in patients with age-related cataracts (p<0.001) [25]. Identifying potential miRNAdisease associations enhances the understanding towards molecular mechanisms and pathogenesis of diseases. As a biomarker, miRNA can be used for disease diagnosis; and as drug targets, miRNAs can be applied to disease treatment. Since carrying out experiments is an expensive and time-consuming process, only a small number of miRNA-disease associations have been confirmed by traditional experimental approaches. Proposing computational models to predict disease-related miRNAs is a worthful supplement to experiments. Researchers should spare no effort to excogitate a more accurate prediction method so that reasonable candidates can be provided for future biological experiments [26].
In recent years, several computational methods have been developed to predict potential miRNA-disease associations and some of them performed well [27][28][29][30][31][32][33]. Based on the assumption that functionally related miRNAs are more likely to be associated with diseases which have similar phenotypes, Jiang el al. [34] proposed a network-based approach by combining miRNA similarity network, disease similarity network with miRNA-disease association network. After that, based on the hypergeometric distribution, a scoring system was constructed to acquire the scores of potential miRNA-disease associations. Focusing on the functional link between miRNA targets and disease genes, Shi et al. [35] devised a computational model by performing random walk on a protein-protein interaction (PPI) network and focusing on the functional links between miRNAs targets and diseases genes in PPI network. Mørk et al. [36] developed miRNA-Protein-Disease (miRPD) association prediction model by linking miRNAs to diseases via the underlying proteins involved. However, these methods did not exhibit a commendable predictive performance because their performance depended largely on miRNA-target interactions which have a high ratio of false-positive and false-negative samples.
Some other computational models without relying on miRNA-target interactions have been proposed in the past few years. Xuan et al. [37] developed the method called Human Disease-related MiRNA Prediction (HDMP) to predict miRNAs associated with diseases. The association score between a miRNA and a disease was computed by summing up the subscores for the miRNA's k neighbors, and the sub-score for a neighbor was calculated via multiplying the weight of the neighbor with the functional similarity between the neighbor and the miRNA. However, HDMP could not be applied to new diseases without any known associated miRNAs because predictions were made mainly from the information of miRNAs' neighbors. Based on the global similarity measures, Chen et al. [38] developed a method named Random Walk with Restart for MiRNA-Disease Association prediction (RWRMDA) by implementing random walk on the miRNA functional similarity network to prioritize candidate miRNAs for disease of interest. Nonetheless, this method was unable to predict miRNAs associated with the diseases without any known related miRNAs. Another computational model named MIR-NAs associated with Diseases Prediction (MIDP) was developed by Xuan et al.
[39] based on random walk on a miRNA network derived from miRNA-associated diseases and semantic similarity of their associated diseases. The model assigned higher transition weights to labeled nodes than unlabeled nodes, which efficiently exploited the prior information of nodes and the various ranges of topologies. Besides, since they extended the walking on a miRNA-disease bilayer network, MIDP could also be used to prioritize candidate miRNAs for diseases without any known associated miRNAs. Later, a method named Matrix Completion for MiRNA-Disease Association prediction (MCMDA) [40] was proposed to predict potential associations by utilizing the matrix completion algorithm to update the adjacency matrix. However, the algorithm also suffered from a limitation of not being applicable to new diseases and new miRNAs.
Recently, Chen et al. [41] proposed another model called Within and Between Score for MiRNA-Disease Association prediction (WBSMDA). After integrating similarity for miRNAs and diseases, within-score and between-score were calculated and combined to obtain the final score for potential miRNA-disease association inference. Later, Chen et al. [42] presented a model of Heterogeneous Graph Inference for MiRNA-Disease Association prediction (HGIMDA) by combining the integrated miRNA similarity network, the integrated disease similarities network and the known miRNA-disease associations network into a heterogeneous graph. After that, they constructed an iterative equation by summarizing all paths with the length equal to three from which they can infer potential association between a disease and a miRNA.
In addition, several other computational models were based on machine learning algorithms. For example, based on the features which were extracted from MiRNA Target-Dysregulated Network (MTDN) model by assessing topological properties of miRNAs and changes in miRNA expression, Xu et al. [43] implemented a Support Vector Machine (SVM) classifier to distinguish positive miRNA-disease associations from negative ones. However, even till today it is still difficult to obtain negative samples, and this fact seriously decreased the prediction accuracy of MTDN. Chen et al. [44] further presented Regularized Least Squares for MiRNA-Disease Association prediction (RLSMDA) method based on semi-supervised learning in the miRNA space and the disease space. What is worth mentioning is that RLSMDA could identify related miRNAs for diseases without any known associated miRNAs. Chen et al.
[45] developed another computational model called restricted Boltzmann machine for multiple types of miRNA-disease association prediction (RBMMMDA), the core of which was restricted Boltzmann machine (RBM). The model built a two-layer undirected graphical model containing layers of visible and hidden units. Compared to previous models, RBMMMDA could obtain not only new miRNA-disease associations but also the corresponding association types. The method named Ranking-based KNN for miRNA-Disease Association prediction (RKNNMDA) [46] was first implemented to search for k-nearest neighbors both for miRNAs and diseases by using the K-Nearest Neighbors (KNN) algorithm. Then these k-nearest neighbors were reranked according to the SVM ranking model. Finally, weighted voting was carried out on the ranking results to obtain the final ranking of all potential miRNA-disease associations. The drawback of RKNNMDA was that bias might be caused to miRNAs with more known associated diseases.
Identifying miRNAs associated with diseases is beneficial for the development of diagnostic/treatment tools for diseases. Using traditional experimental methods for association detection is demanding and so computational models for miRNA-disease association prediction are needed to complement to experiments. Because previously developed computational methods have some aforementioned limitations, it is essential to develop a new method that exploits more useful information and make more reliable predictions. However, there are also some difficulties of predicting potential disease-related miRNAs, such as the rare known miRNAdisease associations, the unavailable negative miRNA-disease associations, the relatively limited biological datasets about miRNAs, and the universality to new diseases without any known associated miRNAs as well as new miRNAs. What's more, considering that some of the existing computational models are only based on one of the matrix decomposition algorithm and network algorithm, it is of great significance to fully take advantage of these two methods to develop a new calculation model for miRNA-disease association prediction. In this study, we developed an effective computational model of Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction (MDHGI). We first rebuilt a new adjacency matrix by using Sparse Learning Method (SLM) to decompose the original adjacency matrix obtained from known miRNA-disease associations. Then we combined the miRNA functional similarities network, the disease semantic similarities network, the Gaussian interaction profile kernel similarities network, and the new adjacency matrix into a heterogeneous graph. Finally, we implemented normalization on integrated similarity for miRNAs and diseases and iteration algorithm on the graph to obtain a predicted association score scores for all miRNA-disease pairs. To evaluate the effectiveness of MDHGI, global and local Leave-One-Out Cross Validation (LOOCV) as well as 5-fold cross validation were carried out. The AUCs of global and local LOOCV were respectively 0.8945 and 0.8240, and the model obtained an average AUC of 0.8794+/-0.0021 in 5-fold cross validation. In the case studies of four important human cancers, 49, 49, 50, and 50 out top 50 predicted miRNAs for Esophageal Neoplasms, Lymphoma, Lung Neoplasms, and Breast Neoplasms were respectively confirmed by different databases or experimental literatures in PubMed. These results proved that MDHGI was effective in predicting potential miRNA-disease associations and it had significant advantages over previous methods. Our main contribution in this article is to perfect the HGIMDA model and further improve its accuracy by taking full advantage of the two methods (matrix decomposition and network algorithm). Besides, the idea presented in this article may have new inspiration for other researchers and the model we proposed is also a supplement to methodological research.

Human miRNA-disease associations
Actually, since the data in the databases is derived from the collected experimental literatures, it is a common practice for researchers to utilize known miRNA-disease associations data in HMDD as the training set. As previous studies have done [27,39,[47][48][49], in this paper, the known miRNA-disease associations dataset was extracted from the HMDD V2.0 database. The dataset contained 5430 validated associations between 495 miRNAs and 383 diseases. To facilitate subsequent calculations, we constructed an adjacency matrix A 2 R m×n to store the known miRNA-disease associations and other miRNA-disease pairs. In the adjacency matrix A, m and n are respectively defined as the number of miRNAs and diseases. Besides, element A (r i ,d j ) is set to be 1 if miRNA r i is associated with disease d j , otherwise 0 [42].

Disease semantic similarity model 1
In previous studies [50-54], many researchers made use of the DAG to describe a disease in their calculation models. According to the National Library of Medicine (http://www.nlm.nih. gov/), we can obtain the relationship of various diseases based on the disease Directed Acyclic Graph (DAG) constructed from the MeSH descriptor of Category C. For example, for the DAG of lung neoplasms (See Fig 1), 'respiratory tract diseases' points to 'lung diseases'. All nodes in the DAG are connected by a direct edge from a more general term, we call it parent, to a more specific term, and we call it child [55]. Here, a disease D was described by DAG = (D, T(D),E(D)), in which we defined all ancestor nodes of D and D itself as T(D) and the edge set including the direct edges from parent nodes to child nodes as E(D). In DAG(D), the contribution of disease d to the semantic value of disease D was defined as: where Δ is the semantic contribution decay factor. Here, based on the previous literature [37], we denoted the value of Δ to 0.5. Moreover, the semantic value of disease D was defined as: Since the larger part of DAG was shared by two diseases, the higher semantic similarity value they would get, the semantic similarity score between disease d i and d j were defined as follows:

Disease semantic similarity model 2
It is obvious that a disease which appears in less DAGs contributes to the semantic similarity of disease at a high level. Considering the inexact approach in disease semantic similarity model 1 that the contribution of diseases in the same layer of DAG (D) to the semantic value of D were treated as the same. We defined the contribution of disease d in DAG(D) to the semantic value of disease D as follows:

D2 D ðdÞ ¼ À log½the number of DAGs including t=the number of diseases ð5Þ
We then defined the semantic similarity between disease d i and d j in the similar way as the disease semantic similarity model 1.
MiRNA functional similarity Wang et al. [56] developed the MISIM method to calculate the miRNA functional similarity between a miRNA pair (r i and r j ). The whole process of MISIM can be divided into four steps.
In the first step, we need to identify the diseases set D(r i ) (diseases associated with r i ) and D(r j ) (diseases associated with r j ) for miRNA r i and r j , respectively. Next, the semantic value of all diseases in these two sets are computed according to the corresponding DAG. Third, the semantic similarity for each disease pairs between D(r i ) and D(r j ) can be calculated based on their semantic value. Finally, the functional similarity of r i and r j is calculated based on the semantic similarity obtained in step three. From the website http://www.cuilab.cn/files/ images/cuilab/misim.zip, we downloaded the miRNA functional similarity data. Then the miRNA functional similarity matrix MS was constructed, in which the element MS(r i ,r j ) indicated the similarity value between the miRNA r i and the miRNA r j .

Gaussian interaction profile kernel similarity
Based on the notion that functionally similar miRNAs are usually associated with similar diseases, the Gaussian interaction profile kernel similarity can be constructed as another algorithm for similarity measurement between two miRNAs/diseases [57,58]. It is obvious that the i th row and j th column of adjacent matrix A respectively represents the information whether the miRNA or the disease are associated with each of the diseases or the miRNAs. For convenience, we denoted vector IV(r i ) and IV(d j ) to represent the i th row vector and j th column vector, respectively. Therefore, the Gaussian interaction profile kernel similarity of diseases and miRNAs could be computed as follows: where the adjustment coefficients β d and β r could be defined as follows: where β 0 d and β 0 r are the original bandwidths and both of them were defined as 1 based on the previous study [59].

Integrated similarity for miRNAs and diseases
The integrated disease similarity could be obtained through combining the disease semantic similarity and the disease Gaussian interaction profile kernel similarity. What makes a difference to the integrated miRNA similarity is that if disease d i and d j have their own DAG (i.e. these two diseases have semantic similarity), then the final integrated similarity is the average between SS and GD. Otherwise the integrated disease similarity equals to the value of Gaussian interaction profile kernel similarity.
Furthermore, the integrated miRNA similarity could be obtained by combining the miRNA functional similarity with miRNA Gaussian interaction profile kernel similarity: r i and r j has functional similarity

MDHGI
In this study, the proposed method, MDHGI, fully extends the advantages of matrix factorization and network algorithm to make prediction for miRNA-disease associations. The flow chart of the algorithm is shown in Fig 2. Actually, the data we used to train our model are normally far from perfect. Considering that, a portion of the miRNA-disease associations in the real data would be redundant, and also some other miRNA-disease associations would be missing from the real data. Hence, the adjacency matrix for miRNA-disease associations can be decomposed into two parts. The first part is a linear combination of the original adjacency matrix and a low-rank matrix and the second part is a sparse matrix with most entries being zeros and can be considered as the noise or the outliers. The method is used to look for the lowest-rank matrix which is further utilized to reconstruct a new adjacency matrix that will be used in the next calculation.
Firstly, we decomposed A as follows: Obviously, there were infinite many solutions for Eq (15). However, since we wished X to be of low rank, where rank of a matrix was the maximum number of linearly independent column (or row) vectors in the matrix, and E to be sparse, we could enforce the nuclear norm or trace norm on X and sparse norm on E. Mathematically, Eq (15) could be thus relaxed as where α is a positive free parameter which was used to balance the weights of low-rank matrix and sparse matrix. Here, according to the existing method [60], the value of α was defined as 0.1.
Minimizing the trace norm of a matrix contributed to the lower-rank matrix, meanwhile the sparse norm was capable of identifying noises and outliers.
If the matrix A in AX in the right side of Eq (16) is set as identity matrix, then the model is degenerated to the robust PCA. Therefore, Eq (16) could also be regarded as a generalization of the robust PCA [61,62]. Eq (16) could be rewritten into an equivalent problem as The Eq (19) above, which is a constraint and convex optimization problem, can be solved by off-the-shelf interior point solvers after being reformulated as a semidefinite program [63]. However, the interior point solvers are not suitable for large matrices since they rely much on second-order information of the objective function. Thus, we should take advantage of both the first-order information and the special properties of this class of convex optimization problems to overcome the scalability issue. The iterative thresholding (IT) algorithm, accelerated proximal gradient (APG) algorithm, exact augmented Lagrange multipliers (EALM) algorithm and inexact augmented Lagrange multipliers (IALM) algorithm are several methods to solve the problem of Eq (19). However, for IT algorithm, as shown in the original literature [55], the iteration process converges extremely slowly (about 10 4 iterations to converge). As for APG algorithm, although the APG's computing speed has improved when compared to IT algorithm, it is still not as fast as IALM. Especially, the solution to Eq (19) obtained from the IALM is much more accurate than that by APG. Moreover, even though the convergence rate of EALM is as fast as IALM, the latter requires less number of partial SVDs. In general, the IALM algorithm is a relatively more efficient algorithm to solve the problem of Eq (19). Thus, in this paper, we utilized IALM [64] method by first converting Eq (19) to an unconstraint problem and then minimizing this problem based on augmented Lagrange function such that where μ ! 0 is a penalty parameter. The problem above could be solved by minimizing with respect to J, X, and E, respectively. Besides, after fixing the other variables and then updating the Lagrange multipliers Y 1 , Y 2 , Eq (20)would be settled. The detailed steps of how to solve Eq (20) is shown in Fig 3. We defined the solution of Eq (20) as X Ã and E Ã . If A ij represents the association between miRNA r i and disease d j , then X Ã 2 R n×n could be considered as a similarity matrix that described the similarity between diseases. While if A ij represents the associations between disease d i and miRNA r j (as the transposition of the adjacency matrix in Eq (15)), then X Ã 2 R m×m describes the similarity between miRNAs. After obtaining X Ã , the solution of Eq (20), we could compute the new associations between each pair of miRNAs and diseases by projecting the adjacency matrix onto the lower-dimensional space as From the matrix A Ã , we reacquired the miRNA-disease associations information which were further combined with the integrated similarity for miRNAs and diseases into a heterogeneous graph. By analyzing the heterogeneous graph, for disease d and miRNA m, we could further define their potential association probability as follows if they had no known associations.
According to the equation above, the potential association probability between miRNA m and disease d could be calculated by summarizing all paths with the length equal to three (See Fig 4). Moreover, considering the iteration of above process, we obtained iterative equation through representing the equation as matrix multiplications.
Here, the decay factor a was denoted to 0.4 based on the previous study [65]. For the iteration, we can treat this process like that every node with prior information disseminates the information obtained in the previous iteration to its neighbors. Due to the relation between the endpoints and the probability of looking into an edge among the same end-points in a random network with the same node degrees, the weight of an edge was normalized according to the degrees of its end-points [66]. Based on the previous literature [65], miRNA-disease association probability matrix P would converge when SR and SD were properly normalized utilizing Eqs (24) and (25), respectively. Moreover, we have given the specific proof process as Theorem 1 in the S1 Text Here, we set the cutoff as 10 −6 . The iteration above would become stable when the change between P i and P i+1 measured by L1 norm was less than the given cutoff.
In addition, for convenience, we have made a web server at http://chengroup.cumt.edu.cn/ tool/mdhgi/. After opening the website and selecting the disease name of interest in the box, researchers will get the prediction results of potential disease-related miRNAs. For more details, please see the website's 'Guide'.

Performance evaluation
Here, we used two types of cross validation to evaluate the performance of MDHGI, namely, LOOCV and 5-fold cross validation. LOOCV could be further divided into global and local LOOCV, in which each known association was in turn considered to be the test sample and the others were treated as the training samples. In global LOOCV, each of the known miRNAdisease associations was in turn considered as the test sample and all unknown miRNA-disease pairs were treated as candidate samples, while in local LOOCV, candidate samples only contained those miRNAs without any known associations with the investigated disease in the test sample. In 5-fold cross validation, we randomly divided all known miRNA-disease associations into five subsets with equal size. Then each subset was in turn considered as the test sample and the rest four subsets were treated as training samples. In the same way as LOOCV, all unknown miRNA-disease pairs were regarded as candidate samples. Subsequently, we obtained a predicted association score matrix by MDHGI, and ranked the score of each test sample against the scores of the candidate samples. This partition-prediction-ranking procedure was repeated 100 times to obtain a sound estimate of the mean and variance of MDHGI's prediction accuracy.
In each cross validation scheme, the model would be considered to successfully predict an association if the ranking of a test sample was above a given threshold. Moreover, we drew a receiver operating characteristics (ROC) curve through plotting the true positive rate (TPR, sensitivity) versus the false positive rate (FPR, 1-specificity) at different thresholds. Sensitivity referred to as the percentage of the test samples whose ranks surpassed the given threshold, while specificity denoted the percentage of negative miRNA-disease associations whose ranks were below the threshold. Then, we calculated the area under the ROC curve (AUC) to evaluate the predictive performance of MDHGI. AUC = 1 would indicate that all test samples were perfectly predicted, while AUC = 0.5 would mean the model only had random prediction performance. As shown in Fig 5, MDHGI obtained an AUC of 0.8945 in global LOOCV and an AUC of 0.8240 in local LOOCV. These results proved that MDHGI exhibited a sound performance in predicting potential miRNA-disease associations. However, the AUCs for MaxFlow [67] [69] were 0.7891, 0.8196 and 0.6299, respectively. Both RWRMDA and MIDP were not applicable to global LOOCV, because, based on random walk, they could not uncover missing associations for all the diseases simultaneously. Moreover, MiRAI was also not included in global LOOCV. In MiRAI, for a disease/miRNA associated with more miR-NAs/diseases, the association scores between the disease/miRNA and its candidate miRNAs/ diseases tended to be higher. Therefore, the association scores obtained for different diseases were not comparable. MiRAI had a low AUC because our training dataset was sparse. Since the dataset only contained 5430 validated associations between 495 miRNAs and 383 diseases, the majority miRNAs/diseases were associated with only a few diseases/miRNAs. While in the original literature [69], the dataset contained only 83 diseases with at least 20 known associated miRNAs. As for 5-fold cross validation, in comparison with MaxFlow, RKNNMDA, RLSMDA, HDMP, WBSMDA and MCMDA whose average AUCs were 0.8579+/-0.001, 0.6723+/-0.0027, 0.8569+/-0.0020, 0.8342+/-0.0010, 0.8185+/-0.0009 and 0.8767+/-0.0011, respectively, the average AUC for MDHGI was 0.8794+/-0.0021. This further confirmed the superior prediction accuracy and the performance stability of our model.
In addition, we have supplemented seven experiments by assigning different weight parameters to miRNA-miRNA edges and disease-disease edges, while the weight for miRNA-disease edges remains unchanged (See Table 1). The reason why we carried out these experiments is that we can make quantitative analysis for the reliability of the data (the miRNA functional similarity and the disease semantic similarity). As shown in Table 1, with the diminution of the weight for miRNA-miRNA edge and disease-disease edge, the values of the AUC for Global LOOCV, Local LOOCV and 5-fold cross validation decreased. What is worth mentioning is that all of the three types of AUCs descended in a very slow way, which proved our model's stability in a certain degree. From the results, we can conclude that the data of miRNA functional similarity and disease semantic similarity we used are reliable.

Case studies
In order to further demonstrate the prediction accuracy of MDHGI, we carried out case studies on two important human complex diseases by prioritizing candidate miRNAs for the diseases using our model with the training dataset from HMDD V2.0 [70]. Just like the validation databases in some existing methods [27,39,[47][48][49], we verified the top 50 predictions with two other prominent miRNA-disease association databases, namely, dbDEMC [71] and miR2Disease [72]. The first type of case study was implemented on Esophageal Neoplasms and Lymphoma. In our model, we utilized the known miRNA-disease associations in HMDD V2.0 as the training set. After ranking all candidate miRNAs for each investigated disease based on their predicted scores, the top 50 predicted miRNAs were picked out and verified in other two prominent miRNA-disease association databases (i.e., dbDEMC and miR2Disease). Besides, the results showed that 232 of the 5430 known miRNA-disease associations in HMDD V2.0 also existed in miR2Disease and 546 associations also existed in dbDEMC after comparing the HMDD V2.0 with miR2Disease/dbDEMC. Nonetheless, since only candidate miRNAs (miR-NAs unassociated with the investigated disease based on HMDD V2.0) for an investigated disease were ranked and verified, there was no overlap between the training samples and the prediction lists. Hence, none of the top 50 predicted miRNAs existed in HMDD V2.0 and the verification of miRNAs in the prediction lists was completely independent of HMDD V2.0. Esophageal cancer is a commonly-diagnosed cancer arising from the esophagus-the food pipe that runs between the throat and the stomach. Based on the estimates of the esophageal cancer burden in the United States in 2017, the new cases and deaths from esophageal cancer will be 16940 and 15690, respectively [73]. Recent research showed that the first miRNA we predicted (hsa-mir-200b) suppresses invasiveness and modulates the cytoskeletal and adhesive machinery in esophageal squamous cell carcinoma cells via targeting Kindlin-2 [74]. Moreover, the data provided by Chen et al. [75] offered the convincing evidence that combined expression of hsa-mir-133a and hsa-mir-133b (2nd in the prediction list) might predict chemosensitivity of patients with esophageal squamous cell carcinoma (ESCC) undergoing paclitaxel-based chemotherapy which implied its importance in applying 'personalized cancer medicine' in the clinical treatment of ESCC. Another example is that aberrant expression level of hsa-mir-16 (3rd in the prediction list) could suppress cell apoptosis while promote growth by regulating the reversion-inducing cysteine-rich protein with Kazal motifs (RECK) and the ex-determining region Y-related high-mobility-group box transcription factor 6 (SOX6) which play important roles in the pathogenesis of ESCC [76]. MDHGI was implemented to identify potentially related miRNAs for Esophageal Neoplasms and ranked the miRNAs in terms of their association scores. As a result, 10 out of the top 10, 18 out of the top 20, and 43 out of the top 50 predictions were manually confirmed in database dbDEMC and miR2disease (See Table 2). Besides, to further confirmed our prediction results, we also manually verified the top 50 predicted miRNAs in PubMed. The result showed that 49, 49 and 46 out of the top 50 predictions were respectively confirmed by at least one, two, and three experimental literatures in PubMed (See S1 Table).
Lymphoma is a group of blood cell tumors that develop from lymphocytes (a type of white blood cell). Hodgkin lymphoma (HL) and non-Hodgkin lymphoma (NHL) are the two main types of lymphoma [77]. Recent experimental studies showed that hsa-mir-223 (1st in the prediction list) regulates cell growth and targets proto-oncogenes in mycosis fungoides/cutaneous T-cell lymphoma [78]. Besides, it also has been verified that plasma hsa-mir-155, hsa-mir-203, and hsa-mir-205 (2nd in the prediction list) are biomarkers for monitoring of primary cutaneous T-cell lymphomas (TCTL) [79]. Moreover, the study of Yang et al. suggested that hsa-mir-10b (3rd in the prediction list) contributes to osteoblast differentiation through targeting B cell lymphoma 6 (Bcl6) which provides a novel insight into understanding the molecular mechanism underlying osteoblast differentiation and suggests a potential target for inhibiting bone loss [80]. Taking lymphoma as the investigated disease and implementing MDHGI for potential miRNA-lymphoma association prediction, nine out of the top 10, 15 out of the top 20 and 44 out of the top 50 potential lymphoma-associated miRNAs were manually verified in database dbDEMC and miR2disease (See Table 3). Furthermore, in the same way as the validation of esophageal cancer, 49, 48 and 46 out of the top 50 predictions were respectively confirmed by at least one, two and three experimental literatures in PubMed (See S2 Table).
To facilitate further validation and research, we have provided the complete prediction list of potential miRNAs associated with all the 383 human diseases in HMDD V2.0, together with the association scores predicted by MDHGI (See S3 Table). In addition, to illustrate the applicability of MDHGI to new diseases, namely, diseases that have no known associated miRNAs, we carried out another case study on Lung Neoplasms. Known associations for this disease were removed from the training dataset, so that predictions would only be made from the information of other diseases' related miRNAs and the similarity measures. After implementing MDHGI, we obtained the ranking of Lung Neoplasms' candidate miRNAs in terms of their association scores (See Table 4). The data provided by Babu et al. suggested that increased expression of hsa-mir-20a (1st in the prediction list) in lung cancer may decrease iron export which will lead to intracellular iron retention and cell proliferation [81]. Besides, recent research showed that hsa-mir-17 (2nd in the prediction list) and hsa-mir-92 families play important roles in cisplatin resistance and can be used as potential biomarkers for better predicting the clinical response to platinum-based chemotherapy in non-small cell lung cancer (NSCLC) [82]. Shen et al. provided the evidence that downregulation of hsa-mir-18a (3rd in the prediction list) sensitizes NSCLC to radiation treatment and it may help to develop a new approach to sensitizing radioresistant lung cancer cells by targeting hsa-mir-18a [83]. Respectively, 10, 20 and 50 out of the top 10, 20 and 50 predictions were manually confirmed in HMDD V2.0, dbDEMC and miR2Disease. Similarly, we also manually verified the top 50 predicted miRNAs in PubMed. The result showed that 50 out of the top 50 predictions were confirmed by at least three experimental literatures in PubMed (See S4 Table). Finally, we trained our model with the dataset from the HMDD V1.0 to demonstrate that MDHGI would perform equally well on different datasets. Breast Neoplasms was used as the investigated disease. As a result, there were respectively 10, 20, and 48 out of the top 10, 20 and 50 predictions manually confirmed in the three databases mentioned above (See Table 5). Besides, 50 out of the top 50 predictions were confirmed by at least three experimental literatures in PubMed (See S5 Table). Taking first-ranked hsa-let-7e as an example, research confirmed that umonji/Arid1 B (JARID1B) promoted breast tumor cell cycle progression through epigenetic repression of hsa-let-7e [84]. Recent experimental studies showed that breast cancer patients with low hsa-let-7b (2nd in the prediction list) expression had poor prognoses which indicated that hsa-let-7b might act as cancer suppressor gene in breast cancer development and progression by inhibiting the expression of BSG [85]. Moreover, the results of Sun et al. suggested that hsa-mir-223 (3rd in the prediction list) increases the sensitivity of triple-negative breast cancer stem cells (TNBCSCs) to TRAIL (tumor necrosis factor-related apoptosisinducing ligand)-induced apoptosis by targeting HCLS1 (hematopoietic cell-specific Lyn substrate 1)-associated protein X-1 (HAX-1) [86].
According to the results presented, MDHGI consistently achieved an excellent predictive performance in each of the four case studies. With the continuous experimental research on miRNA-disease associations, we expect that more and more miRNAs in the prediction lists generated by our model would be verified in the future. Table 4. Prediction of the top 50 predicted miRNAs associated with lung neoplasms based on known associations in HMDD V2.0 database. All known associations between the miRNAs and Lung Neoplasms were removed before the prediction process. The prediction result was examined in dbDEMC, miR2Disease and HMDD V2.0. The first column records top 1-25 related miRNAs. The third column records the top 26-50 related miRNAs.

Discussion
This paper introduced the computational method called MDHGI in which we combined the sparse learning method with the heterogeneous graph inference method to calculate potential miRNA-disease association scores. In the process of low-rank matrix decomposition, the sparse norm could effectively handle training datasets with a high level of noises and a low quality, which were commonly faced by biological researchers. However, some elements with the value of 1 in the adjacency matrix might turn into 0 after using the sparse learning method, which means the corresponding known miRNA-disease associations information might be removed. To overcome the disadvantage, the heterogeneous graph inference method was used by integrating the Gaussian interaction profile kernel similarity, the disease semantic similarity, the miRNA functional similarity, and miRNA-disease associations which were reacquired from the recalculated adjacency matrix into a heterogeneous graph. The excellent performance of MDHGI was demonstrated by experimental results from both cross validation and case studies on Esophageal Neoplasms, Lymphoma, Lung Neoplasms and Breast Neoplasms. It could be concluded that MDHGI should serve as an effective tool for predicting potential miRNA-disease associations, and would be helpful in human disease prevention, treatment, diagnosis, and prognosis. The reliable performance of MDHGI came from the following factors. Firstly, by decomposing the original data into a clean (a linear combination of low-rank matrix and the adjacency matrix) and noise (sparse matrix) parts, we could obtain a clean data about the associations between miRNAs and diseases. Secondly, more and more disease-miRNA association data had been discovered and confirmed. Due to the data-dependent property of sparse learning method, the increasing number of known associations improved the prediction accuracy. Thirdly, MDHGI could be used to make predictions for new diseases which have no known related miRNAs and miRNAs without any known associated diseases. Lastly, MDHGI could effectively uncover missing miRNA-disease associations for all diseases simultaneously. Therefore, MDHGI is a superior model over previous ones.
Limitations also exist in this method. Firstly, though current studies benefited from the increased known data, it is never a finished work to expand data. Secondly, it is obvious that assigning different penalization parameters for the three different types of edges (miRNA-miRNA edges, disease-disease edges and miRNA-disease edges) would be more accurate for the prediction performance. However, there are some difficulties that make us unable to do this work. Firstly, for the moment, we don't know how to properly give different weights to vertexes and edges in the network. Secondly, since all the known miRNA-disease associations we utilized in our model were based on databases (i.e., different experimental literatures), it is very difficult for us to quantify the reliability of different edges. Hence, taking full account of your suggestions, we will conduct our research in this area in the next step. In addition, the parameter set in the algorithm is difficult to optimize, and deserves further research. Finally, MDHGI might cause bias to miRNAs which have more associated disease records and vice versa. Therefore, we would develop optimization strategies to improve the accuracy of this prediction method in the future.
Supporting information S1 Text. The specific proof process of Theorem 1.