SCMFMDA: Predicting microRNA-disease associations based on similarity constrained matrix factorization

miRNAs belong to small non-coding RNAs that are related to a number of complicated biological processes. Considerable studies have suggested that miRNAs are closely associated with many human diseases. In this study, we proposed a computational model based on Similarity Constrained Matrix Factorization for miRNA-Disease Association Prediction (SCMFMDA). In order to effectively combine different disease and miRNA similarity data, we applied similarity network fusion algorithm to obtain integrated disease similarity (composed of disease functional similarity, disease semantic similarity and disease Gaussian interaction profile kernel similarity) and integrated miRNA similarity (composed of miRNA functional similarity, miRNA sequence similarity and miRNA Gaussian interaction profile kernel similarity). In addition, the L2 regularization terms and similarity constraint terms were added to traditional Nonnegative Matrix Factorization algorithm to predict disease-related miRNAs. SCMFMDA achieved AUCs of 0.9675 and 0.9447 based on global Leave-one-out cross validation and five-fold cross validation, respectively. Furthermore, the case studies on two common human diseases were also implemented to demonstrate the prediction accuracy of SCMFMDA. The out of top 50 predicted miRNAs confirmed by experimental reports that indicated SCMFMDA was effective for prediction of relationship between miRNAs and diseases.

Introduction MicroRNAs (miRNAs) are a number of 17-24nt non-coding RNAs, which act a pivotal part in controlling the expression of gene through RNA cleavage or translation repression [1][2][3]. Lin-4 was the first miRNA inspected in experiment by Lee et al. [4] in 1993. Since that time, a large amount of miRNAs was discovered by researchers in experiments [4,5]. Researchers have sought out generous miRNAs from various of species that included viruses, animals and plants [6]. Because miRNAs regulated the expression of a great quantity of target genes, the total miRNA pathway played a key role in gene expression control [7][8][9]. miRNAs are bound up with several crucial biological processes, such as cell development, cell differentiation, cell proliferation and so on [10]. Developmental defects can be the result of the dysregulation of miR-NAs that also associate with progression of diseases [11]. In the meantime, considerable studies have indicated that miRNAs are connected with a serious of human neoplasms, which include lung neoplasms [12], prostate neoplasms [13] and so on. Hence distinguishing miR-NAs associated with diseases can deepen understanding of the genetic causes of complex diseases. Massive connections between miRNAs and diseases have been found by a variety of traditional experiments in the past few years [14,15]. Traditional manual models can infer the connections between miRNA and disease, but which are time-consuming, laborious and high failure rate. Therefore, showing the potential relationship between miRNAs and diseases in need of computational methods with effectiveness and stability, as they can obtain increasing reliable miRNA-disease connections [16].
In the past period of time, a great deal of computation-based algorithms and methods have been applied to predict potential relationship of miRNA-disease [17,18]. For example, Jiang et al. [19] proposed a model that applied the human phenome-microRNAome network to predict potential interactions between miRNAs with similar function and diseases with similar phenotypic. However, the predictive performance of the model was not as decent as expected due to be affected by high false positive and false negative rates existing in the associations between miRNAs and targets. Later, the model WBSMDA [20] introduced the Gaussian interaction profile similarity to enrich similarity information of miRNA and disease. The WBSMDA could also predict potential relationship between new miRNAs and new diseases without any verified correlative information. The collaborative matrix factorization method was applied to predict the relationship of miRNA-diseases in CMFMDA [21], which also could utilize plentiful biological information observe unknown interactions. The model EGBMMDA [22] began to take advantage of decision tree learning to discover novel miRNAdisease interaction by integrating verified miRNA-disease connections, miRNA functional similarity and disease semantic similarity. The informative feature vector was constructed by multi-measures to train the regression tree under the gradient boosting framework. Zhao et al. [23] applied adaptive boosting to observe unverified miRNA-disease association in ABMDA model. And they utilized k-means clustering on negative samples to perform random sampling, which could control the balance between positive samples and negative samples. The BHCMDA [24] model utilized biased heat conduction (BHC) algorithm to predict unknown connections between miRNAs and diseases though combining miRNA similarity matrix, disease similarity matrix and miRNA-disease association matrix. The probabilistic matrix factorization (PMF) algorithm was used in IMIPMF [25] model to infer potential miRNA-disease interactions. The PMF was widely used in recommender systems, so it could effectively make use of all information to recommend miRNAs which are strongly associated with the disease.
Recently, the methods based on random walk were gradually proposed and more accuracy prediction results were obtained. Chen et al. [26] utilized the random walk with restart algorithm to construct RWRMDA model. Because the prediction performance calculated by global network similarity was better than local network [27,28], RWRMDA employed global network similarity to gain the feasible interactions between miRNAs and diseases. Unfortunately, RWRMDA was inappropriate to the diseases without known associated miRNAs. Shi et al. [29] utilized the function links between human disease genes and miRNA targets to devise a novel model. Random walk algorithm and global network distance measurement were applied to search feasible relationship between miRNAs and diseases. Liu et al. [30] also implemented random walk with restart algorithm in the model to make prediction results to a higher degree. They employed random walk with restart algorithm on a heterogeneous graph established by utilizing disease similarity and miRNA similarity. Luo et al. [31] employed imbalanced bi-random walk method on a heterogeneous network with information of miRNAs and diseases to identify feasible interactions of miRNA-disease. Niu et al. [32] applied random walk with restart algorithm to extract miRNA features from integrated miRNA similarity network in RWBRMDA model. Then these miRNA features were utilized by binary logistic regression algorithm to predict potential miRNA-disease associations.
For the sake of obtaining reliable and accurate predictive performance, machine learningbased methods gradually were utilized to predict unknown miRNA-disease associations. For instance, the model RBMMMDA [33] utilized restricted Boltzmann machine to predict miRNA-disease multi-type associations. The RBMMMDA could gain not only novel associations between miRNAs and diseases, but also corresponding association types. The model PBMDA [34] constructed a heterogeneous graph including different interlinked sub-graphs and further adopted depth-first search algorithm to seek potential miRNA-disease associations. PBMDA could function as a useful calculation tool to accelerate the prediction of miRNA-disease interactions. The model DNRLMF-MDA [35] integrated dynamic neighborhood regularized and logistic matrix factorization to predict potential relationship of miRNAdisease. DNRLMF-MDA applied logistic matrix factorization algorithm to association probability between miRNAs and diseases. Then implementing dynamic neighborhood regularized algorithm to improve predictive performance. Peng et al. [36] proposed the model MDA-CNN for miRNA-disease connection identification. The miRNA-disease interaction features were firstly captured by a three-layer network. Then an auto-encoder was employed to identify obvious miRNA-disease feature combinations. After these feature representations were reduced, the convolutional neural network utilized them to predict the final results. The significant machine learning-based model MLMDA [37] was proposed by Zheng et al. to predict unknown relationship of miRNA-disease. The k-mer sparse matrix was used to extract miRNA sequence information. Then integrating miRNA sequence information, miRNA and disease similarity information to construct feature vectors. The deep auto-encoder neural network (AE) and random forest classifier made full use of feature vectors to calculate the prediction probability. The NCMCMDA [38] model integrated neighborhood constraint with matrix completion algorithm to change the recovery task into an optimization problem. This model applied the fast iterative shrinkage-thresholding algorithm to recover missing interactions between miRNAs and diseases. Zhang et al. [39] proposed the computational model MSFSP to achieve a more accuracy predictive performance of miRNA-disease interactions. The MSFSP firstly integrated various similarity information of miRNA and disease to construct the similarity of miRNA and disease. Then miRNA and disease similarity matrices and verified miRNA-disease association matrix were utilized to constitute the weighted network of miRNA-disease connections. The final prediction labels were calculated by weighting miRNA and disease space projection scores. Ji et al. [40] proposed SVAEMDA model to infer more disease-related miRNAs, which used miRNA similarity and disease similarity to obtain the representations of miRNA and disease. In addition, the variational autoencoder based predictor was trained to predict unknown interactions of miRNA-disease, which combined verified miRNAdisease interactions with the representations of miRNA and disease to generate the feature vectors of miRNA and disease.
Because there were several limitations in previous models, we presented a novel model based on Similarity Constrained Matrix Factorization for miRNA-Disease Association Prediction (SCMFMDA). In order to obtain plentiful disease similarity data, we applied similarity network fusion algorithm to integrate various disease similarities, which consisted of disease functional similarity, disease semantic similarity and disease Gaussian interaction profile kernel similarity. Similarly, miRNA similarity data was obtained by applying similarity network fusion to integrate miRNA functional similarity, miRNA sequence similarity and miRNA Gaussian interaction profile kernel similarity. In addition, we added L 2 regularization terms and similarity constraint terms to standard Nonnegative Matrix Factorization (NMF) method to predict more unknown miRNA-disease associations. To evaluate the effectiveness of SCMFMDA, global Leave-one-out cross validation and five-fold cross validation were carried out on the verified miRNA-disease association data downloaded from HMDD v2.0 [41]. As a result, SCMFMDA achieved AUC values of 0.9675 and 0.9447, respectively. Furthermore, we performed case studies on colon neoplasms and lung neoplasms. Consequently, the miR2Disease [42] and dbDEMC v2.0 [43] databases were utilized to validate results of case studies, which achieved high confirmation ratios. Experimental results showed that SCMFMDA was effective for inferring possible relationship between miRNAs and diseases.

Human miRNA-disease associations
In this study, we downloaded verified human miRNA-disease association information from HMDD v2.0 database, which included 5430 known associations between 383 diseases and 495 miRNAs. For the sake of making calculation convenient, we made an adjacency matrix A2R nd×nm to indicate the verified miRNA-disease associations. The nd and nm mean the number of diseases and miRNAs, respectively. We used a ij to represent the (i,j)th element of matrix A. Specifically, The element a ij is set to 1 if disease d i is related to miRNA m j ; and otherwise, it is set to 0.

Disease functional similarity
The phenotypically similar diseases tend to associate with similar genes. Therefore, we could calculate disease functional similarity based on the functional information of gene. The loglikelihood score (LLS) represents the probability of a functional linkage between different genes, which can be downloaded from the HumanNet database [44] and be normalized as follows: where LLS(g a , g b ) denotes the LLS between gene g a and gene g b , LLS max and LLS min are the maximum LLS and minimum LLS in HumanNet database; LLS n (g a , g b ) represents the normalized LLS. Then, the gene functional similarity score can be calculated by the below equation: where S HumanNet represents the link set that contains whole links between genes in HumanNet database; e(a,b) indicates the link between gene g a and gene g b . Furthermore, the functional similarity score between gene g and gene set G is defined as follows: The SIDD [45] can be utilized to obtain disease-gene association data, which are involved in calculating disease functional similarity SD 1 by the following equation:

Disease semantic similarity
On the basis of previous study [46], the medical subject headings (Mesh) descriptors could be implemented to calculate disease semantic similarity. Here, the Directed Acyclic Graph (DAG) could be adopted to indicate the specific relationship of different diseases. Concretely, the DAG(D) = (D,T(D),E(D)) represents the DAG of disease D, in which T(D) denotes the node set containing D itself and its ancestor nodes, E(D) denotes the relevant edge set including edges from parent nodes to their child nodes directly. Then the semantic value of disease D can be calculated as below: where the semantic contribution of disease d to D can be calculated as follows: here, Δ is the semantic contribution factor that is set to 0.5 based on previous literature [47].
On the basis of assumption that various diseases tend to be regarded as similar diseases if the large parts of their DAGs are same. Therefore, the semantic similarity DS 1 (d i , d j ) between disease d i and disease d j can be defined as follows: Based on the previous study [48], diseases appear in less DAGs may be more specific, these diseases ought to gain a higher semantic contribution in DAGs. Therefore, different diseases located in the same layer of one DAG, which may obtain the different contribution value. Specifically, the semantic contribution of disease d to D can be calculated in different way as below: Correspondingly, the semantic score of disease D and semantic similarity DS 2 (d i , d j ) between disease d i and disease d j can be calculated as follows: Finally, we integrated DS 1 and DS 2 to calculate final disease semantic similarity SD 2 (d i , d j ) between disease d i and disease d j in following equation: miRNA functional similarity Based on the calculation method of miRNA functional similarity [49,50], assuming that functionally similar miRNAs tend to be linked with phenotypically similar diseases and vice versa. We downloaded miRNA functional similarity data from http://www.cuilab.cn/files/images/ cuilab/misim.zip. Here, we constructed the matrix SM 1 with nm rows and nm columns for storing the corresponding information. The element SM 1 (m i , m j ) represents the relevant functional similarity score between miRNA m i and miRNA m j .

miRNA sequence similarity
We utilized the Needleman-Wunsch Algorithm to calculate miRNA sequence similarity, and corresponding miRNA sequence information can be obtained from miRBase database [51]. Be similar to miRNA functional similarity, we also constructed a matrix SM 2 2R nm×nm to store sequence similarity information, where SM 2 (m i , m j ) was the relevant sequence similarity score between miRNA m i and miRNA m j .

Gaussian interaction profile kernel similarity for diseases and miRNAs
On the basis of previous study [49,50], because miRNAs with similar function are likely to be linked with diseases with similar phenotypes, the Gaussian interaction profile (GIP) kernel similarity can be calculated and applied to stand for the miRNA similarity and disease similarity. Concretely, the binary vector K(d i ) is constructed to indicate the interaction profile of disease d i in accordance with whether d i possesses known association with each miRNA or not.
Here, the GIP kernel similarity SD 3 (d i , d j ) between disease d i and disease d j can be calculated as below equations: In the same light, the GIP kernel similarity SM 3 (m i , m j ) between miRNA m i and miRNA m j can be calculated by the following formulas: where the binary vector K(m i ) indicates the interaction profile of miRNA m i in accordance with whether m i has known association with each disease or not, the parameter ρ m is utilized to control kernel bandwidth.

Overview
The SCMFMDA includes two major parts: similarity network fusion is applied to obtain integrated disease similarity and integrated miRNA similarity; known miRNA-diseases associations and integrated similarities are adopted in similarity constrained matrix factorization to infer unknown associations of miRNA-disease. The specific flow chart of SCMFMDA is shown in Fig 1.   Fig 1. Flow

Integrating similarity for diseases and miRNAs
The similarity between two diseases can use disease functional similarity, disease semantic similarity and disease GIP kernel similarity to represent. Similarly, miRNA functional similarity, miRNA sequence similarity and miRNA GIP kernel similarity can be utilized to indicate similarity between different miRNAs. Here, the similarity network fusion (SNF) [52] method is applied to integrate various similarities for disease and miRNA. According to previous study, the process of SNF can be expressed as iterative update of similarity matrices. The main steps of utilizing SNF to integrate different disease similarities SD n , n = 1,2,3 are introduced as follows.
In the first step, we calculated normalized weight matrix P n of each similarity network as follows: In the second step, we utilized k nearest neighbor (KNN) algorithm to measure the local relationship of each similarity network. The specific process to obtain corresponding matrix K n is displayed as follows: where the N i indicates the number of neighbors in the disease.
In the third step, we applied SNF to integrate normalized weight matrix P n and local relationship matrix K n as follows: Because we had three different disease similarity networks (disease functional similarity, disease semantic similarity and disease GIP kernel similarity), the m was equal to 3. After iterative update, the ultimate disease similarity matrix S d could be obtained as follows: Similarly, we could apply SNF algorithm to obtain final miRNA similarity matrix S m .

Similarity constrained matrix factorization
After obtaining processed disease similarity and miRNA similarity, similarity constrained matrix factorization method is adopted to observe more unknown interactions of miRNA-disease, and Fig 2 shows concrete details of it. The SCMFMDA factorized the matrix A2R nd×nm into U2R nd×γ and V2R nm×γ , where γ denoted the dimension of disease feature and miRNA feature in the low-rank spaces. To be specific, the association of miRNA-disease roughly equal to the inner product between the disease feature vector and the miRNA feature vector: a ij � u i v T j , where u i and v j represent the ith row of U and the jth row of V, respectively. The corresponding objective function is shown as follows: Then, the L 2 regularization terms of u i and v j are added to the Eq (20) for solving overfitting problem.
where σ is the regularization parameter for u i and v j .
On the basis of previous study [53,54], the geometric properties of data points may be kept when they are mapped from high-rank space into low-rank space. Disease similarity S d and miRNA similarity S m can indicate geometric structure of data points, so we present similarity constraint terms S U and S V as follows: where S d ij represents the similarity between disease d i and disease d j , S m ij denotes the similarity between miRNA m i and miRNA m j , respectively. Considering the similarity degree between two data points is up to the distance of them, so S U will incur a heavy penalty if the distance of d i and d j are close in disease feature space. Therefore, we could keep the geometric structure of disease data points by minimizing S U , which would cause that disease d i and disease d j were mapped closely in low dimensional space. For miRNA, it is the same situation. Hence, the objective function of SCMFMDA are proposed by adding S U and S V to Eq (21) as follows: where ε is regarded as hyper parameter which can availably control the smoothness of similarity consistency.

Optimization algorithm
In this section, we proposed an efficacious optimization algorithm to calculate the objective function of SCMFMDA. First, the partial derivatives of L in regard to u i and v j are calculated as follows: where A(i,:) denotes the ith row of matrix A.
where A(:,j) denotes the jth column of matrix A. Then, the second derivatives of L in regard to u i and v j are calculated by the below equations: According to Newton's method, u i and v j can be executed iterative update as follows: Hence, u i and v j can be updated by the following formulas: When the convergence condition is met, the update of u i and v j will stop. The prediction matrix can be obtained by updated u i and v j .
The value of A P ij denotes the association probability between disease d i and miRNA m j . The more likely the association is, if the score is higher.

Parameters optimization
In this section, parameters γ, σ and ε are quantitatively analyzed to research their effect on the prediction performance. γ represents the dimension of diseases and miRNAs in low-rank spaces, and γ<min (nd, nm) that can be considered as the percentage of min (nd, nm). Parameters σ and ε denote the regularization parameters. The AUC value of 5-CV is applied to evaluate influence of the choice of parameters on the performance of model. And after generous test experiments were conducted, we could get the conclusion that the value of γ would affect the experiment individually. For this reason, we fixed σ and ε in a suitable combination to test the most suitable value of γ2{0,10%,. . .,1} in SCMFMDA. In order to ensure the correctness of the test, σ and ε are fixed in different combination. From Fig 3A, we could see that SCMFMDA obtained the best performance when γ = 50%. In addition, the γ = 50% is fixed so that the effect of regularization parameters σ and ε can be clearly evaluated. We utilized all combinations of σ2{2 −3 ,2 −2 ,. . .,2 3 } and ε2{2 −3 ,2 −2 ,. . .,2 3 } to construct SCMFMDA. From Fig 3B, we could discover that SCMFMDA acquired best AUC value of 0.9447 when σ = 2 2 and ε = 2 0 . In summary, γ, σ and ε are set to 50%, 2 2 and 2 0 in our model, respectively.

Model comparison
In order to evaluate the prediction ability of SCMFMDA, we compared several previous computational methods that were proposed to predict unknown miRNA-disease associations. We applied same dataset (HMDD v2.0 database) to train these methods so that comparison results could be considered as fairness. The specific information of these methods are shown as follows.
Based on the HMDD v2.0 database that included 5430 verified associations and 184155 unverified associations between 383 diseases and 495 miRNAs, global Leave-one-out cross validation (global LOOCV) and five-fold cross validation (5-CV) were implemented to evaluate the prediction performance of these methods. In the framework of global LOOCV, the test set was held by each verified association of miRNA-disease in turn, the training set was composed of other verified associations. The whole unknown miRNA-disease associations were considered as candidate samples. Similarly, in the framework of 5-CV, the whole verified miRNAdisease associations were divided into five parts in a random way, where test set was held by one part in turn, training set consisted of other four parts in turn. The whole unknown miRNA-disease associations were considered as candidate samples. In addition, by either the global LOOCV or the 5-CV, we applied SCMFMDA to obtain all predicted association scores so that the ranking of test set relative to candidate samples could be calculated. When the ranking of all test sample were higher than the certain threshold, SCMFMDA was regarded as a valid model. Then we could utilize the Receiver operating characteristics (ROC) curve that was obtained by plotting the true positive rate (TPR) against the false positive rate (FPR) to effectively evaluate the performance of SCMFMDA. We could calculate the area under the ROC curve (AUC) of SCMFMDA whose value was between 0 and 1. Similarly, we could obtain AUCs of other computational methods by utilizing the information of HMDD v2.0 database.
In this work, when global LOOCV method was conducted, SCMFMDA, MSCHLMDA, ICFMDA and SACMDA acquired average AUC values of 0.9675, 0.9287, 0.9072 and 0.8777, respectively (Fig 4). For the purpose of reducing potential deviations resulted in random sample segmentations, we applied 100 times repeated segmentations to verified associations of miRNA-disease in 5-CV method, and the average AUC values of SCMFMDA, MSCHLMDA, ICFMDA and SACMDA reached 0.9447, 0.9263, 0.9046, and 0.8773, respectively ( Fig 5). Obviously, the prediction performance of SCMFMDA was better than other methods.
In order to further reflect the performance of the SCMFMDA, it is also compared with other state-of-the-art matrix factorization-based methods that include GRNMF, GRL 2,1 −NMF, NPCMF, KBMFMDA. The 5-CV results of all model are demonstrated in Table 1, clearly SCMFMDA possesses the best AUC. The advantages of SCMFMDA than other matrix factorization-based models are as follows: first, the biological similarity data that are utilized in SCMFMDA obviously more than other models; second, SCMFMDA utilizes SNF instead of traditional linear combination method to integrate various similarity data, which greatly guarantee the completeness and effectiveness of experiment data; third, the L 2 regularization and similarity constraint terms are added to the NMF objective function, which benefit to correctly discover more unknown miRNA-disease connections.

Case studies
For the purpose of demonstrating the effectiveness and accuracy of SCMFMDA, we applied an evaluation experiment in this section. We implemented two types of human diseases, i.e., colon neoplasms and lung neoplasms to validate the expression of our method. There is no doubt that these diseases do great harm to human health. Colon neoplasms belongs to malignancy in the field of Medicine, which has been confirmed to associate with several miRNAs [62,63]. Lung neoplasms is one of the most dangerous malignancies with the fastest increase in morbidity and mortality [12]. A growing number of evidence indicates that lung neoplasms and a few of miRNAs have close relationship. For a specific disease, verified associations of whole diseases in HMDD v2.0 database are considered as training samples, unverified associations with the specific disease in HMDD v2.0 database are treated a candidate samples. By training this model, we could rank predicted association score of the candidate samples and then the top 50 candidate associations with the specific disease are selected. In addition, we utilized two types of databases that were miR2disease and dbDEMC v2.0 to check out miRNAs that have been ranked. Moreover, Tables 2 and 3 indicated prediction results obtained via SCMFMDA, respectively. The 94% and 92% of top 50 miRNAs that inferred by our model, which were individually confirmed to associate with colon neoplasms and lung neoplasms according to the miR2Disease and dbDEMC v2.0 databases. Only 3 and 4 of top 50 predicted miRNAs that are related colon neoplasm and lung neoplasms could not find clues in the databases.

Discussion and conclusion
In this paper, we introduced a new model named SCMFMDA that used similarity constrained matrix factorization algorithm to predict possible associations of miRNA-disease. In order to obtain plenty of disease similarity data and miRNA similarity data, similarity network fusion algorithm is used to integrate various disease and miRNA biological information, respectively. In addition, L 2 regularization terms and similarity constraint terms are added to the standard  NMF for predicting more unobserved miRNA-disease associations. In the frameworks of global LOOCV and 5-CV, the AUCs of SCMFMDA severally achieved 0.9675 and 0.9447 that indicated the performance of our model had a significant improvement relative to previous models. Furthermore, the predicted miRNAs that related to colon neoplasms and lung neoplasms were confirmed by the experiment literatures, so the prediction results of our model were proved to be reliable. What should be denoted is that the following factors may contribute to the reliable performance of SCMFMDA. First, similarity network fusion algorithm was applied to integrate different disease and miRNA similarities, which can ensure the richness of biological data in the experiment. Then, the function of L 2 regularization terms is avoiding overfitting problem. Moreover, the similarity constraint terms consist of disease feature-based similarity and miRNA feature-based similarity, which can generate robustness to the data richness.
However, several limitations may influence the performance of SCMFMDA. First, the model is applicable to the diseases and miRNAs must appear in the selected dataset, but can't make predictions for other diseases and miRNAs. In addition, for some important parameters in SCMFMDA, we hadn't appropriate way to select the most suitable parameters expect carrying out all combinations. Therefore, we should continuously optimize our model to improve its performance in later days. Supporting information S1