Prediction of disease-related metabolites using bi-random walks

Metabolites play a significant role in various complex human disease. The exploration of the relationship between metabolites and diseases can help us to better understand the underlying pathogenesis. Several network-based methods have been used to predict the association between metabolite and disease. However, some methods ignored hierarchical differences in disease network and failed to work in the absence of known metabolite-disease associations. This paper presents a bi-random walks based method for disease-related metabolites prediction, called MDBIRW. First of all, we reconstruct the disease similarity network and metabolite functional similarity network by integrating Gaussian Interaction Profile (GIP) kernel similarity of diseases and GIP kernel similarity of metabolites, respectively. Then, the bi-random walks algorithm is executed on the reconstructed disease similarity network and metabolite functional similarity network to predict potential disease-metabolite associations. At last, MDBIRW achieves reliable performance in leave-one-out cross validation (AUC of 0.910) and 5-fold cross validation (AUC of 0.924). The experimental results show that our method outperforms other existing methods for predicting disease-related metabolites.


Introduction
Metabolites play an important role in the maintenance, growth and reproduction of organisms, and are greatly helpful to illustrate the underlying molecular disease-causing mechanisms [1]. There is abundant evidence that diseases are always accompanied with changes in metabolite [2]. Hence, it is significant to identify abnormal metabolites for diagnosis and treatment of diseases [3].
As the development of molecular technology, many researchers have revealed the association between disease and other molecular products like gene, microRNA, circRNA, protein, etc [4][5][6]. Luo et al. used BIRW to predict the potential association between drug and disease [7]. Yan et al. developed the method DNRLMF-MDA by integrating disease similarity and miRNA similarity to predict disease-related miRNA based on dynamic neighbourhood regularized logistic matrix factorization [8]. In recent years, more and more researchers have been attracted to metabolite. Czech et al. used the method of gas and liquid chromatography-tandem mass spectrometry (GC-MS and LC-MS/MS) to analyze CSF samples in Alzheimer's a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 patients [9]. An integrated mass spectrometry approach was developed to research the new cerebrospinal fluid biomarkers of multiple sclerosis [10]. The contents of metabolites in the patients of Alzheimer's brain were studied in [11]. In 2010, Erika et al. developed a method to discover phenylbutyrate metabolites in patients with Huntington's disease [12]. Susan et al. integrated metabolomics and transcriptomes data to identify biomarkers for type 2 diabetes [13]. Baumgartner et al. proposed a novel network-based approach to identifying dynamic metabolic biomarkers in cardiovascular disease [14]. Previous research has shown that metabolites with similar functions are highly likely to be associated with the same or similar diseases [3]. Shang et al. proposed a method named PROFANCY to predict metabolites associated with disease based on metabolite functional similarity in metabolic pathways [15]. Hu et al. constructed a weighted metabolite association network for all the similarities of metabolite pairs, the random walk was utilized to predict metabolic markers of diseases [16]. Although some achievements has been made, there are still a lot of researches to do in the field of diseaserelated metabolites prediction. Considering that traditional RWR cannot fully combine the information of the metabolite network, disease network and disease-metabolite association network, and cannot predict the disease-related metabolites without known relationships. We apply bi-random walks algorithm to predict metabolite-disease associations by walking in disease network and metabolite network.
In this study, we utilize bi-random walks to identify disease-related metabolites. First, we compute disease semantic similarity and metabolite functional similarity, as well as create the Gaussian Interaction Profile kernel similarity for diseases and metabolites base on known metabolite-disease associations. Then, we integrate disease semantic similarity and disease GIP kernel similarity to construct disease similarity network. Similarly, metabolite functional similarity and metabolite GIP kernel similarity are integrated to construct metabolite similarity network. After that, Bi-random walks is used in two subnetworks to predict metabolite-disease associations. Finally, leave-one out cross validation, five-fold cross validation and case studies are used to assess the performance of our method. The experimental results illustrate that our method MDBIRW can effectively predict disease-related metabolites and show the superior performance compared to other competing methods.

Human metabolite-disease association
We downloaded the metabolites data and diseases data from Human Metabolome Database (HMDB) [17] and Human Disease Ontology (DO) [18], respectively (S1 File). 2262 metabolites, 216 diseases and 4537 metabolite-disease associations can be obtained after removing redundant associations. The set of metabolites are denoted by where m is the number of metabolites. Similarly, the set of diseases is denoted by D ¼ fd j g n j¼1 , where n is the number of diseases. And the adjacent matrix A indicates the metabolite-disease associations network. If there is a known association between disease d(i) and metabolite m(j), A (i, j) is equal to 1, otherwise 0.

Disease semantic similarity
In the MeSH database(https://meshb.nlm.nih.gov/) [19], every disease can be regarded as a node in Directed Acyclic Graph (DAG). Each MeSH descriptor displays a hierarchical DAG structure. For disease d which can be represented as DAG(d) = (d, T(d), E(d)), where T(d) is an ancestral set of disease d i and E(d) indicates the corresponding edges. The semantic score of disease d can be calculated as follows: where the disease t2T(d), Δ is semantic contribution decay factor and we set Δ = 0.5. The semantic value DV(d) of disease d is defined as follows: Then, the semantic similarity between d i and d j can be calculated as follows: where DV (d i ) and DV (d j ) indicate the value of the disease t associated with disease d i and d j .
Finally, we obtain the disease semantic similarity among all diseases, and symmetric matrix S d n�n indicates the disease semantic similarity network.

Metabolite functional similarity
Wang et al. proposed a method called MISIM [20]. In previous work, MISIM was used to calculate the similarity of micro-RNAs based on the similarity of related diseases. We apply the MISIM to compute the similarity of metabolites by using the related diseases semantic similarity. Here, we define d as a specific disease and D = {d 1 ,d 2 ,� � �,d k } represent a disease group. The similarity of disease d to group of diseases D can be calculated as follows: where k represents the number of D, S (d, D) represents the maximum similarity between one disease and a group of diseases. Afterwards, we can obtain the similarity of metabolites by the following formula: where D i and D j are two sets of diseases related to metabolite m i and m j , num i and num j represent the number of D i and D j , respectively. The symmetric matrix S m m�m indicates the metabolic functional similarity network.

Gaussian interaction profile kernel similarity for diseases and metabolites
Considering the assumption that the more common metabolites(diseases) of a disease(metabolite) pair has, the more similar they are. We utilize Gaussian Interaction Profile kernel similarity to calculate metabolite similarity and disease similarity based on the topologic information of known disease-metabolite associations.
In the disease-metabolite association network A, IP (m i ) represents the interaction profile for metabolite m i , which is a binary vector with size of n. If a disease is related to m i , the corresponding value of IP (m i ) is 1, otherwise 0. According to the interaction profiles, the Gaussian interaction profile kernel similarity matrix for metabolite GS m can be calculated as follows: where m is the number of metabolites. λ d indicates the normalized kernel bandwidth, and can be updated by a new normalized bandwidth l 0 d . According to previous relevant research, we set l 0 d ¼ 1 [21]. Similarly, we can compute the Gaussian interaction profile kernel similarity matrix for diseases GS d as follows: where l 0 m is also set as 1, n is the number of diseases.

Reconstruction of disease similarity network and metabolite similarity network
In this section, we reconstruct disease similarity and metabolite similarity. A disease similarity network can be reconstructed based on the disease semantic similarities and gaussian interaction profile kernel similarity of disease. We define the disease similarity network DS on the basis of matrix S d and GS d as follows: where DS(i,j) is the final disease similarity value of disease i and disease j. When the disease semantic similarity S d (i,j) = 0, we replace S d (i,j) with GS d (i,j). Otherwise, we hypothesize that the disease semantic similarity is as important as the Gaussian Interaction Profile Kernel Similarity of disease. Similarly, metabolite similarity network MS can be reconstructed by S m and GS m , the final metabolite similarity network can be calculated as follows: where MS(i,j) represents the similarity value between metabolite i and metabolite j.

Bi-Random walks on heterogeneous network
In this study, we propose a novel method to predict metabolite-disease associations. With disease similarity network, metabolite similarity network and known disease-metabolite network, we create a Heterogeneous network, including two types of nodes and three types of edges among them. Fig 1 is an example of the heterogeneous network. The upper sub-network is a metabolite similarity network, and the lower sub-network is a disease similarity network. The middle sub-network is a bipartite graph of metabolite-disease relationship. Supposing m 1 and m 4 have a high similarity value, and m 1 has a known association with d 2 . In order to predict the association between m 4 and d 2 , we can take m 4 as starting node for random walker, which jump from m 4 to m 1 and then to d 2 through the edge that connect to m 1 and d 2 . we also can take d 2 as starting node for random walker, which jump from d 2 to d 4 and the to m 4 . These two ways both can obtain the associated probability between m 4 and d 2 , the former firstly finds out the most similar intermediate metabolites based on the similarity of metabolites, and then calculates the associated probability between intermediate metabolites and corresponding diseases based on intermediate metabolites. Using bi-random walks algorithm [22,23] can achieve forecast by walking in metabolite subnetwork and disease subnetwork. The associated probability of arbitrarily metabolite-disease can be calculate by bi-random walk. Fig 2 shows the workflow of MDBIRW for predicting disease-related metabolite. Bi-random walks is utilized to evaluate potential metabolite-disease association, the association probability of metabolite-disease pair without known association record would be computed based on the steady state of the random walk process. During the random walk, metabolite subnetwork and disease subnetwork have different walking steps. Different walking steps can better obtain information of direct or undirect nodes in different networks. Hence, we define l, r as the numbers of maximal iterations in the metabolite subnetwork and disease subnetwork. The process of bi-random walks is described as follows: Here, α represents the decay factor with ranges from 0 to 1. RM t (i,j) and RD t (i,j) denote the probability of walking on metabolite similarity network and disease similarity network, respectively. MDBIRW can eliminate bias caused by topological and structural characteristics of the different networks by adjusting the number of walking steps of metabolite subnetwork and disease subnetwork. The pseudocode of MDBIRW algorithm is shown in Algorithm 1.

Parameter analysis
Three parameters l, r, and α are probed in MDBIRW. The parameter α is decay factor, the range of α is {0.3,0.5,0.7,0.9}. l and r control the iteration steps of two subnetwork, and choose the two parameters from {1,2,3,4,5}. If l > r means the random walker prefer to walk in metabolite network, vice versa. The analysis results of parameters are shown as Table 1   We explore the influences of l and r by using grid search method. From Table 1, we can conclusion that the maximum iteration steps l and r should not exceed 4. The AUC values on the diagonal is almost always higher than the rest values of its row and column. In other words, the optimal AUC value will be obtained when the maximum iteration steps of metabolite network and disease network are equal. Therefore, in our study, the optimal parameters are set that α = 0.3,l = 3,r = 3.

Performance of MDBIRW
Leave one out cross-validation (LOOCV) only take one sample as test set and the remains are used as training data. In our study, there are 2262 metabolites, coupled with 216 diseases and 4537 metabolite-disease associations. Therefore, we need to execute LOOCV program 4537 times. At each round, one corresponding known metabolite-disease association should be converted to unknown as test sample and the rest of known metabolite-disease association be used to as training samples. After execute bi-random walks with LOOCV, predicted results will be obtained.
Five-fold cross-validation (FFCV) is also utilized to evaluate the performance of our method. In FFCV, 4537 metabolite-disease associations were randomly divided into 5 groups. For each execution, one group is used as test set while 4 groups are used as training sets [24].
Receiver Operating Characteristic (ROC) curve is also called sensitivity curve, which using false positive rate and true positive rate as horizontal axis and vertical axis, respectively. The area of under the ROC curve is AUC value. The higher AUC value is, the better performance will be. In our study, the number of negative samples is more than the number of positive samples. Hence, we randomly select as many negative samples as positive samples. We arrange the final predicted values in descending order and calculate false positive rate and true positive rate by setting thresholds. Finally, the true positive rate (TPR) and false positive rate (FPR) for each threshold can be computed as follows: where TP and TN represent the number of positive samples and negative samples that can be correctly identified, and FP and FN are the number of the positive samples and negative samples that cannot be correctly identified, respectively. Precision-Recall (PR) curve utilizes recall and precision as horizontal axis and vertical axis of PR curve. The area under precision-recall curve (AUPR) is to evaluate the performance of our method by considering the precision and recall. Different precision-recall pairs will be obtained by setting different thresholds. Precision and recall can be calculated as follows: Recall where TP indicates the number of real identified positive samples. FP and FN respectively represent the number of negative samples that are incorrectly labelled as positive samples and the number of positive samples that are incorrectly labelled as negative samples. According to our predicted result, the result of LOOCV is 0.903 and FFCV is 0.924, which confirms the superior performance of our method. Fig 4 shows the comparison result of LOOCV. MERWMDA [25] applied the maximum entropy theory to the random walk and revealed potential disease-miRNA associations on the heterogeneous network. RWR [26], the traditional random walk with restart algorithm, starting from any node and it have two choices in each step: randomly moving to neighbor nodes with (1−α) or returning to start node with probability α. MERWMDA and RWR are utilized as comparison methods to verify the performance of our method. Fig 5 shows the comparison result of MDBIRW, MERWMDA and RWR in FFCV. ROC and PR curves are plotted to evaluate the performance of our method. We use same number of positive and negative samples, the trend of these two curves is similar.

Case studies
We chose obesity, colorectal cancer and Alzheimer's disease as case studies. For each disease, we removed all known metabolite-disease associations and performed MDBIRW to obtain predicted scores. According the experimental prediction score (from high to low), we obtain the top 10 metabolites related to disease. Next, we mined biomedical literature from the National Center for Biotechnology information (NCBI, https://www.ncbi.nlm.nih.gov/) database and manually checked these metabolites. As a result, 8 out of 10, 9 out of 10 and 10 out of 10 predicted obesity, colorectal cancer and Alzheimer's disease be validated, respectively.
Obesity has become a major public health problem around the world, the prevalence rate of which is rising in almost all countries. There is growing evidence that obesity is linked to metabolite. As shown in Table 2, by executing bi-random walks to identify underlying metabolites with obesity, nine of top 10 identified metabolites have been validated. The change of L-Phenylalanine in obese men suggested the early changes in obesity in young men [27]. The levels of Cholesterol is relatively high in obese young men has been already verified [27]. Zhao et al. has discovered that the levels of glycine have significant weight at baseline during five years [28]. L-Tryptophan and L-Tyrosine are abnormally expressed in obese children [28]. The changes of L-Arginine and L-Histidine were positively with obese parameter [29]. Central adiposity is associated with creatine changes, which has been found by Kaur et al [30]. 5-Hydroxyindole acetic acid and L-Alanine have not been confirmed link to obesity in human.
The incidence of colorectal cancer (CRC) is second only to gastric cancer, esophageal cancer and primary liver cancer [31]. In recent years, the incidence of colorectal cancer in adolescents and young adults is higher. Endogenous metabolites have verified it have great potential in the early diagnosis and personalized treatment of CRC [32]. Using our method to predict metabolites with CRC, and sorting the score of results in descending order. 9 out of 10 metabolites are confirmed, which is described in Table 3. Prediction of disease-related metabolites using bi-random walks Alzheimer's disease (AD), the most common cause of dementia, is a health problem that attracts increasing global attention and has a huge impact on human health [33]. Researchers developed various methods to identify AD related metabolites, for instance, using capillary electrophoresis-mass spectrometry to identify 9 metabolites are disease progression biomarkers [34]. Abnormal phospholipid metabolism is likely to lead to AD, and abnormal levels of metabolism is utilized to study AD by combining metabolomic-profiling approach [35]. 10 out of 10 predicted AD related metabolite were confirmed, as shown in Table 4.

Conclusions
There is increasing evidence that metabolites play an important role in the prediction, diagnosis and treatment of many complex diseases. In this paper, MDBIRW be used to predict the latent associations between metabolite and disease. The experimental results and case studies illustrate that the performance of MDBIRW is superior to that of other methods. The effective performance of MDBIRW mainly due to following factors. Firstly, the semantic disease similarity, metabolite functional similarity and Gaussian interaction profile kernel similarity were  Table 2. Top 10 metabolites of obesity identified by bi-random walks method.

Metabolite Names HMDB ID Evidences
integrated. Secondly, by controlling the number of iterative steps in metabolite network and disease network, MDBIRW can make better use of the hierarchical information of the nodes in two subnetworks to achieve a higher prediction accuracy. This method still has some limitations needing to be improved in future research. First, gaussian interaction profile kernel similarity of diseases and metabolites Overreliance on known metabolite-disease association, resulting in biased similarity calculations. For this problem, different data can be used to compute the similarities of diseases and metabolites such as GO data. In addition, we only used single data source, disease and metabolic data. Complex diseases are commonly caused by the interaction of multi-omics, thus, we will combine other omics data to improve prediction performance in the following study.
Supporting information S1 File. The data file of metabolite-disease associations. (XLS)