Figures
Abstract
Machine learning techniques and computer-aided methods are now widely used in the pre-discovery tasks of drug discovery, effectively improving the efficiency of drug development and reducing the workload and cost. In this study, we used multi-source heterogeneous network information to build a network model, learn the network topology through multiple network diffusion algorithms, and obtain compressed low-dimensional feature vectors for predicting drug–target interactions (DTIs). We applied the metropolis–hasting random walk (MHRW) algorithm to improve the performance of the random walk with restart (RWR) algorithm, forming the basis by which the self-loop probability of the current node is removed. Additionally, the propagation efficiency of the MHRW was improved using the improved metropolis–hasting random walk (IMRWR) algorithm, facilitating network deep sampling. Finally, we proposed a correction of the transfer probability of the entire network after increasing the self-loop rate of isolated nodes to form the ISLRWR algorithm. Notably, the ISLRWR algorithm improved the area under the receiver operating characteristic curve (AUROC) by 7.53 and 5.72%, and the area under the precision-recall curve (AUPRC) by 5.95 and 4.19% compared to the RWR and MHRW algorithms, respectively, in predicting DTIs performance. Moreover, after excluding the interference of homologous proteins (popular drugs or targets may lead to inflated prediction results), the ISLRWR algorithm still showed a significant performance improvement.
Citation: Sun L, Yin Z, Lu L (2025) ISLRWR: A network diffusion algorithm for drug–target interactions prediction. PLoS ONE 20(1): e0302281. https://doi.org/10.1371/journal.pone.0302281
Editor: Tao Huang, Chinese Academy of Sciences, CHINA
Received: December 15, 2023; Accepted: April 1, 2024; Published: January 30, 2025
Copyright: © 2025 Sun et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The results of the article were calculated and performed on Ubuntu 20.04, and sample data and code are publicized at the “figshare” platform. Readers can download and use it for free. The URL: https://doi.org/10.6084/m9.figshare.25194281.v1. The DOI: 10.6084/m9.figshare.25194281.
Funding: This work was supported by National Natural Science Foundation of China (No:62072296). Z.Y. was supported by National Natural Science Foundation of China (No:62072296). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Drug–target interactions (DTIs) prediction is essential for discovering new drugs and potential targets. Traditional methods for developing new drugs require several chemical and clinical trials [1], with a long R&D cycle and massive capital consumption. The process from lead identification to clinical trials takes at least 12 years, with an R&D cost of up to 1–1.8 billion USD [2, 3]. Therefore, many scholars have proposed using machine learning and computer-aided tools to replace the traditional pre-discovery work of drug R&D to improve efficiency and reduce costs [4–6].
Computer technologies such as machine learning are widely used in various aspects of drug screening [7, 8], instead of the need to perform many repetitive tasks [9], and provide direction and range for aimless drug experiments. The DTIs prediction task is a critical and limiting step for discovering potential targets and new drugs [10, 11]. In addition, drug–drug interactions (DDIs) [12], protein–protein interactions (PPIs) [13], miRNA–disease correlation prediction [14, 15], and drug–target binding affinity (DTA) prediction [16] are typical computer-aided drug screening tasks. DDIs are concerned with drug side effects and toxicity, where patients may experience life-threatening pharmacological toxicity or side effects with severe clinical consequences when taking multiple drugs [17, 18]. PPIs focus on the physiological functions of proteins and assist with understanding the interactions between cellular molecules and viral microorganisms [19]. The study of miRNA–disease interrelationships is essential to understanding human disease pathogenesis because abnormal expression and regulation of some miRNAs can lead to the development of certain diseases [15, 20]. DTA prediction focuses on successive values of the strength of the interaction of a drug with a protein or a gene locus [21].
This study focuses on DTIs prediction given that it represents a more direct means of learning potential drug–target pairs [10, 11]. DDIs focus on drug characteristics, such as drug toxicity and side effects, which are directly significant in drug use and clinical application [12, 17, 18]. DDIs predictions are usually based on semantic relationships, such as knowledge graphs, in which the side effects, dosages, and indications of a drug are deposited into a certain network node. The contextual node information is used to predict whether or not toxicity will occur between drugs. This task is usually for drugs that are already in use and do not involve target information. There is no direct information about undiscovered drug–target pairs. However, the goal of this study was to predict and discover known and unknown drug–target pairs and the DDIs relationship can be used as only one of the aspects of calculating the similarity between drugs. PPIs focus on protein characteristics, such as the physiological functions of interactions between proteins, and can play an auxiliary role in discovering potential drugs [22]. PPIs are necessary for understanding most biological processes and are involved in most cellular activities. This task mainly uses sequence and structural information of proteins and uses only protein-related information without any drug information. This is insufficient to provide support and assistance for drug discovery. Therefore, PPIs are suitable only for providing one aspect of the protein similarity calculation necessary for drug discovery. In addition, DTA predicts successive values of the binding strength between drug targets, representing a more in-depth and detailed study of the nature of drug action [16, 21]. However, the contribution to the discovery of potential drug–target pairs is limited. DTA prediction is a regression task in which the output is a continuous measure of drug-target pair binding, typically using known drug-target pairs as training data, and then focusing on drug-target pairs that are known to interact and how strongly they interact. Drug-target pairs for which it is not yet known whether an interaction will occur are generally excluded from the training data for DTA prediction. However, this paper is precisely intended to focus on and address the discovery of potential unknown drug-target pairs. Therefore, in this study, we use the DTIs prediction task to provide direct information about potential drug–target pairs. This paper aims to train a model with known drug target pairs to mine some potential unknown drug-target pairs predictively. DTA is used to provide detailed information on known drug-target pairs, while DTIs prediction is used for known drug-target pairs and negative samples, i.e., drug-target pairs of potential screening value for which it is not yet certain if there is an interaction. Thus, the DTIs prediction task directly supports the ultimate goal of this paper.
Current machine learning methods for solving DTIs prediction can be categorized into the following six groups [23]: similarity matrix-based methods, deep learning methods, feature-based methods, matrix decomposition methods, network-based methods, and hybrid methods. Similarity matrix-based methods calculate proximity using different similarity scores or distance calculation formulas [24, 25]. The pharmacological similarity of drugs and genomic similarity of proteins can also be used to design similarity scoring schemes. In recent years, deep learning methods have been frequently applied to predict DTIs [26–28]. Such methods have been shown to have excellent performance in dealing with noisy data, such as high-dimensional data in drug repurposing. Deep learning methods use biological, physical, chemical, and network topology information of drugs and targets to generate feature vectors and train models using deep learning frameworks. However, deep learning models mostly lack stable and reliable negative samples during training, thus significantly reducing prediction performance. Feature-based methods include tree-based, kernel-based, and support vector machines, which represent drug and protein sequences with feature vectors of a certain length by constructing a specific feature space. Then, it builds various machine-learning models based on the feature representation to predict DTIs [29, 30]. Matrix decomposition methods involve decomposing the original observation matrix into two lower-order matrices [31, 32]. Matrix decomposition methods cannot depend on the chemical or pharmacological similarity of the drug because they use collaborative filtering algorithms that can minimize the error of point-by-point linear reconstruction of the dataset using low-rank embedded matrices. Network models are typically built by constructing heterogeneous networks [33–35] that integrate drug information, protein target information, and known DTIs, assuming that similar drugs act on similar targets. Hybrid methods effectively combine the above five methods to improve the algorithmic capability and robustness while mitigating the flaws and drawbacks of a single method.
Notably, most recent updates in the drug–target interactions field are network-related. Chatterjee et al. [36] proposed the AI-Bind model for DTIs prediction, which is based on a network sampling strategy that incorporates information on chemical molecular formulae of drugs and amino acid sequences of proteins and uses unsupervised learning to train the model. AI-Bind uses direct structural information, which is advantageous for generalizing to unseen drug–target pairs, but the predictive performance is insufficient. Zeng et al. [37] proposed the deepDTnet model for DTIs prediction. deepDTnet is a deep learning method that embeds multiple heterogeneous information from chemical, genomic, phenotypic, and cellular networks. Shang et al. [38] proposed a multilayer network representation called MEDTI for learning DTIs. The method demonstrated excellent performance in integrating multiple types of data and managing network noise. Wan et al. [39] proposed a nonlinear end-to-end learning model called NeoDTI, which can integrate a wide range of heterogeneous information and is robustly tolerant to hyperparameter selection. Li et al. [40] proposed an end-to-end collaborative contrast model called SGCL-DTI. SGCL-DTI can generate contrast losses to guide model optimization in a supervised manner. Most of these models rely on large datasets, and if the dataset is not large enough and does not cover enough information, the prediction performance will be significantly reduced. Therefore, the superiority of these models is partly due to the superiority of the dataset. We tried to apply the previous model to the same dataset as a baseline model for the algorithm proposed in this paper. In addition, we have chosen two knowledge graph link prediction models [41, 42] as baseline models for this paper. The above models are the baseline models because they are classical models through heterogeneous networks, knowledge graphs, etc., and are used by a wide range of scholars for discussion and comparison. It is objective to use them as a baseline for comparison.
This study uses a network method that mixes similarity computation and feature representation. The feature extraction and compression method is derived from an article published in Nature Communications by Luo et al. [43] in 2017. This study provides a framework for extracting compressed low-dimensional feature vectors from heterogeneous network information. In this work, we improve the network diffusion algorithm based on earlier research. The diffusion algorithm of the network is improved and upgraded in the feature extraction and network learning process, thereby improving the overall predictive efficacy of the model. In this paper, the random walk with restart (RWR) [44] was used to learn the structure of the network, and the Metropolis-Hasting random walk (MHRW) [45] algorithm was proposed to improve the network diffusion performance.
The MHRW algorithm draws on the Metropolis-Hasting process, which changes the traditional random walk algorithm’s strategy of equal probability toward neighboring nodes. The wandering particles generate different transfer probabilities using different network structures, which increases the comprehensiveness and accuracy of the algorithm in learning network structural features. We also applied the improved metropolis-hasting random walk (IMRWR) [46] algorithm to improve the performance of the MHRW. The algorithm removes the self-loop rate of the node so that the random walk particles are bound to walk to the next node in each iteration and will not stay in the same place. This improves the efficiency of network diffusion and promotes deep sampling. In this study, we propose a new network diffusion strategy by adding the self-loop probability of isolated nodes such that the wandering particles are more likely to perceive the isolated nodes rather than ignore them, avoiding wandering particles being trapped in dead ends. This method is named ISLRWR and is found to have excellent predictive performance in AUROC and AUPRC.
The features and advantages of this research can be summarized as follows.
- (1) A combination of drug–drug, drug–disease, drug–side effect, protein–protein, protein–disease, and other heterogeneous network information is integrated as the original information for DTIs prediction.
- (2) The MHRW and IMRWR algorithms were applied for DTIs network diffusion, and the performance was significantly improved compared to that of the RWR algorithm.
- (3) The ISLRWR algorithm is proposed, and its performance is significantly improved based on the original model.
- (4) Selecting deepDTnet, MEDTI, NeoDTI, SGCL-DTI, DistMult and ComplEx as baseline models, we used the same dataset to compare the DTIs prediction performance of ISLRWR and the baseline model. We found that ISLRWR performed well on a variety of evaluation indicators.
Materials and methods
Data presentation
The data used in this study consists of two main parts: drugs and proteins. The drug portion includes drug structural similarity, drug-related diseases, drug side effects, and drug–drug toxicity. The protein portion includes protein genomics similarity, protein-related diseases, and protein–protein interactions. In order to increase the objectivity of model performance comparison, we integrated two datasets for model training and evaluation, dataset A and dataset B, respectively. Drug data for dataset A were obtained from the Drugbank(3.0) database [47], the protein data were obtained from the HPRD database [48], the disease data were obtained from the CTD(2013) [49], and the side effect data were obtained from the SIDER (2.0) database [50]. In addition, drug data for dataset B were obtained from Drugbank(5.0) database [51], the protein data were obtained from the UniProt database [52], the disease data were obtained from the CTD(2021) database [53], and the side effect data were obtained from the SIDER (4.0) database [54]. In summary, we used four types of nodes (drug, disease, side-effect, and protein) and six types of edges (drug–drug, drug–disease, drug–side effect, drug–protein, protein–disease, and protein–protein) to construct the heterogeneous network. The node and edge statistics of dataset A and dataset B are presented in Tables 1 and 2.
Overview of the model
The network model constructed for DTIs prediction can be decomposed into three parts: (1) constructing the heterogeneous network and importing the information; (2) calculating the drug similarity and protein similarity; and (3) extracting the compressed feature vectors through the network model, which uses random walk with restart as the diffusion algorithm to learn the topology of the network.
Fig 1 illustrates the framework structure of this paper. For the data part, we used two datasets consisting of four node types and six edge types respectively. The databases from which the node and edge data come from, respectively, are labeled in the Fig 1. The heterogeneous networks include five kinds of networks, three drug-related (drug-drug, drug–side effect, and drug–disease), and two protein-related (protein-protein and protein–disease). The similarity matrix was calculated according to drug and protein. In addition to the similarity information extracted from the heterogeneous network, we also supplemented the drug structure similarity and proteogenomic similarity. The feature extraction uses network diffusion algorithms to learn the network topology and uses the method proposed by Luo et al. [43] to obtain low-dimensional compressed feature vectors. We used the original restart random walk algorithm RWR [44]; the MHRW algorithm regarding the metropolis-hasting process [45]; the IMRWR algorithm with node self-loop rates removed [46]; the ISLRWR algorithm, which recalculates the transfer probabilities after adding the self-loop rates of the isolated nodes; and the ISLHRWR algorithm, which recomputes the similarity matrices. Then, we compared the five diffusion algorithms using experimental results. Note that the network diffusion algorithm is a specific subdivision algorithm of the feature extraction session so in the framework diagram “Network diffusion algorithm” directed “Feature extraction”. The green shading in Fig 1 indicates the flow and use of drug data, and the pink shading indicates the flow and use of protein data.
The green circles represent drug molecules, pink squares represent protein molecules, orange triangles represent diseases, and red pentagons represent side effects. The green and gray tables represent the drug similarity matrix and the purple and blue tables represent the protein similarity matrix. Nd indicates the number of drug nodes, Np indicates the number of protein nodes, fd indicates the length of the drug feature, and fp indicates the length of the protein feature.
Similarity calculations
In this paper, drug similarity is calculated using three heterogeneous networks related to drugs and supplemented with a matrix based on the structural similarity of drugs. Drug–drug interactions are represented by an adjacency matrix, where each element indicates whether the drug undergoes a toxic effect or other chemical reaction, where the presence of an element in 1 indicates an interaction. This drug similarity is based on the drug–drug interaction and is calculated using the Jaccard similarity coefficient, which focuses on the number and proportion of (1,1) tuples in the two 0-1 vectors. Similarly, we used the drug–disease interaction network to obtain the drug similarity matrix using the disease as a benchmark, where drugs acting on the same disease are considered to have some degree of similarity. Based on the drug-side effect interaction network, we obtained a similarity matrix comparing the similarities and differences and the degree of crossover of drug side effects. An additional drug structural similarity matrix was added to obtain four dimensions of drug based similarity. We calculated a drug structural similarity using the drug’s SMILES (simplified molecular-input line-entry system) sequence. We used Tanimoto scores [55] to obtain the drug’s structural similarity.
Similarly, a similarity matrix was obtained based on protein–protein interactions, another similarity matrix was obtained using disease as a comparison principle through the protein–disease interaction network, and a protein genome similarity matrix was supplemented, such that there were three matrices for protein similarity. We calculated the protein genome similarity from the amino acid sequence of proteins. We calculated the normalized Smith-Waterman score [56] as the genomic similarity of the proteins.
The Jaccard similarity coefficients were calculated using the Eq (1).
(1)
J(i,j) denotes the Jaccard similarity of drug i and drug j. M11 denotes the number of (1,1) tuples of 0-1 vectors of drug i and drug j responding simultaneously to 1. M01 denotes the number of (0,1) tuples for drug i and drug j. M10 denotes the number of (1,0) tuples for drugs i and j. For example, in the drug-disease adjacency matrix, the row vector of drug1 is [1, 1, 1, 1, 1, 0, …, 0, 0, 0]1*5603, and the row vector of drug2 is [1, 1, 0, 0, 1, 1, …, 0, 0, 0]1*5603. The number of (1,1) tuples is 75,the number of (1,0) tuples is 272 and the number of (0,1) tuples is 47. Then the similarity between drug1 and drug2 is calculated as: . We show the computational process as Fig 2.
Low-dimensional feature extraction
This network feature extraction method [43] can handle highly noisy, incomplete, and large-scale high-dimensional biological data to obtain a low-dimensional and informative vector representation. The method learns contextual information in a single network and topological properties in multiple networks. Based on the obtained low-dimensional vector representation, the best projection from drug space to target space can be found, thus predicting a new DTIs based on the geometric proximity of the projection mapping.
As shown in Fig 1, the X- and Y-features of the drug and protein are learned, where Nd denotes the number of drug nodes, Np denotes the number of protein nodes, fd denotes the length of the drug features, and fp denotes the length of the protein features. The optimal projection matrix Z is then found to minimize the difference between the interaction matrix and XZYT. The resulting low-dimensional feature vectors encode relational attributes (e.g., similarity), association information, and the topological context of the drug and protein nodes.
Network diffusion algorithm
The diffusion state of the network is learned using the restarted random walk algorithm, which obtains a low-dimensional feature vector representation of a single node by minimizing the difference between the diffusion state and a parameterized polynomial logic model. The transfer probability of the diffusion state of the network is first computed using the original RWR [44]. As shown in Eq (2), K(i) is the degree of node i, is the transfer probability matrix calculated by the RWR algorithm. Γ(i) denotes the set of neighboring nodes to which node i is connected. vj denotes the next node.
(2)
Then, assuming that the probability of a particle randomly going to the next node is c, and the probability of returning to the initial node is 1—c, the probability vector of a particle reaching each node in the network is Eq (3).
(3)
As shown in Eq (3), ei is the start vector, ei = 1 if i is the start node, otherwise ei = 0. πi is the column vector, πi(t) denotes the probability vector of other nodes transferring to node i at moment t. πi(t + 1) denotes the probability vector of other nodes transferring to node i at moment t + 1. Then the steady-state solution of Eq (3) is Eq (4).
(4)
In addition, we used the MHRW algorithm [45], which is borrowed from the metropolis–hasting process, and the transfer probability is calculated as Eq (5).
(5)
As shown in Eq (5), K(i) is the degree of node i, K(j) is the degree of node j. is the transfer probability matrix calculated by the MHRW algorithm. Γ(i) denotes the set of neighboring nodes to which node i is connected. vj,vl denotes the nodes in the Γ(i). pil denotes the transfer probability value from node i to node l. Accordingly, we also applied the IMRWR [46] algorithm to calculate the transfer probability matrix, as Eq (6).
is the transfer probability matrix calculated by the IMRWR algorithm.
(6)
Finally, we proposed the ISLRWR algorithm for recalculating the transfer probability of particles after adding the self-loop probability of isolated nodes. As shown in Eq (7), is the transfer probability matrix calculated by the ISLRWR algorithm. Γ(i) denotes the set of neighboring nodes to which node i is connected. vj denotes the next node. ni denotes the number of isolated nodes. J(i, j) denotes similarity between node i and node j.
(7)
The evolution of the above random walk diffusion algorithm is shown in Fig 3, which visualizes the optimization of the transfer probability and the change in the sampling strategy of the MHRW algorithm compared to the RWR algorithm. The IMRWR removes the self-loop probability of the current node compared to the MHRW algorithm, which ensures that the wandering particles transfer at each step and do not stay in the same place. The ISLRWR adds the self-loop probability of isolated nodes and then recalculates the current node’s transfer probability, ensuring that the row sum of the matrix of transfer probabilities is 1.
The red arrow in the figure indicates that the particle goes to the next step, the green arrow indicates that the particle returns to the previous step, and the pink arrow indicates that the particle stays at the current node.
Baseline models and evaluation metrics
We chose MEDTI, deepDTnet, SGCL-DTI, DistMult, ComplEx, and NeoDTI as the baseline models because these models have attracted much attention in recent years in the research of related fields and not only have gained extensive discussion and high evaluation but also their reliability has been fully verified. They have demonstrated remarkable innovativeness at the technical level, providing new perspectives and solutions for research in the field of DTIs prediction by utilizing advanced algorithms and unique architectures. Therefore, we select these models as a baseline for comparison, aiming to assess the performance of the models proposed in this paper more comprehensively through in-depth comparison and analysis so as to reveal their strengths and weaknesses and provide valuable references and insights for future research work.
We used six different evaluation metrics to comprehensively assess the performance of the model, which are precision score, recall score, F1 score, matthews correlation coefficient (MCC), area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPRC). In binary classification problems, we can usually obtain counts of true negatives (TN), false negatives (FN), true positives (TP), and false positives (FP). Among them, precision score is derived by calculating the ratio TP/(TP + FP), which reflects the ability of the classifier to avoid incorrectly labeling negative samples as positive. Recall score, on the other hand, is derived by calculating the ratio TP/(TP + FN), which measures the classifier’s ability to find all positive samples. The F1 score is the reconciled mean of precision and recall, and is calculated as F1 = 2 * (precision * recall)/(precision + recall). In addition, MCC as a measure of the quality of binary categorization has a value range of -1 to +1 and is usually considered as a balanced measure. Except for MCC, the values of the other evaluation metrics lie between 0 and 1.
Results
Evaluation of the drug-target interaction predicting capabilities
In this study, ten-fold cross-validation was used to train and evaluate the model by randomly dividing all known DTIs (known positive samples) into ten equal portions, selecting one of them at a time, and randomly sampling the same number of non-interacting drug–target pairs (the same number of negative samples generated) as the test set. The remaining 90% of the positive samples with known interactions were used as the training set, along with the same number of negative samples generated.
In the DTIs prediction task, drug–target pairs with known interactions were available from databases. We obtained samples with unknown interactions from random sampling matches from the drug list and protein list. In most cases, we assumed that drug–target pairs obtained by random sampling were free of interaction situations unless there was preexisting literature on the subject or it had been stored in a database. The approach taken in this study was to assign label 0 to drug–target pairs obtained by random matching and label 1 to drug–target pairs with known interactions obtained from the database. Note that the number of samples for label 0 was the same as the number of samples for label 1. If the negative sample (label 0) obtained from the sampling was found to be already present in the positive list, then we deleted the negative sample and resampled the sampling until the number of negative and positive samples (label 1) were equal.
We calculated the mean and standard deviation of the AUROC and AUPRC for the ten crossover trials, as shown in Table 3. To exclude the interference of homologous proteins, drug–target pairs with protein sequence identity scores of 0.4, as well as drug similarity of 0.6 or more, were removed. Because popular and commonly used drugs and proteins are more likely to be predicted, which may lead to inflated and potentially redundant predictions, we deleted these data and reassessed the above methodology. In this study, the enhancement effects of the AUROC and AUPRC calculated from the above cases are visualized in Fig 4, a bar chart containing short standard deviations.
The graphs indicate the high and low values of the evaluation indicators. The darker the color the higher the value. The top short line indicates the standard deviation of the ten-fold cross-validation.
The MHRW algorithm improves the performance of learning the network structure and predicting the DTIs over the original RWR algorithm; the AUROC improves by 1.81%, and the AUPRC improves by 1.76%. In contrast, the IMRWR algorithm does not show a significant performance improvement. The ISLRWR algorithm improved AUROC by 5.72% and AUPRC by 4.19% compared with the MHRW algorithm. The ISLRWR algorithm improved AUROC by 7.53% and AUPRC by 5.95% compared with the RWR algorithm. This result showed that the prediction performance of the ISLRWR algorithm improved significantly. In addition, the ISLHRWR algorithm uses the Hamming similarity calculation method to recalculate the similarity matrix and then uses ISLRWR to learn the network structure to obtain better prediction performance. The ISLHRWR algorithm improves the AUROC and AUPRC by 2.11% and 1.93%, respectively, compared to the ISLRWR algorithm.
In addition, the performance improvement of the above algorithms still exists after removing homologous proteins. The MHRW algorithm outperforms the RWR algorithm with a 2.48% improvement in AUROC and a 1.91% improvement in AUPRC. The IMRWR algorithm improves the AUROC and AUPRC by 0.54% and 0.58%, respectively, compared to the RWR algorithm. The ISLRWR algorithm improves the AUROC and AUPRC by 4.90% and 3.69%, respectively, compared to the MHRW algorithm, which improves the AUROC and AUPRC by 7.38% and 5.60%, respectively, compared to the RWR algorithm. The ISLHRWR algorithm improves the AUROC and AUPRC by 2.31% and 2.66%, respectively, compared to the ISLRWR algorithm. The removal of homologous proteins alleviates the problem of inflated prediction results to some extent.
To comprehensively demonstrate the performance of ISLRWR in DTIs prediction, we selected MEDTI, deepDTnet, SGCL-DTI, DistMult, ComplEx and NeoDTI as baseline models to compare them with ISLRWR. We choose AUROC, AUPRC, precision score, recall score, F1 score, and MCC as evaluation indicators. We compared ISLRWR-DTI with the baseline model on dataset A and found that ISLRWR performed the best on AUROC, AUPRC, precision score, F1 score and MCC. Regarding recall score, however, NeoDTI was better than ISLRWR. In addition we compared ISLRWR-DTI with the baseline model on dataset B as well. We find that deep learning models like DEEP and NeoDTI perform better on large-scale datasets, but perform weakly when the amount of data is not large enough. The results based on dataset A are presented in Table 4 and Fig 5. The results based on dataset B are presented in Table 5 and Fig 6. In summary, the prediction performance of ISLRWR-DTI is still superior on small and medium sized datasets and is slightly inferior but acceptable on large datasets.
The graphs indicate the high and low values of the evaluation indicators. The darker the color, the higher the value. The top short line indicates the standard deviation of the ten-fold cross-validation.
The graphs indicate the high and low values of the evaluation indicators. The darker the color, the higher the value. The top short line indicates the standard deviation of the ten-fold cross-validation.
Noteworthy drug-protein pair analysis
In summary, the ISLRWR algorithm has the best performance in learning network structure and predicting DTIs compared to the RWR, MHRW, and IMRWR algorithms. Therefore, to obtain more auxiliary information for drug screening through the prediction results, we discuss the common networks in the DTIs network, focusing on the six drug–target common networks that are most densely connected with drugs. We determined the density of the shared network by degree. The higher the number of shared drugs between two targets, the denser the network. We selected the six dense networks with the highest number of shared connections for interpretation.
According to Fig 7, fludrocortisone and fluoxymesterone are in the shared network of target NR3C1 and target AR. The efficacy of the drugs in the sharing network is summarized in Table 6.
The green nodes indicate drugs and the pink nodes indicate targets. Mixed-color arc connections represent node–node interactions. The yellow shadows are used to highlight the network communities that require attention.
The network of the two homologous targets of PTGS1 and PTGS2 contains 22 associated drugs, the main effects of which are antipyretic and anti-inflammatory. Antipyretic and analgesic drugs include acetaminophen, indomethacin, napumetone, ketorolac, tolmetin, piroxicam, fenoprofen, diclofenac, mefenamic acid, naproxen, mexicam, diflunisal, suprofen, bromfenac, balsalazide, and ibuprofen. In addition, the following drugs are commonly used to treat rheumatoid arthritis: Sulindac, flurbiprofen, etodolac, sulfasalazine, oxaprozin, and ketoprofen.
Three drugs are co-associated with the targets SCN5A and KCNH2. The efficacy of the drugs in the sharing network is summarized in Table 7.
Sixteen drugs are associated with the targets OPRD1 (gene for μ -opioid receptor), OPRK1, and OPRM1 (μ-opioid receptor). The efficacy of the drugs in the sharing network is summarized in Table 8.
Twenty-three drugs are associated with ADRA1A and ADRA2A. Ziprasidone is an atypical antipsychotic. Amitriptyline is a depression medication that is used to treat all types of depression or to relieve chronic pain. Olanzapine is an atypical neuroleptic. In addition, the seven drugs in this co-association network, including ziprasidone, amitriptyline, olanzapine, clozapine, doxepin, quetiapine, and aripiprazole, were also associated with HTR2A and CHRM1.
GABPA1 is in a shared network with GABR-i and other targets. Alprazolam is a benzodiazepine hypnotic sedative and anxiolytic agent. Chlordiazepoxide has sedative, anxiolytic, muscle relaxant, and anticonvulsant effects. Midazolam is used clinically for treating insomnia and can also be used to induce sleep during surgical procedures. In addition, flurazepam, diazepam, oxazepam, triazolam, clonazepam, estazolam, bromazepam, and nitrazepam can be used to induce sedative and tranquilizing effects.
Furthermore, we enumerate some of the correctly predicted drug–target pairs to provide supporting information for drug discovery. We selected three drugs Vitamin A, Eletriptan, and Olanzapine with the targets that interact with them as examples. The predicted results are given in Table 9.
In addition, by carefully screening the prediction results, we obtained the following newly discovered potential drug–target pairs. We defined a new drug–target pair as a drug ID and target ID, which are both searchable in the original training data. The drug–target pair was not known to exist in the training data but was predicted to be true in the subsequent prediction results. We obtained such drug–target pairs from sampling matches between the pre-drug ID list and the target ID list. We put such results into currently published popular drug databases for searching to verify whether such findings were true and reliable. Some of the new drug–target pairs verified as true by the database searches are displayed in Table 10.
Discussion
Traditional drug discovery methods are both time-consuming and costly; therefore, it is valuable to use computer-aided techniques such as machine learning methods to improve the efficiency of drug discovery. Many machine learning methods have been applied to various aspects of drug discovery, among which, DTIs prediction is an important facilitating task for discovering new drugs and potential targets.
However, many of the DTIs models rely on large datasets, and if the dataset is not large enough and does not cover enough information, the prediction performance will be greatly reduced. In other words, the superiority of these models is partly due to the superiority of the dataset. In this study, we focused on the model and the algorithm itself to improve the prediction performance. Few scholars have paid attention to net diffusion algorithms for network feature extraction. We applied the MHRW and IMRWR algorithms and proposed the ISLRWR algorithm. We found that by improving the network diffusion strategy, we could improve the prediction performance of DTIs to some extent.
In this study, we integrate four kinds of nodes (drug, target, disease, and side-effect) and six kinds of edges (drug-drug, drug–side effect, drug-disease, drug-protein, protein-protein, and protein–disease) as heterogeneous networks. Then, we learn the network topology using multiple network diffusion algorithms to obtain compressed low-dimensional feature vectors. We applied the MHRW algorithm to improve the diffusion performance of the RWR algorithm. Further, we applied the IMRWR algorithm to remove the self-loop rate of the current node to improve the efficiency and sampling depth of network learning. In summary, we propose an ISLRWR algorithm that increases the self-loop rate of isolated nodes and then re-corrects the transfer probability matrix, significantly improving the AUROC and AUPRC of DTIs prediction. There is potential for data inflation phenomenon, i.e., an excessive number of homologous proteins leads to popular drugs and targets being predicted more easily and with a notable prediction tendency, resulting in non-objective prediction results. Therefore, homologous proteins were removed, and the above algorithm was applied again, revealing that the ISLRWR still had excellent performance. To comprehensively demonstrate the performance of ISLRWR in DTIs prediction, we chose deepDTnet, MEDTI, NeoDTI, SGCL-DTI, DistMult, and ComplEx as baseline models to compare them with ISLRWR. We found that ISLRWR performs excellently on a variety of indicators.
To obtain direct auxiliary information on DTIs, we plotted the predictions of ISLRWR as a network connectivity map and found the more densely connected target common networks for interpretation. In addition, we provide some correctly predicted drug target pairs and new drug target pairs verified by the database as auxiliary information for drug discovery.
However, the present study had several limitations. Our methods only used direct association information from heterogeneous networks and did not use higher-order dependencies. For example, drugs and targets were associated with each other through multistep transit paths, and this type of information was referred to as metapaths by some scholars. In future research, we will consider further utilizing higher-order relations in higher-order network topological properties. In addition, although good performance was observed on the DTIs prediction task, generalizability could be lacking. Moreover, the findings of this study require further validation and generalization in other tasks or domains. In the future, we would like to apply our method of DTIs prediction to other association prediction tasks, such as lncRNA–miRNA interaction prediction [58, 59], metabolite–disease association prediction [60, 61] and numerous other tasks in biomedical forecasting [62–66]. These tasks have commonalities with the DTIs prediction task, and using graph network models for association prediction is currently a popular treatment.
Conclusion
This paper aims to provide better performing and more usable network models for DTIs prediction than those currently available. The entirety of the model is enhanced by improving the individual parts of the model. In this paper, the MHRW algorithm, IMRWR algorithm, and proposed ISLRWR algorithm (as an improvement of the original RWR algorithm) are applied to achieve the overall performance improvement of the model. This approach proves to be helpful for the improvements of the model, however, the degree of innovation is limited. Therefore, in the future, researchers can develop completely new models or revolutionize the range of applications based on the current overall improvement of the model.
References
- 1. Mohs RC, Greig NH. Drug discovery and development: Role of basic biological research. Alzheimer’s & Dementia: Translational Research & Clinical Interventions. 2017;3(4):651–657. pmid:29255791
- 2. Deore AB, Dhumane JR, Wagh R, Sonawane R. The Stages of Drug Discovery and Development Process. Asian Journal of Pharmaceutical Research and Development. 2019;7(6):62–67.
- 3. Shaker B, Ahmad S, Lee J, Jung C, Na D. In silico methods and tools for drug discovery. Computers in biology and medicine. 2021;137:104851. pmid:34520990
- 4. Lin X, Li X, Lin X. A Review on Applications of Computational Methods in Drug Screening and Design. Molecules. 2020;25(6):1375. pmid:32197324
- 5.
Malviya R, Sharma A. Applications of Computational Methods and Modeling in Drug Delivery. CRC Press; 2021.163–190. Available from: http://dx.doi.org/10.1201/9781003185246-9.
- 6. Sabe VT, Ntombela T, Jhamba LA, Maguire GEM, Govender T, Naicker T, et al. Current trends in computer aided drug design and a highlight of drugs discovered via computational techniques: A review. European Journal of Medicinal Chemistry. 2021;224:113705. pmid:34303871
- 7. Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, et al. Applications of machine learning in drug discovery and development. Nature reviews Drug discovery. 2019;18(6):463–477. pmid:30976107
- 8. Rodrigues T, Bernardes GJL. Machine learning for target discovery in drug development. Current opinion in chemical biology. 2020;56:16–22. pmid:31734566
- 9. Patel V, Shah M. Artificial intelligence and machine learning in drug discovery and development. Intelligent Medicine. 2022;2(3):134–140.
- 10. Zhang Z, Chen L, Zhong F, Wang D, Jiang J, Zhang S, et al. Graph neural network approaches for drug-target interactions. Current Opinion in Structural Biology. 2022;73:102327. pmid:35074533
- 11. Peng Y, Zhao S, Zeng Z, Hu X, Yin Z. LGBMDF: A cascade forest framework with LightGBM for predicting drug-target interactions. Frontiers in Microbiology. 2023;13:1092467. pmid:36687573
- 12. Vo TH, Nguyen NTK, Kha QH, Le NQK. On the road to explainable AI in drug-drug interactions prediction: A systematic review. Computational and Structural Biotechnology Journal. 2022;20:2112–2123. pmid:35832629
- 13. Soleymani F, Paquet E, Viktor H, Michalowski W, Spinello D. Protein–protein interaction prediction with deep learning: A comprehensive review. Computational and Structural Biotechnology Journal. 2022;20:5316–5341. pmid:36212542
- 14. Yu L, Zheng Y, Ju B, Ao C, Gao L. Research progress of miRNA–disease association prediction and comparison of related algorithms. Briefings in Bioinformatics. 2022;23(3):1–18. pmid:35246678
- 15. Hu X, Yin Z, Zeng Z, Peng Y. Prediction of miRNA–disease associations by cascade forest model based on stacked autoencoder. Molecules. 2023;28(13):5013. pmid:37446675
- 16. Zhang Y, Hu Y, Han N, Yang A, Liu X, Cai H. A survey of drug-target interaction and affinity prediction methods via graph neural networks. Computers in Biology and Medicine. 2023;163:107136. pmid:37329615
- 17. Hong Y, Luo P, Jin S, Liu X. LaGAT: link-aware graph attention network for drug–drug interaction prediction. Bioinformatics. 2022;38(24):5406–5412. pmid:36271850
- 18. Pang S, Zhang Y, Song T, Zhang X, Wang X, Rodriguez-Patón A. AMDE: a novel attention-mechanism-based multidimensional feature encoder for drug–drug interaction prediction. Briefings in Bioinformatics. 2022;23(1):1–12. pmid:34965586
- 19. Koca MB, Nourani E, Abbasoğlu F, Karadeniz I, Sevilgen FE. Graph convolutional network based virus-human protein-protein interaction prediction for novel viruses. Computational Biology and Chemistry. 2022;101:107755. pmid:36037723
- 20. Li Z, Zhong T, Huang D, You ZH, Nie R. Hierarchical graph attention network for miRNA-disease association prediction. Molecular Therapy. 2022;30(4):1775–1786. pmid:35121109
- 21. Nguyen T, Le H, Quinn TP, Nguyen T, Le TD, Venkatesh S. GraphDTA: predicting drug–target binding affinity with graph neural networks. Bioinformatics. 2020;37(8):1140–1147.
- 22. Tsuchiya K, Kurohara T, Fukuhara K, Misawa T, Demizu Y. Helical Foldamers and Stapled Peptides as New Modalities in Drug Discovery: Modulators of Protein-Protein Interactions. Processes. 2022;10(5):924.
- 23. Bagherian M, Sabeti E, Wang K, Sartor MA, Nikolovska-Coleska Z, Najarian K. Machine learning approaches and databases for prediction of drug–target interaction: a survey paper. Briefings in bioinformatics. 2020;22(1):247–269.
- 24. Ding H, Takigawa I, Mamitsuka H, Zhu S. Similarity-based machine learning methods for predicting drug–target interactions: a brief review. Briefings in bioinformatics. 2013;15(5):734–747. pmid:23933754
- 25.
Shi Z, Li J. Drug-Target Interaction Prediction with Weighted Bayesian Ranking. In: Proceedings of the 2nd International Conference on Biomedical Engineering and Bioinformatics. ICBEB 2018. ACM; 2018. 19-24. Available from: http://dx.doi.org/10.1145/3278198.3278210.
- 26. Abbasi K, Razzaghi P, Poso A, Ghanbari-Ara S, Masoudi-Nejad A. Deep Learning in Drug Target Interaction Prediction: Current and Future Perspectives. Current Medicinal Chemistry. 2021;28(11):2100–2113. pmid:32895036
- 27. Wang YB, You ZH, Yang S, Yi HC, Chen ZH, Zheng K. A deep learning-based method for drug-target interaction prediction based on long short-term memory neural network. BMC medical informatics and decision making. 2020;20(2):1–9. pmid:32183788
- 28. Huang K, Fu T, Glass LM, Zitnik M, Xiao C, Sun J. DeepPurpose: a deep learning library for drug–target interaction prediction. Bioinformatics. 2020;36(22–23):5545–5547.
- 29. Sachdev K, Gupta MK. A comprehensive review of feature based methods for drug target interaction prediction. Journal of biomedical informatics. 2019;93:103159. pmid:30926470
- 30. Mahmud SMH, Chen W, Liu Y, Awal MA, Ahmed K, Rahman MH, et al. PreDTIs: prediction of drug–target interactions based on multiple feature information using gradient boosting framework with data balancing and feature selection techniques. Briefings in bioinformatics. 2021;22(5):1–20. pmid:33709119
- 31. Ding Y, Tang J, Guo F, Zou Q. Identification of drug–target interactions via multiple kernel-based triple collaborative matrix factorization. Briefings in bioinformatics. 2022;23(2):1–12. pmid:35134117
- 32. Sajadi SZ, Zare Chahooki MA, Tavakol M, Gharaghani S. Matrix factorization with denoising autoencoders for prediction of drug–target interactions. Molecular Diversity. 2022;27(3):1333–1343. pmid:35871213
- 33. Zeng X, Zhu S, Hou Y, Zhang P, Li L, Li J, et al. Network-based prediction of drug–target interactions using an arbitrary-order proximity embedded deep forest. Bioinformatics. 2020;36(9):2805–2812. pmid:31971579
- 34. An Q, Yu L. A heterogeneous network embedding framework for predicting similarity-based drug-target interactions. Briefings in bioinformatics. 2021;22(6):1–10. pmid:34373895
- 35. Yue Y, He S. DTI-HeNE: a novel method for drug-target interaction prediction based on heterogeneous network embedding. BMC bioinformatics. 2021;22(1):1–20. pmid:34479477
- 36. Chatterjee A, Walters R, Shafi Z, Ahmed OS, Sebek M, Gysi D, et al. Improving the generalizability of protein-ligand binding predictions with AI-Bind. Nature Communications. 2023;14(1):1989. pmid:37031187
- 37. Zeng X, Zhu S, Lu W, Liu Z, Huang J, Zhou Y, et al. Target identification among known drugs by deep learning from heterogeneous networks. Chemical Science. 2020;11(7):1775–1797. pmid:34123272
- 38. Shang Y, Gao L, Zou Q, Yu L. Prediction of drug-target interactions based on multi-layer network representation learning. Neurocomputing. 2021;434:80–89.
- 39. Wan F, Hong L, Xiao A, Jiang T, Zeng J. NeoDTI: neural integration of neighbor information from a heterogeneous network for discovering new drug-target interactions. Bioinformatics. 2019;35(1):104–111. pmid:30561548
- 40. Li Y, Qiao G, Gao X, Wang G. Supervised graph co-contrastive learning for drug–target interaction prediction. Bioinformatics. 2022;38(10):2847–2854. pmid:35561181
- 41.
Yang B, Yih Wt, He X, Gao J, Deng L. Embedding entities and relations for learning and inference in knowledge bases. arXiv.2014; 10.48550/arXiv.1412.6575
- 42.
Trouillon T, Welbl J, Riedel S, Gaussier E, Bouchard G. Complex embeddings for simple link prediction. International conference on machine learning. PMLR; 2016.2071-2080. Available from: https://proceedings.mlr.press/v48/trouillon16.html.
- 43. Luo Y, Zhao X, Zhou J, Yang J, Zhang Y, Kuang W, et al. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nature communications. 2017;8(1):573. pmid:28924171
- 44.
Tong H, Faloutsos C, Pan Jy. Fast random walk with restart and its applications. In: Sixth International Conference on Data Mining (ICDM’06). IEEE; 2006.613–622. Available from: http://dx.doi.org/10.1109/icdm.2006.70.
- 45.
Gjoka M, Kurant M, Butts CT, Markopoulou A. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. 2010 Proceedings IEEE INFOCOM. IEEE; 2010.1-9. Available from: http://dx.doi.org/10.1109/infcom.2010.5462078.
- 46. Lv L, He M, Yi C. An improved MH link prediction algorithm combining with Random Walk with Restart. Journal of Yunnan University (Natural Science Edition). 2021;43(2):245–253.
- 47. Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, et al. DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic acids research. 2010;39(1):1035–1041. pmid:21059682
- 48. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, et al. Human Protein Reference Database–2009 update. Nucleic acids research. 2009;37(1):767–772. pmid:18988627
- 49. Davis AP, Murphy CG, Johnson R, Lay JM, Lennon-Hopkins K, Saraceni-Richards C, et al. The Comparative Toxicogenomics Database: update 2013. Nucleic acids research. 2013;41(1):1104–1114. pmid:23093600
- 50. Kuhn M, Campillos M, Letunic I, Jensen LJ, Bork P. A side effect resource to capture phenotypic effects of drugs. Molecular systems biology. 2010;6(1):343. pmid:20087340
- 51. Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic acids research. 2018;46(1):1074–1082.
- 52. Consortium U. UniProt: a worldwide hub of protein knowledge. Nucleic acids research. 2019;47(1):506–515.
- 53. Davis AP, Grondin CJ, Johnson RJ, Sciaky D, Wiegers J, Wiegers TC, et al. Comparative toxicogenomics database (CTD): update 2021. Nucleic acids research. 2021;49(1):1138–1143. pmid:33068428
- 54. Kuhn M, Letunic I, Jensen LJ, Bork P. The SIDER database of drugs and side effects. Nucleic acids research. 2016;44(1):1075–1079. pmid:26481350
- 55.
Tanimoto TT. Elementary mathematical theory of classification and prediction.1958;45.
- 56. Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics. 2008;24(13):232–240. pmid:18586719
- 57. Avram S, Bologa CG, Holmes J, Bocci G, Wilson TB, Nguyen DT, et al. DrugCentral 2021 supports drug discovery and repositioning. Nucleic acids research. 2021;49(1):1160–1169. pmid:33151287
- 58. Wang W, Zhang L, Sun J, Zhao Q, Shuai J. Predicting the potential human lncRNA-miRNA interactions based on graph convolution network with conditional random field. Briefings in Bioinformatics. 2022;23(6):1–9. pmid:36305458
- 59. Zhang L, Yang P, Feng H, Zhao Q, Liu H. Using network distance analysis to predict lncRNA-miRNA interactions. Interdisciplinary Sciences: Computational Life Sciences. 2021;13:535–545. pmid:34232474
- 60. Sun F, Sun J, Zhao Q. A deep learning method for predicting metabolite-disease associations via graph neural network. Briefings in Bioinformatics. 2022;23(4):1–11. pmid:35817399
- 61. Gao H, Sun J, Wang Y, Lu Y, Liu L, Zhao Q, et al. Predicting metabolite-disease associations based on auto-encoder and non-negative matrix factorization. Briefings in Bioinformatics. 2023;24(5):1–13. pmid:37466194
- 62. Hu H, Feng Z, Lin H, Zhao J, Zhang Y, Xu F, et al. Modeling and analyzing single-cell multimodal data with deep parametric inference. Briefings in Bioinformatics. 2023;24(1):1–13. pmid:36642414
- 63. Hu H, Feng Z, Lin H, Cheng J, Lyu J, Zhang Y, et al. Gene function and cell surface protein association analysis based on single-cell multiomics data. Computers in Biology and Medicine. 2023;157:106733. pmid:36924730
- 64. Wang T, Sun J, Zhao Q. Investigating cardiotoxicity related with hERG channel blockers using molecular fingerprints and graph attention mechanism. Computers in biology and medicine. 2023;153:106464. pmid:36584603
- 65. Chen Z, Zhang L, Sun J, Meng R, Yin S, Zhao Q. DCAMCP: A deep learning model based on capsule network and attention mechanism for molecular carcinogenicity prediction. Journal of cellular and molecular medicine. 2023;27(20):3117–3126. pmid:37525507
- 66. Meng R, Yin S, Sun J, Hu H, Zhao Q. scAAGA: Single cell data analysis framework using asymmetric autoencoder with gene attention. Computers in biology and medicine. 2023;165:107414. pmid:37660567