Predicting Drug-Target Interactions for New Drug Compounds Using a Weighted Nearest Neighbor Profile

In silico discovery of interactions between drug compounds and target proteins is of core importance for improving the efficiency of the laborious and costly experimental determination of drug-target interaction. Drug-target interaction data are available for many classes of pharmaceutically useful target proteins including enzymes, ion channels, GPCRs and nuclear receptors. However, current drug-target interaction databases contain a small number of drug-target pairs which are experimentally validated interactions. In particular, for some drug compounds (or targets) there is no available interaction. This motivates the need for developing methods that predict interacting pairs with high accuracy also for these 'new' drug compounds (or targets). We show that a simple weighted nearest neighbor procedure is highly effective for this task. We integrate this procedure into a recent machine learning method for drug-target interaction we developed in previous work. Results of experiments indicate that the resulting method predicts true interactions with high accuracy also for new drug compounds and achieves results comparable or better than those of recent state-of-the-art algorithms. Software is publicly available at http://cs.ru.nl/~tvanlaarhoven/drugtarget2013/.


Introduction
A core problem in pharmacology is the determination of interactions between drug compounds and target proteins in order to understand and study their effects. The in silico prediction of such interactions is of crucial importance for improving the efficiency of the laborious and costly experimental determination of drug-target interaction (see e.g. [1][2][3][4]).
Drug-target interaction data are available for various classes of pharmaceutically useful target proteins including enzymes, ion channels, GPCRs and nuclear receptors [5]. Publicly available databases have been built and maintained, such as KEGG BRITE [6], DrugBank [7], GLIDA [8], SuperTarget and Matador [9], BRENDA [10], and ChEMBL [11], containing drug-target interaction and other related sources of information, like chemical and genomic data.
The availability of these data has boosted the development of machine learning methods for the in silico prediction of drugtarget interactions, including the seminal paper by Yamanishi et al. [12]. In that paper the authors distinguish between prediction for 'known' drug compounds or targets, for which at least one interaction is present in the training set; and prediction for 'new' drug compounds or targets, for which no interaction in the training set is available. This results in four possible settings for predicting drug-target interaction, depending on whether the drug compounds and/or targets are known or new.
In this paper we generalize the applicability of the method introduced in [16] to so-called new drug compounds, that is, drug compounds for which no interactions are known. The method, hereafter called GIP, uses known interactions of a drug for predicting novel ones by means of a regularized least square algorithm incorporating a product of kernels constructed from drug compound and target interaction profiles. We propose a simple weighted nearest neighbor algorithm, called WNN, for constructing an interaction score profile for a new drug compound using chemical and interaction information about known compounds in the dataset. The WNN method can be used as a standalone algorithm for predicting interactions for new drug compounds. It can also be directly incorporated into the GIP method for handling new drug compounds. We call the resulting combination WNN-GIP. The methods can be directly adapted to handle also unknown targets or both unknown drug compounds and targets.
We test the predictive performance of WNN and WNN-GIP on four drug-target interaction networks in humans involving enzymes, ion channels, GPCRs and nuclear receptors. Results as measured by the area under the curve (AUC) and area under the precision-recall curve (AUPR) [20] show that the weighted nearest neighbor profile algorithm and its incorporation into the GIP method are capable to predict true interactions for new drug compounds with satisfactory accuracy. The algorithms achieve competitive or better results than the recent state-of-the-art algorithms KBMF2K [15] and BLM-NII [17]. KBMF2K is based on a fully probabilistic approach to model drug-target interaction, which can be applied to discover target (respectively drug compound) interactions for new drug compounds (respectively target proteins). Results in [15] indicate improved accuracy over the method introduced in [19]. BLM-NII is an extension of the BLM method [13] to deal with new drug compounds (or targets). In BLM-NII a drug-target interaction for a new drug compound is inferred by constructing an estimated interaction profile from the drug compounds in the training data. The resulting profile is then used as label information to learn an interaction model for that drug compound with the BLM method.

The Problem
We consider the problem of predicting interactions using a drug-target interaction network, chemical similarity between drug compounds and genomic similarity between targets proteins. Formally we are given a set X d~f d 1 ,d 2 , . . . ,d n d g of drug compounds and a set X t~f t 1 ,t 2 , . . . ,t nt g of target proteins. A set of interactions between drug compounds and targets is known. A bipartite network (between drug compounds and targets) can be constructed whose edges are such known interactions. Its corresponding adjacency matrix is a n d |n t matrix Y such that y ij~1 if drug compound d i interacts with target t j , and y ij~0 otherwise. Furthermore, information about the the chemical similarity between drug compounds and genomic similarity between targets is given in the form of the similarity matrices S d and S g , respectively.
The goal is to assign scores to drug-target pairs (d i ,t j ) such that pairs with higher scores are more likely to interact.

The GIP Method
Machine learning methods for tackling this problem are mainly based on the assumption that drug compounds exhibiting a similar pattern of interaction and non-interaction with the targets in a drug-target interaction network are likely to show similar interaction behavior with respect to new targets. A similar assumption on targets is considered. Here use the method introduced in [16]. It is based on the so-called (target) interaction profile y di of a drug compound d i , defined to be row i of the adjacency matrix Y , and the (drug compound) interaction profile y T tj of a target protein t j , defined to be column j of Y . The interaction profiles generated from a drug-target interaction network are used as feature vectors for a classifier. A kernel from the interaction profiles is constructed using topology of the drugtarget network, defined for drug compounds d i and d j as follows: Dy di D 2 ): A kernel K GIP,t for the similarities between target proteins is defined analogously. Moreover, the kernels K chemical,d and K genomic,t are considered, containing information about the chemical and genomic space. They are constructed from the chemical and genomic similarity matrices S d and S g between drug compounds and between targets, by applying a simple transformation to make them symmetric and positive definite. The interaction profile kernel can be easily combined with these kernels using a weighted average.
The kernel for drug compounds and the kernel for target proteins can be combined using the Kronecker product K d 6K t , such that for drug-target pairs (d i ,t i ) and (d j ,t j ) For each drug compound with at least one known interaction in the training data, a score interaction profileŷ y is computed from its interaction profile y and the kernel matrix K, using the Regularized Least Squared (RLS) classifier. This is achieved by means of the simple closed form solution formulâ where s is a regularization parameter.
We refer the reader to [16] for a more detailed description and analysis of this method.
For simplicity in the sequel we call GIP the RLS algorithm that uses the kernel defined as the Kronecker product of the weighted averages of the interaction kernels and chemical and genomic kernels.

Weighted Nearest Neighbor for New Drug Compounds
We want to extend GIP to new drug compounds, that is, compounds for which no interaction is known. To this aim, we propose a simple weighted nearest neighbor procedure. For a new drug compound, its chemical similarity with other known drug compounds and their corresponding profiles are used in order to infer a score interaction profile for that drug compound.
Specifically, the score interaction profile y d WNN of a new drug compound d is the weighted sum of the profiles of the drug compounds in the training data, where a higher weight is assigned to profiles of those drug compounds more similar to d. Let y 1 , . . . ,y nd be the interaction profiles of the other compounds in the dataset (that is, the rows of Y ), listed in decreasing order with respect to their chemical similarity to d. Then where the weights w i 's are computed using a given decay value Tƒ1 as w i~T i{1 . For computational reasons we only sum over drug compounds with weight at least 10 {4 . In our experiments we choose the decay rate T with 5 fold cross-validation to maximize AUC. We call the resulting procedure WNN.
An extension of GIP to handle new drug compounds using WNN, hereafter called WNN-GIP, can be directly formulated: for each new drug compound d, add y d WNN as new row to the matrix Y and apply GIP to predict the score interaction profileŷ y of d.

A Method to Show the Bias of a LOOCV Procedure
In a recent paper [17] the BLM-NII algorithm is introduced and assessed using the following leave-one-out cross validation (LOOCV) procedure. Each compound with only one interaction in Y is treated as a 'new candidate' in the cross validation and the BLM-NII procedure is applied to it. We observe that in this way a strong prior is implicitly used in the cross validation, namely the fact that the considered compound had at least one interaction.
To illustrate how this prior introduces a bias on the results, we consider the following simple procedure, called Const. Const constructs an all '1's profile for the drug compounds or target proteins with only one interaction.
We can incorporate Const into GIP in the same way as WNN, giving the Const-GIP method. With this method all possible interactions for drug/targets with only one interaction will be ranked before interactions with drugs/targets that also have other interactions. Essentially, for such interactions the method only has to do half the work, since the fact that the drug/target is correct can be known with certainty. In real world situations there are also drug compounds that interact with none of the target under consideration, and vice versa, which would invalidate the Const-GIP method.

Experiments
We perform a comparative experimental analysis of the proposed algorithms and two recently published methods [15,17].

Datasets
To this end we use the four drug-target interaction networks in humans involving enzymes, ion channels, G-protein-coupled receptors (GPCRs) and nuclear receptors from [12]. Table 1 lists some properties of the datasets.
Drug-target interaction information was retrieved from the KEGG BRITE [6], BRENDA [10], SuperTarget [9] and DrugBank [7] databases. Chemical structures of the compounds was derived from the DRUG and COMPOUND sections in the KEGG LIGAND database [6]. The chemical structure similarity between compounds was computed using SIMCOMP [21], which tries to find a graph matching between two compound structures. This resulted in a similarity matrix for the denoted by S c , which represents the chemical space. Amino acid sequences of the target (human) proteins were obtained from the KEGG GENES database [6]. Sequence similarity between proteins was computed using a normalized version of Smith-Waterman score [22], resulting in a similarity matrix denoted S g , which represents the genomic space.
These datasets are publicly available at http://web.kuicr.kyotou.ac.jp/supp/yoshi/drugtarget/and http://cbio.ensmp.fr/˜yyamanishi/bipartitelocal/. They are used as current standard benchmark data for comparing the performance of machine learning algorithms for drug-target interaction. We use these datasets as they are without adding new interactions from source databases.

Results
We follow the experimental procedure adopted in [15,19]. Specifically, for each dataset, drug compounds are split into five subsets of roughly equal size. Each subset is then used in turn as the test set and training is performed on the data consisting of the remaining four subsets. This procedure is repeated five times.
Results are assessed using the AUC and AUPR quality measures, generally used in this type of studies. Specifically, the ROC curve of true positives as a function of false positives is computed, and the area under the ROC curve (AUC) is considered as quality measure (see for instance [23]). Furthermore, the precision-recall curve is computed, that is, the plot of the ratio of true positives among all positive predictions for each given recall rate. The area under this curve (AUPR) provides a quantitative assessment of how well, on average, predicted scores of true interactions are separated from predicted scores of true noninteractions. Since there are few true drug-target interactions, the AUPR is a more informative quality measure than the AUC, as it punishes much more the existence of false positive examples found among the top ranked prediction scores [20].
Average AUC and AUPR results and standard deviations are reported in Table 2. They indicate that a WNN-GIP has slightly better (average) AUC on all datasets except Enzyme. However, WNN has slightly better AUPR than WNN-GIP. By itself the GIP method does not work well in this setting, which is to be expected, since it was not designed to handle new drugs.
To estimate the statistical significance of the AUC results we used the method described in [24]. To determine significance of the AUPR results we used bootstrapping.  The last column of table 2 lists the average value of the decay rate T over the folds and repetitions. In general, the larger dataset have a higher (slower) decay rate, which means that more neighbors are taken into account.

Comparison with other Methods
We consider the two following recent methods: KBMF2K [15] and BLM-NII [17].
KBMF2K is based on a Bayesian formulation that combines dimensionality reduction, matrix factorization and binary classification for predicting drug-target interaction networks using only chemical similarity between drug compounds and genomic similarity between target proteins.
In BLM-NII a drug-target interaction for a new drug compound d is inferred by constructing an estimated interaction profile for d as follows. For each target, an entry of the profile for d is defined as the sum of the similarity values of d and each of the drug compounds interacting with that target. The resulting profile is then used as label information to learn an interaction model for d by means of the BLM method.
Comparison with KBMF2K. To compare results of WNN and WNN-GIP with those reported in [15], we follow the experimental procedure therein used (described in the previous section). Table 2 also includes the AUC and AUPR for the KBMF2K method. They indicate similar performance of KBMF2K and the simpler WNN algorithm, and slightly better overall results achieved by WNN-GIP, except on the Ion Channel dataset.
We could test the prediction capability of the proposed methods on unknown drug-target interactions of the given network using the procedure adopted in [15]. Therein, the complete interaction network for each dataset is used as training data, and the predictions on non-interacting pairs in the training set are ranked with respect to their interaction scores. However, since each drug compound or target in the training set has at least one interaction, we do not need to use WNN and the results are those of GIP. We report the top five predicted interactions for each dataset in  Table 3. The full lists of all predicted interactions ranked by interaction score can be found in http://cs.ru.nl/˜tvanlaarhoven/ drugtarget2013/. Comparison with BLM-NII. Table 4 shows the results of the LOOCV experiments. As expected, both Const-GIP and BLM-NII achieve very good results, with comparable AUC, and slightly better AUPR performance achieved by Const-GIP. To asses the statistical significance of these differences we used an upper bound on the variance of the AUC and AUPR for BLM-NII, because the actual variance is unknown. With this bound the differences in AUC scores are not statistically significant.
In general, these results indicate that cross validation should be applied and interpreted with care. Note that the cross validation procedure used in the comparison with KBMF2K is also positively biased, since we know that each 'new' drug compound has at least one interaction, but there the bias is much smaller.

Discussion
In this work, we proposed a simple yet effective procedure to predict interaction profiles for unknown drug compounds and show how it can be directly integrated into a recent machine learning algorithm for the in-silico prediction of drug-target interactions. The novelty of our approach comes in the use of a weighted nearest neighbor procedure for inferring a profile for a drug compound by using interaction profiles of the compounds in the training data, where each profile is weighted using information about chemical similarity between drug compounds integrated with a simple decay scheme. The method can be directly modified to predict interaction scores of unknown targets (or of both unknown targets and drug compounds).
We performed a comparative assessment of the proposed methods on four different drug-target interaction networks from humans involving enzymes, ion channels, GPCRs and nuclear receptors. Results indicated that WNN is competitive in predicting interaction for unknown drug compounds with more involved machine learning methods recently proposed, notably a fully probabilistic method based on a Bayesian formulation that combines kernel-based nonlinear dimensionality reduction, matrix factorization and binary classification. Furthermore we showed that the direct integration of WNN in a recent kernel based machine learning method provides a general and powerful tool for finding drug-target interactions.
The computational complexity of WNN is O(n d 2 zn t 2 ), while the computational complexity of WNN-GIP is dominated by the RLS prediction using the Kronecker product kernel, which is O(n d 3 zn t 3 ) as implemented in [16], but can be further improved yielding a quadratic computational complexity by applying recent techniques for large-scale kernel methods for computing the two kernel decompositions, e.g. [25]. Therefore WNN-GIP is more efficient than KBMF2K, since the total time complexity of each iteration in the variational approximation method used in KBMF2K is O(Rn d 3 zRn t 3 zR 3 ), where R is the subspace dimensionality used in the method.
A limitation of our approach is that it does not make a difference between an inactive target and a target that has not been measured for a compound.
Compounds with a higher mutual chemical similarity also have a higher chance of having the same bioactivity. This information could be considered by WNN by determining directly the weights from the similarity, instead of using the proposed ranking-based decay mechanism. In this way all the compounds with high similarity would be considered with a high weight and all the compounds with low similarity would only have a minor contribution to the final predicted profile. On the same reasoning there is also a similarity threshold from where the chance is so low that two compounds have the same profile that it would be better not to predict something in the first place. In particular for new screening data from very large screening libraries chances are high that none of the references are really similar to the screening hits, which would most likely have a detrimental effect in the overall prediction performance, if predictions would be made for all such compounds. Many published target prediction algorithms apply such "applicability domain" or confidence estimations for their predictions. WNN could be modified to address this issue for instance by including a binary annotation based on a similarity threshold, or a more advanced procedure based on the similarities of all compounds considered for the generation of the profile.