Modified linear regression predicts drug-target interactions accurately

State-of-the-art approaches for the prediction of drug–target interactions (DTI) are based on various techniques, such as matrix factorisation, restricted Boltzmann machines, network-based inference and bipartite local models (BLM). In this paper, we propose the framework of Asymmetric Loss Models (ALM) which is more consistent with the underlying chemical reality compared with conventional regression techniques. Furthermore, we propose to use an asymmetric loss model with BLM to predict drug–target interactions accurately. We evaluate our approach on publicly available real-world drug–target interaction datasets. The results show that our approach outperforms state-of-the-art DTI techniques, including recent versions of BLM.


Introduction
When developing new drugs and identifying their side effects [1], pharmaceutical science relies on findings from related branches of science, including statistics and computer science. An essential step in this process is the identification of interactions between drugs and pharmacological targets. Although the existence of interactions can be reliably confirmed by in vitro binding assays, see e.g., [2][3][4][5], such methods are expensive and time consuming [6]. In order to address this bottleneck, computational approaches have been designed and implemented for the estimation of the probability of interactions. Therefore, most promising candidates for in vitro experiments may be selected based on in silico approaches.
The importance of drug-target interaction prediction is further emphasised by the costs of drug development. While estimates vary, they agree that it costs hundreds of millions of dollars to bring a new drug to the market, see e.g. [7] for an overview. Furthermore, the process may take more than 10 years in total.
Drug-target interaction prediction (DTI) techniques promise to reduce the aforementioned costs and time, and to support drug repositioning [8], i.e., the use of an existing medicine to treat a disease that has not been treated with that drug yet. Drug repositioning is especially relevant for the treatment of rare diseases, including neurological disorders. While each of the rare diseases affect only few people, due to the large number of rare diseases, in total 6-8% of the entire population is affected by one of those diseases. This results in a paradox situation: although a significant fraction of the population is suffering from one of the rare diseases, it is economically irrational to develop new drugs for many of them. However, drug repositioning may potentially lead to breakthroughs in such cases.
In silico approaches for DTI include techniques based on docking simulations [9], ligand chemistry [10], text mining [11,12] and machine learning. Text mining is inherently limited to the identification of entities and interactions that have already been documented, although the output of approaches based on text mining, i.e., the identified interactions, may serve as input data for other approaches, such as the ones based on machine learning. A serious limitation of docking simulations is that information about the three-dimensional structure of candidate drugs and targets is required. In many cases, e.g. for G-protein coupled receptors (GPCR) and ion channels, such information may not be available. Moreover, the performance of ligand-based approaches is known to decrease if only few ligands are known.
Despite all the aforementioned efforts, accurate prediction of drug-target interactions still remained a challenge. In this paper, we propose a new regression technique for accurate DTI predictions. We use a novel loss function that reflects the needs of drug-target interaction better than wide-spread loss functions, such as mean squared error or logistic loss. Our generic framework of asymmetric loss models (ALM) works with various regressors. For simplicity, we instantiate ALM with linear regression which leads to asymmetric loss linear regression (ALLR). We propose to use this new regressor in BLM for drug-target interaction prediction. Note that ALM is substantially different from hubness-aware regressors that we used with BLM in our previous work [19]. As ALLR is a modified version of linear regression, we call our approach Drug-Target Interaction Prediction with Modified Linear Regression, or MOLIERE for short. We evaluate MOLIERE on publicly available real-world datasets and show that our approach outperforms state-of-the-art DTI techniques, including recent versions of BLM and the cases when conventional loss functions are used. Furthermore, we show that MOLIERE is able to predict medically relevant drug-target interactions that are not contained in the original datasets. determined by the Smith-Waterman algorithm, see [17,32] for details. Chemical structure similarities between drugs were computed using the SIMCOMP algorithm [33].
Each entry m i,j of the interaction matrix M indicates whether the interaction between the ith drug (denoted as d i ) and j-th target (denoted as t j ) is known: Note that in case of these datasets, only the information about the presence of interactions is explicit, there is no explicit information about the absence of interactions. In particular, the semantics of m i,j = −1 is that the corresponding drug d i and target t j may or may not interact. In fact, some of the drug-target pairs denoted as −1 actually interact, however, the interaction was unknown when these datasets were created, roughly 10 years ago. In order to allow for a fair comparison with other works in the literature, in our experiments reported in Tables 2-4, we used the original datasets without these "new" interactions.
In order to illustrate that our approach is indeed able to predict unknown interactions, we show that using the original data, we could predict many of those interactions that have been discovered meanwhile (Tables 5-7).

Problem formulation
We define the Drug-Target Interaction Prediction problem as follows. We are given a set D ¼ fd 1 ; . . . ; d n g of n drugs, a set T ¼ ft 1 ; . . . ; t m g of m pharmaceutical targets, an n × n drug similarity matrix S D , an m × m target similarity matrix S T and an n × m interaction matrix M. For  some of the drug-target pairs the presence or absence of interaction is unknown (or simulated to be unknown in order to evaluate our approach). The task is to predict the likelihood of interaction for these unknown pairs. At the first glance, the above DTI problem seems to be similar to the problems considered in the recommender systems community. Note, however, that most recommender techniques consider only the interactions ("ratings") because even a few ratings are thought to be more informative than metadata, such as users' similarity based on their demographic information [34]. In contrast, drug-drug and target-target similarities play an essential role in DTI.

Bipartite local models
BLM considers DTI as a link prediction problem in bipartite graphs [28]. The vertices in one of the vertex classes correspond to drugs, whereas the vertices in the other vertex class correspond to targets. There is an edge e i,j between drug d i and target t j if and only if m i,j = 1.
The likelihood of unknown interactions is predicted as follows: we consider an unknown pair u i,j = (d i , t j ) and calculate the likelihood of interaction as the aggregate of two independent predictions.
The first prediction, called drug-centric prediction (Fig 1, left panel), is based on the relations between d i and the targets. Each target t k (except t j ) is labelled as "+ 1" or "−1" depending on m i, k . Then a model is trained to distinguish "+ 1"-labelled and "−1"-labelled targets. Subsequently, this model is applied to predict the likelihood of interaction for the unknown pair u i,j . This first prediction is denoted byŷ 0 i;j . (When describing BLM, in accordance with our data, we assumed that only the information about the presence of an interaction is explicit, and therefore we train the model to distinguish known interacting pairs from pairs with unknown status. In contrast, if both known interacting and known non-interacting drug-target pairs are given, one may train the model using only the known interacting and known non-interacting pairs). The second prediction, called target-centric prediction,ŷ 00 i;j , is obtained similarly, but instead of considering the interactions of drug d i and labelling the targets, the interactions of target t j are considered and drugs are labelled (Fig 1, right panel). The models that make the first and second predictions are called drug-centric and target-centric local models.
In order to obtain the final prediction of BLM, we average the predictions of the aforementioned local models:ŷ i;j ¼ŷ 0 i;j þŷ 00 Note that instead of averaging, other aggregation functions, such as minimum or maximum are possible as well. According to our observations, our approach achieves most accurate results when the two predictions are averaged. However, the effect of the aggregation function can be considered as minor: when we repeated our experiments reported in Table 4 with min and max aggregation functions, we observed that our approach consistently outperformed its competitors for all the three aggregation functions. For example, on the GPCR dataset, our approach achieved an AUPR of 0.737 and 0.730 using min and max respectively, whereas we obtained an AUPR of 0.753 in case of averaging the two predictions.
BLM is a generic framework in which various regressors or classifiers can be used as local models. For example, Bleakley and Yamanishi [28] used support vector machines with a domain-specific kernel, whereas Buza and Peška used a hubness-aware regressor [19]. In our current work, we use BLM with asymmetric loss linear regression which will be described in the next section.

Asymmetric loss models
Local models are the heart of BLM. Next, we propose a new regression technique that we use as a local model. Given a regression model f θ where θ is the vector of parameters, f θ estimates the value of the target y for an instance x asŷ ¼ f y ðxÞ. In order to determine the appropriate parameter values θ � , usually, a loss function L D (θ) is minimised: Note that the actual value of L D (θ) depends both on the dataset D and parameters θ. However, once the dataset is fixed, in particular, while the model is being trained using a given training dataset D, the loss can be seen as a function of the parameter vector θ. Therefore, we aim at finding parameters θ � that minimise the loss. A wide-spread loss function is mean squared errors: where |D| is the number of instances in D.
While the sum of squared errors is popular, we argue that in case of DTI, it is not fully consistent with the underlying chemical reality. In particular, binding energy may be different for various interactions. Consequently, in case of the presence of an interaction (y = + 1), we should not penalise a model that predicts a score that is higher than + 1. Similarly, in case of an unknown interaction (y = −1), we do not want to penalise a model that predicts a score that is lower than −1. Therefore, we propose an asymmetric loss function. First, we define the error of the model f θ for a single prediction f θ (x), for instance x with label y as We define mean asymmetric loss (MAL) as the mean of the above errors for all instances of the dataset D: The above loss can be minimised with various optimisation techniques ranging from gradient-based methods to more advanced approaches, see e.g. [35]. For simplicity, we decided to use gradient descent. The partial derivative @MAL D ðyÞ @y of MAL D (θ) is: where @errðf y ; x; yÞ @y ¼ 0 if f y ðxÞ > þ1 and y ¼ þ1 0 if f y ðxÞ < À 1 and y ¼ À 1 2 f y ðxÞ À y ð Þ @f y ðxÞ @y otherwise: In case of linear regression where x = (x 1 , . . ., x k ), θ = {w 0 , w 1 , . . .w k }, and the model is while the partial derivative according to w 0 is if f y ðxÞ < À 1 and y ¼ À 1 2ðf y ðxÞ À yÞ otherwise: We propose to use stochastic gradient descent to optimise MAL D . The pseudocode of the resulting asymmetric loss linear regression (ALLR) is shown in Fig 2.

Weighted profile
One of the shortcomings of the BLM approach is that it does not handle the case of new drugs/ targets. With new drug (or new target, respectively), we mean a drug d (target t) that does not have any known interaction in the (training) data. In such cases, BLM labels all targets (drugs) as "−1", consequently, no reasonable local model can be learned. In order to alleviate this problem, we use the weighted profiles [17] of the most similar drugs/targets to obtain predictions for new drugs/targets. Given a new drug d i , and a target t j , we predict the likelihood of the interaction between d i and t j as follows:ŷ where N D ðd i Þ denotes the set of indices of the k d most similar drugs to d i (not including d i itself) based on the drug-drug similarities S D . The intuition behind Eq (11) is that similar drugs are likely to behave similarly in terms of their interaction with a given target. Therefore, drugs are weighed according to their similarity to the new drug d i and we calculate the weighted average of the known interactions of other drugs with the same target.
The case of new targets is analogous. Given a new target t j and a drug d i , the weighted profile approach can be used to calculate the prediction for the likelihood of the interaction between d i and t j as follows:ŷ where N T ðt j Þ denotes the set of indices of the k t most similar targets to t j (not including t j itself) based on the target-target similarities S T . Although the weighted profile approach is more general than BLM, in the sense that it can be used for new drugs/targets as well, the predictions of the weighted profile approach are usually less accurate than the predictions of BLM. Therefore, we use the weighted profile approach instead of BLM only in case of new drugs and targets.

Our approach
We summarise our approach as follows. We use BLM for drug-target interaction prediction with the proposed asymmetric loss linear regression as local model in cases when the corresponding drug (target) has at least one known interaction and therefore the local model has at least one positive training instance. When initialising the parameters of ALLR, we use σ = 10 −8 . We train each ALLR model with a learning rate η = 10 −3 for e = 100 epochs. According to our observations, ALLR is robust in the sense that the aforementioned settings allowed

PLOS ONE
ALLR to converge to a model that outperformed other DTI techniques on all the examined datasets (see Table 4).
While predicting the interaction score between drug d and target t with ALLR, we represent each drug (target) as a vector of its similarities to all the drugs (targets) and its interactions, except the interactions with t (or d respectively), because the interactions with d (t) serve as labels for the local models, see Fig 3 for an illustration.
In case of new drugs (targets), we predict the likelihood of interactions using the weighted profile approach with k d = k t = 5.

Comparison with baselines
As our approach, MOLIERE, is based on BLM, and uses weighted profile (WP) in case of new interactions, first, we compared the performance of MOLIERE to that of the original BLM and WP according to the widely used leave-one-interaction-out cross-validation protocol, see e.g. [28,30,31].
The predictions were evaluated both in terms of Area Under ROC Curve (AUC) and Area Under Precision-Recall Curve (AUPR). Table 2 shows that MOLIERE clearly outperforms both BLM and WP, both in terms of AUC and AUPR. As the proposed asymmetric loss linear regression is a key component of MOLIERE, we examined the performance in case of alternative regression techniques, in particular we examined the cases when we use (a) standard linear regression and (b) logistic regression instead of ALLR. As expected, the proposed asymmetric loss linear regression indeed improves the quality of predictions, see Table 3.

Comparison with recent DTI techniques
We compared MOLIERE with state-of-the-art drug-target interaction prediction techniques: two recent versions of BLM and three further prominent DTI approaches. The former include BLM with neighbour-based interaction-profile inferring (BLM-NII) [31] and hubness-aware regressors as local models (HLM) [19], while the later refer to net Laplacian regularized least squares (NetLapRLS) [29], a combination of weighted nearest neighbour and Gaussian interaction profile kernels (WNN-GIP) [36], and Bayesian Ranking for Drug-Target Interaction Prediction (BRDTI) [20].
[37] pointed out that leave-one-out cross-validation may lead to overoptimistic results. Therefore, in this section, we used the interaction-based 5 × 5-fold cross-validation protocol, i.e., 5-fold cross-validation is repeated 5-times with different initial data splits. In each of the 5 × 5 rounds of the cross-validation, one fifth of the drug-target pairs were in the test data and AUC and AUPR values were calculated. The reported results are averaged values. In order to judge if the observed differences are statistically significant, we used paired ttest with significance threshold of p = 0.01.
Essential hyperparameters of BLM-NII, HLM, WNN-GIP, NetLapRLS and BRDTI were learned via grid-search in internal 5-fold cross-validation on the training data. For other hyperparameters that are not expected to have major impact on the results, we used default values according to the publication in which the approach was published.
In particular, for BLM-NII, as proposed by Mei et al.
[31], the max function was used to generate final predictions and the weight α for the combination of structural and collaborative similarities was learned from {0.0, 0.1, . . ., 1.0}.
In case of HLM, according to [19], we performed experiments with N = 25 base prediction models, while the number of nearest neighbours for the local model ECkNN and the number of selected features, were learned from {3, 5, 7} and {10, 20, 50} respectively.
The content regularisation λ c of BRDTI was learned from {0.1, 0.5, 0.9, 1.5}. The number of latent factors f, number of iterations, global regularisation λ g and initial learning rate η were set to 100, 50, 0.01 and 0.1 respectively.
The results are shown in Table 4 and Fig 4 which show the precision-recall curves for MOLIERE and its competitors. Our approach, MOLIERE, outperforms all the examined competitors in case of Enzyme, Ion Channel, GPCR and NR datasets, both in terms of AUC and AUPR. In the vast majority of the cases, the difference is statistically significant.
The results indicate that our approach, MOLIERE, is the overall best-performing approach out of the examined DTI techniques.

Prediction of new interactions
In order to demonstrate that our approach is able to predict new interactions, we followed the same protocol as in [19], i.e., we trained our approach, MOLIERE, as well as its competitors, BLM-NII, HLM, NetLapRLS and WNN-GIP using all the interactions of the original datasets. As mentioned before, these datasets have been extracted roughly a decade ago and several additional interactions have been validated meanwhile. Our experiment aims to check whether these recently validated interactions can be predicted based on the original interactions.
In particular, we considered those drug-target pairs that have unknown interaction status in the original datasets. We ranked these pairs according to their predicted interaction scores. For simplicity, we use the term predicted interaction for the top-ranked 20 drug-target pairs. The results are shown in Tables 5-7 for Enzyme, GPCR and IC datasets (for drug and target identifiers, see: http://www.kegg.jp/). As one can see, many of the predicted interactions are validated. We point out that some of the validated interactions were only predicted by our approach, especially in case of the GPCR dataset.

MOLIERE for drug repurposing
In order to illustrate how our predictions may contribute to drug repurposing, we discuss some of the predicted interactions in more details.
First, we consider Diazoxide (KEGG ID: D00294) and "adenosine triphosphate binding cassette, sub-family C member 9" (ABCC9), also known as "sulfonylurea receptor 2" (SUR2) (KEGG ID: hsa:10060), i.e., the second predicted interaction listed in Table 7 Next, we consider the predicted interaction between Isradipine (KEGG ID: D00349) and "Calcium Voltage-Gated Channel Subunit Alpha1 A" (CACNA1A), also known as "spinocerebellar ataxia type 6" (SCA6) (KEGG ID: hsa:773), listed in the 11th line of Table 7. According to KEGG, Isradipine is an L type dihydropyridine calcium channel blocker that is used in hypertension. CACNA1A gene encodes the alpha-1A subunit of P/Q type voltage-dependent calcium channel. Mutations of this gene can cause spinocerebellar ataxia, early infantile epileptic encephalopathy, episodic ataxias, hemiplegic migraine and hemiconvulsion-hemiplegiaepilepsy syndrome. Some mutations of the gene increases the density of functional channels and their open probabilities in familial hemiplegic migraine [48]. Since our method takes into consideration the similarity of the two different calcium channels, it may be worth to try Isradipine inhibition of the P/Q type voltage-dependent calcium channel in experimental settings.
Finally, we consider the predicted interaction between Caffeine (KEGG ID: D00528) and Cystic fibrosis transmembrane conductance regulator (CFTR, KEGG ID: hsa:1080), listed in the 16th line of Table 7. Caffeine is a central nervous system stimulant, adenosine receptor antagonist and phosphodiesterase inhibitor (1). CFTR is a chloride channel that conducts chloride ions in lung, pancreas, liver, digestive tract and reproductive tract epithelial cell membranes. According to KEGG, gene mutations can cause cystic fibrosis (CF), hereditary pancreatitis and congenital bilateral absence of vas deferens. In rats caffeine intake increased CFTR chloride secretion in intestine [49]. Although, caffeine consumption is basically not recommended for CF patients, if some of the patients actually drink coffee, it would be interesting to compare their disease status with other CF patients not drinking coffee. Such a survey should be carefully designed in order to avoid biases. For example, the number of patients involved in the study should be large enough, one should take into account that people who have more sever disease may pay more attention to the health and lifestyle suggestions, while the type of mutations is also important.

Conclusion
In this paper, we focused on drug-target interaction prediction and proposed a new method, MOLIERE for this task. Despite the fact that MOLIERE is a relatively simple approach, experiments on real-world datasets show that MOLIERE outperforms state-of-the-art DTI methods. By discussing some of the predictions in detail, we showed how our approach may lead to medically relevant hypothesis and support drug repositioning.
As mentioned, the DTI problem shares inherent characteristics with recommender systems tasks, therefore, we expect that MOLIERE will be adapted for recommendation tasks in the future. Furthermore, we point out that the proposed framework of asymmetric loss models is not limited to drug-target interaction prediction, but it may be useful in other cases where the class label is originally continuous (due to the underlying physical, chemical, biological phenomena), but it has been transformed to a binary label.
As for the limitations of our study, we note that our approach is not designed to predict interactions in case of new drugs/targets, i.e., for drugs/targets for which not even one interaction is known. In our current work, we assumed that only few new drugs/targets are considered, and we used the simple weighted profile approach in this case. Therefore, further methodical enhancements are required, if predictions are desired for new drugs/targets.