Neighborhood Regularized Logistic Matrix Factorization for Drug-Target Interaction Prediction

In pharmaceutical sciences, a crucial step of the drug discovery process is the identification of drug-target interactions. However, only a small portion of the drug-target interactions have been experimentally validated, as the experimental validation is laborious and costly. To improve the drug discovery efficiency, there is a great need for the development of accurate computational approaches that can predict potential drug-target interactions to direct the experimental verification. In this paper, we propose a novel drug-target interaction prediction algorithm, namely neighborhood regularized logistic matrix factorization (NRLMF). Specifically, the proposed NRLMF method focuses on modeling the probability that a drug would interact with a target by logistic matrix factorization, where the properties of drugs and targets are represented by drug-specific and target-specific latent vectors, respectively. Moreover, NRLMF assigns higher importance levels to positive observations (i.e., the observed interacting drug-target pairs) than negative observations (i.e., the unknown pairs). Because the positive observations are already experimentally verified, they are usually more trustworthy. Furthermore, the local structure of the drug-target interaction data has also been exploited via neighborhood regularization to achieve better prediction accuracy. We conducted extensive experiments over four benchmark datasets, and NRLMF demonstrated its effectiveness compared with five state-of-the-art approaches.


Introduction
The drug discovery is one of the primary objectives of the pharmaceutical sciences, which is an interdisciplinary research field of fundamental sciences covering biology, chemistry, physics, statistics, etc. In the drug discovery process, the prediction of drug-target interactions (DTIs) is an important step that aims to identify potential new drugs or new targets for existing drugs. Therefore, it can help guide the experimental validation and reduce costs. In recent years, the DTI prediction has attracted vast research attentions and numerous algorithms have been proposed [1]. Existing methods predict DTIs based on a small number of experimentally validated interactions in existing databases, such as ChEMBL [2], DrugBank [3], KEGG DRUG [4], and SuperTarget [5]. Previous studies have shown that a fraction of new interactions between drugs and targets can be predicted based on the experimentally validated DTIs, and the computational methods for identifying DTIs can significantly improve the drug discovery efficiency.
In general, traditional computational methods proposed for DTI prediction can be categorized into two main groups: docking simulation approaches and ligand-based approaches [6][7][8]. The docking simulation approaches predict potential DTIs, considering the structural information of target proteins. However, the docking simulation is extensively time-consuming, and the structural information may not be available for some protein families, for example the G-protein coupled receptors (GPCRs). In the ligand-based approaches, potential DTIs are predicted by comparing a candidate ligand with the known ligands of the target proteins. This kind of approaches may not perform well for the targets with a small number of ligands.
Recently, the quick development of machine learning techniques provides effective and efficient ways to predict DTIs. An intuitive idea is to formulate the DTI prediction as a binary classification problem, where the drug-target pairs are treated as instances, and the chemical structures of drugs and the amino acid subsequences of targets are treated as features. Then, classical classification methods can be used, e.g., support vector machines (SVM) [9] and regularized least square (RLS) [10]. For example, in [11], a SVM model was utilized to classify a given drug-target pair into interaction and non-interaction, considering the amino acid sequences of proteins, chemical structures, and the mass spectrometry data. Bleakley and Yamanishi proposed a supervised approach for DTI prediction based on the bipartite local models (BLMs), where SVM was used to build the local models [12]. Xia et al. proposed a semi-supervised DTI prediction approach, namely Laplacian regularized least square (LapRLS), and extended it to incorporate the kernel constructed from the known DTI network [13]. van Laarhoven et al. defined a Gaussian interaction profile (GIP) kernel to represent the interactions between drugs and targets, and they employed RLS with the GIP kernel for DTI prediction problems [14,15]. Cheng et al. developed three supervised inference methods for DTI prediction based on the complex network theory [16]. Mei et al. integrated BLM method with a neighbor-based interaction-profile inferring (NII) procedure to form a DTI prediction approach called BLM-NII, where the RLS classifier with GIP kernel was used as the local model [17]. Moreover, Yamanishi et al. developed a web server called DINIES, which utilized supervised machine learning techniques, e.g., pairwise kernel learning and distance metric learning, to predict unknown DTIs from different sources of biological data [18]. Ding et al. used a uniform experimental setting to empirically review the advantages and limitations of existing similarity-based learning approaches for DTI prediction [19]. Furthermore, other auxiliary information has also been exploited for DTI prediction. For example, in [20], Li et al. developed a computational framework that integrated literature mining and the protein and drug connectivity information derived from protein interaction networks to build the disease-specific drugprotein connectivity maps. In [21], Chen et al. utilized the data from public datasets to build a semantic linked network connecting drugs and targets. A statistical model was also proposed to evaluate the association of drug-target pairs.
Essentially, the DTI prediction problem is a recommendation task that aims to suggest a list of potential DTIs. Thus, another line of research for DTI prediction is the application of recommendation technologies. In the literature, collaborative filtering (CF) based approaches are the most widely adopted recommendation methods, which can be categorized into two main groups, i.e., memory-based CF and model-based CF approaches [22,23]. As the most successful model-based CF approach, matrix factorization has been explored for DTI prediction in recent studies. For example, Gönen proposed a kernelized Bayesian matrix factorization (KBMF) method, which combined the kernel-based dimensionality reduction, matrix factorization, and binary classification for DTI prediction [24]. Cobanoglu et al. utilized probabilistic matrix factorization (PMF) [25] to predict unknown DTIs [26]. The accuracy of the PMF based approach was further improved by an active learning strategy. Moreover, Zheng et al. introduced the multiple similarities collaborative matrix factorization (MSCMF) model, which exploited multiple kinds of drug similarities and target similarities to improve the DTI prediction accuracy [27].
In this paper, we propose a novel matrix factorization approach, namely neighborhood regularized logistic matrix factorization (NRLMF), for DTI prediction. The proposed NRLMF method focuses on predicting the probability that a drug would interact with a target. Specifically, the properties of a drug and a target are represented by two latent vectors in the shared low dimensional latent space, respectively. For each drug-target pair, the interaction probability is modeled by a logistic function of the drug-specific and target-specific latent vectors. This is different from the KBMF method [24] that predicts the interaction probability using a standard normal cumulative distribution function of the drug-specific and target-specific latent vectors [28]. In NRLMF, an observed interacting drug-target pair (i.e., positive observation) is treated as c (c ! 1) positive examples, while an unknown pair (i.e., negative observation) is treated as a single negative example. As such, NRLMF assigns higher importance levels to positive observations than negatives. Because the positive observations are biologically validated and thus usually more trustworthy. However, the negative observations could contain potential DTIs and are thus unreliable. This differs from previous matrix factorization based DTI prediction methods [24,26,27] that treat the interaction and unknown pairs equally.
Additionally, NRLMF also studies the local structure of the interaction data to further improve the DTI prediction accuracy, by exploiting the neighborhood influences from most similar drugs and most similar targets. In particular, NRLMF imposes individual regularization constraints between the latent representations of a drug and its nearest neighbors, which are most similar with the given drug. Similar neighborhood regularization constraints have also been added on the latent representations of targets. Note that this neighborhood regularization method is different from previous approaches that exploit the drug similarities and target similarities using kernels [13,14,17,29] or factorizing the similarity matrices [27]. Moreover, the proposed approach only considers nearest neighbors instead of all similar neighbors as used in previous approaches, avoiding noisy information, thus achieves more accurate results.
The performances of NRLMF were empirically evaluated on four benchmark datasets, compared with five state-of-the-art DTI prediction methods. Experimental results showed that NRLMF usually outperformed other competing methods on all datasets under different experimental settings, in terms of the widely adopted measures, i.e., the area under the ROC curve (AUC) and the area under the precision-recall curve (AUPR). In addition, the practical prediction ability of NRLMF was also confirmed by mapping with the latest version of online biological databases, including ChEMBL [2], DrugBank [30], KEGG [4], and Matador [5].

Materials
The performances of DTI prediction algorithms were evaluated on four benchmark datasets, including Nuclear Receptors, G-Protein Coupled Receptors (GPCR), Ion Channels, and Enzymes. These datasets were originally provided by [31] and were publicly available at http:// web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/. Table 1 summarizes the statistics of all four datasets. Each dataset contains three types of information: 1) the observed DTIs, 2) the drug similarities, and 3) the target similarities. Particularly, the observed DTIs were retrieved from public databases KEGG BRITE [32], BRENDA [33], SuperTarget [5], and DrugBank [3]. The drug similarities were computed based on the chemical structures of the compounds derived from the DRUG and COMPOUND sections in the KEGG LIGAND database [32]. For a pair of compounds, the similarity between their chemical structures was measured by the SIM-COMP algorithm [34]. The target similarities, on the other hand, were calculated based on the amino acid sequences of target proteins, which were retrieved from the KEGG GENES database [32]. The normalized Smith-Waterman score was used to compute the sequence similarity between two proteins.

Problem Formalization
In this paper, the set of drugs is denoted by D ¼ fd i g m i¼1 , and the set of targets is denoted by T ¼ ft j g n j¼1 , where m and n are the number of drugs and number of targets, respectively. The interactions between drugs and targets are represented by a binary matrix Y 2 R mÂn , where each element y ij 2 {0, 1}. If a drug d i has been experimentally verified to interact with a target t j , y ij is set to 1; otherwise, y ij is set to 0. The non-zero elements in Y are called "interaction pairs" and regarded as positive observations. The zero elements in Y are called "unknown pairs" and regarded as negative observations. We define the set of positive drugs and targets as D þ ¼ fd i j P n j¼1 y ij > 0; 81 i mg and T þ ¼ ft j j P m i¼1 y ij > 0; 81 j ng, respectively. Then, the set of negative drugs (i.e., new drugs without any known interaction targets) and negative targets (i.e., new targets without any known interaction drugs) are defined as D − = DnD + and T − = TnT + , respectively. In addition, the drug similarities are represented by S d 2 R mÂm , where the (i, μ) element s d im is the similarity between d i and d μ . The target similarities are described using S t 2 R nÂn , where the (j, ν) element s t jn is the similarity between t j and t ν . The objective of this study is to first predict the interaction probability of a drug-target pair and subsequently rank the candidate drug-target pairs according to the predicted probabilities in descending order, such that the top-ranked pairs are the most likely to interact.

Logistic Matrix Factorization
The matrix factorization technique has been successfully applied for DTI prediction in previous studies. In this work, we develop the DTI prediction model based on logistic matrix factorization (LMF) [35], which has been demonstrated to be effective for personalized recommendations. The primary idea of applying LMF for DTI prediction is to model the probability that a drug would interact with a target. In particular, both drugs and targets are mapped into a shared latent space, with a low dimensionality r, where r ( min(m, n). The properties of a drug d i and a target t j are described by two latent vectors u i 2 R 1Âr and v j 2 R 1Âr , respectively. Then, the interaction probability p ij of a drug-target pair (d i , t j ) is modeled by the following logistic function: For simplicity, we further denote the latent vectors of all drugs and all targets by U 2 R mÂr and V 2 R nÂr respectively, where u i is the i th row in U and v j is the j th row in V.
In DTI prediction tasks, the observed interacting drug-target pairs have been experimentally verified, thus they are more trustworthy and important than the unknown pairs. Towards a more accurate modeling for DTI prediction, we propose to assign higher importance levels to the interaction pairs than unknown pairs. In particular, each interaction pair is treated as c (c ! 1) positive training examples, and each unknown pair is treated as a single negative training example. Here, c is a constant used to control the importance levels of observed interactions and is empirically set to 5 in the experiments. This importance weighting strategy has been demonstrated to be effective for personalized recommendations [35][36][37]. However, to the best of our knowledge, it has not been explored for DTI prediction in previous studies.
By assuming that all the training examples are independent, the probability of the observations is as follows: Note that when y ij = 1, c(1 − y ij ) = 1 − y ij , and when y ij = 0, cy ij = y ij . Hence, we can rewrite Eq (2) as follows: In addition, we also place zero-mean spherical Gaussian priors on the latent vectors of drugs and targets as: where s 2 d and s 2 t are parameters controlling the variances of Gaussian distributions, and I denotes the identity matrix. Hence, through a Bayesian inference, we have The log of the posterior distribution is thus derived as follows: where C is a constant term independent of the model parameters (i.e., U and V). The model parameters can then be learned by maximizing the posterior distribution, which is equivalent with minimizing the following objective function: where l d ¼ 1 and kÁk F denotes the Frobenius norm of a matrix. The problem in Eq (7) can be solved using an alternating gradient descent method [35].

Regularized by Neighborhood
Through mapping both drugs and targets into a shared latent space, the LMF model can effectively estimate the global structure of the DTI data. However, LMF ignores the strong neighborhood associations among a small set of closely related drugs or targets. Thus, we propose to exploit the nearest neighborhood of a drug and that of a target to further improve the DTI prediction accuracy. For a drug d i , we denote the set of its nearest neighbors by N(d i ) 2 D\d i , where N(d i ) is constructed by choosing K 1 most similar drugs with d i . Then, we construct the set N(t j ) 2 T\t j , which consists of the K 1 most similar targets with t j . In the experiments, we empirically set K 1 to 5.
In this paper, the drug neighborhood information is represented using an adjacency matrix A, where the (i, μ) element a iμ is defined as follows: ( Similarly, the adjacency matrix used to describe the target neighborhood information is denoted by B, where its (j, ν) element b jν is defined as follows: ( Note that the adjacency matrices A and B are not symmetric. The primary idea of exploiting the drug neighborhood information for DTI prediction is to minimize the distances between d i and its nearest neighbors N(d i ) in the latent space. This objective can be achieved by minimizing the following objective function: where tr(Á) is the trace of a matrix, matrices, in which the diagonal elements are D d ii ¼ P m m¼1 a im and e D d mm ¼ P m i¼1 a im respectively. Moreover, we also exploit the neighborhood information of targets for DTI prediction by minimizing the following objective function: Note that the proposed neighborhood regularization only considers influences from the K 1 nearest neighbors of each drug and each target. It is different from the graph Laplacian constraints used in previous studies [38,39] which consider influences from all similar drugs and targets. Clearly, given a drug-target pair, we leverage their nearest neighbors, instead of all the neighbors that could potentially introduce noisy information, to enhance the prediction accuracy.

NRLMF
The final DTI prediction model can be formulated by considering the drug-target interactions as well as the neighborhood of drugs and targets. By plugging Eqs (10) and (11) into Eq (7), the proposed NRLMF model is formulated as follows: The optimization problem in Eq (12) can be solved by an alternating gradient ascent procedure. Denoting the objective function in Eq (12) by L, the partial gradients with respect to U and V are as follows: where P 2 R mÂn , in which the (i, j) element is p ij (see Eq (1)), denotes the Hadamard product of two matrices. To accelerate the convergence of the gradient descent optimization methods, we use the AdaGrad algorithm [40] to adaptively choose the gradient step size. The details of the optimization algorithm to the proposed NRLMF model are described in Algorithm 1, where U and V are randomly initialized using a Gaussian distribution with mean 0, standard deviation 1 ffi ffi r p .
Output: U, V 1 Initialize U and V randomly, and set φ ik = 0, jk = 0, 81 i m, 1 j n, and 1 k r; 2 Construct the adjacency matrices A and B according to Eq (8) and Eq (9) respectively; 3 Compute the neighborhood regularization matrices L d and L t according to Eq (10) and Eq (11) respectively; 4 for t = 1, . . ., max_iter do 5 G d @L @U ; // fix V and compute the gradient with respect to U 6 for i = 1, . . ., m do 7 for k = 1, . . ., r do // g d ik and u ik are the (i, k) element in G d and U respectively.

@L @V
; // fix U and compute the gradient with respect to V 11 for j = 1, . . ., n do 12 for k = 1, . . ., r do // g t jk and v jk are the (j, k) element in G t and V respectively.
Once the latent vectors U and V have been learned, the probability associated with any unknown drug-target pair (d i , t j ) can be predicted by Eq (1). However, in the training procedure, the latent vectors of drugs belonging to the negative drug set D − and those of the targets belonging to the negative target set T − are learned solely based on negative observations (i.e., unknown pairs). As we know, some negative observations may be potential positive DTIs. Due to such uncertainty over negative observations, the learned latent vectors of the negative drugs and targets may not be accurate enough to describe their properties. One remedy for this problem is to replace the latent vector of a negative drug/target using the linear combination of the latent vectors of its nearest neighbors in the positive set. For a drug d i 2 D − , we denote the set of its K 2 nearest neighbors in D + by N + (d i ). Similarly, for a target t j 2 T − , the set of its K 2 nearest neighbors in T + is denoted by N + (t j ). Note that N + (d i ) and N + (t j ) are built using the same criteria as that used to construct the neighborhood in the training procedure. Then, the prediction of the interaction probability of a drug-target pair (u i , v j ) is modified as, where Note that Eq (15) shows a general case for smoothing the learned drug-specific and target-specific latent vectors. In the experiments, K 2 is empirically set to 5 to simplify the model.

Results
We have performed extensive experiments to evaluate the performance of the proposed NRLMF method.

Experimental Settings
Following previous studies [13-15, 19, 24, 27], the performance of the DTI prediction methods were evaluated under five trials of 10-fold cross-validation (CV), and both AUC and AUPR were used as the evaluation metrics. In particular, for each method, we performed 10-fold CV for five times, each time with a different random seed. Then, we calculated an AUC score in each repetition of CV and reported a final AUC score that was the average over the five repetitions. The AUPR score was calculated in the same manner. The drug-target interaction matrix Y 2 R mÂn had m rows for drugs and n columns for targets. We conducted CV under three different settings as follows [19,27,41].
• CVS1: CV on drug-target pairs-random entries in Y (i.e., drug-target pairs) were selected for testing.
• CVS2: CV on drugs-random rows in Y (i.e., drugs) were blinded for testing.
• CVS3: CV on targets-random columns in Y (i.e., targets) were blinded for testing.
Under CVS1, in each round, we used 90% of elements in Y as training data and the remaining 10% of elements as test data. Under CVS2, in each round, we used 90% of rows in Y as training data and the remaining 10% of rows as test data. Under CVS3, in each round, we used 90% of columns in Y as training data and the remaining 10% of columns as test data. Note that these three settings CVS1, CVS2, and CVS3 refer to the DTI prediction for 1) new (unknown) pairs, 2) new drugs, and 3) new targets, respectively.
In this paper, we compared the proposed NRLMF method with the following state-of-the-art methods, namely, NetLapRLS [13], KBMF2K [24], BLM-NII [17], WNN-GIP [15], and CMF [27], by testing their prediction capabilities under the above three settings. The settings of the hyper-parameters of each method were as follows. For the matrix factorization based methods, the dimensionality of the latent space r was selected from {50, 100} [27]. In NRLMF, we set λ d = λ t and chose these two parameters from {2 −5 , 2 −4 , Á Á Á, 2 1 }. The neighborhood regularization parameters α and β of NRLMF were selected from {2 −5 , 2 −4 , Á Á Á, 2 2 } and {2 −5 , 2 −4 , Á Á Á, 2 0 }, respectively, and the optimal learning rate γ was selected from {2 −3 , 2 −2 , Á Á Á, 2 0 }. In KBMF2K, the margin parameter ν was selected from {0, 1}. For CMF, the regularization coefficient λ l was chosen from {2 −2 , Á Á Á, 2 1 }, while λ d and λ t were chosen from {2 −3 , 2 −2 , Á Á Á, 2 5 }. For NetLapRLS, we set γ d 2 /γ d 1 = γ p 2 /γ p 1 , β d = β p , and chose their values from {10 −6 , 10 −5 , Á Á Á, 10 2 }. In BLM-NII, the linear combination weight α was chosen from {0.0, 0.1, Á Á Á, 1.0}, and the max function was used to integrate the interaction scores predicted independently from the drug side and the target side. For WNN-GIP, the decay value T was chosen from {0.1, 0.2, Á Á Á, 0.9}. We set the weighting parameters α d = α t and chose their values from {0.0, 0.1, Á Á Á, 1.0}. For a machine learning methods, the most suitable hyper-parameters on different datasets are usually different. Thus, we need to choose the optimal hyper-parameters for each method on different datasets. In the literature, the most widely used hyper-parameter optimization strategies are grid search and manual search [42]. In this work, we adopted grid search to choose the optimal hyperparameters for each DTI prediction method on each dataset. As part of future work, we would like to use the random search strategy proposed in [42] to improve the efficiency of hyperparameter optimization for DTI prediction methods. Table 2 shows the AUC and AUPR values obtained by various methods under the setting CVS1. As shown in Table 2, NRLMF attains the best AUC values over all datasets. The final average AUC obtained by NRLMF is 0.974, which is 2.10% better than the second method BLM-NII. Moreover, NRLMF achieves the highest AUPR over three datasets (i.e., Nuclear Receptor, GPCR, and Enzyme) and obtains the second best AUPR on the Ion Channel dataset, where CMF outperforms NRLMF (0.923 for CMF vs. 0.906 for NRLMF). The average AUPR obtained by NRLMF is 0.819, which is 4.73% higher than that obtained by the second best method CMF. In summary, under the setting CVS1, NRLMF outperforms other competing methods, being statistically significant except two comparison cases with CMF at the significant level of 0.05 using t-test.

Comparisons with the State-of-the-Arts
The results obtained under the setting CVS2 for new drugs are shown in Table 3. In particular, NRLMF outperforms the other methods over the Nuclear Receptor, GPCR, and Ion Channel datasets, in terms of AUC. On the Enzyme dataset, WNN-GIP achieves a little better AUC than NRLMF (0.882 for WNN-GIP vs. 0.871 for NRLMF). Over all datasets, NRLMF obtains the best average AUC value 0.870. For the AUPR metric, NRLMF achieves the best results on all datasets except the GPCR dataset, where KBMF2K and CMF are slightly better than NRLMF. Overall, NRLMF achieves the best average AUPR 0.403, which is 13.84% higher than the second-best method KBMF2K and 17.84% higher than the third-best method CMF.
In addition, Table 4 summarizes the results obtained under the setting CVS3 for new targets. We observe that WNN-GIP outperforms other methods on the Nuclear Receptor dataset, in terms of AUC and AUPR. On the other three datastes, the proposed NRLMF achieves the best AUC and AUPR values. Over all datasets, WNN-GIP achieves the highest average AUC value 0.940, which is 1.29% better than the second-best method NRLMF. For the AUPR measure, NRLMF achieves the best average AUPR 0.651, which is a 11.09% better than the second-best method WNN-GIP.
The task under the setting CVS1 focuses on predicting the unknown pair (d i , t j ), where at least one DTI is known for d i and t j respectively in the training data. However, the tasks under CVS2 and CVS3 focus on the predictions for new drugs and new targets respectively, where no DTIs are observed for new drugs and new targets in the training data. Therefore, the task under CVS1 is easier than those under CVS2 and CVS3, and the AUC and AUPR values obtained by DTI prediction methods under CVS1 are higher than those obtained under CVS2 and CVS3 as expected. For all CV settings, the proposed NRLMF method achieves the best AUC values in 10 out of 12 scenarios (i.e., 3 CV settings on 4 datasets) via integrating LMF with neighborhood regularization. In the remaining 2 scenarios (i.e., CVS2 on Enzyme dataset  (13)). Especially, for the targets with only one interaction, the accuracies of the learned latent vectors may be drastically reduced. In NRLMF, the latent vectors of negative drugs and targets are smoothed using their nearest neighbors. However, there is no smoothing for the latent vectors of targets with only one interaction (see Eq (15) ., under CVS3), the performance of NRLMF on Nuclear Receptor dataset is the most likely to be affected. For the AUPR metric, NRLMF attains the best AUPR values in 9 out of 12 scenarios, which is to be expected, since the methods that optimize AUC are not guaranteed to optimize AUPR [43]. In addition, the target sequence similarity S t is more reliable and informative than the drug chemical similarity S d [14]. Hence, the information propagated from the neighbors to the new targets by the regularization term in Eq (11) will be more accurate than those to new drugs by the term in Eq (10). This explains the results well that various methods usually achieve higher AUC and AUPR under CVS3 than CVS2.

Neighborhood Benefits
The proposed NRLMF method incorporates neighborhood information for DTI prediction via the neighborhood regularization in training and the neighborhood smoothing in prediction. "Avg." shows the average AUC/AUPR over four datasets. The best results in each row are in bold faces and the second best results are underlined.
* indicates NRLMF significantly outperforms the competitor with p < 0.05 using t-test.
Next, we will study how the neighborhood information benefits DTI prediction under the setting CVS1. For the results under CVS2 and CVS3, please refer to the supporting S1-S8 Figs for details. Fig 1 shows the AUC values obtained by NRLMF with respect to different settings of the neighborhood size K 1 used for the neighborhood regularization in the training procedure. As shown in Fig 1, the optimal values of K 1 are 3, 5, 5, and 5, for four datasets, respectively. Under the setting CVS1, the average AUC of NRLMF is 0.958 when K 1 is set as 0 (i.e., without neighborhood regularization in training), while it is increased to 0.974 when K 1 is set as 5. Fig 2 illustrates the AUPR values with respect to different settings of K 1 . We find that NRLMF achieves the best AUPR by setting K 1 as 7, 7, 9, and 3, respectively. When K 1 = 0, the average AUPR achieved by NRLMF without neighborhood regularization is 0.772, while it is increased to 0.818 by setting K 1 = 5. These results highlight that the neighborhood regularization is highly desirable for DTI prediction.
In addition, we also study the impact of the neighborhood size K 2 used for neighborhood smoothing in the prediction procedure. Figs 3 and 4 plot the AUC and AUPR values obtained by NRLMF with respect to different settings of K 2 . As shown in Fig 3, NRLMF achieves best AUC via setting K 2 as 5, 3, 5, and 5, respectively. For AUPR measure, the best results are achieved by setting K 2 as 5, 3, 9, and 5, respectively. Over all datasets, when K 2 = 0 (i.e., without neighborhood smoothing in prediction), the average AUC and AUPR values obtained by NRLMF are 0.950 and 0.772, respectively, while these values are 0.974 and 0.819 when K 2 = 5. These observations demonstrate the effectiveness of nearest neighbors to predict the interaction probability for a given drug-target pair. In addition, when we set K 1 and K 2 as 5, we can get reasonably good results for both AUC and AUPR, respectively.

Parameter Sensitivity Analysis for c and r
In this section, we focus on the sensitivity analysis for other two parameters, i.e., the importance levels of observed DTIs c and the dimensionality of the latent space r, under the setting CVS1. As to the performance trend of NRLMF with respect to different settings for c and r under CVS2 and CVS3, please refer to the supporting S9-S16 Figs for details.
As shown in Fig 5, when the importance level c is set as 1 (i.e., without importance weighting), NRLMF outperforms other competitors on Nuclear Receptor, GPCR, and Ion Channel datasets, and is comparable with the best competitor on the Enzyme dataset (0.971 for NRLMF vs. 0.978 for the best competitor), in terms of AUC. This again highlights the effectiveness of integrating logistic matrix factorization with neighborhood regularization for DTI prediction. By setting c = 5, NRLMF is able to achieve the optimal AUC values and outperforms all competing methods over all datasets. For the AUPR metric, Fig 6 shows that NRLMF with setting c = 1 outperforms other competitors on the Nuclear Receptor dataset and performs poorer than the best competitor on the remaining three datasets. This is expected, since the methods that optimize AUC are not guaranteed to optimize AUPR [43]. In addition, NRLMF achieves better AUPR under the setting c > 1 than under the setting c = 1, on the GPCR, Ion Channel, and Enzyme datasets. On the Nuclear Receptor dataset, NRLMF attains slightly better AUPR under the setting c = 1 than under the other settings. These observations demonstrate that assigning more importance on the observed interactions can boost the performance of NRLMF. However, when c is large enough, the performance of NRLMF tends to become saturated, where further increasing c has very limited improvement.
The impact of the dimensionality of the latent space r on the performance of NRLMF, in terms of AUC and AUPR, is shown in Figs 7 and 8, respectively. We find that larger r generally achieves better results. The two exceptions are the AUPR measure on Nuclear Receptor and Ion Channel datasets, where r = 30 leads to slightly better results than r = 50. Nevertheless, r = 100 achieves the best results or the second best results measured by AUC and AUPR, on all datasets. Thus, the parameter r is recommended to be set in the range [50, 100], which is consistent with previous studies [27].

Predicting Novel Interactions
In this section, we evaluate the practical ability of NRLMF on predicting novel interactions, which refer to interactions with high probabilities that do not occur in the benchmark datasets. Following similar settings in previous studies [12,14,15,19,24], four well-known biological databases, i.e., ChEMBL [2], DrugBank [30], KEGG [4], and Matador [5], are used as references to verify whether the predicted new DTIs are true or not.
To conduct this study, we have collected the online profiles associated with the drugs and targets in each benchmark dataset from the online reference databases and parsed the approved drug-target interactions. Over all benchmark datasets, there are 791 drugs and 986 targets, and 1,999 novel interactions have been confirmed in one or more reference databases. The number of confirmed novel interactions in Nuclear Receptor, GPCR, Ion Channel, and Enzyme datasets are 21, 512, 1034, and 432, respectively. For each dataset, the entire dataset is used as training set. The unknown interactions will be ranked based on the interaction probabilities predicted using the optimal parameters learned under CVS1 instead of those learned under other two settings (i.e., CVS2 and CVS3). This is because that our objective is to predict those novel likely drug-target interactions, instead of focusing a new drug or a new target. Then, the predicted novel interactions are the top ranked unknown drug-target interaction pairs. Table 5 shows the top 30 novel interactions predicted by NRLMF on the GPCR dataset. In this table, the DTIs are bolded to indicate that they exist in one or more of the reference databases. The third column of Table 5 shows the predicted interaction probability of a drug-target pair. For each pair, the databases containing it are listed in the last column of the table, where C is short for ChEMBL, D for DrugBank, K for KEGG, and M for Matador. For example, the highest ranked DTI is (D00283, hsa1814) with predicted probability 0.9181, which exists in the databases ChEMBL, DrugBank, and Metador. We find that 67% of the predictions (20 out of 30) are currently confirmed in at least one of the reference databases. Since these databases are still being updated as new DTIs are found, the fraction of new DTIs correctly predicted by NRLMF may increase in the future. This encouraging result that NRLMF can successfully detect quite a few novel interactions that are not in the GPCR dataset, implies that NRLMF is very effective in predicting new true DTIs from sparse matrices consisted of very few DTIs. Finally, Table 6 summarizes the fractions of true DTIs among the top N (N = 10, 30, 50) predictions generated by various DTI methods, using the optimal parameters learned under CVS1. We observe that NRLMF is able to achieve consistently accurate prediction results across all the datasets. For example, the fractions of true DTIs among the top 10 predicted interactions are 50%, 60%, 50%, and 90% for all datasets, respectively. Compared with other methods, NRLMF is able to achieve comparable or even better prediction results across all the datasets. These observations indicate that the proposed algorithm is very effective for finding novel DTIs, thus it may help biologists or clinicians significantly reduce the cost of biological test. For more details about the novel DTI prediction, please refer to the supporting S1-S4 Texts, where the top 1000 novel DTIs predicted by NRLMF are provided.

Discussion
This paper presents a novel drug-target interaction prediction method, namely neighborhood regularized logistic matrix factorization (NRLMF). The novelty of NRLMF comes from integrating logistic matrix factorization with neighborhood regularization to predict the interaction probability of a given drug-target pair. Specifically, both drugs and targets are mapped into a shared latent space, and the drug-target interactions are modeled using the linear combinations of the drug-specific and target-specific latent vectors. In addition, higher importance level is assigned to the positive observations (i.e., interaction pairs), while lower level is for negative observations (i.e., unknown pairs). Moreover, the neighborhood regularization based on the drug similarities and target similarities is utilized to further improve the prediction ability of the model.
To evaluate the performance of NRLMF, an extensive set of experiments were performed on four benchmark datasets, compared with five state-of-the-art DTI prediction methods. The promising results further validated the empirical efficacy of the proposed algorithm. For example, on average, NRLMF attains the best AUC values under CVS1 and CVS2, and the second best AUC value under CVS3. In terms of AUPR, NRLMF achieves the best averaged AUPR values over all datasets, under all three CV settings. These results indicate that NRLMF outperforms existing state-of-the-art methods in predicting new pairs and new drugs, and is comparable or even better than existing methods in predicting new targets. However, on the dataset with a large fraction of drugs which have only one interaction (e.g., 72.22% on the Nuclear Receptor dataset), WNN-GIP may outperform NRLMF in predicting new targets. On the dataset with a large fraction of targets which have only one interaction (e.g., 43.37% on the Enzyme dataset), WNN-GIP may achieve better results than NRLMF in predicting new drugs. In addition, the high practical predicting ability of NRLMF have also been verified. For example, on the Enzyme dataset, 90% of the top 10 novel DTIs predicted by NRLMF have been confirmed by the latest version of four well-known biological databases, including ChEMBL, DrugBank, KEGG, and Matador.
The optimization problem of NRLMF is solved using an alternating gradient descent optimization algorithm, the time complexity of which is O(iter Á r Á m Á n), where iter denotes the number of iterations. However, the time complexity of the solutions to the other two matrix factorization based DTI prediction methods (i.e., KBMF2K and CMF) are O(iter Á (r Á m 3 +r Á n 3 +r 3 )) and O(iter Á (r 2 Á (m+n) 2 +r 3 Á (m+n))), respectively. Therefore, NRLMF is more efficient than KBMF2K and CMF. In addition, NRLMF can also be extended to incorporate multiple types of similarities from drugs and targets for DTI prediction. One direction for future work is to couple logistic matrix factorization with the multiple kernel learning techniques [44]. Another potential direction for future work is to exploit boosting technique, e.g., the AdaBPR model in [45], to improve the prediction accuracy of the proposed NRLMF method.