Large-Scale Off-Target Identification Using Fast and Accurate Dual Regularized One-Class Collaborative Filtering and Its Application to Drug Repurposing

Target-based screening is one of the major approaches in drug discovery. Besides the intended target, unexpected drug off-target interactions often occur, and many of them have not been recognized and characterized. The off-target interactions can be responsible for either therapeutic or side effects. Thus, identifying the genome-wide off-targets of lead compounds or existing drugs will be critical for designing effective and safe drugs, and providing new opportunities for drug repurposing. Although many computational methods have been developed to predict drug-target interactions, they are either less accurate than the one that we are proposing here or computationally too intensive, thereby limiting their capability for large-scale off-target identification. In addition, the performances of most machine learning based algorithms have been mainly evaluated to predict off-target interactions in the same gene family for hundreds of chemicals. It is not clear how these algorithms perform in terms of detecting off-targets across gene families on a proteome scale. Here, we are presenting a fast and accurate off-target prediction method, REMAP, which is based on a dual regularized one-class collaborative filtering algorithm, to explore continuous chemical space, protein space, and their interactome on a large scale. When tested in a reliable, extensive, and cross-gene family benchmark, REMAP outperforms the state-of-the-art methods. Furthermore, REMAP is highly scalable. It can screen a dataset of 200 thousands chemicals against 20 thousands proteins within 2 hours. Using the reconstructed genome-wide target profile as the fingerprint of a chemical compound, we predicted that seven FDA-approved drugs can be repurposed as novel anti-cancer therapies. The anti-cancer activity of six of them is supported by experimental evidences. Thus, REMAP is a valuable addition to the existing in silico toolbox for drug target identification, drug repurposing, phenotypic screening, and side effect prediction. The software and benchmark are available at https://github.com/hansaimlim/REMAP.


Introduction
Conventional one-drug-one-gene drug discovery and drug development is a time-consuming and expensive process. It suffers from high attrition rate and possible unexpected post-market withdrawal [1]. It has been recognized that a drug rarely only binds to its intended target, and off-target interactions (i.e. interactions between the drug and unintended targets) are common [2]. The off-target interaction may lead to adverse drug reactions (ADRs) [3], as demonstrated by the deadly side effect of a Fatty Acid Amide Hydrolase (FAAH) inhibitor in a recent clinical trial [4]. On the other hand, the off-target interaction may be therapeutically useful, thus providing opportunities for drug repurposing and polypharmacology [2]. Therefore, identifying off-target interactions is an important step in drug discovery and development in order to reduce the drug attrition rate and to accelerate the drug discovery and development process, and ultimately to make safer and more affordable drugs.
Many efforts have been devoted to developing statistical machine learning methods for the prediction of unknown drug-target associations by screening large chemical and protein data sets [5]. One of the fundamental assumptions in applying statistical machine learning methods to drug-target interaction prediction is that similar chemicals bind to similar protein targets, and vice versa. Based on this similarity principle, both semi-supervisedand supervised machine learning techniques have been applied. The semi-supervised learning methods either build statistical models for the k nearest neighbors (k-NN) of the query compound with similar compounds in the database (e.g. Parzen-Rosenblatt Window (PRW) [6] and Set Ensemble Analysis (SEA) [7] are examples). Although a large number of 2D and 3D fingerprint representations of chemical structures have been developed, chemical structure similarity that is measured by Tanimoto coefficient (TC) or other similarity metrics of fingerprints is not continuously correlated with the binding activity. Activity cliff exists in the chemical space, where a small modification of a chemical structure can lead to a dramatic change in binding activity [8]. Thus, the chemical structural similarity alone is not sufficient to capture genome-wide target binding profile, as protein-chemical interaction is determined by both protein structures and chemical structures. New deep learning techniques that can learn non-linear, hierarchical relationships may provide new solutions for representing chemical space [9][10][11][12]. However, few work has been done to incorporate protein relationships into the deep learning framework. It remains to be seen whether the deep learning is applicable to genome-wide target prediction.
A number of techniques such as Gaussian Interaction Profile (GIP), Weighted Nearest Neighbor (WNN), Regularized Least Squares (RLS) classifier [13,14], and matrix factorization [15][16][17] have been developed to integrate chemical and genomic space. Among them, Neighborhood Regularized Logistic Matrix Factorization (NRLMF) [17] and Kernelized Bayesian Matrix Factorization (KBMF) [16] are two of the most successful methods. However, several drawbacks in these algorithms hinder their applications in genome-wide off-target predictions. First, several algorithms with high performance such as KBMF are extremely time and memory-consuming. Second, these algorithms depend on a supervised learning framework that requires negative cases. While publicly available biological and/or chemical databases (e.g. ZINC [18], ChEMBL [19], DrugBank [20], PubChem [21], and UniProt [22]) have enabled large-scale screening of drug-target associations, the known chemical-protein associations are sparse, and the number of reported negative cases (i.e. chemical-protein pairs not associated) is too small to optimally train a prediction algorithm [23]. Using randomly generated negative cases will adversely impact the performance of these algorithms, and algorithmically derived negative cases are often based on unrealistic assumptions [23]. Finally, these algorithms have been mainly evaluated for the prediction of off-targets within the same gene family (e.g. GPCR) using a small benchmark with hundreds of drugs and targets. Their performances in predicting off-target across gene families on a large scale are uncertain. Indeed, drug crossreactivity often occurs across fold spaces [2]. Thus, the development of in silico prediction methods that are fast as well as accurate enough to explore the available data is urgent.
Here, we make several contributions to address the aforementioned problems. First, we present an efficient method, REMAP, which formulates the off-target predictions as a dual-regularized One Class Collaborative Filtering (OCCF) problem. Thus, negative data are not needed for the training, but can be used if available. Secondly, REMAP is highly scalable with promising accuracy, thus can be applied to large-scale off-target predictions. Thirdly, we introduce a new benchmark set to evaluate the performance of drug-target interactions across gene families. Finally, we apply REMAP to repurposing existing drugs for new diseases. We identified seven drugs that have anti-cancer activity. Six of them are supported by experimental evidence.

Problem formulation
The problem we try to solve here is to predict how likely it is that a chemical interacts with a target protein, using a chemical-protein association network, chemical-chemical similarity, and protein-protein similarity information. We start by preparing a bipartite network for chemical-protein associations as a sparse n × m matrix R, where n is the number of chemicals and m is the number of proteins. R i,j = 1 if the i th chemical is associated with the j th protein, and R i,j = 0, otherwise. The chemical-chemical similarity scores are in an n × n square matrix C, with C i,j representing the chemical-chemical similarity score between the i th and j th chemicals (0 C i,j 1) for total n chemicals. The protein-protein similarity scores are in the same format for total m proteins (0 T i,j 1). We consider this problem an analog of user-item preferences such that users and items represent chemicals and proteins, respectively. Therefore, the problem is to provide an n × m matrix P in which P i,j is the prediction score for the interaction between the i th chemical and the j th protein.
Overview of off-target prediction method REMAP Our prediction method REMAP is based on a one-class collaborative filtering algorithm that recommends the users' preferences to the listed items [24]. It assumes that similar users will prefer similar items, unobserved associations are not necessarily negative, and user-item preferences can be analogous to drug-target associations. Assuming that a fairly low number of factors (i.e. smaller number of features than the number of total chemicals or protein targets) may capture the characteristics determining the chemical-protein associations, two low-rank matrices, U (chemical side) and V (protein side), were approximated such that X n i X m j fR À ðU Á V T Þg is minimized where R is the matrix for known chemical-protein associations and V T is the transposition of the protein side low-rank matrix V. The two low rank matrices, U n×r and V m×r are obtained by iteratively minimizing the objective function, All symbols used in the paper are summarized in Table 1, and the overall process of REMAP is in Fig 1. Here, p wt is the penalty weight on the observed and unobserved associations which indicate the reliability of the assigned probability of true association, p imp is the imputed The chemical-chemical similarity score for the chemicals c 1 and c 2 d Tani ( c 1 ;c 2 ) The Tanimoto dissimilarity coefficient for the chemicals c 1 and c 2 T (p1,p2) The target-target similarity score for the query protein p 1 and the target protein p 2 d bit(p1,p2) The bit score for the query protein p 1 and the target protein p 2 D C , D T The degree matrices of C and T, respectively U, V The chemical-side and the target-side low-rank approximation matrices The element of R at its i th row and j th column The j th column of R value (i.e. the probability of unobserved associations as real associations), p reg is the regularization parameter to prevent overfitting, p chem is the importance parameter for chemical-chemical similarity, p prot is the importance parameter for protein-protein similarity, and tr(A) is the trace of matrix A (Table 1). In this study, we use global weight and imputation. However, the weight and imputation values may be determined by a priori knowledge or from the prediction of other machine learning algorithms (i.e. p wt and p imp can be matrices with the same dimension as the matrix R). The raw predicted score for the i th chemical to bind the j th protein can be calculated by P ði;jÞ ¼ U UPði;:Þ Á V T UPðj;:Þ . The raw scores were adjusted based on the ratio of observed positive and negative cases when the negative data are available (explained in the prediction score adjustment section). Also, the matrix U n×r is referred to as a low-rank drug profile since its i th row represents the i th drug's behavior in the drug-target interaction network as well as drug-drug similarity spaces compressed to r number of features. The REMAP code was originally written in Matlab and modified for drug-target predictions. . Solid lines within the network represent connectivity (edges), and the arrows represent mathematical processes. Red squares represent single similarity values, and blue bars in U and V represent row and column vectors. Lower-case c and p represents chemicals and proteins, respectively. The letter symbols are annotated in Table 1.

Chemical-chemical similarity
Chemical-chemical similarity scores are one of the required inputs of REMAP. Although there are a number of metrics developed for chemical-chemical similarity, a recent study showed that Tanimoto coefficient-based similarity is highly efficient for fingerprint-based similarity measurement [25]. The fingerprint of choice in this study is the Extended Connectivity Fingerprint (ECFP), which has been successfully applied to chemical structure-based target prediction method, PRW [6]. Thus, it allows for a fair comparison of REMAP with PRW. It is interesting to compare the different fingerprints in the future study.
To calculate a similarity score between two chemicals, c 1 and c 2 , the Tanimoto dissimilarity coefficient d Tani ðc 1 ;c 2 Þ was obtained using JChem with the Tanimoto metric for the ECFP descriptor type using the command in the Unix environment, "ChemAxon/JChem/bin/ screenmd target_smi query_smi -k ECFP -g -c -M Tanimoto" [26]. The chemical-chemical similarity score, C ðc 1 ;c 2 Þ is defined as C ðc 1 ;c 2 Þ ¼ 1 À d Tani ðc 1 ;c 2 Þ . Briefly, two chemicals have a higher similarity score if they have more of the same chemical moieties (e.g. functional groups) at more similar relative positions. Chemical similarity scores below 0.5 were treated as noise and set to 0.

Protein-protein similarity
Protein-protein similarity scores are also one of the required inputs for REMAP. The similarity between two proteins was calculated based on their sequence similarity using NCBI BLAST [27] with an e-value threshold of 1 × 10 −5 and its default options (e.g. 11 for gap open penalty and 1 for its extension, BLOSUM62 for the scoring matrix, and so on). Based on our 10-fold cross validation (see below), e-value thresholds from 1 to 1 × 10 −20 did not significantly affect the performance (S1 Fig). Therefore, we decided to use a moderately stringent threshold (BLAST default is 1 × 10 −3 ). A similarity score for query protein p 1 to target protein p 2 was calculated by the ratio of a bit score for the pair compared to the bit score of a self-query. To be specific, for the query protein p 1 to the target protein p 2 , protein-protein the similarity score was defined such that T (p1,p2) = d bit(p1,p2) /d bit(p1,p1) .

Benchmark test and data preparation
For benchmark tests, ZINC data was filtered by IC 50 10 μM, which yielded 31,735 unique chemical-protein associations for 12,384 chemicals and 3,500 proteins (ZINC dataset [18]). Targets that are protein complexes or cell-based tests were excluded. Proteins whose primary sequence is unavailable were also excluded. Protein sequences were obtained from UniProt [22], and the whole protein sequences were used to calculate protein-protein similarity scores.
To assess the predictive power of our algorithm, we performed a 10-fold cross validation on the ZINC dataset described above. We set the parameters as follows: p wt = p imp = p reg = 0.1, r = 300, p chem = 0.75, p prot = 0.1, and p iter = 400. The optimized values determined by the 10-fold cross validation of benchmark are shown in S2 Fig. It is noted that the best performance is achieved when p chem = 0.25 and p prot = 0.25. To further evaluate REMAP, we compared its performance on the ZINC dataset with several methods: a chemical similarity-based method (PRW [6]), the best performed matrix factorization methods so far (NRLMF [17] and KBMF with twin kernels (KBMF2K) [16]), combination of WNN and GIP (WNNGIP [14]), and another type of matrix factorization method (Collaborative Matrix Factorization (CMF) [15]) for different types of chemicals and proteins.
To obtain a detailed view of the performance of the methods, we divided the ZINC dataset into 3 categories with 2 subcategories for each, based on the connectivity of known chemical-protein associations and the degree of uniqueness of the chemicals. First, all the chemicals in the dataset were classified into the chemicals having only one known target (NT1), two known targets (NT2), or three or more known targets (NT3). Then, for the chemicals in each category, they were further divided based on either the number of known chemicals (ligands) the target proteins are associated with (number of ligands in increments of 5) or the maximum chemicalchemical similarity score for the chemical in the dataset (the similarity score range increment is 0.1). The label used in this paper for the dataset are NTaLb, or NTaMaxTcd, where 'NT' stands for the Number of known Target, 'L' for the number of known Ligand, and 'Tc' for the maximum (Tanimoto coefficient-based) chemical-chemical similarity score for the given chemical in the dataset, with NT = a, b L b +4, and d − 0.1 < Tc d. For instance, NT2L1 is the data set label for chemicals having two known targets and proteins having 1 to 5 ligands in the dataset, and NT1Tc0.9 is for chemicals with the most similar chemicals between 0.8 and 0.9 of similarity scores and having one known target. Chemicals having more than three known targets are included in the NT3 class, and proteins having more than twenty-one known ligands were included in L21 (not limited to 25). The categories of the ZINC dataset were then used to evaluate the performance of off-target prediction, and their labels mean the number of known ligands (L) or the maximum structural similarity (Tc) with their corresponding ranges. For example, 'L21more' stands for the dataset for proteins having 21 or more known targets, and 'Tc0.9to1.0' stands for maximum structural similarity greater than 0.9 and up to 1.0 (Tc0.5to0.6 is inclusive of 0.5). Note that NT1 is equivalent to chemicals without any known target when they are tested for cross validation. Therefore, performances on NT1 datasets reflect the ability to address the cold start problem. In other words, when one known drugtarget association is intentionally hidden for the chemicals in the NT1 dataset, the tested chemicals will not have any known target in the training data, and they are less likely to be given a good recommendation of targets. This is analogous to the new user or new item problem reviewed by Su et al. [28].

Measuring prediction accuracy of REMAP by TPR vs. cutoff rank
A typical measure of prediction performance is the Receiver Operating Characteristic (ROC) curve by which one can assess the reliability of the positively predicted results. However, it is difficult to apply the ROC curve on our chemical-protein association datasets since the vast majority of the chemical-protein pairs have not been tested, and thus it is unclear whether the missing entries are actually unassociated or just not yet observed.
In order to assess how reliable the positively predicted results from REMAP are, we needed to define a performance measurement that is analogous to ROC curve but not dependent on the true negatives. Our primary measure of performance is the true positive rate ( P True Positives P Condition Positives ; Recall or Recovery) at the top 1% of predictions for each chemical. To be specific, the top 1% of predictions includes up to the 35 th -ranked predicted target protein for a chemical for our datasets (3,500 possible target proteins for each chemical). Thus, for instance, a TPR of 0.965 at the 35 th cutoff rank (top 1%) means that 96.5% of the total tested positive pairs were ranked 35 th or better for the tested chemicals.

Scalability of REMAP as a matrix factorization algorithm
In order to assess the speed of REMAP for practical uses, we measured its running time by varying the rank parameter or the size of dataset. On the ZINC dataset (12,384 chemicals and 3,500 proteins), up to r = 2,000 was tested, and at fixed r = 200, dataset sizes up to 200,000 chemicals and 20,000 proteins were tested. The number of iterations (p iter ) was fixed to 400. A single node of CPU with 2.88 GB of memory in the City University of New York High Performance Computing Center (CUNY HPCC) was used for REMAP running time tests. We also compared the running times of different matrix factorization methods with ours. Due to the large time complexity and memory requirement for other algorithms, a multi-core node with up to 700 GB of shared memory system in CUNY HPCC was used for them on the ZINC dataset.

Genome-wide chemical-protein associations
Chemical-protein associations were obtained from the ZINC [18], ChEMBL [19] and Drug-Bank [20] databases. To obtain reliable chemical-protein association pairs, binding assays records with IC 50 information were extracted from the databases, and the cutoff IC 50 value of 10 μM was used where applicable. Two chemicals were considered the same if their InChI Keys are identical, and two proteins were considered so if their UniProt Accessions are identical. For records with IC 50 in μg/L (found in ChEMBL), the full molecular weights of the compounds listed on ChEMBL were used to convert μg/L to μM. Chemical-protein pairs were considered associated if IC 50 10 μM (active pairs), unassociated if IC 50 >10 μM (inactive pairs), ambiguous if records exist in both ranges (ambiguous pairs), and unobserved otherwise (unknown pairs). A total of 198,712 unique chemicals and 3,549 unique target proteins were obtained from the combination of ChEMBL and ZINC with 228,725 unique chemical-protein active pairs, 76,643 inactive pairs, and 4,068 ambiguous pairs. Of the 198,712 chemicals, 722 were found to be FDA-approved drugs. Furthermore, drug-target relationships were extracted from the DrugBank and integrated into the ZINC_ChEMBL dataset above. A total of 199,338 unique chemicals and 6,277 unique proteins were obtained from the combination of ZINC, ChEMBL, and DrugBank with 233,378 unique chemical-protein active pairs.

Drug-target interaction profile analysis for drug repurposing
Since REMAP showed promising performances on predicting off-targets for chemicals with at least one known target, it is possible to use REMAP to suggest new purposes for some FDA approved drugs. As the matrix product of U UP (chemical-side low-rank matrix) and V UP (protein side low-rank matrix) is the predicted drug-target interaction matrix P, the i th row of U UP contains the target interaction profile for the i th drug. Therefore, we analyzed the drug-drug similarities based on the low-rank matrix U UP . We ran REMAP with the data combination of three databases explained above, with the parameters used in the benchmark evaluations. Then, we calculated drug-drug cosine similarities based on the matrix U UP . For each row of U UP for FDA approved drugs, the cosine similarity of drug c 1 and drug c 2 can be calculated by, To search for possibly undiscovered uses of the drugs, we focused on drugs that are found to have high cosine similarity but low Tanimoto similarity (< 0.5). Markov Cluster (MCL) Algorithm [29,30] was used to cluster drugs based on their cosine similarity of a low-rank target profile. Drug-disease associations were obtained from the Comparative Toxicogenomics Database (CTD) [31].

Prediction score adjustment
The raw prediction score (P ði;jÞ ¼ U UPði;:Þ Á V T UPðj;:Þ ) can be adjusted to better reflect the real data as well as to statistically discriminate the positive and negative predictions. We used the active, inactive and ambiguous pairs obtained from the ChEMBL database to adjust the score. REMAP prediction on the ZINC_ChEMBL dataset showed a clear division between the active and inactive pairs, suggesting that predictions scored around 1.0 are highly likely to be positive (Fig 2A). As mentioned above, however, there is a large difference between the number of active and inactive pairs, which is not likely to reflect the ratio of the actual positive and negative chemical-protein pairs. Greater accuracy is expected by adjusting the prediction scores to reflect such a positive/negative ratio. To estimate the ratio, we first normalized the counts in each bin in the histogram (Fig 2A) and calculated the weights that minimize the sum of error, where w 1 and w 2 are the weights on active and inactive pairs, respectively (w 1 + w 2 = 1.0), and A i , p i and N i are the normalized counts in i th bin of ambiguous, active and inactive pairs, respectively. The optimum adjustment weights were approximately w 1 = 0.16, w 2 = 0.84 (Fig 2B). This implies that approximately 16% of total observations are positive. Since the ratio of negative/positive is about 5.25 w 2 w 1 ¼ 5:25 , we It is noted that the prediction score adjustment was not used in the benchmark study, where no negative data were used.

Graphic analysis
Drug-drug clustered network was visualized using Cytoscape [32].

REMAP is highly effective in predicting off-targets even for novel chemicals
We evaluated the performances of algorithms for chemicals having one, two, or more than three known targets with varying maximum chemical-chemical similarity ranges or with proteins having a certain number of known ligands (dataset prepared as explained in the methods and materials section). In general, the performances of both algorithms improve as the number of known ligands per protein or the maximum chemical-chemical similarity value increases.
It was noticeable that REMAP performed significantly better than PRW when there was at least one known target for a chemical whose targets are predicted (Figs 3 and 4). REMAP showed greater than 90% recovery at the top 1% when the tested chemicals have at least one known target. All algorithms are sensitive to the number of ligands per target. The more ligands, the higher accuracy. While PRW also reached reasonably high recovery for some categories (e.g. more than 11 known ligands per proteins, or C ðc 1 ;c 2 Þ > 0:6 of the most similar trained chemicals), REMAP showed that it is reliable for testing chemicals without high similarity to the trained chemicals (Figs 3B and 4B). In other words, REMAP is applicable to chemicals that are structurally distant to the chemicals already in the dataset. Except where the target proteins have 1 to 5 known ligands, REMAP performed best among the three algorithms in all cases with at least one known target for the tested chemicals (Figs 3 and 4). In the most of cases, the differences in the performance between REMAP and other two algorithms are statistically significant. Therefore, in practice, REMAP can predict potential drug targets for chemicals with at least one known target as training data, even when the chemicals are structurally dissimilar to the training chemicals. With the optimized parameters (see below), ROC-like curves shows the general trend of performances of the three algorithms up to the top 10% of predictions (S3 and S4 Figs).
As shown in Figs 3 and 4, REMAP outperforms the state-of-the-art NRLFM algorithm in most of the tested cases. As NRLMF is sensitive to the rank parameter, we carried out optimizations to determine optimal rank and iterations for NRLMF (S5 Fig). The optimal rank and iterations used in the evaluation were 100 and 300, respectively. Moreover, in the current implementation, REMAP is approximately 10 times faster and uses 50% less memory than NRLMF. Consistent with the results by Liu et al. [17], the accuracies of NRLFM are significantly higher than KBMF2K, CMF, and WNNGIP in all of ZINC benchmarks. Overall, REMAP is one of the best-performing methods for the genome-wide off-target predictions. Chemical-chemical similarity based on Tanimoto coefficient significantly helps REMAP's performance, while protein-protein similarity information contains significant noise To test whether the chemical-chemical similarity matrix helps prediction, we performed 10-fold cross validation on the ZINC dataset with the contents of the chemical-chemical or the protein-protein similarity matrix controlled. In other words, about half of the non-zero chemical-chemical similarity scores were randomly chosen and removed (set to 0) for the "half-filled chemical similarity" matrix, and all entries are set to 0 for the "zero-filled chemical similarity" matrix. The predictive power of REMAP showed noticeable improvement when all available chemical-chemical similarity pairs were used, compared to the half-filled or the zero-filled similarity matrix (Fig 5A). Similarly, the contents of the protein-protein similarity matrix were controlled (e.g. half-filled protein similarity, and zero-filled protein similarity) while the full chemical similarity matrix was used. Unlike the chemical-chemical similarity, the protein-protein similarity information did not necessarily improve REMAP's predictive power. The performance was best when a half of the protein-protein similarity information was used together with the full chemical-chemical similarity matrix (Fig 5B). This suggests that there is significant noise in the protein-protein sequence similarity matrix although the information does help prediction. A careful examination of the BLAST-based protein-protein similarity matrix may give an insight into the design of a novel protein-protein similarity metric for drug-target binding activities (see discussion section).
We also performed optimization tests for p chem and p prot on ZINC dataset. Although the performance was slightly better when the chemical-chemical similarity importance was maximum (Fig 6A), the difference was too small to conclude that it is best to fix p chem = 1. Instead, the prediction may rely too much on the chemical-chemical similarity scores. Therefore, to allow flexibility on chemical-chemical similarity information, we set p chem = 0.75 at which the performance was almost as accurate as p chem = 1. On the other hand, the performance was best when the protein-protein sequence similarity importance, p prot , was 0.1 (Fig 6B), further supporting our claim that protein-protein sequence similarity is not an optimal choice for the prediction of a drug-target interaction. When jointly optimizing p chem and p prot , their optimal value is 0.25 and 0.25, respectively, in the 10-fold cross validation benchmark evaluation (S2B Fig). Our result supports a recent study [25] which showed that Tanimoto coefficient is efficient for the chemical similarity calculation. Chemical fingerprint-based chemical-protein association prediction has been studied by Koutsoukas et al [6]. By defining bins (target proteins) that can contain certain chemical features based on the chemical fingerprints, Koutsoukas et al. successfully demonstrated that their algorithm, PRW, can efficiently predict unknown chemicalprotein associations [6]. While the basic idea of dissecting chemical compounds into functional groups is the same, it should be noted that PRW does not consider the information obtained from proteins, as well as interactome.

REMAP is readily scalable for large chemical-protein data space
For all our tests, REMAP showed great speed without losing its accuracy. On our benchmark dataset (ZINC; 12,384 chemicals and 3,500 proteins), it took approximately 120 seconds to run x-axis of Tc0.6to0.7 means that for the tested chemicals, at least one trained chemical was found such that 0:6 < C ðc 1 ;c 2 Þ 0:7 and no trained chemical was found in greater similarity than 0.7. All TPR values are based on 10-fold cross validation. Error bars represents s.e.m. Asterisks represents statistical significance based one t-test of the 10 TPR values (* for p < 0.05, ** for p < 0.001).
doi:10.1371/journal.pcbi.1005135.g003 . The time complexity is linearly dependent on the rank (Fig 7A). The scalability of REMAP is superior when compared to KBMF2K, a state-of the art matrix factorization algorithm that is implemented in Matlab and has been extensively studied for predicting drug-target interactions [16]. KBMF2K took more than 10 days for the same size matrix using the same computer system in the ZINC benchmark. Moreover, REMAP was capable of higher rank factorization while KBMF2K was limited to rank 200 in our system due to the memory requirement (over 100 GB of memory). At a much higher rank (r = 2,000), less than one hour was required for REMAP on the same dataset (Fig 7A). Time complexity experiments on larger dataset showed that REMAP completed predictions on a dataset with 200,000 rows and 20,000 columns within 2 hours on a single core computing system with 2.88 GB of memory, demonstrating its ability to screen the whole human genome of approximately 20,000 proteins in two hours (Fig 7B).

Large scale prediction of drug-target interactions
Since REMAP is scalable and shows superior accuracy based on our benchmark tests, we performed large scale prediction of drug-target interactions on the ZCD dataset (explained in the Materials and Methods section). As explained in the prediction score adjustment section, prediction scores for the active pairs were mostly located between 0.75 and 1.0 (Fig 2A).

Low rank profile based drug-drug similarity analysis
As expected, the percentage of pairs of chemicals that share common targets decreases with the decrease of the chemical structural similarity measured by the Tc of ECFP fingerprints (C ðc 1 ;c 2 Þ ). The percentage of target-sharing chemical pairs drops below 50% and 0.5% when the Tc is between 0.5 and 0.6, and less than 0.5, respectively (S6 Fig). Thus, it is less likely that the chemical structural similarity alone can reliably detect novel binding relations between two chemicals when the Tc is less than 0.5. It is interesting to see how REMAP performs when the chemical structural similarity fails.
We analyzed the low-rank drug profile (matrix U UP ) to check whether it represented the target-binding behavior of the drugs. When filtered by low chemical structure similarity (C ðc 1 ;c 2 Þ < 0:5Þ), there are 899,871 drug-drug pairs. Among them, the profile similarity score (S cos;ðc 1 ;c 2 Þ ) of 91,888 pairs is higher than 0.3. With high profile similarity (0:99 S cos;ðc 1 ;c 2 Þ 1Þ), a total of 1,327 drug-drug pairs were found of which 1,033 pairs shared at least one common known target. S7 Fig shows the percentage of pairs that share the common target in different profile similarity bucket for FDA-approved drugs. This result suggests that REMAP is able to provide a chemical-protein binding profile that cannot be captured by chemical structure similarity alone.
When S cos;ðc 1 ;c 2 Þ 0:3, the percentage of two drugs that share a common target drops below 50% (S7 Fig). We constructed a drug-drug similarity network by filtering out drug pairs with S cos;ðc 1 ;c 2 Þ 0:3, then applied the MCL algorithm on the drug-drug network to find clusters of similar drugs. The largest cluster of drugs contained a total of 313 drugs, and their relationships to diseases were examined based on the known associations annotated in CTD [31]. As a result, we found that the drugs are mostly related to mental disorders, including hyperkinesis, x-axis of Tc0.5to0.6 means that for the tested chemicals, at least one trained chemical was found such that 0:5 C ðc 1 ;c 2 Þ 0:6 and no trained chemical was found in greater similarity than 0.6. All TPR values are based on 10-fold cross validation. Error bars represents s.e.m. Asterisks represents statistical significance based one t-test of the 10 TPR values (* for p < 0.05, ** for p < 0.001). doi:10.1371/journal.pcbi.1005135.g004

Fig 5. Performance of REMAP according to the amount of the chemical-chemical or the protein-protein similarity information used for its 10-fold cross validation on the ZINC dataset. (A)
True Positive Rate at the given cutoff rank. All available chemical and protein similarity information included (blue), a half of chemical-chemical similarity was ignored (orange), and the entire chemical-chemical similarity was ignored (green). (B) The blue line is the same as A. A half of protein-protein similarity matrix was ignored (gray), and the entire protein-protein similarity was ignored (red). dystonia, catalepsy, schizophrenia and basal ganglia diseases as the mostly related diseases. The most frequent known protein targets by the drugs were GPCRs (S1 Table). It is comparable that GPCRs were 1,924 times targeted while kinases were targeted only 55 times. While it is interesting to further examine the cluster, validating all of the possible drug-target pairs in the largest cluster may be inefficient. A smaller cluster of drugs contained a total of thirty-one FDA approved drugs twenty-six of which are known to target kinases or interact with microtubule (Table 2). Seven drugs in the cluster have not been used for cancer treatment and were found to be closely linked to the anticancer drugs (Fig 8 and Table 2). Interestingly, several of them have been tested for their anticancer activity. For example, colchicine (also known as colchine), an FDA approved drug for gout treatment, has been shown to have anti-proliferative effects on several human liver cancer cell lines at clinically acceptable concentrations [33]. Griseofulvin, an antifungal antibiotic drug, appears to be effective as an anti-cancer drug when used together with other anti-cancer drugs [34]. The three anthelmintic drugs, albendazole, mebendazole and niclosamide, have been studied and repurposed for their anti-cancer effects on different types of cancers. Albendazole has been shown to be effective in suppressing liver cancer cells both in vitro and in vivo [35], and recently has been repurposed for ovarian cancer treatment with a bovine serum albumin-based nanoparticle drug delivery system [36]. Mebendazole showed anti-cancer activities in human lung cancer cell lines [37] and human adrenocortical cell lines [38], and it has been repurposed for colon cancer treatment [39]. Both niclosamide and mebendazole showed beneficial effects in glioblastoma in different studies [40,41]. It has been proposed to use aprepitant in combination with other compounds to improve the efficiency of temozolomide, the current standard drug for glioblastoma treatment [42]. Anti-cancer activity of carbidopa hydrate have not yet been reported. It will be interesting to experimentally validate the prediction.

REMAP improves the predictive power of off-target prediction and drug repurposing
Our extensive benchmark studies show that REMAP outperforms existing algorithms in most of the cases for the off-target prediction. Compared with other state-of-the-art matrix factorization algorithms, the predictive power of REMAP comes from several improvements. First, we formulated the drug-target prediction as a one-class collaborative filtering problem; thus the negative data are not required for the training. Second, a priori knowledge including known negative data can be incorporated into the matrix factorization with imputation and weighting. Finally, using global imputation and weighting, the algorithm is computationally efficient without significantly sacrificing its performance.
The efficiency and effectiveness of REMAP allows us to predict proteome-wide target binding profiles of hundreds of thousands of chemicals. As the proteome-wide target binding profile is more correlated with phenotypic response than a single target binding, REMAP will facilitate linking molecular interactions in the test tube with in vivo drug activity. When using a multi-target binding profile predicted by REMAP as the signature of a chemical compound, seven drugs were found to be associated with anti-cancer therapeutics, although they do not have detectable chemical structural similarity. Among them, the anti-cancer activity of six drugs was supported by experimental evidences. Thus, REMAP could be a useful tool for drug repurposing.

Remaining issues and future directions
Although REMAP showed its high potential on genome-wide off-target predictions as discussed above, two issues remain: the cold start problem and suboptimal protein-protein similarity metrics. Similar to matrix factorization algorithms such as NRLMF, REMAP suffers from cold start problem, also known as new user or new item problem. In other words, it is difficult to recommend a product for a new user if the new user has never purchased or reviewed a product in the database [28]. For novel chemicals that do not have any known target in the dataset, REMAP did not show better performance than PRW. Moreover, if the target of the novel chemical has 5 or fewer known ligands, the recovery of REMAP is lower than 0.5 (S8A Fig).
When the novel chemical is similar to those chemicals in the database, the recovery of REMAP reached above 90% (S8B Fig). These results suggest that, in practice, existing matrix factorization-based methods, including REMAP, are not the optimal choice if the chemicals of interest do not have any known target. To resolve this issue, it is possible to design an algorithm that combines the benefits of PRW or other algorithms with REMAP. The use of confidence weights and a priori imputation makes it straightforward for REMAP to incorporate additional information. In addition, the time and memory efficiency of REMAP makes it possible to apply active learning to overcome the cold start problem [43][44][45][46]. The suboptimal performance of REMAP may arise from the lack of molecular-level biochemical details in deriving the protein-protein similarity metrics. When testing the ZINC dataset, we found that REMAP performs better as lower weight was assigned for protein-protein sequence similarity data (Fig 6B). In addition, the predictive power of REMAP improved when about half of the randomly selected protein-protein similarity scores were removed, further confirming that noise confounds relating global sequence similarity to ligand binding ( Fig  5B). It is not surprising that proteins with similar sequences do not necessarily bind to similar chemicals, as protein-ligand interaction is governed by the spatial organization of amino acid residues in the protein structure [47]. Amino acid mutations/post-translational modifications and conformational dynamics may alter the binding of the ligand through direct modification of the ligand binding site or allosteric interaction. A protein may also consist of multiple binding sites that accommodate different types of ligands. Thus, two proteins with high sequence similarity do not necessarily bind the same ligands because the two proteins may possess different 3D conformations, especially in their binding pockets [47]. In contrast, two proteins with low sequence similarity can bind to the same ligands if their binding pockets are similar [48,49]. The binding site similarity can be a more biologically sensitive measure of protein-protein similarity for the off-target prediction [50][51][52][53][54][55]. Such work is on-going.

Conclusion
In silico drug-target screening is an essential step to reduce costly experimental steps in drug development. In this study, we showed that dual-regularized one-class collaborative filtering algorithm, a class of computational methods frequently used in user-item preference recommendations, may be applied to drug-target association predictions. Our study presents REMAP, a collaborative filtering algorithm with capability of running whole human genomelevel predictions within two hours. Other studies on some types of cancer treatment support our algorithm's ability to capture drug-drug similarities based on both the drug-target interaction profile and the chemical structural similarity. Our study shows the limitation of REMAP  For instance, the x-axis of Tc0.5to0.6 means that for the tested chemicals, at least one trained chemical was found such that 0:5 C ðc 1 ;c 2 Þ 0:6 and no trained chemical was found in greater similarity than 0.6. All TPR values are based on 10-fold cross validation. Error bars represents s.e.m. Asterisks represents statistical significance based one t-test of the 10 TPR values ( Ã for p < 0.05, ÃÃ for p < 0.001).
(EPS) S1 Table. The drugs and target information for the largest cluster in Fig 7. (XLSX) S2 Table. The protein IDs and annotations for Table 2 and S1 Table. Obtained from Uni-Prot. (XLSX)