DTI-SNNFRA: Drug-Target interaction prediction by shared nearest neighbors and fuzzy-rough approximation

In-silico prediction of repurposable drugs is an effective drug discovery strategy which supplements de-nevo drug discovery from scratch. Lack of severe side effects of repurposable drugs in other applications and most importantly the reduced time frame in development suits the urgency of the situation in general. Most recent, and most advanced artificial intelligence (AI) approaches have boosted drug repurposing in terms of throughput and accuracy enormously. But, with the growing number of drugs, targets and their huge interactions produce imbalance data which may not be suitable to directly feed into the classification model. Here, we proposed DTI-SNNFRA, a framework for identifying drug-target interaction (DTI), based on shared nearest neighbor (SNN) and fuzzy-rough approximation (FRA). It uses a sampling technique to collectively reduce the huge search space covering the available drugs, large number of drugs targets and millions of interactions between them. DTI-SNNFRA operates on two stages: first it uses SNN followed by a partitional clustering for sampling the search space. Next, it computes degree of fuzzy-rough approximations and with proper degree threshold selection for undersampling of the negative samples from all possible interaction pairs between drugs and targets obtained in the first stage. Finally, classification is performed using the positive and selected negative samples. We evaluate the efficacy of DTI-SNNFRA using AUC (Area under ROC Curve), Geometric Mean, and F1 Score. The model performs extremely well with high prediction score (around 0.95). The predicted drug-target interactions are validated through a existing drug-target database (Connectivity Map (Cmap)).


INTRODUCTION
The drug development strategies aka drug repositioning, drug repurposing, drug reprofiling, predicts the interaction between drug and target from the existing drug-target database [19]. There are two types of drug-target interaction. One is called competitive inhibitors which adhere themselves to the active site of the target to suppress the reaction. Another type is allostoric inhibitors which prevents their target to refrain from reactions by attaching themselves to the allosteric site of the target which helps in M killing pathogens to recover from the diseases. There exist several synthesized compounds whose target profiles and effects are still unknown. The research and findings about compounds properties, its reactions/responses to drugs and their targets have generated large and complex databases which needs efficient computational methods to analyze and prediction of drug-target interaction. New drug design requires more than 13.5 years and the cost exceeds 1.8 billion dollars [5], [7]; still it suffers from the strong side effects on patients. Therefore the repositioning or repurposing of the existing drug is helpful for pharmaceutical companies due to its known side-effects and govt. approval. The drug repositioning usually reinvestigates those existing drugs which were denied approval due to new therapeutic indications. The drug repurposing processes [2] find the already authorized drugs and compounds for treating different diseases. The practical laboratory experiments for identifying the interactions between the drug and target are expensive as well as time-consuming and labor-intensive. Instead of the in vitro experimental search, initially virtual screening is accomplished and then possible candidates go through experimental verification. The in-silico based prediction methods comprises docking simulations that need 3D structure analysis of drug and target molecules to determine the potential binding sites. Despite the excellent accuracy of this process, unavailability of proper 3D structure of drugs and target, and long processing time hinders the docking simulation. Rather the computational approach such as chemogenomics, assorted as ligand based, target based and target-ligand is better in terms of expenses and time. In the chemogenomics the chemical space and genomic space are mined together to find the potential compounds such as imaging probes and drug leads [2]. Plenty of machine learning techniques based on the similarity computation, matrix factorization, network models, features driven and deep learning models for DTIs prediction are prevalent in the literature [19], [6]. The similarity based approaches find how a drug (or target) or new drug is similar to the known pairs based on the pharmacological similarity of drugs, genomic similarity of protein sequences, similarity measures with respect to chemical based, ligand based, expression based, side effect based and annotation based [24], [2]. The disadvantage of this approach is that only a small number of interactions are available for comparison and there are a large number of pairs for which interactions are not known. There is an extensive collection of matrix factorization algorithms, in which [2], given an interaction matrix X n×m , the main goal is to decompose it into two lower order matrices, Y n×k and Z m×k such that X = Y Z T with k < n, m which further will go through matrix completion technique to compute the missing data that help in DTI prediction task. In feature based [2] methods the drug and target vector are concatenated and append the binary label for positive and negative interaction. Examples of features based method include Bagging-based Ensemble method(BE-DTI) [21] that employs dimensionality reduction and active learning. In [8], first feature subspacing and then three different dimensionality reduction techniques are used to prepare training data for base learners, namely decision tree and kernel ridge regression.Network-based models such as TL-HGBI, DrugNet utilizes heterogeneous networks not only to predict the drug but also recommend the way of treatments [2], [5]. In [20], the matrix inverse computation is used to compute relevance grade between two nodes in a weighted network of drug-target interactions. Deep learning based DTIs prediction [2], [4] utilize the biological, topological and physico-chemical information of the drug and target to compute vectors/matrices and have the capability to capture the inherent interactions among the drugs and targets over the others state-of-the-art feature computation methods and classifiers. Deep learning techniques sometimes can not be applied due to the unavailability of sufficient information.
In this article, the feature-based method, called DTI-SNNFRA is proposed, under the purview of chemogenomic approaches. Here individual drug and target are represented by the feature vector. The drug-target interaction prediction task is a class imbalance problems as most of the interaction pairs are unannotated and hence belongs to the negative class. This pushes the classifiers to be more biased towards the major class, affecting the proper classification process. In light of the above context, we have proposed a prediction framework for drug-target interaction in two phases that reduces the search space considerably. In the first phase, for each drug-target pair, the shared nearestneighbors (SNN) of this drug are accumulated. These collected drugs are then clustered and the centroid from each cluster is taken as a representative. Representative targets are also determined in a similar way. These representative drugs and targets are used to form drug-target pairs that are fewer in number and are probable candidates for possible interactions. The pairs obtained in this way are treated as negative interaction pairs. The positive interaction pairs are taken from the given drug-target interaction matrix. Despite the reduction in search space, the obtained training set created in this way is highly imbalanced. In order to encounter this problem, we have utilized fuzzy-rough approximation scores and one threshold value to perform undersampling of the negative samples. If the number of negative interaction pairs are considerably less than the positive interaction pairs then oversampling is carried out by Adaptive Synthetic Sampling Method (ADASYN). This produces a reduced and balanced training set which can be used by any general classifiers. We have applied several state-of-the-art classifiers such as SVM, decision tree, random forest for this task to find the accuracy of predicted interactions.
In section 2 of this article, dataset utilized in this work along with method and algorithms are explained. In section 2.3, a brief description and definition of the fuzzy-rough set based lower and upper approximation is outlined. In section 3, results and discussions are presented and finally section 4 draws the conclusion.

MATERIALS AND METHODS
In this section, we describe the datasets used in this work, key ideas of our algorithms and some background of fuzzyrough set. The building block of proposed DTI-SNNFRA method is shown in Figure 1.

Dataset Preparation
In this article, the interaction data is taken from the Drug-Bank database [16] (version 4.3, released on 17 Nov. 2015) and from dataset mentioned in [23]. For dataset1, the number of drugs are 5877, targets are 3348 and the number of interactions between the drugs and targets are 12674. Every drug and target are represented by the feature vectors. The feature vector of each drug has been computed by Rcpi [3] package and the PROFEAT [17] web server which is represented by constitutional, topological and geometrical descriptors. The feature vector of each target has been computed using amino acid composition, pseudo-amino acid composition and CTD (composition, transition, distribution) descriptors, among others. The number of features for drug and target of dataset1 are 193 and 1290, respectively. In the dataset2, every drug is represented by a binary vector where each element of this vector denotes the presence and absence of one of 881 chemical substructures. Each target of dataset2 is also represented by a binary vector where each element denotes the presence and absence of 876 different protein as mentioned in Pfam database [10]. The high dimensional features vector of drug and target are concatenated to represent the drug-target pair features vector and can be represented for dataset1 as: These drug-target pairs features vectors are then normalized in the range [0, 1] by min-max method for avoiding bias towards any feature.

Workflow of the proposed framework
In this section the basic steps of our proposed method are described:

Step 1: Finding positive and negative drug-target pairs and spiliting
After the normalization, only the drugs and targets which have known interactions in the interaction matrix are used to form the positive samples for classifiers to predict new drug target interactions. But the number of negative samples derived from the interaction matrix is really a big number. Note that the high dimensionality and large number of samples may have diverse effect in the prediction task. The proposed method can tackle both the issues. Finding characteristically similar drugs and targets using the nearest-neighbor query facilitates new drug-target prediction. Determination of the nearest-neighbors using similarity distance measures are sensitive to the dimensionality and the distribution of the dataset. The popular similarity function L 1 and L 2 in Minkowski space infers the fact that for a certain data distribution, the relative difference of the distances of the closest and farthest data points of an independently selected point goes to 0, as the dimensionality increases. For this reason, the primary distances functions like L 1 , L 2 , and cosine etc. are not suitable for high dimensional data. In this context, Shared Nearest Neighbor (SNN) based ranking is strong in higher dimensions than primary distances [13]. Assume the dataset S consisting of n = |S| objects and k ∈ N + . For each individual drug (or target), let N N k (x) ⊆ S represents k-nearest-neighbors of x ∈ S as computed using L 2 similarity measure. The overlap between the computed shared nearest neighbors sets of the objects x and y is represented as: The Algorithm 1, provides the procedure to compute shared nearest neighbors and the Algorithm 2, outlines how the training dataset is prepared for classifiers. Suppose, the total number of drug and targets are M and N . Assume drug d i,i∈M interacts with target t j,j∈N . Now for this d i , the indices of all drugs in SN N k (d i , d r ), ∀r ∈ M and i = r are identified and assigned to snnD i . Similarly, for the target t j , the indices of all targets in SN N k (t j , t r ), ∀r ∈ N and j = r are identified and assigned to snnT j . Then all the drugs and targets in snnD i and snnT j are clustered using the k-medoids clustering and centroids are selected as a representatives of snnD i and snnT j . The Calinski-Harabasz criterion is used here to determine the correct number of clusters. Now these representatives drugs and targets from snnD i and snnT j are taken to make cartesian product pairs. Subsequently, corresponding drug vector and target vector are concatenated for each cartesian product pair which in turn are included in negative samples set. Selection of the negative samples obtained by the above shared nearest neighbors approaches reduces the inclusion of the irrelevant drug-target pairs. For example, in dataset1, the number of approved drug-target pairs are 12674 and number of all possible pairs from which interaction may be predicted are 19663522. So, handling these huge number of negative samples pairs against 12674 positive samples are very difficult. The number of shared nearest neighbors of one drug or target, where this one drug or target is taken from each positive drug-target pair, with all others remaining drugs and targets, is rather producing fewer number of drug-target pairs for negative samples. Therefore, SNN based initial selection can be used to reduce the large number of irrelevant negative drug-target pairs. For dataset1, the number of drug-target pairs from the pool of 19663522 pairs are 45933 which indicates 427 times samples removal. Positive and negative set of samples thus obtained are divided into m and n groups, respectively.

Step 2: Decision table preparation and average approximation degree computation
Each group from the negative set, say, n j is taken m times with m group from the positive set and m number of decision table are prepared. Now each decision table are used to compute the fuzzy rough upper approximation degree of each sample in the n j group. The m number of upper approximation degree of each sample in the n j group are then taken for average upper approximation degree computation.

Step 3: Under sampling based on approximation degree
One grade membership is computed for every negative sample with respect to all positive interactions as per the Algorithm 3. The fuzzy-rough upper approximation degree as mentioned in Section 2.3 is computed here which possibly indicates the possible interaction degree value between 0 to 1 scale. Now, one threshold value near 1 can be assumed to select many samples whose fuzzy-rough upper approximation degree are greater than or equal to that threshold in order to augment positive samples set. Another one threshold value near 0 can be assumed to select many samples whose fuzzy-rough upper approximation degree are less than or equal that threshold in order to undersample the negative samples.

Step 4: Oversampling, if required
If the number of negative samples as selected in Step 3 are less than the total positive samples then oversampling using Adaptive Synthetic Sampling Method (ADASYN) [12], prior to feed into the classification or regression model, can be used to balance the positive negative ratio.

2.2.5
Step 5: Interaction prediction Dataset as obtained in step 4 are then used for classification or regression for predicting the interaction of the pairs belong to the negative set.

Fuzzy rough set
Assume that the drug-target pairs as obtained by given interaction matrix and SNN based initial filtering constitute a decision table is called IT . In this table, every row is denoted by m numbers of features i.e. C = {f i : 1 ≤ i ≤ m} and one decision attribute D = {d}.
In this IT , how two objects are indiscernible are determined by calculating fuzzy indiscernibility relation (FIR). Subsequently, this indiscernibility relation is taken to determine fuzzy-rough lower and upper approximation. The fuzzy lower and upper approximations using fuzzy similarity relation (either fuzzy equivalence or tolerance relation), in pursuance of Radzikowska's model, to approximate a concept Y are defined as [15]: Here, in equations 2 and 3, I indicates a fuzzy implicator, T denotes a t-norm and R P is the fuzzy similarity relation computed by the features subset P ⊆ C. To calculate the fuzzy similarity relation R P , which is used in fuzzy lower and upper approximations as mentioned in the equation 2, 3, with respect to features subset P ⊆ C the following equation may be taken.
Here, µ R f (x, y) denotes the similarity degree between object x and y with respect to feature f . Some examples of fuzzy similarity relations are given below: where σ 2 stands for the variance of feature f .

Upper approximation degree computation:
In Figure 1, the fuzzy rough upper approximation degree is computed as follows: 1. Computing fuzzy indiscernibility relation of conditional attributes using the lukasiewicz t-tnorm and tolerance relation as mentioned in section 2.3.
2. Computing fuzzy indiscernibility relation of decision attribute using the its crisp relation.
3. Computing fuzzy upper approximation using the lukasiewicz t-tnorm as per the equation 3.
This fuzzy upper approximation degree can be used to select the samples from the negative samples set.

Algorithm 2: Dataset Preparation
Input: DT = drug-target interaction matrix D = feature matrix for the drug T = feature matrix for the target Output: labeled TrainingDataSet

Performance metrics
This section explains the experimental results by using three metrics referred to as ROC-AUC scores, F1 scores and Geometric Mean scores [9]. The ROC-AUC provides a single score used to compare the models. It ranges from 0 to 1 where 1 indicates the perfect model and 0.5 represents a model having no prediction skill and the values less than 0.5 indicate that the prediction skill is worse than no skill. This ROC-AUC performance evaluation is unconcerned for high imbalanced dataset. How well the positive class and the negative class are predicted by a model are represented by the sensitivity and specificity. The sensitivity and specificity together can be integrated into a single score called geometric mean is represented by sqrt(Sensitivity * Specificity) where the Sensitivity = TruePositive / (TruePositive + FalseNegative) and Specificity=TrueNegative / (FalsePositive + TrueNegative). The F1 score can be used to achieve the balance between Precision and Recall. It is also used where class imbalance due to large number of negative samples are present. All three scores are calculated here using 5-fold cross-validation and average AUC, F1 score and G-mean score are computed. Note that the datasets 1 and 2 as mentioned in section 2 are ( upperApproxDegree of N groupIndexOf n | f or each groupIndexOfn ∈ seq(1 : n) and ∀ groupIndexOfm))

Sampling:
tp and tq are the thresolds for Mp and Mq used for prediction.

Five sampling methods and EnsemDT and En-semKRR vs proposed method
The prediction scores of the DTI-SNNFRA method has been compared with the five sampling techniques known as SMOTE, ADASYN, RUS, SMOTEENN, SMOTETomek respectively in Fig. 2 using the SVM, Decision Tree (DT), and Random Forest(RF) Classifiers. Also the AUC score of the DTI-SNNFRA method is compared with EnsemDT and EnsemKRR [8] in Table 1. The list of parameters for RF are as follows: nEstimators = 100, gini criterion function is used to measure the quality of a split. The minimum number of samples required to split an internal node is minSsamplesSplit = 2. The maximum depth of the tree is A. Dataset 1 B. Dataset 2  For SVM classifier, the kernel type used is linear and regularization parameter C = 1.0. The Fig. 4 (A) and (B) demonstrates the variation of the AUC score of the decision tree with respect to two hyperparameters called tree depth and max feature. In Fig. 4 (C), one heatmap has been shown which is computed by the grid search based hyperparameters(n estimators, max depths) selection of the random forest model. To prepare negative drug-target pairs, the number of nearest neighbors are 11 which later used to compute the shared nearest neighbors. Total number of features for the drug-target pairs are 1483 for dataset1 ( or 1757 for dataset2) which is very high. We have transformed these high-dimensional data into lower dimensions by the PCA  for computing the approximation degree using fuzzy-rough set.

Feature selection and comparisons
In Fig. 3, the prediction scores in term of AUC values have been shown for both datasets considering feature selection and no feature selection. These prediction scores have been computed by RUSBoostClassifier as it mitigates the class imbalancing problem during learning by the random undersampling the samples at each iteration of boosting. The training and testing ratio for RUSBoostClassifier is taken here as 70:30. In RUSBoostClassifier, number of base learner and learning rate used as parameters are 50 and 0.1, respectively. For feature selection, the features importance scores have been computed using XGBoost and random forest. Random forest assumes the feature importance is the average of all decision tree feature importance. These two feature importance computation methods splits the datasets where positive and negative samples appear in approximately equal numbers. All the positive and negative groups pairs individually taken by the XGBoost and random forest classifiers for computing the feature importance. Finally, average feature importance scores are computed and top 100 features are taken for prediction. The average execution time, without feature selection, over 50 thresholds for the dataset1 and dataset2 are 617.66 sec. and 232.07 sec. respectively. When feature selection is considered, these average execution time for the dataset1 and dataset2 are 232.07 sec. and 77.61 sec. respectively.

Sensitivity vs number of base learners and classification errors
In Fig. 3, two plots represent the M vs Sensitivity graph for both datasets where M represents the number of base learner that ranging from 1 to 50. This experiment is carried out for few thresolds values. The classification errors means the proportion of samples that the classifier misclassified are also reported in Fig. 3.

Drug-target interaction of the proposed method
In Table 3, few interactions between the drugs and targets have been provided. To test the efficacy of the proposed method, we have omitted several known interactions from training data. Then training with the remaining data and verifying the prediction of the previouly omitted data has been carried out for many pairs. The seven drugs for Serine hydroxymethyltransferase, cytosolic are predicted correctly and five are listed in Table 3, also five new interactions have been shown. In similar way, in Table 3 , few correct predictions and few new intearctions are shown for the drug named Monoamine oxidase. Also in the Table 3, few correct prediction and new interaction have been mentioned for the the two targets called alpha-D-glucose 6-phosphate and Adenosine-5-Diphosphoribose. In Fig. 5, some drug-target interactions have been shown and few links are showing the interactions between the treatments area and drugs.

Drug-target interaction validation
To verify our prediction results, we have used the Connectivity Map (Cmap) [14] prediction results provided by the Broad Institute. The drug name and target name in the Drugbank dataset have the different representation in Cmap. To match our prediction results with the Cmap, some conversion between Drugbank and Cmap have been done using the webchem R package [22]. This R package retrieves the chemical information from the web using a suite of 14 web services. Our prediction results of drug-target pairs with respect to Drugbank dataset are taken by webchem which only fetch the information from Wikidata. Due to the lack of information in the suite of web services except Wikidata as provided by webchem R package, we have obtained few matching between our prediction and Cmap predictions. In Table 2, a list of 50 pairs of the drugs and targets are given which are the prediction of our method and also validate the Cmap prediction results. Some drug-target pairs in Table 2, as predicted by our proposed method, have the interactions as (DB01248, P07437), (DB04846, P07550), (DB00839, Q09428), (DB00450, P35462), (DB00776, Q9Y5Y9), (DB00776, P35498) are also reported in [18], [26], [1], [11], [25] and [11] respectively. Thirty-four boldface drug-target pairs in the Table 2 are the result of our prediction model are also validated by the Cmap.

CONCLUSION
In this article a novel computational approach for drugtarget interaction prediction is presented utilizing existing drug-target data. There are two major caveats in this domain: massive amount of drugs and targets creating a huge search space and highly imbalanced drug-target interactions dataset as there are very small number of drugtarget interactions unveiled so far. Thus, the size of the negative samples is much higher than size of the positive samples. Here, we have used shared nearest neighbours rather than taking a fixed number of nearest neighbours as it will be more effective in higher dimensional dataset. The reason behind this is that typically, the size of the overlapped items within the neighborhoods of a pair of drugs or targets inside the same group is substantially larger than the neighborhoods of a pair of drugs or targets belonging to different groups. Moreover, to reduce the the curse of imbalanced dataset, these shared nearest neighbors are further grouped by k-medoids and the representative centroids of k-medoids for the drug and target are considered as new possible drug-target interaction pairs for each known drug-targets pair. Additionally, in order to deal with imbalanced dataset further, we have computed the degree of fuzzy-rough upper approximation of all possible interaction pairs in the negative samples set to perform undersampling. Thereafter, selecting a threshold of the computed degrees, size of the negative and positive samples sets were balanced. This upper approximation degree based undersampling of the negative set causes improvement of the prediction scores. The computation cost of degree of the fuzzy-rough upper approximation is challenging as the dimension interaction pairs is extremely high. The execution time of this fuzzy-rough upper approximation degree is directly proportional to the number of features. Therefore, further investigation on fuzzy-rough set based feature selection followed by fuzzy-rough upper approximation computation may improve the prediction score. Instead of using a single threshold for undersampling, multiple threshold based undersampling may be investigated for tackling the curse of imbalanced datasets. Moreover, oversampling of the positive samples to balance the number negative samples and the number positive samples may be explored to improve the prediction score. The proposed method DTI-SNNFRA not only selects the possible candidates using shared nearest neighbors in the first stage but also has the ability to assign the grade of interaction through fuzzy-rough upper approximation degree computation. The threshold based undersampling and balancing the positive and negative interaction pairs leads to a robust classification model. We believe that DTI-SNNFRA may be a promising framework for drug-target interaction prediction.