A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction

Protein-nucleotide interactions are ubiquitous in a wide variety of biological processes. Accurately identifying interaction residues solely from protein sequences is useful for both protein function annotation and drug design, especially in the post-genomic era, as large volumes of protein data have not been functionally annotated. Protein-nucleotide binding residue prediction is a typical imbalanced learning problem, where binding residues are extremely fewer in number than non-binding residues. Alleviating the severity of class imbalance has been demonstrated to be a promising means of improving the prediction performance of a machine-learning-based predictor for class imbalance problems. However, little attention has been paid to the negative impact of class imbalance on protein-nucleotide binding residue prediction. In this study, we propose a new supervised over-sampling algorithm that synthesizes additional minority class samples to address class imbalance. The experimental results from protein-nucleotide interaction datasets demonstrate that the proposed supervised over-sampling algorithm can relieve the severity of class imbalance and help to improve prediction performance. Based on the proposed over-sampling algorithm, a predictor, called TargetSOS, is implemented for protein-nucleotide binding residue prediction. Cross-validation tests and independent validation tests demonstrate the effectiveness of TargetSOS. The web-server and datasets used in this study are freely available at http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS/.


Introduction
Protein-ligand interactions are ubiquitous in virtually all biological processes [1][2][3], and the prediction of protein-ligand interactions using automated computational methods has been an area of intense research in bioinformatics fields [4][5][6][7][8][9][10][11][12][13][14][15]. As important ligand types, nucleotides (e.g., ATP, ADP, AMP, GDP, and GTP) play critical roles in various metabolic processes, such as providing chemical energy, signaling, and replication and transcription of DNA [10][11][12][13][14][15]. The residues in a protein to which nucleotides bind are called protein-nucleotide binding residues. By interacting with the binding residues in a protein, nucleotides can carry out their specific biological functions. Furthermore, proteinnucleotide (e.g., protein-ATP) binding residues are considered valuable targets of therapeutic drugs [12]. Hence, accurate identification of nucleotide-binding residues in protein sequences is of significant importance for protein function analysis and drug design [16], especially in the post-genomic era, as large volumes of protein data have not been functionally annotated.
Much effort has been made to identify and characterize nucleotide-binding residues from protein sequences. In the early stages, motif-based methods [17][18][19][20][21] dominated this field. For most motif-based methods, conserved motifs in known nucleotidebinding protein sequences or structures are first identified; then, the identified motifs are further utilized to uncover potential binding residues in those un-annotated proteins. Although considerable progress has been achieved in motif-based methods, challenges remain. As Chen et al. [14] reported, motif-based methods often characterize the protein-nucleotide interaction motifs within a relatively narrow range, usually only for a selected interaction mode for a single nucleotide type; in addition, some motif-based methods require tertiary protein structure as the input, which substantially limits their utility, as it is very common in many realistic application scenarios for a given protein target to only have sequence information and no corresponding tertiary structure information [22,23].
The above-mentioned challenges have motivated researchers in this field to develop machine-learning-based methods for predicting protein-ligand binding residues solely from protein sequences [4][5][6]13,14,22,[24][25][26]. In pioneering work, Chauhan et al. [13] designed a predictor, called ATPint, specifically for predicting protein-ATP binding residues. This group also designed a GTPspecific predictor for protein-GTP binding residue prediction [27], and their earlier studies demonstrated the feasibility of predicting protein-nucleotide binding residues solely from protein sequence information [13,27]. Later, researchers tended to design predictors that covered a wide range of nucleotide types. For example, Firoz et al. [15] implemented a method of performing binding residue predictions for six nucleotide types, i.e., AMP, GMP, ADP, GDP, ATP and GTP. Recently, Chen et al. [14] presented a predictor, called NsitePred, that could also be used to perform binding residue predictions for multiple nucleotides based on much larger training datasets. All in all, great success has been achieved in this field.
Machine-learning-based protein-nucleotide binding residue prediction is, in fact, a typical imbalanced learning problem because the number of negative samples (i.e., non-binding residues) is significantly larger than that of positive samples (i.e., binding residues). Previous studies in the machine-learning field have shown that direct application of traditional machine-learning algorithms tends to result in a bias toward the majority class [28]. Unfortunately, most of the existing machine-learning-based predictors, including ATPint [13], ATPsite [24], and NsitePred [14], have not carefully considered this serious class imbalance phenomenon.
Among the above-mentioned solutions, the sample rescaling strategy (e.g., over-sampling [39] and under-sampling [40]) is the basic technique, and it attempts to balance the sizes of different classes by changing the numbers and distributions within them; this strategy has been demonstrated to be effective for imbalanced learning problems [29,30]. For example, we recently investigated class imbalance in the protein-nucleotide binding prediction problem and found that prediction performance could be improved by balancing the number of samples in different classes via an under-sampling technique [22,25,26].
In this study, we seek to overcome the problem of class imbalance via an over-sampling technique. In contrast to the under-sampling technique, which reduces the size of the majority class, an over-sampling technique attempts to balance the sizes of different classes by generating additional samples for the minority class. To date, many over-sampling techniques have emerged, including random over-sampling (ROS), the synthetic minority over-sampling technique (SMOTE) [39], and adaptive synthetic sampling (ADASYN) [41]. Motivated by these existing oversampling techniques, in this study, we propose a new supervised over-sampling (SOS) algorithm that synthesizes new additional

Benchmark Datasets
Two benchmark datasets were chosen to evaluate the efficacy of the proposed SOS algorithm and of the implemented predictor, TargetSOS. The first dataset [13], ATP168, consists of 168 nonredundant, ATP-interacting protein sequences, of which the maximal pairwise sequence identity is less than 40%. In total, ATP168 includes 3104 and 59226 residues for ATP binding and ATP non-binding, respectively. The second dataset [14], NUC5, is a multiple nucleotide-interacting dataset that consists of five training sub-datasets, each for a specific type of nucleotide; more specifically, NUC5 consists of 227, 321, 140, 56, and 105 protein sequences that interact with five types of nucleotides, i.e., ATP, ADP, AMP, GTP, and GDP, respectively, and the maximal pairwise identity of the sequences of each of the five sub-datasets is less than 40%. In addition, for each nucleotide type, Chen et al. [14] constructed a corresponding, independent validation dataset to evaluate the generalization capability of a prediction model. For each independent validation dataset, the maximal pairwise sequence identity is culled to 40%. Furthermore, any sequence in the independent validation dataset shares less than 40% identity to sequences in the corresponding training sub-dataset. Table 1 summarizes the detailed compositions of the two benchmark datasets. All data listed in Table 1 can be found in Supporting Information S1. Further details regarding the construction of the datasets can be found in [13] and [14].

Feature Representation and Classifier
The main purpose of this study is to demonstrate the feasibility of the proposed SOS algorithm and its effectiveness in proteinnucleotide binding residue prediction. To fulfill the aforementioned purpose, only the most commonly used feature representation methods and classifiers in the field of protein-nucleotide binding residue prediction are used. More specifically, the position-specific scoring matrix (PSSM) and predicted protein secondary structure (PSS), both of which have been demonstrated to be especially useful for protein-nucleotide binding residue prediction [13,14,25,26], are taken to extract discriminative feature vectors. Support vector machine (SVM) [42] is used as a classifier for constructing a prediction model.

A. Extract Feature Vector from the Position-Specific
Scoring Matrix. Position-specific scoring matrix (PSSM) derived features have been widely used in bioinformatics including intrinsic disorder prediction [43][44][45], protein secondary structure prediction [46], transmembrane helix prediction [47][48][49], protein 3D structure prediction [50], and protein-ligand binding prediction [14,51]. In this study, we obtain the PSSM of a query protein sequence by performing PSI-BLAST [52] to search the Swiss-Prot database through three iterations and with 0.001 as the E-value cutoff against the query sequence. To facilitate the subsequent computation, we further normalize each score, denoted as x, that is contained in the PSSM using the logistic function f (x)~1= 1ze {x ð Þ . Based on the normalized PSSM, the feature vector, denoted LogisticPSSM, for each residue in the protein sequence can be extracted by applying a sliding-window technique, as follows [25,26]: for a residue at position i along the query sequence, its LogisticPSSM feature vector consists of the normalized PSSM scores of the query sequence that correspond to a sequence segment of length W that is centered on i. It has been demonstrated that W = 17 is a better choice for several proteinligand binding residue prediction studies [25,26]. Consequently, the dimensionality of the LogisticPSSM feature vector of a residue is 17620 = 340-D.

B. Extract Feature Vector from the Predicted Protein
Secondary Structure. PSIPRED [53], which has been widely used in bioinformatics [54,55], can predict the probabilities of each residue in a query protein sequence belonging to three secondary structure classes, i.e., coil, helix, and strand. We obtained the predicted protein secondary structure by performing PSIPRED against the query sequence. The obtained predicted secondary structure is an L63 probability matrix, where L is the length of the protein sequence. Similar to the LogisticPSSM feature extraction, we can extract a 1763 = 51-D feature vector, denoted as PSS, for each residue in the protein by applying a sliding window of size 17. The final discriminative feature vector of a residue is formed by serially combining its LogisticPSSM feature with the corresponding PSS feature, and the dimensionality of the obtained feature vector for the residue is 340+51 = 391-D.
C. Support Vector Machine. Support vector machine (SVM), which was proposed by Vapnik [42], has been widely used in a variety of bioinformatics fields, including the proteinnucleotide binding residue prediction [13,14] considered in this study. In view of this, we will also use SVM as the base-learning model to evaluate the efficacy of the proposed SOS algorithm.
Here, we will briefly introduce the basic idea of SVM. Let {1g are the feature vector and the corresponding label of the i-th sample, respectively, and +1 and 21 are the labels of positive class and negative class, respectively.
In linearly separable cases, SVM constructs a hyperplane that separates the samples of two classes with a maximum margin. The optimal separating hyperplane (OSH) is constructed by finding another vector, w, and a parameter, b, that minimizes 1 2 w k k 2 and satisfies the following conditions: where w is a vector normal to the hyperplane, and w k k 2 is the Euclidean norm of w.
The solution is a unique, globally optimized result with the following expansion:  Over-Sampling Algorithm and Protein-Nucleotide Binding Prediction Support vectors are those x i , whose corresponding a i w0.
Once the w and b are found, a query input x can be classified as follows: To allow for mislabeled examples, Corinna Cortes and Vladimir N. Vapnik suggested a modified maximum margin idea, i.e., ''soft margin'' technique [56].
For each training sample, a corresponding slack variable is introduced: j i w0, i~1,2,3, Á Á Á ,N. Accordingly, the relaxed separation constraint is given as: Then, the OSH can be solved by minimizing.
where c is the regularization parameter. Furthermore, to address non-linearly separable cases, the ''kernel substitution'' technique is introduced as follows: first, the input vector x i [ R d is mapped into a higher dimensional Hilbert space, H, by a non-linear kernel function, K(x i ,x j ); then, the OSH in the mapped space, H, is solved using a procedure similar to that for a linear case, and the decision function is given by: To train a SVM on a given data set, the kernel function and the regularity parameter c need to be specified in advance. In this study, LIBSVM [57] (http://www.csie.ntu.edu.tw/,cjlin/libsvm/) , which is one of the most commonly used kernel functions, is chosen as the kernel function. The regularization parameter c and the kernel width parameter s are optimized based on 10-fold cross-validation using a grid search strategy in the LIBSVM [57] software.
Dealing with Class Imbalance: A New Supervised Over-Sampling Method As described in the introduction section, protein-nucleotide binding residue prediction is a typical imbalanced learning problem. By revisiting Table 1, we can easily find that a severe class imbalance phenomenon does exist among both training datasets and independent validation datasets: the ratio of the number of non-binding residues to that of binding residues is often larger than 20.
In this study, we propose a new SOS algorithm for relieving the severity of class imbalance to facilitate the subsequent statistical machine learning methods. To demonstrate the effectiveness of the proposed SOS, several popular over-sampling methods, including ROS, SMOTE [39], and ADASYN [41], are used to perform comparisons with the proposed SOS.
A. Random Over-sampling. In the ROS technique, the minority set S min is augmented by replicating randomly selected samples within the set.
Although ROS is simple and easy to perform, a potential problem is that the resulting dataset tends to be over-fitted because ROS simply appends replicated samples to the original dataset; thus, multiple instances of certain samples become ''tied'' [58]. In view of this issue, several improved over-sampling techniques, e.g., SMOTE [39] and ADASYN [41], have been proposed and have shown promising results in various imbalanced applications. In this study, two improved over-sampling techniques, i.e., SMOTE [39] and ADASYN [41], were considered.
B. Synthetic Minority Over-sampling Technique. The SMOTE method [39] augments the minority class set S min by creating artificial samples based on the feature space similarities between existing minority samples. The SMOTE procedure is briefly described below.
For each sample x i in S min , let S K i be the set of the K-nearest neighbors of x i in S min under the Euclidian distance metric. To synthesize a new sample, an element in S K i , denoted asx x i , is selected and then multiplied by the feature vector difference betweenx x i and x i and by a random number between [0, 1]. Finally, this vector is added to x i : where d[[0, 1] is a random number. These synthesized samples help break the ties introduced by ROS and augment the original dataset in a manner that, in general, significantly improves subsequent learning [28].
C. Adaptive Synthetic Sampling. SMOTE creates the same number of synthetic samples for each original minority sample without considering the neighboring majority samples, which increases the occurrence of overlapping between classes [28]. In view of this limitation, various adaptive over-sampling methods, e.g., ADASYN [41], have been proposed.
ADASYN uses a systematic method to adaptively create different numbers of synthetic samples for different original minority samples according to their distributions. The ADASYN procedure is briefly described below.
The number of samples that must be synthesized for the entire minority class is computed first: where b[½0,1 is a parameter that determines the balance level after the ADASYN process. Then, for each original sample, x i [ S min , its K-nearest neighbors are found according to the Euclidean distance metric, and the distribution function, C i , which is defined as: is calculated, where D i is the number of samples in the K-nearest neighbors of x i that belong to S maj , and Z is a normalization constant so that C i is a distribution function, i.e., P C i~1 . Next, the number of synthetic samples that must be generated for each x i [ S min is computed: Finally, for each x i [ S min , g i synthetic samples are generated according to Eq. (7), as in SMOTE.
The key difference between ADASYN and SMOTE is that the former uses a density distribution, C, as a criterion to automatically decide the number of synthetic samples that must be generated for each minority sample by adaptively changing the weights of the different minority samples to compensate for the skewed distributions [28,41]. The latter generates the same number of synthetic samples for each original minority sample.
D. Proposed Supervised Over-sampling. Let S~S min |S maj be the training dataset, where S min~f x (i) min g Nmin i~1 is the minority class sample set, and S maj~f x (i) maj g Nmaj i~1 is the majority class sample set. The purpose of the proposed SOS algorithm is to obtain a relatively balanced dataset, denoted asŜ S, by synthesizing additional minority class samples under a supervised process.
Let bw1 be the parameter of the over-sampling coefficient, which is a scalar quantity that measures the ratio of the size of the minority class sample set after over-sampling to that of the original minority class sample set. In other words, b controls how many additional minority samples will be generated. More additional minority samples will be synthesized with larger values of b.
The process of the proposed SOS is described as follows: Step I: Training an initial classifier model, denoted as C model , on the original training dataset S min |S maj : The trained classifier model will be used to judge whether a synthesized minority class sample is valid.
Step II: Synthesizing an additional minority sample: First, two samples, denoted as x (i) min and x (j) min , will be randomly selected from the minority class sample set S min : According to the two randomly selected minority class samples, an additional sample can be synthesized: where l is a random value ranging from 0 to 1. Then, the confidence of the synthesized sample, x (new) min , being a minority class sample is predicted using the trained initial classifier model C model : The validity of the synthesized sample depends on its confidence. More specifically, the synthesized sample is a valid minority class sample if and only if P(x (new) min )[½T low ,T high , i.e., its confidence lies within the prescribed confidence interval ½T low ,T high .
Step II is repeated until the b{1 ð Þ : N min valid minority class samples have been synthesized.
Algorithm 1 summarizes the proposed SOS. Note that the three parameters, i.e., b, T low , and T high , are problem-dependent. In this study, we set b~2, T low~0 :6, and T high~0 :9.
Note that in Step II, it is straightforward and reasonable that a synthesized sample will not be considered valid when its confidence is less than the prescribed lower confidence, T low . However, a synthesized sample will also be considered invalid if its confidence is larger than the prescribed upper confidence, T high . The underlying reason for this choice is that we believe that a synthesized sample with confidence that is too high tends to become ''tied'' with those true minority class samples, thus potentially leading to an over-fitting problem. is the minority class sample set and S maj~f x (i) maj g Nmaj i~1 is the majority class sample set; b-The oversampling coefficient, which is the size of the minority class after over-sampling, divided by that of the original minority class; ½T low ,T high -The confidence interval, which is used to determine whether a synthetic sample belongs to the minority class.
OUTPUT:Ŝ S~Ŝ S min |S maj -The over-sampled training dataset, whereŜ S min is the minority class sample set after over-sampling.
1. Training a classifier model, denoted as C model , using the original training set S min |S maj : Randomly select two samples, denoted as x (i) min and x (j) min , from S min :

5.
Synthesize a new sample: where l is a random value ranging 0 from 1; 6.
Predict the confidence of x (new) min being a minority class sample: END IF 10. END WHILE 11.Ŝ S min /Ŝ S min |S min 12.Ŝ S/Ŝ S min |S maj 13. RETURNŜ S

Evaluation Indexes
Let TP, FP, TN, and FN be the abbreviations for true positive, false positive, true negative, and false negative, respectively. Then, Sensitivity(Sen), Specificity(Spe), Accuracy(Acc), and the Matthews correlation coefficient (MCC) can be defined as follows: However, these four evaluation indexes are threshold-dependent, i.e., the values of these indexes vary with the threshold that is used in the prediction model. Considering that the MCC measures the overall quality of the binary predictions, we reported these threshold-dependent evaluation indexes by choosing the threshold that maximizes the value of the MCC of the predictions (termed MaxMCC Evaluation in this study).
It has not escaped our notice that several predictors reported their performances by selecting the threshold that balances the values of Sen and Spe [13,25,26] (termed Balanced Evaluation in this study). For the purpose of a fair comparison, we also used Balanced Evaluation when comparing the proposed method with these predictors.
In addition, the Area Under the receiver operating characteristic (ROC) Curve (AUC), which is threshold-independent and increases in direct proportion to prediction performance, was used to evaluate the overall prediction qualities of the considered prediction models.

Supervised Over-Sampling Helps to Enhance Prediction Performance
In this section, we empirically demonstrate that the performance of protein-nucleotide binding residue prediction can be further improved by applying the proposed SOS algorithm. Tables 2 and 3 summarize the performance comparisons between with-SOS and without-SOS for ATP168 and ATP227 over fivefold cross-validation under Balanced Evaluation and MaxMCC Evaluation, respectively. Figure 1 (a) and (b) illustrate the ROC curves of with-SOS and without-SOS for ATP168 and ATP227 over five-fold cross-validation. The results listed in Tables 2 and 3 show that the prediction performances are remarkably improved after SOS is applied. An improvement in the AUC of over 2% is observed for both the ATP168 and ATP227 datasets. In addition, the other four indexes, i.e., Sen, Spe, Acc, and MCC, of the with-SOS predictions are consistently higher than that of the without-SOS predictions. Taking MCC as an example, improvements of 5% and 4% are observed for ATP168 and ATP227, respectively, under Balanced Evaluation, whereas improvements of 12% and 8% are achieved for ATP168 and ATP227, respectively, under MaxMCC Evaluation.

Comparisons with Other Over-Sampling Methods
In this section, we compare the proposed SOS with several other popular over-sampling methods, including ROS, SMOTE [39], and ADASYN [41]. Table 4 shows comparisons of the performance of SOS, ROS, SMOTE, and ADASYN for ATP168 and ATP227 over five-fold cross-validation under MaxMCC Evaluation. The results for the four other types of nucleotide ligands, i.e., ADP, AMP, GTP, and GDP, can be found in Supporting Information S2.
From Table 4, it is clear that the proposed SOS significantly outperforms ROS for both ATP168 and ATP227. Taking AUC and MCC, which are two overall measurements of prediction quality, as examples, average improvements of approximately 3% and 5% are observed. We also found that the proposed SOS achieves comparable performance to ADASYN and slightly outperforms SMOTE for ATP168 and ATP227. Similar phenomenon could also be found for the four other types of nucleotide ligands (refer to Supporting Information S2).
The results listed in Table 4 and Supporting Information S2 show that the proposed SOS performs much better than ROS and can achieve comparable performances to ADASYN and SMOTE, which demonstrates the efficacy of the proposed SOS.

Comparisons with Existing Predictors
In this section, we compare the proposed predictor, called TargetSOS, to the existing popular protein-nucleotide binding residue predictors to demonstrate its efficacy. TargetSOS performs predictions using a SVM model, which is trained with the proposed SOS algorithm in the NUC5 dataset and uses the LogisticPSSM+PSS feature as the model input. The comparisons are performed for both the cross-validation test and the independent validation test. Note that when cross-validation comparisons are performed for ATP168, only the Balanced Evaluation results are reported because the results for most existing predictors that are constructed from ATP168 are reported under Balanced Evaluation. For the same reason, cross-validation comparisons for the NUC5 dataset are reported under MaxMCC Evaluation.
A. Cross-Validation Test. Table 5 lists the performance comparisons of the proposed TargetSOS, TargetATP [26], TargetATPsite [25], and ATPint [13] for ATP168 over five-fold cross-validation under Balanced Evaluation. By observing Table 5, we find that the proposed TargetSOS significantly outperforms ATPint and is the best performer among the four considered predictors that were specifically designed for protein-ATP binding residue prediction. An over 5% improvement is observed for each of the five considered evaluation indexes, i.e., Sen, Spe, Acc, MCC, and AUC. In addition, TargetSOS performs better, although not significantly better, than the two most recently released predictors, i.e., TargetATP [26] and TargetATPsite [25]. Table 6 summarizes the performance comparisons between the proposed TargetSOS and several other popular protein-nucleotide binding residue predictors for the NUC5 dataset over five-fold cross-validation under MaxMCC Evaluation. It is found that the proposed TargetSOS almost always achieves the best performance, with only one exception for ATP concerning MCC and AUC, which are two evaluation indexes that measure the overall prediction quality of a predictor. Taking MCC as an example, TargetSOS achieves improvements of approximately 3%, 8%, 6%, 7%, and 3% for ATP, ADP, AMP, GDP, and GTP, respectively, compared with the second-best performer (i.e., TargetATPsite [25] for ATP and NsitePred [14] for ADP, AMP, GDP, and GTP). The underlying reason for the improvement in MCC is that the TargetSOS can achieve much higher performance with respect to the true positive rate (i.e., Sen) while simultaneously achieving comparable or even slightly better performances for the true negative rate (i.e., Spe). We believe that this improvement may be a result of the SOS technique.
B. Independent Validation Test. It has been routine procedure to evaluate the generalization capability of a predictor using an independent validation test because evaluating a newly developed predictor by only comparing it to existing predictors and by using the same datasets may potentially lead to optimistically biased results, in the sense that the new predictor's characteristics over-fit the used datasets [59]. Considering this potential bias, we also performed independent validation tests for the proposed TargetSOS and compared their performances with those of several other popular sequence-based protein-nucleotide binding residue predictors, as shown in Table 7.
From Table 7, we find that the AUCs for ATP, ADP, AMP, GDP, and GTP when using TargetSOS in the corresponding independent validation datasets are 0.912, 0.912, 0.880, 0.866, and 0.900, respectively. By revisiting Table 6, it is found that the AUCs of TargetSOS for ATP, ADP, AMP, GDP, and GTP on the training datasets are 0.893, 0.914, 0.850, 0.923, and 0.850, respectively. In other words, TargetSOS achieves similar overall prediction performances (measured by AUCs) on the training dataset and the corresponding independent validation dataset for all five nucleotide ligands, indicating that the generalization capability of the TargetSOS that is derived from the knowledge buried in the training datasets has not been under-or overestimated.
In addition, we find that the proposed TargetSOS achieves comparable overall performance (AUC) to the state-of-the-art sequence-based predictors considered in this study. On the other hand, TargetSOS almost always achieves the best performances for MCC, with only one exception for GDP, and an average improvement of approximately 3% is observed compared with the second-best performer (i.e., TargetATP [26] for ATP and NsitePred [14] for ADP, AMP, GDP, and GTP).

Conclusion
In this study, a new SOS algorithm that balances the samples of different classes by synthesizing additional samples for minority class with a supervised process is proposed to address imbalanced learning problems. We apply the proposed SOS algorithm to protein-nucleotide binding residue prediction, and a web-server, called TargetSOS, is implemented. Cross-validation tests and independent validation tests on two benchmark datasets demonstrate that the proposed SOS algorithm helps to improve the performance of protein-nucleotide binding residue prediction. The findings of this study enrich the understanding of class imbalance learning and are sufficiently flexible to be applied to other bioinformatics problems in which class imbalance exists, such as protein functional residue prediction and disulfide bond prediction.

Supporting Information
Supporting Information S1 Datasets used in this study.

(DOC)
Supporting Information S2 Performance comparisons between different over-sampling techniques on the ADP, AMP, GTP, GDP sub-datasets in NUC5.