DisPredict: A Predictor of Disordered Protein Using Optimized RBF Kernel

Intrinsically disordered proteins or, regions perform important biological functions through their dynamic conformations during binding. Thus accurate identification of these disordered regions have significant implications in proper annotation of function, induced fold prediction and drug design to combat critical diseases. We introduce DisPredict, a disorder predictor that employs a single support vector machine with RBF kernel and novel features for reliable characterization of protein structure. DisPredict yields effective performance. In addition to 10-fold cross validation, training and testing of DisPredict was conducted with independent test datasets. The results were consistent with both the training and test error minimal. The use of multiple data sources, makes the predictor generic. The datasets used in developing the model include disordered regions of various length which are categorized as short and long having different compositions, different types of disorder, ranging from fully to partially disordered regions as well as completely ordered regions. Through comparison with other state of the art approaches and case studies, DisPredict is found to be a useful tool with competitive performance. DisPredict is available at https://github.com/tamjidul/DisPredict_v1.0.


Introduction
Many protein regions and some entire proteins do not adopt well-defined, stable threedimensional (3D) structures in an isolated state and under different non native environments [1][2][3]. These proteins or partial regions of proteins are called intrinsically disordered proteins (IDPs) or disordered regions in proteins (IDRs), also known as natively unstructured, denatured or unfolded. The coordinates of their backbone atoms have no specific equilibrium states and can vary largely due to variable physiological conditions, and thus adopt dynamic structural ensembles. Structurally, IDPs (or IDRs) encompass proteins or protein-regions with extended disorder, collapsed disorder and semi-collapsed disorder. These reflect differences in the underlying biophysical characteristics including low hydrophobicity and high net charge, marginal level of residual secondary structure [4,5], dynamic side chains and secondary structures [6,7], rapidly exchanging backbone side-chain hydrogen bonds which make a region unable to form specific secondary structure [3]. Recognition of these protein disordered regions is important for appropriate protein structure prediction, disease causing protein identification, proper annotation of function, induced folding and binding region prediction.
For the last two decades, many works have been presented in evidence that many proteins do not follow the well-known paradigm of sequence to stable structure to function. Rather these proteins adopt disordered state for complex and essential biological functions [1,6,8,9] such as cell cycle control and cellular signal transduction, transcriptional and translational regulation, membrane fusion and control pathways [1,10,11]. They participate in molecular recognition, molecular assembly and protein modification [12,13] via protein-protein, protein-nucleic acid and protein-ligand interactions as well. Disorder proteins are found to be highly associated with critical human diseases [14][15][16], such as cancer, amyloidoses, cardiovascular and neurodegenerative diseases, genetic diseases. Thus, identifying them assists in effective drug development [17,18].
In reality the IDPs are abundant. Approximately 70% of the structures released by Protein Data Bank (PDB) [19] contain some disordered residues [20,21]. A curated database of disordered proteins, called DisProt [22] contains annotation for 694 protein sequences and 1539 disordered regions in its current version 6.02. The IDEAL [23,24] and MobiDB [25,26] databases also provide useful collections for annotation of intrinsic disorder. PDB [19] database, which gives provision of finding disordered regions in the solved secondary or tertiary structure incorporates 105,097 protein entries. To compare, the overall number of non-redundant protein sequences is 46,968,574 according to the most recent 68 release of RefSeq database [27]. However, due to highly flexible characteristics of the residues of IDRs or, IDPs [28]), experimentally verified annotation of intrinsic disorder is growing slowly. Thus to keep pace with this large-scale increase in protein database, effective computational methods for correct identification of disordered residues in IDPs or, IDRs are necessary.
In this article, we propose a new disorder predictor, named "DisPredict (Disorder Predictor)" [58]. DisPredict classifies ordered and disordered residues in a protein sequence with higher accuracy, specifically in terms of Mathews Correlation Coefficient (MCC) and Area Under the receiver operating characteristics Curve (AUC). Dispredict is based on Support Vector Machine (SVM) using Radial Basis Function (RBF) as kernel. We further strengthened the classification performance of DisPredict by selecting optimized parameters of SVM which significantly improved the performance. We utilized a comprehensive set of 56 features to characterize disorder in protein sequence. We compared DisPredict's performance with existing predictors, SPINE-D [47] and MFDp [56], followed by an analysis of its performance with respect to different types of amino acid, length of disorder region and datasets.

Materials and Methods
In this section, we discuss the data-sources, data-processing, input-feature generations, software design and platform and performance evaluation.

Preliminary Disordered Data Sources
In the prior studies, PDB [19] and DisProt [22] are considered as the primary repositories of IDPs. Disorder regions are composed of residues with missing coordinates in structure solved by X-ray crystallography, whereas the residues show highly variable coordinates within ensemble solved by NMR. We selected two datasets which combine sequences from PDB having disordered residues without coordinates (recorded in REMARK 465) and sequences from DisProt, having curated annotations of disorder regions including properties such as short ( 30 residues) and long (> 30 residues) disordered regions, partial as well as fully ordered or disordered chains.

Datasets
We used two different datasets, MxD and SL, to train, test and cross-validate our proposed Dis-Predict. MxD and SL datasets were used to train two disorder predictors, SPINE-D [47] and MFDp [56], respectively. We collected and utilized these datasets to be able to consistently compare DisPredict with these two predictors.
The Mixed Disorder (MxD) dataset is a combination of protein sequences with disordered residues from both PDB and DisProt. Originally developed MxD dataset [56] has 514 protein sequences including 205 chains from PDB and 309 chains from DisProt. We carried out further purification by removing sequences with unknown amino acid (X-tag) since they do not have specific physicochemical properties to get corresponding features in our methodology. This led to the MxD444 dataset, with 444 chains and 214,054 residues, that mixes 49,090 (about 23%) disordered residues and 164,964 (about 77%) ordered residues. SL477 dataset was prepared by the developers of SPINE-D predictor from the benchmark SL (Short Long) dataset [59]. The SL dataset encompasses short and long disordered regions as well as ordered regions. It was built by re-annotating the sequences extracted from DisProt to include reliable order and disorder contents. Among the annotated regions in the SL dataset, 50% of the regions are of the short-disordered category. The short regions in SL dataset are of length 20 residues or less [59]. It is important to incorporate this disorder annotation in a dataset since these short disordered regions are found functionally important as they obtain induced folding with the close proximity of appropriate partners. SL477 also includes very long disordered regions as well as completely disordered proteins, called intrinsically disordered proteins (IDPs). SL dataset comprises of proteins with disorder regions annotated by NMR experimental method as well. To achieve combination of sequences with low sequence identity, SL dataset's sequences were clustered and filtered using BLASTCLUST [60] which resulted in 477 chains with < 25% sequence identity between each pair. SL477 has total 215,343 residues, of which 56,887 (about 25%), 72,808 (about 34%) and 85,648 (about 40%) residues are annotated as disorder, order and unknown, respectively. Unknown residues are those which are marked unknown in the source datasets. We disregarded the residues with unknown annotation during both in training and in evaluating our proposed approach.
Moreover, to test our predictor with less overlapped sequences from training dataset, we extracted two independent test datasets from the two training datasets using BLASTCLUST [60]. We filtered 171 protein chains from SL477 datasets with less than 10% similarity with any sequence from MxD444 dataset. We call these 171 protein chains with 42,572 residues as SL171. We used SL171 as test dataset to independently test our predictor's performance while it is trained by MxD444 dataset. Similarly, we extracted 134 sequences from MxD444 dataset that are independent from SL477 dataset at 10% identity cut off. We call these 134 protein sequences with 38,823 residues as MxD134. We utilized MxD134 as test dataset to independently test our predictor's performance while it is trained by SL477 dataset.
Further, we prepared a completely new dataset that is completely independent of the training sets of DisPredict, SPINE-D [47] and MFDp [56]. We collected 48 new protein chains from DisProt [22] released after version 5.1 upto current version of 6.02. These protein sequences were combined with another 25 protein chains culled from PDB [19]. Protein chains were extracted from PDB x-ray structures with resolution 3.0 angstroms, length ! 50, sequence identity cut-off of 30% and by choosing single chain proteins. We randomly selected 25 chains from the output of this experiment so that no sequence is more than 25% similar with the training sequences. To have a proper combination of ordered and disordered proteins, we ensured that none of these 25 proteins can contain disordered residues expect terminal regions. Altogether, it gave us 73 protein sequences which is a combination of 37 full disorder chains, 23 full ordered chains and 13 protein chains with disordered and ordered regions. We call this Disorder Dataset as DD73. DD73 dataset allows us to perform a robust comparison among DisPredict, SPINE-D [47] and MFDp [56], as it is independent of both SL and MxD dataset.

Input Features
Our input features were carefully chosen to be able to include useful properties such as the sequence information, evolutionary information as well as the structural information (listed in Table 1). Studies suggest that necessary information for the correct folding of a protein is encoded in its amino acid sequence including disorder contents [31]. Moreover, disordered regions are abundant in low complexity regions and in regions with low content of hydrophobic amino acids [28,61]. The physicochemical properties [62] of amino acid are also found to have some degree of correlation with the length of disordered regions; as short disordered regions are mainly negatively charged while long disordered regions are nearly neutral [28,55]. These observations motivated us to use amino acid type (AA), indicated by one numerical value out of twenty and seven physicochemical properties (PP) as features to predict disordered residues in our proposed approach.
Disordered regions and their related functions are conserved within the sequence during evolution [63], thus we considered position specific scoring matrix (PSSM) as input features to capture evolutionary information. PSSM (sequence length × 20) was generated for each sequence by executing three iterations of PSI-BLAST [60] against NCBI's non-redundant database [27,64]. The PSSM values were normalized further using numeric value nine [65], which we call as PSSM normalizing factor. We employed sequence based predicted secondary structure (SS) probabilities for helix, sheet and coil residues [65], predicted solvent accessibility (ASA) [66] and predicted backbone dihedral torsion angles (F and C) fluctuations [67] as features. We included these six features since disordered residues can be characterized by the lack of stable secondary structure [7,8,28] and also the unstructured regions are found to have large solvent accessible area [68]. Literature suggests that the conserved evolutionary information given by PSSM can be transformed from primary structure (amino acid sequence) level to three dimensional structure level by computing monograms and bigrams from PSSM values [69]. The monogrambigram probabilities characterize the subsequence of a protein sequence that can be conserved within a fold in terms of transition probabilities from one amino acid to another [70]. Thus the monogram-bigram features are useful in identifying the evolutionary folded (ordered) or, unfolded (disordered) region of proteins, which motivated us to utilize them as features in disorder prediction. We computed monogram feature matrix (1 × 20) and bigram feature matrix (20 × 20) for each sequence from its PSSM. Monogram feature matrix consists of one monogram value (MG) for each type of amino acid and bigram feature matrix consists of one bigram value (BG) for each pair of 20 possible amino acids, respectively. Further, our analysis based on multiple datasets collected from PDB and DisProt shows that both the monograms and bigrams follow a normal density distribution in their logarithmic space with approximately consistent median value equals to 6.0 within any dataset (Fig 1). Therefore, we used exp(6.0) to normalize these values and reduce the noise. To distinguish the terminal residues for their position specific disorder like behavior, we included terminal indicator feature (T) by encoding five residues of N-terminal as {−1.0, −0.8, −0.6, −0.4, −0.2} and C-terminal as {+1.0, +0.8, +0.6, +0.4, +0.2} respectively, whereas rest of the residues were labeled 0.0. Note that, we included the fundamental features to characterize disorder in protein in our feature set which are well studied and utilized in the literature [47]. Further, we enhanced the feature set by including new features, like MGs and BGs. We included the information of neighboring residues within the features of each residue by using a sliding window, keeping the target residue at the center of the window. The motivation was to incorporate the native interactions and contacts of neighboring residues which are found to play essential roles in determining protein structures and protein folding dynamics [71,72]. We determined the 10-fold cross-validation performance of DisPredict for 13 different window sizes (1, 3, 5, . . ., 23,25) to find the optimal window size 21. Thus, there were 1176 (since, window size × total feature count = (21 × 56) = 1176) features used for each residue. The features were finally scaled within the range [−1, + 1] before using.

SVM Design and Parameterization
DisPredict is a two-layer disorder predictor that integrates optimization-layer and classification-layer. The classification-layer is developed using a single support vector machine (SVM), namely LIBSVM [73]. Due to the working principle of SVM of simultaneously minimizing the empirical classification error (training error) and generalized error (test error) by maximizing the geometric margin of the separating hyperplane, it can be regarded as an effective technique in hard classification problems specially in bioinformatics and computational biology area. We used Gaussian or, radial basis function (RBF) kernel for the SVM to extend its capability to handle non-linearly separable classes. RBF transforms the input feature space into infinite dimension space (i.e. Hilbert space), which results in a linear separating hyperplane. On the other hand, in the optimization-layer of DisPredict, we selected two parameters, C and γ, where C is the cost of misclassification and γ is the parameter of fitting best mode of RBF. The optimal values for the parameters C and γ are determined by grid search using 5 fold cross validation. However, in our case the grid search turned out to be computationally very intensive. Thus, we used 5% of the training dataset to determine the optimal parameters instead. The Dis-Predict output classes such as disordered or ordered residue, in terms of probability, is optimized by another round of 5-fold cross validation. Using the threshold value 0.5, the probabilities are converted into binary decision variables, where probability ranges 0.5 range d 1.0 is considered as disordered and 0.0 range o < 0.5 is considered as ordered. We implemented our software in C++. The software is developed and tested on Linux platform. It is dependent on two external packages, namely PSI-BLAST [60] and NR database [27,64], which are publicly available. DisPredict software is also available online at https://github. com/tamjidul/DisPredict_v1.0 with a user manual.

Performance Evaluation and Statistical Test Criteria
The performance of DisPredict is evaluated using the criteria followed in the past Critical Assessment of protein Structure Prediction (CASP) competitions [74][75][76]. The measures and procedures used in CASP experiments are comprehensive. The predictions are done in two levels: 1. Binary value, defining whether a residue is disorder or not ("+1" for disorder and "−1" for order) and 2. Real value, quantifying the probability of a residue being disorder ("! 0.5" for disorder and "< 0.5" for order).
Binary prediction evaluation. In binary (two-class) prediction of disorder, TP (True Positive) = number of correctly predicted disordered residues, TN (True Negative) = number of correctly predicted ordered residues, FP (False Positive) = number of incorrectly predicted disordered residues and FN (False Negative) = number of incorrectly predicted ordered residues. To determine the total number of correct prediction (both ordered and disordered), N correct = TP + TN is calculated. Sensitivity (SENS) and specificity (SPEC) are two complementary statistical measures identifying the proportionate values of correct prediction of disordered (positive class) and ordered (negative class) residues, respectively.
Here, N d and N o are the total number of disordered and ordered residues, respectively. As increment of one of these measures (SENS and SPEC) usually leads towards the decrement of another measure, neither of these two measures is a suitable indicator of performance for an  Table 1 and the arrows are labeled by the number of features involved. The classification-layer receives final feature set from the feature aggregation step and optimal parameters from the optimization-layer. Then, it generates the predictor model and outputs both binary annotation and real-valued class probabilities. imbalanced dataset. On the contrary, the balanced accuracy (ACC), weighted score (S w ) and Mathews correlation coefficient (MCC) are the measures that take all four components of prediction quality (TP, TN, FP and FN) into account and thus can be regarded as more important indicators. [75]. The S w measure includes weight to address the imbalance in the ratio of ordered and disordered residues and rewards correct disorder classification over correct classification of ordered residues, which is later found to have a linear relationship with ACC (S w = 2 × ACC − 1) [77]. Since both of these measures (ACC and S w ) have been used in CASP assessment, we have also included both of them in our paper instead of just one. MCC score, another measure that accounts for all four parameters of the prediction quality, is the most reasonable and consistent measure for disorder prediction assessment because of not being favorable to over prediction of any class (order/disorder). MCC and S w scores vary from −1 to 1, where −1 and 1 represent perfect misclassification and classification, respectively with a random classification scoring by 0. More recently, precision (PPV ¼ TP TPþFP ) has been appeared as a good measure for binary disorder prediction as it is totally insensitive to the prediction of the dominant class (i.e., here the order state), is therefore computed to evaluate DisPredict. As the prediction becomes better, the values of these metrics also get higher.
We calculated Mean Absolute Error ðMAEÞ ¼ to quantify the error of disorder prediction in content level. Here, n is the total number of protein chains, and c a d ðiÞ and c p d ðiÞ are the actual and predicted disorder content (fraction of disordered residues) for the i th protein chain, respectively. The lower value of MAE corresponds to better prediction.
Evaluation of predicted probability. The SVM model of DisPredict generates a predicted probability value for each residue which signifies the disorder confidence of that residue. This probability value is then binarized using a threshold of 0.5 to generate class annotation. If the probability is greater than or equal to 0.5, the predicted class is 'disorder' and if the probability is less than 0.5, the predicted class is 'order'. Assessment of the predicted probability by a Dis-Predict is performed by receiver operating characteristic (ROC) curve, which depicts the correlation between the true positive rate (TPR or, SENS) and false positive rate (FPR = 1-SPEC) for a probability threshold. The area under the ROC curve (AUC) quantifies the predictive quality of a classifier, where the AUC value equal to 1 indicates a perfect prediction and 0.5 corresponds to a random prediction. Moreover, 95% confidence interval (CI) for the AUC score is evaluated using DeLong's [78] variance estimated by bootstrapping. The evaluation of AUC and CI are performed using the statistical R package with the pROC [79] library.

Performance of 10-Fold Cross Validation
We evaluated the 10-fold cross validation performance of DisPredict separately on SL477 and MxD444 dataset. Regarding the optimum selection of the window size, we ran cross validation individually for 13 different windows, shown in Table 2, for both of the SL477 and MxD444 dataset with default parameters for SVM. The best result for window size 25 was found with ACC, MCC and AUC values equal to 0.82, 0.65 and 0.91, respectively for SL477 dataset, whereas for MxD444 dataset the values are 0.77, 0.48 and 0.85, respectively. The gradual increase in performance becomes a plateau as window goes higher above size 23 (Fig 3). Table 2 also depicts the inverse relationship between SENS and SPEC scores with increasing window size for MxD444 dataset. The best SENS (0.74) is achieved by window size 25 while the best SPEC (0.81) is achieved at window size 5. Overall, the consistent increment in balanced accuracy (ACC) and PPV prove our methodology to be well balanced. Best values of each metric are marked in bold for each dataset separately. 1 N correct is reported with total number of residues (Residue total ) to be predicted in parentheses. Both of the counts correspond to one subset (fold) of the full dataset which is generated for performing cross validation. 2 For AUC, the values within bracket indicate 95% confidence interval with 2000 stratified bootstrap replicas.
As the window size continues to increase, the rate of increase in scores becomes slow. Increase of scores is 0.001, as the windows size grows from 23 to 25 for SL477 dataset and 0.004 for MxD444 dataset, respectively. Note that, this preliminary extensive analysis of performance with multiple window sizes is done without selection of optimal parameters for SVM. For a specific window size (W size ) and total number of residues (Residue total ) in a dataset, we have a feature matrix of dimension, Residue total × (W size × 56). Therefore, the increase in window size leads towards the increase in the dimensions of the feature space, which in turn makes the time expensive grid search for parameters slower. To trade off between performance with optimization and time complexity of parameter selection along with model generation, we determined the optimal values of parameters with a 5% randomly selected subset of residues from training dataset for 3 window sizes (15, 21 and 25). The optimal parameters (C and γ) found from grid search are reported in Table 3. Furthermore, we inserted repeated disordered residue information only in case of training to balance the dataset as the support vector points for the less dominant class may not be sufficient to determine the optimal SVM margin. Specifically, duplicates (2 times for SL477 dataset and 3 times for MxD444 dataset) of disorder samples were provided during generation  of predictor model. However, in case of testing, no repeated information was inserted. Table 4 illustrates the detail of the cross validation results with optimized parameters for 3 different window sizes. The improvement of performance with optimized parameters over non-optimized one was significant. To compare, for SL477 dataset (window size 21), FP and FN values are reduced to 1,002 and 1,083 from 1,125 and 1,152 due to optimization. In case of MxD dataset (window size 21), the FN value is increased by 133 residues. However, the FP value is also decreased by 1,812 residues which maintains the overall increase in the total number of correctly predicted residues from 16,691 to 18,370. The improvement of prediction, both in terms of increased correct classification and decreased misclassification, is also visible from both the sensitivity and specificity scores. For window size 21, the values of S w , precision and MCC are improved by 4.5%, 2.5% and 4.5% respectively due to optimized training on SL477 dataset. At the same time, for MxD444 dataset, these progresses are 15.7%, 33.3% and 26.8% respectively. Note that, this significant improvement in MCC strongly supports our method's capability in handling the imbalance ratio of ordered and disordered residues. Further, the AUC score is also increased by 4.4% and 0.4% as the result of optimization for SL477 and MxD444 dataset, respectively. A comparative analysis of Table 2 and Table 4 also shows that optimized DisPredict model with window size 21 outperforms all the other models of its own kind. Thus we select 21 as the optimal window size for our proposed DisPredict. Furthermore, to understand the relevance of the new features (MGs and BGs) with protein disorder, we separately evaluated optimized DisPredict's performance without monograms and bigrams. We performed 10-fold cross validation on SL477 dataset with the optimal window size 21 and optimal parameters of SVM as reported in Table 3 for SL477 dataset with window size 21. The result of this experiment in terms of ACC, MCC and S w score are 0.810, 0.651 and 0.621, respectively. The comparison of these scores excluding MGs and BGs with those of including MGs and BGs (reported in Table 4 for SL477 dataset) shows that involvement of MGs and BGs along with PSSM leads to a further increase in binary prediction accuracy in terms of 3.2% improved ACC (0.810 to 0.836), 3.8% improved MCC (0.651 to 0.673) and 8.2% improved S w score (0.621 to 0.672).
To uniformly distribute the residues into ten subsets for cross validation, we applied modular arithmetic operation to split the dataset in residue level. As the residues are already included within the neighboring information based on the window, they are detachable from their original sequence. However, this inclusion of residue information within window may yield overlap of information between training and test sets in case of residue level splitting of dataset for cross validation. We analyzed the probability of this residual overlap between training and test sets. Let, there are N sequences in the dataset and the expected length of the sequence is L. Then, the possibility of picking two residues for training and test subsets of 10 fold cross validation which belongs to same sequence is ð 1 Since the expected length of a sequence is L, the chance of training and test overlap for a specific window size (W size ) is W size À1 L . Altogether, the probability of a train and test residue overlap from the same sequence is ð 100 9N 2 Â W size À1 L Þ ¼ ð 100 9 Þ W size À1 N 2 L . For SL477 dataset with N = 477, approximate L ¼ 400 and W size = 21, the probability of the overlap is 2.44 × 10 −06 , which is significantly low and thus can be safely ignored. Further, we reevaluated DisPredict's 10 fold cross validation performance with sequence level sampling by modular operation for SL477 dataset to generate training and test subsets. Table 5 quantifies the difference in performance between residue level and state of the art practice of sequence level splitting of dataset for cross validation with window size 21 and default parameters for SVM. It justifies that DisPredict's performance remains consistent without any significant over prediction in terms of all the metrics.

Evaluation of Independent Training and Testing
With optimized parameters and balanced dataset, we carried out independent training on SL477 and MxD444 datasets followed by testing the resulting predictor model with MxD134 and SL171 dataset, respectively. Note that, these independent test datasets (MxD134 and SL171) were generated at low sequence identity (10%) with the corresponding training datasets (SL477 and MxD444). The consistent results of these two tests done through cross validation and independent test confirm the usage of robust technique and effective feature set in DisPredict as well as training efficacy avoiding possible over-fittings. Table 6 further illustrates the  results of these tests, where we reported the average of the scores computed for equally divided 10 subsets of the full dataset along with the corresponding standard deviation (STDEV). Table 6 reveals that training by SL477 dataset gives consistent performance regardless of test datasets and test procedures (cross validation or independent test) in terms of ACC: 0.836, 0.833 and S w : 0.672, 0.667. These consistencies are also evident in case of training with MxD444 dataset while tested by different datasets and the evaluations are, ACC: 0.805, 0.789 and S w : 0.611, 0.577. We calculated the Mean Absolute Error (MAE) which is also reported along with its corresponding STDEV from mean. The score indicates that the error does not increase from cross validation to independent test as the test-results were robust.
To analyze the probability prediction, the ROC curves given by DisPredict are plotted in Fig  4 in continuous scale between 0.0 and 1.0. In each figure, two ROCs are plotted keeping the training dataset same with varying test datasets and evaluation procedure. Finally, we reported the AUC values which are found consistent for cross validation and independent test indicating our predictor's capability to avoid over-fitting.

Comparison with Existing Predictors
The performance of DisPredict is compared with the state-of-the-art disorder predictors, MFDp [56] and SPINE-D [47]. To remain fair while comparing DisPredict with each of the above two predictors, we train DisPredict separately with respective datasets and compare with each of them separately. Thus, DisPredict is compared with MFDp based on dataset MxD444, while dataset SL477 is used to compare DisPredict with SPINE-D (Table 7).
In particular, MFDp [56] is a meta predictor that combines the predictions from three disorder predictors (DISOPRED2 [32], IUPred [50] and DISOclust [53]). Further, MFDp combines the outputs from three SVMs with linear kernel using a threshold of 0.37, used to output  Table 6. The x-axis and y-axis show the Specificity and Sensitivity, respectively.
doi:10.1371/journal.pone.0141551.g004 binary prediction. In contrast, we utilized single SVM with RBF kernel and optimized parameters combined with a comprehensive set of features to develop the standalone predictor. However, the performance of MFDp in Table 7 is of 5-fold cross validation whereas DisPredict is evaluated by 10-fold cross validation and hence to be considered reliable rather than over-fitted by chance. In terms of MCC, DisPredict improved significantly, which is 36.36% better than MFDp. The improvement in S w score is also 19.6%. DisPredict showed lower sensitivity (7%) than MFDp while at the same time improved specificity by 20%, which in turn improved the balanced accuracy by 6.67%. Moreover, DisPredict outperformed MFDp in AUC score by 1.29% which is used to assess the probability based prediction.
The other state of the art predictor, SPINE-D [47] utilizes ANN technique which was at first developed to output three state prediction and later reduced into two state predictor of ordered and disordered residues. SPINE-D employs a disorder probability threshold of 0.06 that was optimized to achieve maximum S w score. On the contrary, DisPredict is a SVM based two state disorder predictor using a more meaningful threshold for two-class classification of value 0.5. DisPredict outperformed SPINE-D in terms of sensitivity as well as specificity by 5.19% and 1.18% respectively which leads to 3.7% improvement in overall accuracy. DisPredict also outperformed SPINE-D in terms of S w , MCC and AUC by 8.06%, 6.34% and 10.34% respectively.
In addition to the comparison on cross validation test, we evaluated DisPredict, SPINE-D [47] and MFDp [56] on independent DD73 dataset. The comparison among these three methods is illustrated in Table 8. It shows that DisPredict gives better performance among three predictors except in case of sensitivity. DisPredict yielded 2.63% lower sensitivity than that of SPINE-D [47], whereas DisPredict gave 4.25% higher specificity than that of SPINE-D [47]. Table 8 also shows that DisPredict outperformed SPINE-D [47] and MFDp [56] in terms of MCC by 3.76% and 0.76%, respectively. At the same time, DisPredict gave 1.26% and 5.36% improved precision (PPV) than MFDp [56] and SPINE-D [47], respectively. However, DisPredict resulted slightly lower sensitivity than those of SPINE-D [47] and MFDp [56]. At the same time, both SPINE-D [47] and MFDp [56] gave lower specificity than that of DisPredict. Figs 5 and 6 compare the ROC curves and precision-recall curves, respectively, given by DisPredict, SPINE-D [47] and MFDp [56]. Fig 5 shows that the ROC curves given by the three predictors are comparative. At the same time, the precision-recall curves (Fig 6) depicts that DisPredict achieves consistently higher precision upto less than 65% sensitivity (recall). MFDp and SPINE-D have been established as the best disorder predictor among 8 and 11 existing disorder predictors [47,56], respectively, covering different approaches in their relevant publication. In this article, our predictor is shown to be comparable with both of these methods. Therefore, DisPredict can be considered to be one of the finest disorder predictor and can be utilized to produce more reliable annotation of disorder versus order residues.

Case Studies: Characteristic Region and Protein Function
Proteins with disordered regions are found to contain several regions of interest, such as selfstabilizing folded regions, DNA or, nucleotide binding regions, short (up to 20 amino acids) conserved regions of biological significance (known as motif), mediating regions for protein interaction with different partners etc. These characteristic regions undergo various conformational changes, gain structure and affect many important biological functions. We selected three proteins as cases (UniProt IDs: P41212, P01116 and P04637) with experimentally verified  Table 8. The x-axis and y-axis show the Specificity and Sensitivity, respectively. regions of interest to analyze per residue disorder confidence score assigned by DisPredict, SPINE-D and MFDp. Fig 7 illustrates the disorder probability of each residue with respect to residue index. P41212 (Fig 7(A)) is a human ETV6 protein for transcriptional repressor function, which is also involved in several kinds of leukemia and syndrome. For this protein, Dis-Predict and SPINE-D showed comparable performance in detecting the highly conserved region of PNT (pointed) domain [80] [residues 40 − 124] and ETS (E26 transformation-specific) DNA binding region [81] [residues 339 − 420], respectively, while MFDp outperformed both of them with relatively less noise. P01116 (Fig 7(B)) is a human KRAS protein with intrinsic GTPase activity (binds GDP/GTP [82]) and related to several diseases, such as gastric cancer (GASC), acute myelogenous leukemia (AML), cardiofaciocutaneous syndrome 2 (CFC2) etc. DisPredict could identify its GTP (guanosine triphosphate) binding region [residues 10 − 17] and effector region [residues 32 − 40] respectively, with close to cut-off (0.5) probabilities. Note that, these two regions are experimentally verified unstructured regions, which are strongly suggested as structured by both SPINE-D and MFDp. However, the C-terminal hypervariable region [residues 166 − 185] is consistently detected by all three of these predictors. P04637 corresponds to human p53 protein which acts as a tumor suppressor.    In (P41212, A), the yellow (40 − 124 residues) and pink bar (339 − 420 residues) represent to the PNT domain [80] and ETS DNA binding region [81], respectively. In (P01116, B), the orange (10 − 17 residues), cyan (32 − 40 residues) and purple bar (166 − 185 residues) correspond to the GTP binding region [82], effector region and hypervariable region, respectively. In (P04637, C), the dark green (17 − 25 residues), red (325 − 356 residues) and gray bar (370 − 372 residues) highlight to the TADI motif [83], oligomer region and [KR]-[STA]-K binding motif, respectively. DisPredict detected it correctly. The overall comparison depicts that DisPredict's performance is more biologically relevant with correct identification of these short regions. Therefore, it would be interesting to utilize DisPredict in a broader scope in near future.

Discussion
In this article, we proposed a canonical support vector machine which uses a RBF kernel and includes useful and advanced features for predicting disordered residues, called DisPredict. Dis-Predict not only generates the binary class annotation for ordered and disordered residues but also provides order-disorder probabilities that can be treated as the confidence level of the prediction too. The DisPredict outperforms other existing top performing predictors both in predicting binary annotation and probability. The competitive performance of DisPredict is mainly due to the use of a novel methodology that incorporates firstly, radial basis kernel function (RBF) that can implicitly map the feature space in infinite dimension, secondly and most importantly the optimization of the parameters and thirdly, the novel features monogram (MG) and bigram (BG) assisted in determining an optimal as well as effective class separating hyperplane.
This overall performance of DisPredict is also persuaded by the use of a comprehensive set of features that well captures the sequential (amino acid composition) and structural characterization of ordered and disordered residues or, proteins. We used SPINE X [65] to generate the secondary structure related fine features. The distinguishing property of our feature set in comparison with existing predictors is the inclusion of monogram (MG) and bigram (BG), computed from PSSM. When a region of a protein is evolutionary conserved in a fold, then all the proteins within that fold are likely to have a conserved group of MGs and BGs. As some intrinsic disordered regions are conserved, addition of these features provides important structural evolutionary characteristics. By determining the appropriate window size, we have also included the effect of optimal interactions due to the contacts among neighboring residues.
The robust performance of DisPredict is also justified by training and testing the predictor with multiple datasets: SL477, SL171 and MxD444, MxD134. The datasets used to train DisPredict encompass disorder annotation from several complementary sources (X-ray and NMR defined disorder from PDB and DisProt) as well as disorder region of various lengths. The SL dataset comprises of 81 full disordered proteins (IDPs) while the rest of the chains contain 928 disordered regions (IDRs). On the other hand, the MxD dataset is composed of 55 full disordered chains, 4 full ordered chains and 385 chains, sharing both structured and disordered regions, which include 730 disordered regions (IDRs). Furthermore, 70% of the IDRs included within partially disordered proteins are short ( 30 residues) and 30% of them are long (> 30 residues). This combination of several length disordered regions (Fig 8) included within training confirms the consistent performance of DisPredict for disordered regions of all sizes as well as different types of disordered residues.
It is interesting to note that, regardless of cross validation or independent test, DisPredict's performance is relatively better while it is trained on SL477 dataset than that of MxD444 (Table 6). To further insight into this discrepancy, we investigated the correlation of true annotation provided in the dataset with the actual structural characterization of disordered and ordered residues. Disordered residues are distinguished from ordered residues by low content of secondary structure [8,28], therefore high probability of coil residues than helical or beta strand residues and disordered regions are likely to have large solvent accessible (exposed) area [55]. We represented the correlation of the fraction of secondary structure content and fraction of exposed residues for disordered and ordered regions of all length in Fig 9. We employed the predicted probability of each residue to be coil and predicted per residue solvent accessibility provided by SPINE-X [65] since all residues do not have defined coordinates (structure) to compute secondary structure and solvent accessibility.
We calculated the average coil probability (P coil ) for each ordered or disordered region and computed the fraction of exposed residues with greater than 25% solvent accessibility (F exposed ) of that region. In this analysis, we discarded 5 residues from N and C-terminal regions of each  dataset. The x-axis and y-axis correspond to the probability of having well defined secondary structure (in terms of probability being coil) and fraction of exposed residues of that region, respectively. protein sequence as they are mostly found on the surface of a protein chain (not buried in the core) and more likely to be affected by the interaction with nearby structured protein, yielding to a highly flexible and dynamic conformation. The plots for both datasets show that the ordered regions are mostly concentrated in the portion with relatively low coil probability, 0.3 P coil < 0.5 (high content of well defined helical or strand secondary structured residues) and low exposure, 0.2 F exposed < 0.5. While on the contrary, the disorder regions are found abundant in the area of high coil probability, 0.5 P coil 0.9 (low content of helical or strand secondary structured residues) and high exposure, 0.5 F exposed 1.0. However, we found the intrinsic difference between these two datasets according to their annotation of residues as order and disorder. This difference is also evident from the top right location of the correlation plot, 0.6 P coil 0.8 and 0.4 P coil 0.9, designated for disordered regions. For SL477 dataset (Fig 9(A)), the number disordered regions are predominant over the number of ordered regions in this top right location of disordered regions in the plot. In contrast, the same location of the plot is overlapped by both ordered and disordered regions in case of MxD444. We further quantified the difference as 13% of the data in MxD444's ordered set are more likely to be coil as well as highly exposed while 6% of the data in SL477's ordered set are exposed as well as coil. This higher proportion of misleading annotation in MxD444 dataset contributes relatively lower signal to noise ratio (SNR) of 87/13 compared to 94/6 for SL477 which is the most compelling reason of the better performance of DisPredict in case of training dataset SL477 over MxD444. As the prediction produced by DisPredict is well capable of detecting such discrepancies in the native annotation of the datasets, it can be utilized as a reliable source of correct annotation of the ordered and disordered residues. We should also focus that, a similar proportion of 11% and 13% of the disordered data are also mixed with the ordered residues in the low coil probability region of the plot for both MxD444 and SL477 dataset, respectively.
We would like to highlight that the amino acid residue compositions may vary in different datasets as well as within short ( 30 residues) and long (> 30 residues) disordered regions [28,29]. Specifically, short disordered regions are enriched with aspartic acid (D), glycine (G) and serine (S). On the contrary, glutamic acid (E), lysine (K) and proline (P) are likely to be abundant in long disordered regions. To give further insight into this residue composition and confirm the ability of DisPredict to detect the residue preferences of short and long disordered regions, we determined the residual composition profile for our two test datasets, SL171 (Fig  10(A)) and MxD134 (Fig 10(B)). It is to be noted that, these two datasets contain experimentally annotated disorder from two different sources. SL171 contains sequences with disorder annotation from DisProt while MxD134 contains that from PDB. The composition profile consists of the actual ratio (r a ) and predicted ratio (r p ) of each amino acid type out of total annotated and predicted disordered residues.
The composition profile demonstrates that SL171's disordered residue set accommodates relatively higher ratio of amino acid type E (10%) and K (9%), which are long disorder prone residues. In contrast, MxD134's disordered residue set is enriched with high ratio of amino acid type S (11%), G (10%) and D (9%), known as short disorder prone residues. Another significant difference between the intrinsic compositions of these two datasets is in the proportion of histidine (H). Disorder annotation from PDB includes higher ratio of H-tag (8% in MxD134, compared to 2% in SL171), which is sometimes used for protein purification. The predicted proportion of all these amino acids given by DisPredict ensures its capability of detecting residues in disordered region of all length accurately with no significant over prediction. Moreover, DisPredict could also accurately predict methionine (M) at highly flexible Nterminal region. To further quantify DisPredict's performance in detecting residue composition, we evaluated the Root Mean Square Difference (RMSE) and Pearson Correlation Coefficient (PCC) between actual and predicted ratio (r a and r p ) for each amino acid type. For MxD134 test dataset, we found RMSE of 0.0046, which was comparatively higher than the RMSE value computed for SL171 which equals to 0.0018. However, the correspondence between actual composition and predicted composition by DisPredict measured with PCC (P-Value < 10 −5 ) was found equally positive, 0.9976 and 0.9897 for SL171 and MxD134 dataset, respectively. It is important to note that, this consistent result is corresponding to the independent test where the dataset used to train DisPredict shared significantly low sequence identity (at most 10%) with test dataset, which once again implicates the strength of the classification methodology of DisPredict.
Finally, accurate prediction of disorder has useful implication in proteomic studies due to its direct involvement in the proper function of a protein. Successful detection of disordered region of a protein is considered to be the first step in drug design to combat critical diseases. We have built DisPredict using the canonical SVM classifier with RBF kernel and established it as a successful fine predictor of disorder by utilizing the benchmark datasets. In addition to that, our case studies ensure biologically relevant performances of DisPredict.