Estimation of Position Specific Energy as a Feature of Protein Residues from Sequence Alone for Structural Classification

A set of features computed from the primary amino acid sequence of proteins, is crucial in the process of inducing a machine learning model that is capable of accurately predicting three-dimensional protein structures. Solutions for existing protein structure prediction problems are in need of features that can capture the complexity of molecular level interactions. With a view to this, we propose a novel approach to estimate position specific estimated energy (PSEE) of a residue using contact energy and predicted relative solvent accessibility (RSA). Furthermore, we demonstrate PSEE can be reasonably estimated based on sequence information alone. PSEE is useful in identifying the structured as well as unstructured or, intrinsically disordered region of a protein by computing favorable and unfavorable energy respectively, characterized by appropriate threshold. The most intriguing finding, verified empirically, is the indication that the PSEE feature can effectively classify disorder versus ordered residues and can segregate different secondary structure type residues by computing the constituent energies. PSEE values for each amino acid strongly correlate with the hydrophobicity value of the corresponding amino acid. Further, PSEE can be used to detect the existence of critical binding regions that essentially undergo disorder-to-order transitions to perform crucial biological functions. Towards an application of disorder prediction using the PSEE feature, we have rigorously tested and found that a support vector machine model informed by a set of features including PSEE consistently outperforms a model with an identical set of features with PSEE removed. In addition, the new disorder predictor, DisPredict2, shows competitive performance in predicting protein disorder when compared with six existing disordered protein predictors.


Introduction
Proteins, being the fundamental structural macromolecules of a cell, are involved in most of the cell functions. A fully functional protein is usually the one that is appropriately twisted, coiled and folded into a specific three-dimensional conformation. The three-dimensional structures of proteins specify their associated functions [1,2]. It is well-known from the literature that the primary protein sequence alone has the essential information needed to determine its corresponding secondary and tertiary structures [3]. However, proteins may misfold under some physicochemical conditions that can alter the usual structural state [4][5][6]. Moreover, there is an abundance of proteins that are natively unstructured either at the regional level or, sequence wide, known as intrinsically disordered regions (IDRs) or, intrinsically disordered proteins (IDPs) [5,6]. IDRs and IDPs become biologically active through disorder-to-structure transitions [4,[7][8][9][10][11][12][13]. The connection of IDPs with critical human diseases, such as cancer, cardiovascular diseases, neurodegenerative diseases, genetic diseases, diabetes, amyloidosis and others, has created research areas such as prediction of protein disorder, identification of induced folding region, or binding sites in disordered proteins and drug discovery.
IDPs and protein sequences with IDRs that participate in important biological functions are increasing fast in number [14]. However, the experimental annotation of disordered residues is currently progressing slowly [15,16]. Thus the computational tools [15][16][17][18][19][20][21][22][23][24][25][26][27] for predicting disordered residues using sequence-based features play an alternative and vital role in understanding the functions of disordered proteins and disordered protein regions. These tools that use machine learning algorithms are useful in computational biology as they can produce results quickly. The critical assessment of protein structure prediction, popularly known as CASP competitions [28][29][30] evaluate the performances of existing disorder predictors biennially since 2002. These predictive tools require a set of features that capture the distinguishing characteristics of the ordered and disordered residues to be predicted. In this article, we propose a novel and effective feature for characterizing different structural descriptors of protein residues from its primary sequence.
Anfinsen's thermodynamic hypothesis [3] explains that a protein gains the lowest free energy in its natively stable structure. The structural stability of proteins requires a large number of inter-residual interactions. Such interactions among residues result in short range hydrogen-bond formation, van der Waals interactions as well as electrostatic interactions between protein atoms. Inter-residual contact energies between residues in proteins can be estimated from the residue-residue contacts observed in the crystal structures of globular proteins [31,32]. Attempts have been made to predict the pairwise contact energy values among 20 different amino acids from protein sequence [33]. Energy acts as a measure of proteins' structural stability. Lower free energy (especially negative energy) is favorable for stabilizing the folded state of a protein, whereas an unstable protein gains higher free energy (especially positive energy) that is unfavorable for its folded state.
In this work, we introduce a novel approach of predicting the contribution of position specific energy by each residue of a protein sequence to its total energy. We predict this energy per residue from the protein's primary sequence alone, which we term as Position Specific Estimated Energy (PSEE). Note that PSEE does not require a known structure to compute energy, unlike the energy functions [34][35][36]. The computation of PSEE considers the potential contact partners (amino acids) and the contact energies in the neighborhood of the primary protein sequence as well as the relative exposure of the target residues and its partners. Our empirical results indicate that PSEE can serve as a valuable feature for the prediction of disordered protein, secondary structure types and accessible surface area where 1D sequence information to 3D structural information mapping is essential. As an application, we enhance our disordered protein predictor, DisPredict [16] with the new PSEE feature, named as DisPredict2. DisPre-dict2 outperforms DisPredict in predicting disordered residues and shows competitive results with six existing disordered protein predictors in the literature.

Extraction of Position Specific Estimated Energy (PSEE) from Sequence
The free energy of a protein chain is a function of effective inter-residual contacts in its three dimensional conformation. Thomas and Dill described an iterative method [31] to extract interaction potentials, named ENERGI, from a set of protein structures obtained from Protein Data Bank (PDB) [14]. Initially, the 20 × 20 contact energy matrix in [31] is derived from known structures of 37 protein chains. A similar approach is applied in [33] to recalculate the contact energies between all possible pairs of 20 different amino acids using known structures of 785 proteins from PDB. However, the amino acid composition in the primary structure of protein determines its native structure with favorable energy. Therefore, it is believed that the pairwise contact energy can be extracted from the amino acid sequence [33]. The predicted pairwise contact energies in [33] are derived using the primary structures (amino acid sequences) of 674 proteins by least square fitting with the contact energies derived from the tertiary structures of 785 proteins. The actual and predicted energies are found to have a linear relationship, explained in [33].
Here, we present a novel idea of extracting the position specific estimated energy (PSEE) contribution of each residue in a protein from its sequence alone. The preliminary idea behind predicting pairwise energies in [33] conveys that the energy contribution of a residue depends on the amino acid type of that residue as well as the types of its partners in the sequence. We hypothesized that the position specific energy for a protein residue includes the contact effects with different types of amino acids within a neighborhood along the primary sequence. Therefore, we utilize the energy matrix (P) derived in [33] and shown in Table 1, to include the effect of having a variable count of different amino acid type residues that can form favorable contacts with the target residue. Further, we hypothesized that the position specific energy Table 1. Predicted pairwise contact energy matrix derived in [33]. contribution of a protein residue is related to the relative solvent accessibility (RSA) of the target residue and the residues within its neighborhood region. The RSA of a residue is used to determine its proportional exposure (pExp) or proportional burial (pBur) that defines its effective contact surface, therefore can characterize the local environment of that residue in the tertiary structure. In the process of protein folding, the hydrophobic amino acids, having less pExp, act as a driving force to develop the core in the tertiary structure, while the hydrophilic residues usually stay on the surface of the protein with relatively higher pExp. Thus, pExp (or pBur) of a residue can provide useful information in capturing the local solvent effects and can help in computing favorable (negative) energy contribution of that residue in its native structure. Let, AA i be the i t h amino acid residue of the protein sequence, where i 2 1,. . .,L and L be the length of that protein sequence. N i is the neighborhood region around AA i that consists of the contact partner residues of AA i . N i includes the contact radius (CR) number of residues on the either side of a target residue (AA i ). Thus the size of N i is equal to 2CR. The predicted pairwise contact energy between AA i and AA j is denoted by P(AA i , AA j ), where AA j belongs to N i . We weight this contact potential by the proportional burial of the contact partners to capture the essential contact effect in the estimation of position specific energy of the target residue AA i . Therefore, PSEE(AA i ) is formulated as Eq 1.
Computation of proportional exposure (or, burial). RSA of a protein residue is calculated by normalizing the accessible surface area (ASA) of that residue by the surface area of the same type of residue in a reference state. We used the ASA normalizing values derived in [37] using Gly-X-Gly tripeptide as the reference state for a given residue X. Therefore, the proportional exposure (pExp) and burial (pBur) can be expressed by the Eqs 2 and 3.
pExpðAA i Þ ¼ predicted ASAðAA i Þ ASAðAA i Þ in the extended conformation Gly À X À Gly ð2Þ The ASA normalization values are listed in Table 2. We utilized a new ASA predictor framework, REGAd 3 p [34], to generate predicted ASA of the residues. REGAd 3 p [34] is a new real- Table 2. ASA normalization values for 20 amino acids in Å 2 , proposed in [37]. value ASA predictor from protein sequence alone that showed maximum Pearson correlation coefficient (PCC) value of 0.76 on a blind test dataset.
Determining contact radius (CR). PSEE of a residue serves as a measure of the structural stability of that residue being located in that specific position. The structurally stable proteins, so as the residues of proteins, gains energetically favorable (negative) condition compared to the unstructured counterparts. The quantification of PSSE by Eq 1 involves the determination of the contact radius (CR) of the neighborhood around the target residue. It is assumed that the target residue forms effective local contacts with the CR number of residues on its either side. To determine the CR parameter for the computation of PSEE, we applied PSEE as a feature to characterize the structured (ordered) and unstructured (disordered) residues. We performed experiments to search for the best CR parameter value in the range of 4 to 30. We executed this experiment on the DisProt database [38] of disordered proteins that stores manually curated annotations of ordered and disordered residues. The recent release of DisProt version 6.02 contains 694 proteins with 1539 disordered regions. We excluded three chains from this set, Id: DP00688, DP00195, DP00642, as they have unknown amino acids, such as X, B and Z. Furthermore, the Cysteine (C) amino acid, being highly reactive due to its sulfhydryl group, caused abnormal PSEE values for some residues of 11 more protein sequences which we have discarded for the aforementioned reason. A very high Cysteine-Cysteine pairwise interaction energy is also explicit in Table 1. Thus we excluded those 11 chains while tuning the value of CR. This purification resulted a list of 680 protein chains, we label as DisProt680 dataset, from DisProt database [38]. Subsequently, we computed the mean PSEE, formulated by Eq 4, of the DisProt annotated ordered (o) and disordered (d) residues.
Here, n o and n d are the total number of ordered and disordered residues, respectively. We computed PSEEðoÞ and PSEEðdÞ for CR values of 4 to 30. For each value of CR, we define the threshold, t(PSEE), for PSEE based identification of ordered and disordered residues as the value that is equally distant from PSEEðoÞ and PSEEðdÞ. Fig 1 shows the PSEEðoÞ, PSEEðdÞ and t(PSEE) for CR from 4 to 30.
Fig 1 illustrates that PSEE identifies the energetically induced gap between the structured and unstructured residue and clearly draws the separation line in terms of t(PSEE) for all values of CR. For CR values equal to 4 to 30, PSEEðoÞ ranges from -0.51 to -0.58, whereas PSEEðdÞ ranges from -0.13 to -0.15. Therefore, PSEE could recognize the energetically favorable (negative) condition of structured residues. We utilize t(PSEE) of the corresponding CR values to classify ordered versus disordered residues to determine the best CR value that most distinguishes PSEE values of ordered and disordered residues. We plot the PSEE based disorder classification performance in terms of balanced accuracy (ACC), precision (PPV) and Matthews correlation coefficient (MCC) in Fig 2. We carried out this preliminary classification based on PSEE only to identify the effective CR value, thus we ignored the actual numerical values of the performance metrics here. Fig 2 shows that PSEE values calculated with a CR value of 9 perform the disordered residue classification most accurately based on the DisProt680 dataset. Thus we obtained the best CR value 9 and we used the same for the rest of our experiments in this work.
Datasets. We trained DisPredict2 with the same dataset as was utilized in order to train DisPredict [16] to have an accurate assessment of the effectiveness of the novel feature PSEE. DisPredict2 is trained with 477 protein sequences of the Short-Long (SL) [16,39] dataset. The SL477 dataset contains protein chains from DisProt [38] database. 50% of the disorder regions in this dataset are short with less than or equal to 20 residues, and the rest are long. The allowable similarity between protein sequence pairs is 25%. SL477 dataset consists of approximately 25%, 34% and 40% of residues annotated as disordered, ordered and unknown. The unknown residues are annotated as 'X'. We ignored X residues for training and evaluation purposes.
We tested and compared the performance of DisPredict2 with that of DisPredict [16] based on four independent datasets, DD73 [16], CASP8, CASP9 and CASP10. The DD73 dataset was  prepared by us and we used it as a holdout dataset in [16]. While the training dataset, SL477, was extracted from the protein chains of the DisProt database version 5.0, DD73 accommodates 48 proteins from DisProt database version 5.1 to 6.02. The rest of the 25 single chain proteins are extracted from PDB [14] with the following criteria: i) X-ray structures with resolution 3.0 Å, ii) length geq 50 residues, and iii) 30% sequence identity cut-off. Later we removed sequences with more than 25% pairwise sequence similarity using BLASTCLUST (ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html) from the NCBI-BLAST package [40]. Among 73 protein chains, 37 are fully disordered, 23 are fully ordered and 13 have both ordered and disordered regions. For DisPredict2, we utilized the DD73 dataset for both independent evaluation of the predictor and optimization of threshold for disordered residue classification. However, the CASP datasets were kept completely independent, and we did not carry out any optimization on those datasets.
The CASP8 dataset contains 122 protein chains; of which, 103 chains are X-ray derived protein structures and 19 chains are NMR structures. This dataset has approximately 11% disordered residues, and the rest of the residues are structured. We used 111 protein chains of the CASP9 dataset to test and compare DisPredict2 versus DisPredict [16]. For this dataset, only 10% of the total residues were annotated disordered. The CASP9 dataset has a similar proportion of X-ray and NMR derived protein structures. In the CASP10, 94 protein chains were used to assess the disorder predictors. For all CASP datasets, a residue is considered disordered if it lacks spatial coordinates or shows a high conformational variability across different X-ray structures or NMR models.
Feature set. In DisPredict2 we appended PSEE to the 56 features per residue used in Dis-Predict [16] along with PSEE. Therefore, we have 57 features per residue in DisPredict2. The residue level information includes: (i) amino acid type, encoded by a single value, as all the necessary information for the correct folding of a protein can be extracted from its amino acid sequence [3]; (ii) seven physicochemical properties [41] of amino acids as different types disordered regions (short or long) in a protein are found to have distinguished physicochemical properties; (iii) twenty PSSMs (position specific scoring matrix) indicating the evolutionary information conserved in each residue position of a protein sequence; (iv) three predicted secondary structure (helix, beta and coil) probabilities from SPINE-X [42], one predicted relative surface area [43] and two predicted backbone torsion angle (phi, psi) fluctuations [35] since disordered residues are characterized by a lack of stable secondary structure, high exposed area and higher fluctuations of torsion angle; (v) one monogram and twenty bigrams computed from PSSM [44] representing the conserved evolutionary information in the three-dimensional structural level; (vi) one indicator for terminal residues, five residues from N terminal and C terminal are indicated by -1.0 to -0.2 and +0.2 to +1.0 respectively with a step size of 0.2; and (vii) one position specific estimated energy (PSEE) value. Finally, before feeding the features into the classifier, 10 neighboring residues' information, on the either side of the target residue, was aggregated using a sliding window of 21, resulting in 21 × 57 = 1197 features per residue.
Predictor framework and performance evaluation. We developed DisPredict2 using the support vector machine (SVM) algorithm, following our initially designed DisPredict predictor [16]. In order to evaluate the contribution of the proposed PSEE feature in Disprodict2, we used the identical parameter optimization and training procedure for the SVM algorithm as in DisPredict [16]. Moreover, the same datasets were used with the PSEE feature appended with the previously used feature set. SVM with radial basis function (RBF) kernel simultaneously minimizes the empirical classification error (training error) and generalized error (test error) by maximizing the geometric margin of the separating hyperplane. The DisPredict2 predictor framework has three levels. The first level is the parameter tuning that determines the optimal values of two parameters for SVM classifier, namely C and γ, where C is the cost of misclassification that penalizes the feature space points on the wrong side of the decision boundary and γ is the parameter of the RBF kernel. The parameter selection was done by a grid search using 5% of the training dataset, which was guided by 5-fold cross validation with the accuracy (fraction of correctly predicted residues) optimization. The best parameter values found by the grid search is, C = 0.5 and γ = 0.0078125. The second level of DisPredict2 development involves the prediction model that generates both binary annotations and real valued probabilities of order versus disorder residues. The probability range, 0.5 range 1.0, is considered as disorder probability and 0.0 range < 0.5 is considered as order probability. The first and second level development of the predictor was done using LIBSVM [33]. The third level of the predictor is to optimize the threshold for disorder classification and to adjust the predicted annotations of each residue based on the optimized threshold. We employed Youden's J statistic [45] to find the optimal threshold for disorder prediction by analyzing the receiver operating characteristic (ROC) curve using the pROC package [46]. This statistic determines the optimal cut-off that maximizes the distance from the identity (diagonal) line. The optimality criterion is formulated as, maxðsensitivities þ specificitiesÞ ð 5Þ To make our predictor robust, we carried out the threshold optimization with an independent test dataset, DD73. The best threshold value found is 0.79. Therefore, we curated the annotation output given by the SVM model using 0.79 range 1.0 as disorder probability and 0.0 range < 0.79 as order probability. Further, we scaled the probability range [0.0, 0.79) into [0.0, 0.5) for the ordered residues and [0.79, 1.0] into [0.5, 1.0] for the disordered residues to make the DisPredict2's output more natural for binary classification.
The binary outputs given by DisPredict2 is evaluated and compared using the measures listed in Table 3. MCC is considered as the most balanced measure for binary classification [29]. Additionally, we computed Area Under ROC Curve (AUC), considered as the measure for the probability assignment. We further plotted the ROC curves and Precision-Recall curves.
The AUC values and the both curve plots are generated using the ROCR package [47].

Results
In this section, we highlight the usefulness of PSEE to characterize the structural stability of protein residues. Our results show that PSSE can effectively distinguish ordered and disordered residues, residues of three different secondary structures (helix, beta and coil) as well as residues with different physical properties (hydrophobic and hydrophilic). Therefore, PSEE can effectively extract useful biological information from sequence, making it a useful feature for machine learning based computational tools for disorder prediction, secondary structure prediction, residue exposure prediction, contact prediction, binding region prediction etc. Further, we report the predictive performance of DisPredict2, an updated version of disorder predictor,

Discriminatory Capacity of PSEE
Ordered and disordered residues. Fig 3(A) shows the mean PSEE of ordered and disordered residues of the DisProt680 dataset with contact radius of 9 on the either side of the target residue. The absolute gap between PSEEðoÞ and PSEEðdÞ is 0.363 that makes PSEE a reasonable feature to classify ordered versus disordered residues.
Further, we investigated the PSEE values at the regional level. Fig 3(B) plots the PSEE values for IDRs and ordered regions (ORs) computed as the average PSEE values of the respective residues of the regions. The average PSEE value for all IDRs is -0.391 and that for ORs is -1.00. The black dashed line in Fig 3(B) shows the separation line, computed as the middle value (-0.698) of the two average PSEE values for all IDRs and ORs. Therefore, the region below -0.698 is energetically favorable, whereas above it is the unfavorable region. It shows that PSEE values for some IDRs fall into the favorable region as well. To investigate this further, we segregated the IDRs into four types depending on the length of IDRs; IDRs with 5 residues, (5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20) residues,  residues and ! 40 residues. Then we computed the average PSEE for all IDRs having similar lengths. Fig 4 shows the average PSEE for all ORs, IDRs, 4 different types of IDRs along with the separation threshold shown in Fig 3(B). The relatively longer IDRs with  and ! 40 residues have PSEE values, -0.373 and -0.274, which are more unfavorable (less negative) than that of considering all IDRs, -0.391. Therefore, PSEE is useful in identifying long disordered regions. It is important to note that the average PSEE for shorter IDRs with 5 residues, -0.544, is close to the separation line, -0.698, thus these shorter IDRs tend to have favorable energy. These short disordered regions are often called binding sites which are biologically important, as they undergo disorder-to-order transitions by interacting with various partners. Identifying the binding sites in disordered regions is one of the most recent research areas due to the functional importance of binding sites. Our result shows that Helix, beta and coil residues. To manifest the performance of PSEE in capturing the structural differences of three different types of secondary structure residues (helix, beta and coil), we computed the mean PSEE for helix (h), beta (e) and coil (c) residues using Eq 6.
We applied a new secondary structure predictor, called MetaSSPred [48], to generate predicted annotations for helix(h), beta (e) and coil (c) residues. MetaSSPred [48] is a balanced secondary structure predictor that can overcome the under-prediction of less dominating beta residues in the datasets. Helices and beta residues are usually located in the core of the protein, having favorable energy. Beta residues are more structured compared to the helix residues. On the other hand, coil residues stay on the surface areas of proteins and are highly flexible, having unfavorable energy. Fig 5(A) shows the PSEEðhÞ, PSEEðeÞ and PSEEðcÞ for residues of the DisProt680 dataset. Beta residues have the highest negative PSEE and coils possess the lowest negative PSEE, whereas helix residues stay in between the two. This result is reasonable to validate the usefulness of PSEE in identifying different secondary structure residues. To further confirm this, we repeated a similar experiment on another dataset, generated by us in [34], specifically for Position Specific Estimated Energy (PSEE) and Disorder Prediction secondary structure analysis. This dataset is called the secondary structure dataset (SSD) containing 1299 protein sequences with known structures from PDB. We ran DSSP [49] to generate the actual annotations of secondary structures for the residues of the SSD1299 dataset and MetaSSPred [48] for the predicted annotations. The eight class annotations provided by DSSP were converted into three classes using a similar mapping given in [34,48]. The mean PSEE values for the residues in the SSD1299 dataset are shown in Fig 5(B). PSEE consistently distinguished the three types of residues annotated by DSSP as well as MetaSSPred for the SSD1299 dataset. Therefore, PSEE will serve as a useful feature for secondary structure prediction.
Hydrophobic and hydrophilic residues. Hydrophobic (H) amino acids build up the core of a protein and the hydrophilic or Polar (P) ones preferentially cover the surface of the proteins and are in contact with solvent due to their ability to form hydrogen bonds. Therefore, the hydrophobic residues gain energetically favorable conditions compared to hydrophilic residues. Hydrophobic amino acids are A, G, I, L, M, F, P, W, Y and V, whereas the hydrophilic amino acids are R, N, D, C, Q, E, H, K, S and T. We computed mean PSEE for the H and P type residues of both the DisProt680 dataset and the SSD1299 dataset. Fig 6 shows that for both of the datasets, the mean PSEE values for hydrophobic and hydrophilic residues are negative and positive, respectively. Thus PSEE effectively discriminates hydrophobic and hydrophilic  Position Specific Estimated Energy (PSEE) and Disorder Prediction residues. As the hydrophobicity of the residues are directly related to the ASA of the residues, PSEE can serve as a useful feature for ASA prediction [34].
We further collected the hydrophobicity index for 20 different amino acids from [41] and computed mean PSEE for 20 different amino acid residues of the SSD1299 dataset. Essentially the residues with positive hydrophobicity should have a negative mean PSEE. Fig 7 shows the correlation between the hydrophobicity index and the mean PSEE of 20 different amino acid type residues with the correlation coefficient (CC) equal to -0.86. This result emphasizes that (aggregated) PSEE is strongly correlated with the physical property, hydrophobicity, of the amino acid residues, which in turn confirms that the proposed approach is not deviating from the statistics obtained in previous work [41] significantly. A negative CC value is desirable as the high (positive) hydrophobicity of a residue indicates structural stability, thus favorable (negative) energy contribution. Proline is referred as hydrophobic, however it is found more in turns (coils) with unstable structure than helix or beta sheets. Thus it has positive hydrophobicity as well as positive PSEE that correspond to unstable structure.

Disorder Prediction Performance by DisPredict2
In this section, we measure the benefits of using PSEE as feature in the application of structure (or disorder) classification and prediction in terms of comparing DisPredict2 with 7 other state-of-the-art disorder predictors. These predictors include our initial disorder predictor, DisPredict [16], as well as SPINE-D [19], MFDp [20], MFDp2 [21], Espritz [15], IUPred-Long (IUPred-L) and Short (IUPred-S) [33]. SPINE-D [19] is a two-layered neural network based technique that was initially developed for three state prediction (disordered residues in short and long regions, ordered residue) and later reduced into two state prediction (disordered vs ordered residues). Espritz [15] is a high throughput predictor that uses a recursive neural network. MFDp [20] and MFDp2 [21] are meta predictors that combine different complementary disorder predictors' output to have further curated prediction. MFDp [20] combines four predicted disorder probabilities from IUPred-L [26,33], IUPred-S [26,33], DISOPRED2 [22] and DISOclust [50], while its incremental version, MFDp2 [21], further incorporates sequence based predicted disorder content from DisCon [52]. IUPred-L [26,33] and IUPred-S [26,33] predict disordered residues in long and short regions, respectively, using predicted interaction energies. The formulation used in [26,33] included a sequential local environment by Position Specific Estimated Energy (PSEE) and Disorder Prediction involving interactions with potential partners. Our formulation of PSEE further improvises the pairwise energy based feature by strategically combining the proportional burial information of the potential partners which determines the local structural environment. For a comprehensive comparison, we separately ranked the predictors in terms of balanced accuracy (ACC), Precision (PPV), Mathews correlation coefficient (MCC) and Area Under ROC curve (AUC). We gave the same rank to all predictors having similar scores. We assigned a cumulative score (S c ) as a summation of ranks according to different metrics and determined the final rank according to that cumulative score. The results highlight that DisPredict2 is competitive with the different neural network based methods, meta-predictors as well as predictors that uses predicted pairwise energy as a feature. Moreover, the comparative performance analysis of Dis-Predict2 versus DisPredict is provided to focus the utility of PSEE as a feature for disorder prediction. Table 4 shows the performance comparison based on the DD73 dataset. This dataset is collected from both DisProt [38] and PDB [14], and is independent from the training dataset, SL477. DisPredict2 was assigned rank 1 in terms of ACC, MCC and AUC as well as achieved highest S c with a final rank of 1. MFDp2 gave the highest PPV only, however it finally ranked 2 according to the overall performance. Moreover, DisPredict2 provided 0.41%, 6.35%, 3.48% and 1.36% improvement over DisPredict in terms of ACC, PPV, MCC and AUC under the ROC curve, respectively. These improvements focus the benefits of using PSEE as feature. Fig 8  compares the ROC curves and precision-recall curves given by the predictors. DisPredict2, Dis-Predict and SPINE-D gave comparable ROC curves outperforming the others, while DisPre-dict2, DisPredict, MFDp and MFDp2 gave better precision for the recall range 0.3 to 0.8 than those of the others. Table 5 shows the performance of the predictors based on the CASP8 dataset. SPINE-D stood first in terms of ACC and AUC scores, however it gave 33.5% and 7.4% lower PPV and MCC than those MFDp2 whose rank is 1 according to these two scores. DisPredict2 showed comparable performance in terms of all the metrics and attained the best cumulative score, and finally was ranked 1. Thus the overall performance of DisPredict2 is promising. Furthermore, DisPredict2 provided 0.38% lower ACC than that of DisPredict while resulted 18.73%, 8.81% and 2.17% higher PPV, MCC and AUC than those of DisPredict. Fig 9 compares the ROC curves and precision-recall curves. SPINE-D, Espritz, DisPredict2 and MFDp2 gave competitive ROC curves, while the SPINE-D resulted the best precision-recall curve.
The comparative performances of the predictors on 111 protein chains of the CASP9 dataset are reported in Table 6. This is a highly imbalanced dataset with approximately 10% of the Position Specific Estimated Energy (PSEE) and Disorder Prediction residues as disordered. MCC is regarded as the best measure in evaluating prediction performance on such an imbalanced dataset as it does not favor the over-prediction of the dominating class [29]. DisPredict2 achieved the best MCC and precision (PPV) score on the CASP9 dataset, while ranked 3 rd according to ACC and AUC. Conversely, SPINE-D gave the best ACC and AUC. However, it provided 26.5% lower precision than that of DisPredict2. DisPre-dict2 obtained the 1 st position in the final ranking with cumulative score differences of 2 and 4 from Espritz and SPINE-D respectively in 2 nd and 3 rd position. Moreover, DisPredict2 with PSEE performed 20%, 5.76% and 1.69% better than DisPredict in terms of PPV, MCC and AUC, respectively; with a slightly lower (2.66%) accuracy. In Fig 10   outcomes) at some threshold values. However, the PPV values show an increasing trend afterwards. Table 7 illustrates the performance comparison on the CASP10 dataset. This dataset has only 6.2% of the residues annotated as disordered. DisPredict2 achieved reasonable ranks, however not the best, in terms of all the scores. On the contrary, SPINE-D gave the highest ACC and AUC values with very low precision (ranked 6 th ). Similarly, MFDp2 showed the best precision with low ACC (ranked 6 th ) and Espritz gave best MCC with low ACC (ranked 5 th ). The cumulative ranks of Dispredict2, SPINE-D and Espritz were same, therefore all three of them were finally ranked 1 st . Moreover, the performance of DisPredict2 is 39.06%, 15.73% and 3.58% higher in terms of PPV, MCC and AUC, respectively. Therefore, DisPredict2 turns out to be a better disorder predictor than DisPredict [16] using PSEE as the only additional features. Fig 11 shows that SPINE-D consistently resulted in better ROC and precision-recall curves with the highest AUC and ACC values in Table 7, whereas the curves of DisPredict, Dis-Predict2 and Espritz were comparable.
Amyloidogenic region (AR) prediction by DisPredict2. To emphasize the biological significance of the outputs provided by DisPredict2, we collected 7 sequences from AMYPdb [51]   and computed the disorder probabilities of the residues by DisPredict2. These protein sequences contain amyloidogenic regions (ARs) that are insoluble, however can improperly interact to form amyloids. ARs play an important role in protein aggregation, and they are directly linked with critical human diseases such as neurological disorders. Fig 12 shows the location and description of ARs, mean and standard deviation of disorder probabilities of the residues of ARs, along with probability plot for the proteins. The mean disorder probabilities for seven amyloidogenic regions range from 0.213 to 0.776, with an average of 0.45 (approximately in the middle of the probability range) and standard deviation of 0.203. Therefore, Dis-Predict2 identified the disorder (without amyloid formation) to order (with amyloid formation) transitions and the associated structural flexibilities of amyloidogenic regions.

Discussion
In this paper, we describe the extraction of position specific estimated energy, named PSEE, for each residues of a protein based on sequence information alone. The quantification of PSEE includes the interaction effect of the target residue within a neighborhood in terms of pairwise  Position Specific Estimated Energy (PSEE) and Disorder Prediction contact energies between different amino acid types. We define the estimated neighborhood size in terms of the number of residues on either side of the target residue with which it can form favorable contacts. Furthermore, it utilizes the predicted relative exposure (or burial) of a residue to approximate the local three-dimensional conformational position and stability of the residue. Our results show that PSEE is very effective in characterizing ordered (structurally stable) and disordered (structurally unstable) residues as well as regions in protein sequences. Moreover, a fine-grained analysis highlights that the average PSEE of the residues of the binding sites in disordered regions are well separable from those of disordered or ordered regions. Therefore, PSEE detects the existence of critical binding regions in disordered proteins that undergo disorder-to-order transitions and perform crucial biological functions [52]. Moreover, PSEE is effective in distinguishing the residues of two different datasets with three different types of secondary structures (helix, beta and coil). The residues with complementary physical properties, such as hydrophobic and hydrophilic, are promisingly identified by PSEE. Moreover, it strongly correlated with the respective hydrophobicity index of 20 different types of amino acids.
Here, we further discuss the capacity of PSEE to capture multiple structural properties of the residues within the DisProt680 dataset. Fig 13 shows the correlation between pExp (or pBur) and PSEE of disordered and ordered regions. The vertical dashed line is the separation (-0.698) of PSEE for ORs and IDRs, and the horizontal dash-dotted line indicates separation for exposed or, buried residues. We assumed that the residues with relative exposure less than 25%, computed by Eq 2, are buried. We collected ASA for the residues of the DisProt680 dataset by running REGAd 3 p [34]. Therefore, the left of the vertical line is the energetically favorable regions, and most of the ordered regions (blue circle) have PSEE in this region and most of the disordered regions (red diamond) have PSEE on the right side. Specifically, the first quadrant (top-right corner) of the plot is the major distribution area of the disordered regions with unfavorable (positive) energy and higher exposure. On the other hand, the third quadrant (bottom-left corner) of the plot is the essential region for ordered regions with favorable (negative) energy and lower exposure. It is explicit in Fig 13 that the PSEE values of most of the disordered regions are in the first quadrant. Therefore, PSEE can capture the exposure-property of   Position Specific Estimated Energy (PSEE) and Disorder Prediction the residues and, at the same time, can categorize them as ordered or disordered. However, the other quadrants also contain some disordered regions. Fig 13 shows a similar correlation analysis between the coil-like tendency and PSEE of disordered and ordered regions. We collected coil probability of the residues of DisProt680 dataset by running MetaSSPred [48] and assumed that the residues with higher than 50% coil probability have flexible structure. Therefore, the first quadrant (top-right corner) of the plot is the essential area for disordered regions with unfavorable (positive) energy and high coil probability. On the other hand, the third quadrant (bottom-left corner) of the plot is the essential region for ordered regions with favorable (negative) energy and low coil probability. Fig 14  shows that most of the PSEE values for ordered regions fall in the third quadrant; where those of disordered regions fall in the first quadrant. However, for both Figs 13 and 14, the other quadrants also contain some disordered regions. This can be caused by mis-annotation of disorder [16] from DisProt database or the disorder-to-order transition of binding sites.
This promising correlation among different structural properties and the PSEE of protein residues motivated us to propose PSEE as a feature for the development of predictive tools in the area of bioinformatics and computational biology. To validate our argument, we constructed DisPredict2, a new disorder protein predictor, integrating PSEE in the feature set of an existing disorder protein predictor, DisPredict [16]. DisPredict2 demonstrated improved performance over DisPredict [16] and six other disorder predictors on four different datasets including the CASP8, CASP9 and CASP10 datasets. Moreover, the disorder probability output given by DisPredict2 resembles the flexible structural transformation of amyloidogenic regions of proteins. Therefore, we believe that the new position specific residual feature, PSEE, and the Position Specific Estimated Energy (PSEE) and Disorder Prediction disorder predictor, DisPredict2, both will be effective in understanding several insights of protein structures and their respective functions.