Identify Beta-Hairpin Motifs with Quadratic Discriminant Algorithm Based on the Chemical Shifts

Successful prediction of the beta-hairpin motif will be helpful for understanding the of the fold recognition. Some algorithms have been proposed for the prediction of beta-hairpin motifs. However, the parameters used by these methods were primarily based on the amino acid sequences. Here, we proposed a novel model for predicting beta-hairpin structure based on the chemical shift. Firstly, we analyzed the statistical distribution of chemical shifts of six nuclei in not beta-hairpin and beta-hairpin motifs. Secondly, we used these chemical shifts as features combined with three algorithms to predict beta-hairpin structure. Finally, we achieved the best prediction, namely sensitivity of 92%, the specificity of 94% with 0.85 of Mathew’s correlation coefficient using quadratic discriminant analysis algorithm, which is clearly superior to the same method for the prediction of beta-hairpin structure from 20 amino acid compositions in the three-fold cross-validation. Our finding showed that the chemical shift is an effective parameter for beta-hairpin prediction, suggesting the quadratic discriminant analysis is a powerful algorithm for the prediction of beta-hairpin.


Introduction
Protein function is inherently correlated with its structure. So, the prediction of protein structure is an active research field in bioinformatics. At present, it is still difficult to predict the spatial structure directly from protein primary structure. However, the successful prediction of protein super-secondary structure is the key step in the spatial structure prediction. Protein super-secondary-structure motifs are composed of a few regular secondary structural elements connected by loops. These structural motifs play an important role in protein folding and stability because a large number of motifs exist in protein spatial structure. Generally speaking, the empirical prediction of protein super-secondary structure essentially consists of two parts: one is the prediction of different structural types from amino acid sequences [1][2][3]; another is the prediction of structural motifs [4][5][6][7]. In this article we concentrate on the latter. The prediction of beta-hairpin motif will be helpful to identify fold in the unknown structure. In the past decade, many researchers have focused on exploring methods for beta-hairpin prediction [6][7][8][9][10]. However, the features of these studies were mainly derived from the amino acid compositions or dipeptide compositions. In this study, we introduced a novel feature, chemical shifts (CSs), to predict beta-hairpin motifs. Chemical shift describes the local chemical environment of nuclear spins in nuclear magnetic resonance [11]. Therefore, some researchers have utilized it for the determination of bimolecular structures and molecular dynamics studies [12][13][14][15][16][17]. Moreover, some works have studied on protein structure prediction [18][19][20][21][22][23][24][25][26] and protein backbone and side chain torsion angle prediction [27] by using chemical shifts, results showing that chemical shift is a powerful parameter for the determination of protein structure information.
In this paper, we would like to utilize CSs as parameters to predict beta-hairpin motifs combined with quadratic discriminant analysis. Using the benchmark dataset, we adopted threefold cross-validation and achieved the sensitivity of 92% and specificity of 94% and the overall prediction accuracy of 87% by using CSs of six nuclei as features and combining with quadratic discriminant analysis (QDA) algorithm. At the same time, to compare with other parameter, we have performed the prediction by using 20 amino acid compositions (AAC) as inputs of the method of QDA. The results showed that the performance of CSs outperform that of 20 AAC in the prediction of beta-hairpin. At present, some machine learning algorithms were used in the prediction of beta-hairpin motifs [6][7][8][9][10]. Therefore, to test our method and facilitate comparison with other methods, we have performed the prediction by using the same six CSs as feature of the support vector machine (SVM) and Random forest (RF) algorithm in the same cross-validation. Compared results showed that QDA is better than the other two algorithms in terms of accuracies.

Database
All of the CSs data used in this paper were retrieved from the re-referenced protein chemical shift database RefDB [28]. The following steps were performed to construct our dataset. Firstly, only proteins in RefDB overlapping with the corresponding Protein Data Bank (PDB) file with sequence identity of 100% were considered. Secondly, only proteins with the beta-hairpin or beta-link (called not beta-hairpin) motifs information in ArchDB40 database [29] were considered. Thirdly, only proteins with six nuclei (C,C α ,C β ,H N ,H α ,N) assigned CSs were considered. Finally, we utilized the PISCES program [30] to remove the highly similarity sequences. After strictly following the aforementioned procedures, 123 proteins were obtained. Among 123 proteins, 87% (107 sequences) proteins have less than 25% sequence identity, and the sequence identity of the remains ranges from 25 to 30%. In 123 proteins, due to consider the six CSs information at the same time, finally we obtained 157 beta-hairpin fragments, in which the lengths are ranged from 7 to 38 amino acid residues. And 75 not beta-hairpin fragments, the lengths of these fragments are ranged from 8 to 40 amino acid residues. PDB IDs of 123 and CSs data of 157 beta-hairpin fragments and 75 not beta-hairpin fragments are listed in the Supplementary Materials S1-S3 files.

Feature parameter
In the two data subsets {beta-hairpin, not beta-hairpin}, we calculated the averaged CSs of six nuclei for a fragment of length l using following formula.
Here l ¼ ½7e38 in beta À hairpin dataset ½8e40 in not beta À hairpin dataset ( ) ,m = C,C α ,C β ,H N ,H α ,N,and j represents amino acid positions in the fragment. Therefore, a sequence fragment can be converted into a six-dimensional vector R:{t m }.

Statistical distribution
Under the normal distribution, the analysis of variance (ANOVA) can be used to test whether there was a significant difference for two-group or multi-group samples [19,31] in the database. In this paper, the ANOVA is defined by Eq (2) where MS T , MS B and MS W denoted the square means of total, between groups and within a group, respectively. The statistical value, called F-value, is the ratio of MS B and MS W , which can be calculated by Eq (3) From Eq (3), we can see that the MS B becomes increasingly larger than MS W , F-value will become larger. That is to say, there are significant differences between groups, otherwise, the lack of differences.

Quadratic discriminant analysis (QDA)
As mentioned above [6][7][8][9][10], various parameters such as amino acid compositions and dipeptide compositions have been employed in the prediction of beta-hairpin. Here, we used CSs as feature to predict beta-hairpin motifs.
The QDA [32][33][34][35] is an effective algorithm that has been widely applied in genomic and proteomic bioinformatics in recent years. Thus, we used it here to perform prediction.
For a sequence X to be classified, we calculated the averaged CSs of six nuclei using the Eq (1). So, the sequence is converted into a six-dimensional vector R: Here we integrated six-dimensional vector by using QDA. Consider a sequence X is classified into two groups (beta-hairpin, not beta-hairpin). The discriminant analysis function between group i and group j is defined by According to Bayes' Theorem, we deduce Set where where v = beta-hairpin, not beta-hairpin, and p v denotes the number of samples in group v, δ v is the square mahalanobis distance between R and μ v with respect to Sv (notes: μ v and |S v | are calculated in training set), and μ v denotes chemical shift values of six nuclei R:{t m } averaged over group v, |S v | is the determinant of matrix S v . The six-dimensional vector μ v can be written here p v denotes the number of samples in group v; t n m tdenotes the average CSs of m nuclei for n-th sequence in group v; v = beta-hairpin, not beta-hairpin; The covariance matrix S v is 6 × 6dimension, quantifying correlations between the chemical shifts of six nuclei. where the element here (6) and Eq (7), we have concluded It can be easily proved that p(w k |X) is the maximum of p(w v |X), if η k is the maximal one in η v (v = beta-hairpin, not beta-hairpin). Then, we predict that X belongs to group k. In statistical results, fluctuation phenomenon inevitably exists. To correct predicted results, we define the coefficient of the error allowed scope as where η corr denotes X belonging to itself class η, η wro denotes X being predicted other class η. Set the appropriate R, the sequence X in the error allowed scope can be classified correctly by using Eq (12).

Performance evaluation
In statistical prediction, the jackknife test is considered to be the most rigorous test method [36] and has been widely used to evaluate the performance of various predictors [37][38][39][40][41]. However, considering the longer time needed for the jackknife test and because the goal of our paper concentrated on introducing a new model for beta-hairpin prediction, we adopted the three-fold cross-validation to evaluate the performance of our method. We randomly divided the training dataset into three parts, two of which are for training and the one for testing. The process is repeated three times. The final performance was calculated by averaging over all three datasets. The following parameters: the sensitivity (Sn), specificity (Sp), the overall accuracy (Acc) and Mathew's correlation coefficient (MCC) are used to evaluate the predictive performance of our approach.
where true positive (TP) denotes the number of correctly predicted beta-hairpin motif, false negative (FN) denotes the number of the beta-hairpin misclassified as not beta-hairpin motif, false positive (FP) denotes the number of the not beta-hairpin misclassified as betahairpin motif, and true negative (TN) denotes the number of correctly predicted not betahairpin motif.

Statistical distribution of the average CSs of six nuclei
We analyzed the average chemical shifts of six nuclei in beta-hairpin and not beta-hairpin dataset. As showed in Fig 1, we found that the different distribution of the CSs six nuclei in betahairpin and not beta-hairpin dataset. The average chemical shift values of C,C α ,C β ,H α ,N nuclei are higher in not beta-hairpin dataset than beta-hairpin dataset. However, the average chemical shift value of H N nuclei is lower in not beta-hairpin dataset than beta-hairpin dataset. For further investigating whether the distribution of average CSs of six nuclei in two datasets are independent of one another, the analysis of variance (ANOVA) [19,31] can be used for the average CSs of six nuclei in beta-hairpin and not beta-hairpin statistical analysis under a normal distribution. Though we know that many test statistics are approximately normally distributed for large samples (generally>30 samples) under the central limit theorem. In order to strictly verify the validity of a normal distributional assumption, we implemented the statistical test. The Quantile-quantile (Q-Q) plot or Probability-probability (P-P) plot in statistics is often as a means to check the validity of a statistical distributional assumption for a dataset [42]. In term of P-P plot, if the data indeed follow the assumed normal distribution, then the points on the P-P plot will fall approximately on the diagonal line. The result demonstrated that the sampling distributions of six-nuclei CSs obey normal distribution (see supplementary material S4 file). Therefore, ANOVA can be implemented. Table 1 records the F-values of six nuclei and corresponding p-values. From Table 1 we observed that six p-values are less than 0.05 (p< 0.05). This result shows that the average CSs of six nuclei have a significant difference between beta-hairpin and not beta-hairpin structures, suggesting that beta-hairpin motifs can be discriminated from not beta-hairpin sequences based on the CSs of six nuclei.

Prediction of beta-hairpin based on the CSs of six nuclei
Results in Table 1 suggest that the CSs of six nuclei are capable of predicting beta-hairpin. Therefore, we examined the accuracy of six nuclei by using QDA algorithm. Under the benchmark dataset, we calculated the average chemical shift values using the Eq (1). The sequences from two data subsets are converted respectively into six-dimensional vectors. In the training sets, determinant and inverse matrix of covariance matrix S v are calculated. And μ is a sixdimensional mean vector, which is calculated in each dataset. Given a sequence X in testing sets, we may calculate η v by using Eqs (6)(7)(8)(9)(10)(11) and compare the results. Then the class of sequence X was determined by the maximum of η v (v = beta-hairpin and not beta-hairpin). Finally, the coefficient R given in Eq (12) is used to correct predicted results. The current study utilized R<0.2. The results of three-fold cross-validation are listed in Table 2.
From the Table 2, we can see that the sensitivity, specificity and total accuracy are 92%, 94% and 87%, respectively, indicating that chemical shift is a good parameter for the beta-hairpin prediction.
Chemical shift is an easily obtained experimental datum. However, Chemical shift values of a sequence are not always complete for a multitude of reasons. Often, chemical shifts can only be assigned partially or are missing. To assess the impact of incomplete chemical shift assignment and determine the importance of chemical shift of each nucleus, we performed the prediction by removing any one of the CSs six nuclei. Then, the CSs of combination of five nuclei can be seen as features to predict the beta-hairpin. The results are listed in Table 3.
In table 3, we can see that all results are affected compared with using six CSs as features when a CSs feature is left out. If all six CSs are used, we reach a prediction overall accuracy of 87% (see Table 2). The absence of one CS leads to a significant decrease in prediction accuracy ranging from 4% for missing C or C B or H N shifts to 15% for missing N shifts. It is strange that the overall accuracy is worst when the CS of N nuclei is left out. This illustrates that N is the most important feature for prediction the beta-hairpin. According to the overall accuracy, we rank as the importance as: N>H a >C a >C> H N >C B in this paper.

Comparison with other feature
To test our method and facilitate comparison with other feature, we used 20 amino acid compositions (AAC) as inputs of the method of QDA. Notes: Where μ is a twenty-dimensional mean vector, and S v denotes the 20×20 dimensional covariance matrix. The results are also recorded in Table 2. Compared results show that the performance of CSs is more superior to that of 20 AAC for the beta-hairpin prediction.

Comparison with other approaches
Some approaches have been developed for predicting the beta-hairpin motifs [7][8][9][10]. However, due to differences in database, it is difficult to directly compare our results with other published results. Here we examined the predicted performance of other algorithms by use of the same CSs of six nuclei as features. At present, the support vector machine (SVM) and random forest (RF) are arguably the most widely used classification techniques in the Life Sciences [43][44][45][46]. In this paper, we implemented the SVM and RF algorithm based on R software package. The results are all listed in Table 4. Table 4 shows that QDA yields the best outcomes in using six CSs as feature. Therefore, we proposed using QDA to perform the beta-hairpin motifs prediction.

Conclusion
In this paper, we have introduced a model for predicting beta-hairpin motifs based on CSs. By the analysis of the statistical distributions of six-nuclei CSs in beta-hairpin and not beta-hairpin dataset, we found that the CSs of six nuclei are significantly different in beta-hairpin and not beta-hairpin motifs. Finally, we adopted three-fold cross-validation, and achieved the best prediction, namely the sensitivity (Sn) of 92%, the specificity (Sp) of 94%, the total accuracy (Acc) of 87% with 0.85 of Mathew's correlation coefficient (MCC) by using six CSs as features and the quadratic discriminant analysis. Results showed that chemical shift is indeed an effective parameter for the prediction of beta-hairpin motifs. Moreover, we have performed the prediction by combining the CSs of five different nuclei. Results showed that CSs of each nucleus has a different influence on the prediction of beta-hairpin structures. Our model is both simple and easy to perform. We hope this model will assist investigation the topology of protein structures in the near future [47][48][49]. As demonstrated in a series of recent publications [50][51][52][53] in developing new prediction methods, user-friendly and publicly accessible web-servers will significantly enhance their impacts [54], we shall make efforts in our future work to provide a webserver for the prediction method presented in this paper.
Supporting Information S1 File. 123 proteins used in this paper.