SVMTriP: A Method to Predict Antigenic Epitopes Using Support Vector Machine to Integrate Tri-Peptide Similarity and Propensity

Identifying protein surface regions preferentially recognizable by antibodies (antigenic epitopes) is at the heart of new immuno-diagnostic reagent discovery and vaccine design, and computational methods for antigenic epitope prediction provide crucial means to serve this purpose. Many linear B-cell epitope prediction methods were developed, such as BepiPred, ABCPred, AAP, BCPred, BayesB, BEOracle/BROracle, and BEST, towards this goal. However, effective immunological research demands more robust performance of the prediction method than what the current algorithms could provide. In this work, a new method to predict linear antigenic epitopes is developed; Support Vector Machine has been utilized by combining the Tri-peptide similarity and Propensity scores (SVMTriP). Applied to non-redundant B-cell linear epitopes extracted from IEDB, SVMTriP achieves a sensitivity of 80.1% and a precision of 55.2% with a five-fold cross-validation. The AUC value is 0.702. The combination of similarity and propensity of tri-peptide subsequences can improve the prediction performance for linear B-cell epitopes. Moreover, SVMTriP is capable of recognizing viral peptides from a human protein sequence background. A web server based on our method is constructed for public use. The server and all datasets used in the current study are available at http://sysbio.unl.edu/SVMTriP.


Introduction
By secreting antibodies against antigens, B-cells play an important role in the immune system to fight an invasive pathogenic organism or substance. Antigenic epitopes are regions of the protein surface that are preferentially recognized by B-cell antibodies [1]. Prediction of antigenic epitopes is useful for the investigation on the mechanism of body's self-protection systems and could be helpful for the design of vaccine components and immuno-diagnostic reagents [2].
Usually, B-cell antigenic epitopes are classified as either continuous or discontinuous. A continuous (also called linear) epitope is a consecutive fragment from the protein sequence; a discontinuous epitope is composed of several fragments scattered along the protein sequence, but still form an antigen-binding interface in 3D. The boundary between continuous and discontinuous epitopes is vague; a continuous fragment in a discontinuous epitope can be considered as a continuous epitope. Currently, the majority of available epitope prediction methods focus on continuous epitopes due to the relative simplicity of the problem and the convenience of available investigation methods, in which the amino acid sequence of a protein is taken as the input. Such prediction methods are based upon the amino acid properties including hydrophilicity [3,4], solvent accessibility [5], secondary structure [6], flexibility [7], and antigenicity [8]. In addition, based on the epitope databases such as IEDB [9], Bcipep [10], and FIMM [11], there are also some methods using machine learning approaches, such as Hidden Markov Model (HMM) [12], Artificial Neural Network (ANN) [13], and Support Vector Machine (SVM) [14,15], to locate linear epitopes, such as PREDITOP [8,16], PEOPLE [17], BEPITOPE [18], BepiPred [12], ABCPred [13], AAP [14], BCPred [15], BayesB [19], BEOracle/BROracle [20], and BEST [21].
In this work, a new linear B-cell epitope prediction method is developed using the SVM method to integrate the Tri-peptide similarity and Propensity scores (SVMTriP). SVMTriP is tested for varied epitope sequence lengths. With the five-fold cross-validation, SVMTriP achieves a sensitivity (Sn) of 80.1% and a precision (P) of 55.2% for sequences with 20 amino acids (AA), which are higher than those of AAP [14] and BCPred [15].

Prediction performance
SVMTriP is trained and tested with different epitope lengths, and for each length, the SVM parameters have their independent optimal values. For example, for 20AA-length cases, SVMTriP reaches its optimal performance at c = 32, g = 0.05, and p = 0.5 for the SVM model with Sn = 80.1% 62.1% and P = 55.2% 61.0% at the point with the maximal F-measure, 0.693. All results are shown in Table 1. Though, for different lengths of epitope sequences, SVMTriP has various points with the maximal Fmeasure, the precision values for different lengths are similar. The sensitivity increases significantly as the length of the epitope sequences becomes large. The range of the values of areas under the receiver operating characteristic curves (AUC) is from 0.674 to 0.702. Based on results of multiple evaluation methods (Table 1), SVMTriP for 18AA-and 20AA-length cases have the best performance. However, one may note a fact that most of experimental determined epitopes from IEDB [9] have less than 20 AA residues. A possible reason why SVMTriP favors long length of sequences is a long sequence may have more tri-peptides to show detectable frequency tendency. Another possibility is that the epitopic amino acid residues in experimentally determined epitopes are subsets of all real epitopic residues. Based on the testing results, 20AA is set as the default epitope length for SVMTriP to search for putative epitopes on the web server.
For comparison, AAP and BCPred are implemented locally based on their method descriptions [14,15], trained/tested with the same dataset and the five-fold cross-validation procedure for 20AA case. The results are listed in Table 2. Compared with BCPred and AAP, SVMTriP has a similar precision value, but significantly improved sensitivity at the point with the maximal Fmeasure. Figure 1 shows the receiver operating characteristic curve (ROC) for three methods, from which one may notice that SVMTriP has significantly larger true positive rate than BCPred and AAP in the region of low false positive rate. The AUC values are 0.667, 0.667, and 0.702 for AAP, BCPred, and SVMTriP, respectively. The AUC value of SVMTriP is significantly higher than those from the other two methods; the p-values of comparison against AAP and BCPred are 2.17610 25 and 1.58610 25 , respectively.

Top weighted tri-peptides
The prediction model relies on the occurring-frequency distribution of tri-peptides in the tri-peptide space, i.e. all combinations of any three amino acids. In Table 3, tri-peptides with top 20 weights in the optimal SVM model of 20AA-length epitopes are listed. All of the top ranked tri-peptides contain Glutamine or Proline, whereas the occurring frequencies of Glutamine and Proline in known linear epitopes (20AA) are only 8.1% and 6.84%, respectively. In the background of over all proteins, the occurring frequencies of Glutamine and Proline are 3.84% and 3.44% [22], which is not significantly different to the values in linear epitopes. However, the distribution patterns of the combined amino acids are quite different between epitopes and non-epitope peptides. Therefore, the tri-peptides containing Glutamine or Proline may play an important role in epitope recognition by B-cell antibodies. The algorithm of SVMTriP successfully utilized this difference to distinguish linear epitopes from other parts of protein peptides.

Tendency of prediction between virus and human proteins
Independent test of different epitope prediction methods is challenging because of the limited number of known epitopes. In this study, we devise an alternative independent test method. In the training set, most epitopes are from virus or bacteria, and their corresponding antibodies are mainly human antibodies. A basic property of the human immune system is the capability to distinguish any pathogenic agents, viral or bacterial, from the innate structures of the human being. All known B-cell epitopes in the training set came from the response of whole immune system, including the response of CD4 T helper cells. In order to simulate the human immune system, a successfully trained epitope prediction method should act the same, i.e. be able to distinguish pathogenic proteins from human proteins. In other words, the virus proteins should be preferentially more highly scored than human proteins by a successful prediction algorithm. To implement this test, 10 5 20AA-length peptides are collected from virus and human proteins: 5610 4 peptides are randomly selected from 391,466 virus proteins and others from 81,967 human proteins in the Refseq protein database [23]. AAP, BCPred, and SVMTriP are applied to these virus and human peptides, and topranked peptides are returned. The fractions of virus peptides in different numbers of returned peptides are shown in Figure 2. In Table 1

Discussion
Prediction with tri-peptide propensity alone The propensity of tri-peptide alone is tested and the result is shown in Table 4. The prediction sensitivity and precision are 56.5% bad 61.0%, respectively, similar to those of AAP, which is based on bi-peptide propensity and yielded a sensitivity of 59.8% and precision of 58.5% for the same test set. This result indicates that combining similarity scores is essential for the tri-peptide model to achieve a better performance.

Prediction with tri-peptide similarity alone
The tri-peptide similarity scores can be calculated with either Blosum62 or PAM160 matrixes. The performance of two different matrices for the tri-peptide model is evaluated with the same procedure of the five-fold cross-validation for 20AA-length epitopes. The results are shown in Table 4. Without the propensity score, using Blosum62 matrix shows similar performance as using the PAM160. However, when combined with the propensity score, Blosum62 matrix leads to a higher prediction performance.

Discrete tri-peptide subsequence models
We also implement a method that uses the space of tetrapeptide subsequence with one mismatch, i.e. discrete tri-peptide subsequences. For this case, the subsequences are considered in patterns like A_AA or AA_A, where 'A' represents the amino acid residue to be considered, and '_' represents the residue position that will be ignored in the comparison. The number of SVM attributes is still 20 3 , which is identical to that of the tri-peptide model. Interestingly, without considering propensity scores, the subsequence models of A_AA and AA_A patterns have similar sensitivity and precision with the tri-peptide model. However, the combination of similarity and propensity of the tri-peptide model significantly enhances the performance, while addition of the propensity does not increase sensitivity or precision for A_AA and AA_A patterns. The result is shown in Table 4. This finding indicates that the propensity is more important for the tri-peptide model than the discrete tri-peptide subsequence model.

Conclusion
The performance for linear B-cell epitope prediction is improved by concurrently using similarity and propensity of the  Table 3. Weights of tri-peptides in the optimal SVM model.

Datasets
The dataset is constructed by extracting non-redundant linear B-cell epitopes from IEDB [9], because it is frequently updated and has a large number of linear epitopes. Total of 65,456 B-cell linear epitopes are downloaded from IEDB (version June 11th, 2012). The identical epitopes and those possibly related to T-cell are removed. The full-length sequences of corresponding epitopes are also collected. The various lengths of epitope sequences, including 10AA, 12AA, 14AA, 16AA, 18AA, and 20AA, are extracted by trimming the long experimental measured epitopes or attaching more amino acid residues to both ends of short epitopes according to the full-length sequences. For a given length, epitope sequences with $30% similarity, measured by BLAST [24], are clustered together and only one of them is kept as an epitope sequence in the dataset. Finally, the dataset for each length has a total of 4925 non-redundant epitope sequences. For the negative dataset, the same numbers of equal-length sub-sequences are extracted from the non-epitopic segments in the corresponding antigen sequences.

Support Vector Machine Setup
Attribute encoding. The tri-peptide subsequence space is used to encode the SVM attributes. This kernel has a space of 20 3 attributes for both tri-peptide substring and propensity. The score of i-th attribute, K (i) , is defined as the tri-peptide subsequence similarity kernel modulated by its corresponding tri-peptide propensity. Please see Equation (1): where K (i) denotes the score of the i-th attribute, T (i) denotes the i-th tri-peptide subsequence similarity kernel, and P (i) denotes corresponding tri-peptide subsequence propensity of i-th tripeptide subsequence. The tri-peptide subsequence similarity kernel is defined as: where W (i) denotes the tri-peptide that represents the i-th attribute, V j denotes the j-th tri-peptide in the tri-peptide subsequence space for the input sequence. The symbol '':'' denotes getting the similarity score of any two corresponding tripeptide, i.e. sum of three similarity scores for three amino acid pairs from a Blosum/PAM matrix. For example, assuming the length of a given epitope candidate is 20 AA, the tri-peptide subsequence similarity kernel for the i-th attribute is generated by summing over similarity scores of the 18 pairs of tri-peptides; each pair consists of one tri-peptide from the input sequence and the tripeptide represents i-th attribute from the tri-peptide subsequence space. This subsequence kernel was previously used to predict protein subcellular localization by Lei and Dai [25]. The propensity of tri-peptide subsequence representing the i-th attribute is calculated as in Equation (3): where f(i) is the frequency of i-th type of tri-peptide in the positive epitopes, and F(i) is the frequency of i-th type of tri-peptide in 56104 protein sequences randomly selected from the Refseq database [23].
Training/Prediction procedure. The SVM training in this work uses an SVM package, SVM light , implemented by Joachims (http://svmlight.joachims.org/) [26]. All SVM parameters are optimized by a grid search (c = 2 210,21 , g = 2 212,23 , and p = 2 25,22 ). For each grid point of the triplets, a five-fold crossvalidation procedure is employed to evaluate the performance of the trained SVM model. To carry out the five-fold validation procedure, the total of 4925 positive epitopes are split into five groups, and any two-epitope sequences from two different groups do not have sequence similarity more than 20%. At each triplet point, the maximum F-measure is calculated. The optimal parameter set has the largest value in all points by the maximum F-measures. During the procedure of five-fold validation, five test results are used to calculate the mean values and 95% confidence intervals of sensitivity, precision, and the maximal F-measure. For the application on the online server, the prediction model is obtained by training the whole dataset with the same numbers of positive and negative epitopes. To predict a given full-length protein sequence, the sliding window method is employed to obtain subsequences with variable lengths, including 10AA, 12AA, 14AA, 16AA, 18AA, and 20AA. For each subsequence, SVMTriP calculates its score, and a positive score indicates that the subsequence is a putative antigenic epitope.

Evaluation methods
The statistical terms, sensitivity (Sn), precision (P), and Fmeasure, are defined in the following equations: where TP, TN, FP, and FP stand for true positive, true negative, false positive, and false negative, respectively. F-measure is used to determine the optimal prediction results. A java program available at http://pages.cs.wisc.edu/,richm/programs/AUC/ is used to calculate the AUC. The online tool StAR [27,28] is used to test whether the difference between ROC curves resulting from two methods is statistically significant.