Accurate Prediction of Immunogenic T-Cell Epitopes from Epitope Sequences Using the Genetic Algorithm-Based Ensemble Learning

Background T-cell epitopes play the important role in T-cell immune response, and they are critical components in the epitope-based vaccine design. Immunogenicity is the ability to trigger an immune response. The accurate prediction of immunogenic T-cell epitopes is significant for designing useful vaccines and understanding the immune system. Methods In this paper, we attempt to differentiate immunogenic epitopes from non-immunogenic epitopes based on their primary structures. First of all, we explore a variety of sequence-derived features, and analyze their relationship with epitope immunogenicity. To effectively utilize various features, a genetic algorithm (GA)-based ensemble method is proposed to determine the optimal feature subset and develop the high-accuracy ensemble model. In the GA optimization, a chromosome is to represent a feature subset in the search space. For each feature subset, the selected features are utilized to construct the base predictors, and an ensemble model is developed by taking the average of outputs from base predictors. The objective of GA is to search for the optimal feature subset, which leads to the ensemble model with the best cross validation AUC (area under ROC curve) on the training set. Results Two datasets named ‘IMMA2’ and ‘PAAQD’ are adopted as the benchmark datasets. Compared with the state-of-the-art methods POPI, POPISK, PAAQD and our previous method, the GA-based ensemble method produces much better performances, achieving the AUC score of 0.846 on IMMA2 dataset and the AUC score of 0.829 on PAAQD dataset. The statistical analysis demonstrates the performance improvements of GA-based ensemble method are statistically significant. Conclusions The proposed method is a promising tool for predicting the immunogenic epitopes. The source codes and datasets are available in S1 File.


Methods
In this paper, we attempt to differentiate immunogenic epitopes from non-immunogenic epitopes based on their primary structures. First of all, we explore a variety of sequence-derived features, and analyze their relationship with epitope immunogenicity. To effectively utilize various features, a genetic algorithm (GA)-based ensemble method is proposed to determine the optimal feature subset and develop the high-accuracy ensemble model. In the GA optimization, a chromosome is to represent a feature subset in the search space. For each feature subset, the selected features are utilized to construct the base predictors, and an ensemble model is developed by taking the average of outputs from base predictors. The objective of GA is to search for the optimal feature subset, which leads to the ensemble model with the best cross validation AUC (area under ROC curve) on the training set.

Results
Two datasets named 'IMMA2' and 'PAAQD' are adopted as the benchmark datasets. Compared with the state-of-the-art methods POPI, POPISK, PAAQD and our previous method, the GA-based ensemble method produces much better performances, achieving the AUC score of 0.846 on IMMA2 dataset and the AUC score of 0.829 on PAAQD dataset. The statistical analysis demonstrates the performance improvements of GA-based ensemble method are statistically significant.

Conclusions
The proposed method is a promising tool for predicting the immunogenic epitopes. The source codes and datasets are available in S1 File.

Background
A vaccine is a biological preparation, which stimulates the production of antibodies to induce immunity to a particular disease. There are different types of vaccines. The epitope-based vaccine is a new kind of vaccine that recently attracts the wide interests. Critical components in manufacturing epitope-based vaccines are epitopes, which are designed to trigger the immune responses of T-cells or B-cells.
The intracellular antigen-processing pathway for T-cell immune responses is a complex procedure. At first, antigens are cleaved into short peptides, and some peptides are transported into the endoplasmic reticulum (ER) by the antigen presenting proteins. Then, some peptides will bind to major histocompatibility complex (MHC) molecules and form the MHC-peptide complexes. Finally, the complexes are presented on the cell surface to induce the immune response.
T-cell epitopes are defined as the antigen segments that bind to major histocompatibility molecules. The major histocompatibility complex (MHC) is the cell surface molecules in vertebrates that are encoded by a specified gene family. The MHC molecules are of two sorts: MHC-I and MHC-II. MHC-I molecules usually present epitopes of 9 amino acids, whereas epitopes binding to MHC-II may consist of 12-25 amino acids. In the study, we discuss the MHC-I restricted T-cell epitopes, which are also known as 'CTL epitopes'. In the following context, T-cell epitopes refer to the CTL epitopes.
In the design of vaccines, the primary consideration is to reduce risk and retain capability of inducing immune responses. Immunogenicity is the ability to trigger immune responses. Some studies showed that epitopes have the potential of activating the immune response but have not always been immunogenic. In other words, some epitopes can activate the immune response, and the others cannot. Because epitopes are classified into immunogenic epitopes and non-immunogenic epitopes, the work of predicting immunogenic epitopes is challenging and valuable.
A lot of studies [22][23][24] have been focused on crystal structures of the MHC-peptide complexes, but few useful conclusions were drawn because of the limited number of complex structures. Considering the fact that there are much more epitope sequences than epitope structures in databases, researchers make efforts to develop machine-learning prediction models based on epitope sequences. As far as we know, four machine-learning methods were proposed to predict the immunogenic epitopes. POPI [25] is the first method for immunogenic epitope prediction. This method selected 23 informative amino acid propensities from AAIndex database, and then utilized support vector machine classifier (SVM) to construct the prediction models. POPISK [26] is a SVM-based method using the weighted degree string kernel. PAAQD [27] adopted two novel sequence-derived features (the amino acid pairwise contact potentials and the quantum topological molecular similarity), and the SVM-based predictor was constructed. In the previous work [28], we proposed the average scoring ensemble method by combining seven sequence-derived features. Specifically, individual feature-based classifiers were used as base predictors, and the ensemble model takes the average of the outputs from base predictors for prediction.
To the best of our knowledge, there are some key issues for developing high-accuracy models. Firstly, the accuracy of models is highly dependent on the diversity of features. In order to achieve high-accuracy models, we should consider as many sequence-derived features as possible. Secondly, how to effectively combine various features to build high-accuracy model is very challenging. Considering the redundant information between features, the model using all features does not necessarily lead to the best result than a model using a subset of features. Therefore, we need to solve the problem that which features should be selected for modeling.
In this paper, we address above issues to make accurate predictions for the immunogenic epitopes. In order to obtain the diversity of features, we consider 18 sequence-derived features, which describe the sequential and structural characteristics of epitope sequences. Then, a genetic algorithm (GA)-based ensemble method is proposed to simultaneously determine the optimal feature subset for ensemble learning and develop the high-accuracy ensemble model. In the GA optimization, a chromosome is to represent a subset of features in the search space. For each chromosome, the selected features are adopted to construct base predictors, and then an ensemble model is developed by taking the average of outputs from base predictors. The fitness score of a chromosome is the 10-fold cross validation AUC (area under ROC curve) of the corresponding ensemble model on the training set. After an initial population is generated, the population is updated by three operators: selection, mutation and crossover. The objective of GA is to search for the optimal feature subset in the large search space, which leads to the ensemble model with best AUC score. Compared with the state-of-the-art methods, the proposed GA-based ensemble method yields much better performances on benchmark datasets, indicating it is a promising tool for the immunogenic epitope prediction.

Datasets
To the best of our knowledge, there are several immune databases such as SYFPEITHI [29] and IEDB [30], which contain immunogenic epitopes. SYFPEITHI is a database with more than 7000 MHC-binding peptide sequences. In addition, SYFPEITHI can provide the retrieval of sequences and the epitope prediction. The Immune Epitope Database (IEDB) contains a catalog of experimentally characterized T-cell epitopes. Moreover, the epitope structures, source antigens and epitope-derived organisms are annotated. We retrieved MHC-binding peptides with immunogenicity information from above databases, and compiled them in terms of MHC alleles. However, after removing duplicate entries and high homology sequences, only the MHC allele HLA-A2 contains enough epitope sequences (more than 100 sequences) for the study on the immunogenic epitopes. Therefore, we directly adopt the datasets used in the previous studies [25][26][27][28].
We assess the performances of the proposed method on two publicly available datasets. One dataset has been used in the development of POPISK [26]. It consists of 558 HLA-A2 restricted immunogenic epitopes and 527 non-immunogenic epitopes. This dataset named as 'IMMA2 dataset' is available at http://iclab.life.nctu.edu.tw/POPISK/download.php. The other dataset has been applied to PAAQD [27], and we name it 'PAAQD dataset'. This dataset contains 278 HLA-A2 restricted immunogenic epitopes and 101 non-immunogenic epitopes, available at http://pirun.ku.ac.th/~fsciiok/PAAQD.rar. Two datasets have no duplicate sequences, and IMMA2 dataset contains much more epitope sequences than PAAQD dataset. We use IMMA2 dataset to analyze and evaluate the features, and compare the proposed methods with state-ofthe-art methods by using two benchmark datasets.

Sequence-Derived Features
Our work is to differentiate immunogenic epitopes from non-immunogenic epitopes. The first step for the immunogenic epitope prediction is to represent the protein sequences with certain encoding scheme. In this work, we extract 18 protein sequence-derived features that are commonly used to predict protein functions, with the aim of obtaining diversity. Seven out of 18 features have been utilized for the immunogenic epitope prediction, while the rest are taken into account for the first time.
Amino acid pairwise contact potentials (AAPPs) and quantum topological molecular similarity (QTMS): AAPPs describes the potentials between peptide amino acids and MHC amino acids. QTMS includes both physical and topological properties of peptides. Saethang et al. [27] used AAPPs and QTMS descriptors to represent epitopes in the immunogenic epitope prediction. Details about AAPPs and QTMS are available in [27].
Amino acid composition (AAC): the amino acid composition denotes the percentage of each of 20 amino acids in a sequence [32], and the AAC of a sequence is a 20-dimensional numeric vector.
Amino acid pair profile: the amino acid pair profile [33] of a sequence is a 400-dimensional number vector, and each dimension is the percentage of an amino acid pair pattern.
Sparse profile: by encoding each amino acid type as a 20-bit vector with 19 bits set to zero and one bit set to one, the sparse profile [34] of a sequence is obtained by merging the bit vectors for its amino acids.
Pairwise similarity profile: the pairwise similarity profile [35] of a sequence is represented by a numerical vector, which consists of the Smith-Waterman pairwise similarity scores between it and all sequences in the dataset.
Composition (CTDC), transition (CTDT), and distribution (CTDD): by dividing 20 amino acids into three groups in terms of different amino acid attributes, CTDC, CTDT, and CTDD [36] are respectively used to describe the global percent composition of each group in a sequence, global percent composition of group changes along the entire length of the sequence, and distribution pattern of each group along the sequence.
Autocorrelation: autocorrelation descriptors [37] are defined based on the distribution of amino acid properties along sequences. There are three kinds of autocorrelation, namely Normalized Moreau-Broto autocorrelation, Geary autocorrelation and Moran autocorrelation.
Quasi-sequence-order (QSO): QSO [38] is to incorporate physicochemical distance matrix between each pair of the 20 amino acids, and it reflects the sequence order coupling number based on the physicochemical distance.
Pseudo amino acid Composition (PseAA): PseAA [39] is proposed to avoid losing the sequence-order information. PseAA of a sequence consists of two components. The first component represents amino acid composition while the other component reflects the sequence-order information.
Amphiphilic Pseudo Amino Acid Composition (AmPseAA): AmPseAA [40] is an extension of Pseudo-Amino Acid Composition, and it integrates a set of correlation factors that reflect different hydrophobicity and hydrophilicity distribution patterns along a protein chain.
Secondary structures (SS) and relative accessible surface areas (RASA): secondary structures and relative accessible surface areas of residues in a sequence are calculated by the SABLE program [41]. The predicted SS is denoted as H, E or C (helix, sheet, coil). We use (1, 0, 0), (0, 1, 0) and (0, 0, 1) to represent the residue that belongs to three types of secondary structures, respectively. The predicted RASA of a residue is a real value between 0 and 100, representing the percentage of exposed area of the residue over its full area.
These features are summarized in Table 1.

The GA-Based Ensemble Method
Combining informative features helps to make high-accuracy prediction, because various features describe different characteristics of immunogenic epitopes. However, redundant information or unwanted noise is the main concern for feature combination.
In machine learning, the work that combines various features is also known as feature fusion, whose purpose is to exploit features and remove the redundant information. Merging various feature vectors is a simple and widely used feature fusion approach, whereas the ensemble learning is a sophisticated technique. Recently, ensemble learning attracts more and more interests in bioinformatics for their unique advantages in dealing with high-dimensional and complicated data [42][43]. In this paper, we design a genetic algorithm (GA)-based ensemble method for the immunogenic epitope prediction. There are two crucial components for designing the ensemble system, including base predictors and the rule for integrating base predictors. To develop the base predictors, the training sequences can be respectively encoded into different sets of feature vectors by using different features, and the individual feature-based classifiers are built based on the different sets of feature vectors. Here, the random forest (RF) [44] is adopted as the basic classification engine. Then, the individual feature-based RF classifiers are used as the base predictors. Further, we adopt the average scoring ensemble rule to integrate base predictors, and it takes the average of outputs from base predictors to make predictions.
Considering that each feature corresponds to a base predictor, there are 18 candidate base predictors for the ensemble learning. As discussed above, the ensemble based on all features does not always lead to the best result than an ensemble model based on a subset of features.
The key to our ensemble method is to select optimal features for base predictors, which can lead to the best ensemble model. There are 262144 (2 18 ) possible feature subsets for 18 features, and it is difficult to make the exhaustive search in such large search space. The genetic algorithm (GA) is a search approach that mimics the process of natural selection. GA can effectively search the interesting space and easily solve complex problems without requiring the prior knowledge about the space and the problem. Here, we adopt GA to simultaneously determine the optimal feature subset and build the high-accuracy ensemble model.  of the corresponding feature. A good fitness function is essential to assessing the performance of each chromosome. For each chromosome, base predictors are built by using selected features in the chromosome, and then an average scoring ensemble model is developed. An internal 5-fold cross validation is implemented for the ensemble model on the training set, and AUC of the ensemble model is taken as the fitness score of the chromosome. After randomly generating an initial population, the population is updated by three operators: selection, crossover and mutation, according to the fitness scores of chromosomes. GA optimization is to search for the chromosome that maximizes the AUC score.
Here, we use the Matlab GA toolbox and Matlab random forest toolbox (http://code.google. com/p/randomforest-matlab/) to implement the GA optimization and random forest classifiers. The default parameters are adopted for random forests.

Performance Evaluation Metrics
The proposed methods are constructed on the benchmark datasets, and evaluated by the 10-fold cross-validation (10-CV). In the 10-CV, a dataset is randomly split into 10 subsets with equal size. In each fold, one subset is used as the testing data and the rest is treated as training data. The prediction model is trained on the training data, and then it is applied to the testing data. This procedure is repeated until every subset is ever used for testing.
Here, we adopt several metrics to measure the performances of prediction models, namely sensitivity (SN), specificity (SP), accuracy (ACC), Matthew's correlation coefficient (MCC) and the area under the ROC curve (AUC). These metrics are defined as follows.
where TP, TN, FP and FN are the number of true positives, the number of true negatives, the number of false positives and the number of false negatives. The ROC curve is plotted by using the false positive rate (1-specificity) against the true positive rate (sensitivity) for different cutoff thresholds. AUC evaluates the performances regardless of any threshold, and it is adopted as the primary metric in this study.

Evaluation of Various Features
Before constructing prediction models, we investigate the sequence-derived features and analyze their capabilities of discriminating immunogenic epitopes from nonimmunogenic epitopes.
As shown in Table 1, the dimensions of six sequence-derived features (e.g., indexed from F11 to F16) are related to a parameter λ, while the rest of features have fixed dimensions. For the features indexed from F11 to F13, the parameter λ denotes the lag of the autocorrelation; while for the features indexed from F14 to F16, the parameter λ denotes the additional length or dimensions of sequence order factors. Note that λ is an integer and 1< = λ< = L-1, where L is the sequence length (e.g., L = 9 in our work). To test the impact of λ on six features, the prediction models are constructed based on these features, for which different values of λ ranging from 1 to 8 are adopted. We conduct 20 independent runs of 10-CV for each model to avoid the bias of the data split, and the mean scores demonstrated in Table 2. It is observed that six features are to some extent influenced by the parameter λ. Take the Geary autocorrelation as an example, the variation of λ can cause a difference of approximately 7% AUC. The optimal values of λ that yield best results are used for six features in the following context.
Further, 18 sequence-derived features are evaluated on the IMMA2 datasets. More specifically, prediction models are respectively constructed by encoding epitope sequences with various features, and the AUC scores of individual feature-based models help to rank features. As shown in Table 3, AUC scores of individual feature-based models range from 0.58 to 0.78. Accordingly, the performances of these features are categorized into three groups: 'Exceptional' (>0.7),'Good' (< = 0.7 and >0.6) and 'Poor' (< = 0.6). In general, there are 10 'exceptional' features, 7 'good' features and 1 'poor' feature. The predicted relative accessible surface area (RASA) yields the best results, indicating the structural information is the most important clue that recognizes the immunogenic epitopes.

Analysis on Redundant Information between Features
The sequence-derived features display different capability of predicting immunogenic epitopes. However, the redundant information or correlation between features may have the negative impact on combining these features.
To investigate the correlations between features, the individual feature-based models are evaluated on IMMA2 dataset by 20 independent runs of 10-CV, and Pearson correlation coefficients are calculated between the AUC scores of any two individual feature-based models. In statistics, Pearson correlation coefficient is a widely used measure of the linear dependence between two variables. The correlation coefficient is a real value between -1 and 1, where 1 is totally positive correlation, 0 is no correlation, and −1 is totally negative correlation. Here, we take the absolute values of correlation coefficients to measure the redundant information or correlations between features. The results in Table 4 show some features are highly correlated, such as F6 and F1, F6 and F7, indicating the difficulty of utilizing these features.
In order to test the negative impact of redundant information, we directly combine the features by merging feature vectors. Here, seven feature combinations are generated by adding one feature once, in descending order of their AUC scores presented in Table 3. For simplicity, we do not test all feature combination, but it would not influence our analysis. For each feature combination, different groups of feature vectors are merged, and then the prediction model is constructed. According to Table 5, the model based on AAPPs and RASA yields the best results. Therefore, we can conclude that more features do not necessarily lead to better performance.

Performances of GA-based Ensemble Method
Given a variety of sequence-derived features, selecting features that are used for base predictors is the key to building high-accuracy ensemble model. Therefore, we apply GA to determine the optimal subset and develop the ensemble model. The configurations for GA are described as follows. The initial population has 100 chromosomes. In the update of population, the elitist strategy is used for the selection operator, and default parameters in the Matlab GA toolbox are adopted for the mutation probability and crossover probability. The population update may terminate when the change of best fitness scores is less than the default value of 1E-6 or the max generation number of 100 is reached.
The results of GA-based ensemble method on IMMA2 dataset and PAAQD dataset are given in Table 6. The GA-based ensemble method produces the AUC score of 0.846 on IMMA2 dataset and AUC score of 0.829 on PAAQD dataset. We compare the ensemble models with the individual feature-based models, according to Table 2 and Table 6. Clearly, the GA-based ensemble method produces much better results, indicating this ensemble approach can effectively combine various features to enhance performances.
Further, we analyze the optimal subsets that are identified by GA. In each fold of 10-CV, the optimal feature subset is automatically determined by internal cross validation on the training set. There are 200 optimal feature subsets (20×10) for 20 independent runs of 10-CV. Here, we count the frequencies of features appearing in the optimal subsets, and then we may have some useful observations from the results in Table 7. Firstly, these optimal feature subsets are different, because of the different training data. Secondly, the optimal feature subsets do not consist entirely of the highly ranked features. Instead, an optimal subset may include some 'poor' features. For example, the secondary structure (F18) is ever included in 18 optimal subsets. Thirdly, size of optimal subsets ranges from 2 features to 5 features. In conclusion, the optimal feature subset for the ensemble model depends on the training data, and identifying the optimal feature subset is necessary for building high-accuracy models.

Comparison with Benchmark Methods
As far as we know, four computational methods (POPI [25], POPISK [26], PAAQD [27] and our previous method [28]) were proposed to predict immunogenic epitopes. Therefore, we adopt these methods as benchmark methods for comparison.
In this paper, we compare the proposed GA-based ensemble method with four state-of-theart methods on IMMA2 dataset and PAAQD dataset. The models are evaluated by 10-fold cross validation. In addition, the significance of improvements is tested by statistical techniques. First of all, we have to obtain the 10-CV performances of different methods. The source codes for POPI and POPISK are not publicly available, and it is difficult to correctly replicate these methods because details are ambiguous. For the fair comparison, we have to directly adopt their results, which were reported in [25,26]. Although 20 independent runs of 10-CV were ever implemented for POPI and POPISK on IMMA2 dataset, only the average scores are available in the publications, and some scores such as sensitivity and specificity are not provided. The standalone R package of PAAQD can be downloaded at http://pirun.ku.ac.th/~fsciiok/ PAAQD.rar, and we could replicate PAAQD to obtain the results on IMMA2 dataset and PAAQD dataset. Therefore, we conduct 20 independent runs of 10-CV for PAAQD, our previous method and GA-based ensemble method to obtain the scores.
As shown in Table 8, POPI [25], POPISK [26], PAAQD [27] and our previous method [28] produce the AUC scores of 0.64, 0.74, 0.747 and 0.766 on the IMMA2 dataset, while the GAbased ensemble method produces the AUC score of 0.846. The GA-based ensemble method also yields much better performances than PAAQD [27] and our previous method [28] on the PAAQD dataset, enhancing the AUC scores from 0.773 to 0.829.
Although the GA-based ensemble method yields much better performances than benchmark methods, we further adopt the statistical technique to test the significance of performance improvements. Since we implemented 20 independent runs of 10-CV for PAAQD method, our previous method and GA-based ensemble method, we obtain 20 samples for each method by considering the AUC score in a run as a sample. Therefore, the paired t-test is used to compare  Index  F1  F2  F3  F4  F5  F6  F7  F8  F9  F10  F11  F12  F13  F14  F15  F16  F17  F18   Frequencies  4  0  17  3  145  77  97  0  1  30  0  7  0  2  1  24  200  11 doi:10.1371/journal.pone.0128194.t007 GA-based ensemble method with PAAQD and our previous method. As discussed above, only the average scores of 20 runs were given for POPI and POPISK [25,26]. Following the work of PAAQD [27], we have to take the one-sample t-test to test 20 samples of the GA-based ensemble method against the average AUC scores of POPI and POPISK. The results in Table 9 indicate that the improvements of the GA-based ensemble method over benchmark methods are statistically significant (P<0.05 in terms of AUC scores). There are some reasons for the superior performances of the GA-based ensemble method. Firstly, we consider a great number of sequence-derived features, which can guarantee the diversity for ensemble learning. Secondly, the GA-based ensemble method automatically determines the optimal feature subsets for base predictors, for the purpose of incorporating the useful information and reducing the information redundancy. Thirdly, the ensemble model equally treats base predictors, and the ensemble rule does not use any prior knowledge that may affect the performances.

Conclusion
The prediction of T-cell immunogenic epitopes is a crucial task in the immunoinformatics, which can facilitate the epitope-based vaccine design. In this paper, we explore a wide variety of sequence-derived features that make differences between immunogenic epitopes and nonimmunogenic epitopes. Then, we propose the GA-based ensemble method to automatically select the optimal feature subset for base predictors, and develop the ensemble model for the immunogenic epitope prediction. When compared with the state-of-the-art methods POPI, POPISK, PAAQD and our previous method, the GA-based ensemble method produces much better performances on the benchmark datasets, and the t-test analysis demonstrates that the performance improvements of the GA-based ensemble method are statistically significant. Nevertheless, there are some remaining spaces for improving our work. Currently, only the MHC allele HLA-A2 has sufficient immunogenic epitopes for the computational work. In the near future, more immunogenic epitopes will be available, and the computational experiments would be conducted on other MHC alleles. The source codes and datasets are available in supporting information files (S1 File).
Supporting Information S1 File. The source codes and datasets for the GA-based ensemble method. (ZIP)