Figures
Abstract
Cancer immunotherapy enhances the body’s natural immune system to combat cancer, offering the advantage of lowered side effects compared to traditional treatments because of its high selectivity and efficacy. Utilizing computational methods to identify tumor T cell antigens (TTCAs) is valuable in unraveling the biological mechanisms and enhancing the effectiveness of immunotherapy. In this study, we present ENCAP, a predictor for TTCA based on ensemble classifiers and diverse sequence features. Sequences were encoded as a feature vector of 4349 entries based on 57 different feature types, followed by feature engineering and hyperparameter optimization for machine learning models, respectively. The selected feature subsets of ENCAP are primarily composed of physicochemical properties, with several features specifically related to hydrophobicity and amphiphilicity. Two publicly available datasets were used for performance evaluation. ENCAP yields an AUC (Area Under the ROC Curve) of 0.768 and an MCC (Matthew’s Correlation Coefficient) of 0.522 on the first independent test set. On the second test set, it achieves an AUC of 0.960 and an MCC of 0.789. Performance evaluations show that ENCAP generates 4.8% and 13.5% improvements in MCC over the state-of-the-art methods on two popular TTCA datasets, respectively. For the third test dataset of 71 experimentally validated TTCAs from the literature, ENCAP yields prediction accuracy of 0.873, achieving improvements ranging from 12% to 25.7% compared to three state-of-the-art methods. In general, the prediction accuracy is higher for sequences of fewer hydrophobic residues, and more hydrophilic and charged residues. The source code of ENCAP is freely available at https://github.com/YnnJ456/ENCAP.
Citation: Yu J-C, Ni K, Chen C-T (2024) ENCAP: Computational prediction of tumor T cell antigens with ensemble classifiers and diverse sequence features. PLoS ONE 19(7): e0307176. https://doi.org/10.1371/journal.pone.0307176
Editor: Shahid Akbar, Abdul Wali Khan University Mardan, PAKISTAN
Received: March 22, 2024; Accepted: July 1, 2024; Published: July 18, 2024
Copyright: © 2024 Yu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The source code of ENCAP and the datasets are freely available at https://github.com/YnnJ456/ENCAP.
Funding: This work was supported by National Science and Technology Council, Taiwan under Grant NSTC 112-2221-E-468-019. There was no additional external funding received for this study.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Cancer is a major public health challenge that is becoming increasingly prevalent. It is the leading cause of death in many countries and is a major barrier to increasing life expectancy [1]. According to GLOBOCAN (Global Cancer Statistics) 2020, there were 19.3 million new cancer cases diagnosed in 2020 and almost 10.0 million cancer deaths [2]. GLOBOCAN predicts that the number of cancer cases will increase to 28.4 million in 2040. Modern cancer therapies such as chemotherapy, radiation therapy, hormonal therapy, and targeted therapy are lethal to tumor cells, but they can destroy normal cells in the body and produce harmful reactions such as nausea, hair loss, and fatigue [3–6]. In addition, cancer cells can develop resistance to these therapies over time, making it difficult to treat the cancer.
Targeted immunotherapy is one of the promising treatment options because of its high selectivity, efficacy, and reduced side effects [6]. T cells have the ability to identify and eliminate tumor antigens, which are presented by major histocompatibility complex (MHC) class I and class II molecules situated on the surface of antigen-presenting cells [7]. Previous studies show that peptides derived from tumor-associated antigens can induce a peptide-specific T cell immune response which destroys cancer cells [8–10]. Therefore, accurate identification of T cell antigens is an important task in developing highly efficient cancer peptide vaccines [11].
Although several in vivo and in vitro assays have been developed for the characterization of tumor T cell antigens (TTCAs), they demand a significant investment of resources, time, and labor. On the other hand, computational methods are attracting increasing attention due to their great convenience and high efficiency [12–15]. TTAgP 1.0 uses a random forest (RF) model with relatively simple features including amino acid composition (AAC) and several physicochemical features derived from peptides for the identification of TTCAs [16]. iTTCA-Hybrid uses support vector machines (SVM) and RF with five types of features including AAC, dipeptide composition, pseudo amino acid composition, distribution of amino acid properties in sequences, and physicochemical properties derived from the AAindex database [17] for TTCA prediction [18]. iTTCA-RF [19] considers more features, such as global protein sequence descriptors, grouped amino acid composition, grouped dipeptide composition, grouped tripeptide composition, and pseudo-amino acid composition. These features are then processed by MRMD (Maximum-Relevance-Maximum-Distance) [20] to determine an optimal feature subset of 263 numeric values for training and independent test based on RF. TAP 1.0 [21] applies information gain [22] on 544 features from AAindex, resulting in a reduced subset of 10 properties for training and testing based on 15 machine learning (ML) algorithms, within which quadratic discriminant analysis outperforms the rest. iTTCA-MFF [23] uses 50 physicochemical properties and pairwise energy content matrix for encoding, and LASSO (Least Absolute Shrinkage and Selection Operator) algorithm to select the optimal feature subset. The selected features are provided to an SVM model to identify TTCA.
Though existing studies have achieved a certain degree of success, some of them have the drawbacks of limited selection of feature types and a lack of a systematic approach for the selection of feature subset and optimization of ML models. More importantly, there is still potential for improvement in the prediction accuracy. In this study, we present ENCAP (ENsemble predictors for tumor T Cell Antigen Prediction), a TTCA predictor constructed with ensemble classifiers on diverse sequence-based features. A total of 57 different feature types were considered for feature engineering, producing a vector of 4349 numeric values for each sequence. Next, feature ranking was performed and the feature subset for the follow-up ML process was determined with a heuristic feature subset selection algorithm. Six different ensemble classifiers such as random forest and extreme gradient boosting were applied in this study. The hyperparameters of ML models were then optimized with a Bayesian optimization algorithm based on the selected feature subset. Noteworthily, our ML models have been applied on two different TTCA datasets which consist of TTCAs obtained from TANTIGEN [24, 25], but have different criteria in the acquisition of non-TTCAs. For the first dataset, our predictor achieves a Matthew’s correlation coefficient (MCC) of 0.522 for independent test, producing a 4.8% improvement compared to the state-of-art method, iTTCA-RF. For the second dataset, our predictor achieves an MCC of 0.789 for independent test, yielding a 13.5% improvement over the state-of-art method, PSRTTCA. For the third independent test dataset of 71 experimentally validated TTCAs collected from literature, ENCAP achieves accuracy up to 0.873, producing improvements in between 12% and 25.7% over three state-of-the-art methods. Furthermore, the feature analysis indicates that the feature subset of ENCAP is primarily composed of physicochemical properties, with several features specifically related to hydrophobicity and amphiphilicity. It is demonstrated that the differences in the criteria for non-TTCAs between the two datasets can lead to differences in the selected feature subset, especially the compositional features. This situation also highlights the effectiveness of ENCAP in performing adaptive feature selection for the target dataset.
Materials and methods
Datasets
In this study, we employed two publicly available datasets to develop and assess ENCAP. The first dataset [18], containing 592 TTCAs and 393 non-TTCAs, was randomly partitioned into a cross validation (CV) dataset and an independent test set using an 80% and 20% split. The CV dataset, termed DS1-CV, consisted of 470 TTCAs and 318 non-TTCAs; the independent test set, termed DS1-IND, consisted of 122 TTCAs and 75 non-TTCAs. The second dataset consisted of two subsets: DS2-CV, comprising 474 TTCAs and 474 non-TTCAs for cross validation, and DS2-IND, comprising an additional 118 TTCAs and 118 non-TTCAs for independent test [26]. The peptide length distributions of TTCAs and non-TTCAs for DS1 and DS2 are illustrated in S1 and S2 Figs, respectively. It can be observed the lengths for a majority of the TTCAs and non-TTCAs fall within a range of 8 to 10 residues.
For both DS1 and DS2, the TTCAs were obtained from TANTIGEN (Tumor T-cell Antigen Database), and the non-TTCAs were obtained from Immune Epitope Database (IEDB) [27, 28]. The two datasets primarily differ in the acquisition of non-TTCAs. Non-TTCAs from DS1 were obtained with a single criterion: T cell antigens with no known association with any disease. In contrast, non-TTCAs from DS2 were collected using multiple criteria, such as antigens not associated with terms such as ‘tumor’, ‘cancer’, ‘carcinoma’, and ‘metastasis’, validation through in vitro and in vivo assays, presentation by MHC-I (Major Histocompatibility Complex-I) molecules, having Homo sapiens as antigen source and host, and a length of 9 to 14 amino acids. Given that DS1 and DS2 have been employed in multiple studies, we assessed our models using both datasets.
Apart from the two datasets, an additional test dataset of 73 experimentally validated TTCAs collected from literature were obtained from a previous study [26]. Two of the sequences were removed because they consisted of non-standard amino acid, resulting in 71 TTCAs termed DS3-IND. This dataset served as an additional test set for benchmark analysis of prediction models trained with DS1-CV and DS2-CV as well as existing predictors.
Overview of the method
ENCAP utilizes a two-stage process to construct the prediction model, as illustrated in Fig 1. The first stage is feature engineering and the second stage is hyperparameter optimization for ML models. In the first stage, sequences from the CV dataset were converted to various compositional and physicochemical features, followed by data normalization and feature ranking. Next, the best feature number was determined with cross validation using 7 ML models. The N selected features were combined with M features based on motifs, which were obtained with a motif searching algorithm. The N+M features were deemed the best feature subset, which were then used for hyperparameter optimization of ML models in stage 2. A total of 6 ensemble prediction models were considered, including CatBoost (CB) [29], gradient boosting classifier (GBC) [30], extra trees (ET) [31], extreme gradient boosting (XGB) [32], light gradient boosting machine (LGBM) [33], random forest (RF) [34]. Apart from the above, a statistical method, linear discriminant analysis (LDA) [35], was also used for comparison. The hyperparameters of 7 ML models were optimized with a Bayesian optimization algorithm on the CV dataset using the selected feature subset from the first stage. After the hyperparameters for each ML model were determined, 10-fold CV and independent test were performed and the results were benchmarked with various evaluation measures. The pseudocode of ENCAP is shown in S1 Text.
Feature encoding
Converting amino acid sequences into numerical representations is crucial in accurate prediction of TTCA using machine learning algorithms. Various compositional and physicochemical features were used to encode peptides, including peptide length, amino acid composition (AAC), di-peptide composition (DPC), composition of k-spaced amino acid pairs (CKSAAP) [36], amphiphilic pseudo amino acid composition (APAAC) [37–39], composition of k-spaced amino acid group pairs (CKSAAGP) [40, 41], Composition-Transition-Distribution (CTD) [42], conjoint triad (CTriad) [43], dipeptide deviation from expected mean (DDE) [44], grouped amino acid composition (GAAC) [45], grouped di-peptide composition (GDPC) [46], distance distribution of repeats (DDR) [47], residue repeats index (RRI) [47], Shannon entropy of a residue (SER) [47], Shannon entropy of a property (SEP) [47], quasi-sequence order (QSO) [48], instability index [49], Boman index [50, 51], isoelectric point [52], and aromaticity. These features were obtained with iFeature [53], pFeature [47], and modlAMP [54] packages. The entire list of 57 feature types and the size of each of them (number of numeric values) are listed in S1 Table. A feature vector of 4349 entries was generated for each sequence.
Using a diverse set of feature types for sequence encoding is advantageous because different feature types can capture distinct aspects of the peptide sequences, such as composition, physicochemical properties, and structural patterns. By combining multiple feature types, a more comprehensive representation of the sequences is generated, which can mitigate biases introduced by individual encoding methods and potentially improve the generalization ability of ML models.
Apart from the above features, a binary vector associated with sequence motifs was generated. Sequence motifs of DS1 and DS2 were obtained with STREME [55], a motif searching tool which finds ungapped motifs that are enriched in the dataset. STREME was applied to positive and negative samples separately to obtain respective list of motifs ranging from 3 to 20 amino acids. After combining the two motif lists, a binary vector was generated in which the presence and absence of a motif in a given sequence were represented by 1 and 0, respectively. The binary vector was also taken as input for machine learning models.
Normalization
Robust normalization was applied to compositional and physicochemical features with RobustScaler from Scikit-learn package [56] of Python. RobustScaler converts a feature value xi to yi using the following equation.
(1)
in which Q3(X) and Q1(X) stand for the 3rd quartile and the 1st quartile of feature X. The scaled values have a median of zero and an interquartile range of one across all sequences.
Machine learning methods
Ensemble predictors were used in this study because of their accuracy, robustness towards noises, and good model generalizability. We used 6 ensemble predictors in this study, including RF, ET, GBC, LGBM, XGB, and CB. In addition, a statistical model, LDA, was also employed for comparison. RF comprises an extensive ensemble of decision trees, and its prediction is obtained by calculating the average of the outcomes produced by individual trees. ET is similar to RF except for two notable distinctions: 1) every tree is trained using the entire training dataset rather than a bootstrap sample, and 2) the top-down node splitting is determined through a random process, as opposed to relying on a scoring function like information gain or Gini impurity. GBC, LGBM, XGB, and CB are different implementations of boosting algorithms that combines several weak learners into strong learners, in which each new model is trained to minimize the loss function such as mean squared error or cross-entropy of the previous model using gradient descent. LDA attempts to find a linear combination of features that best separates classes in a dataset while minimizing the variance within each class. Implementation of these models was based on Scikit-learn package of Python.
Heuristic feature subset selection
The Boruta package [57] implements a wrapper approach utilizing a random forest algorithm for feature ranking. The algorithm iteratively eliminates features that are statistically less relevant than randomized features, resulting in a ranked feature list based on feature importance. The algorithm with its default parameters was applied to analyze features (except for the motifs) for DS1-CV and DS2-CV, yielding two ranked feature lists.
Next, a heuristic algorithm was applied to determine the best feature numbers of DS1-CV and DS2-CV for machine learning. For a given dataset, the ranked features were expressed as an ordered set F = {f1, f2, … f4335}, sorted in descending order according to feature importance. The feature subset of the top N features is defined as SN = {f1, f2, …fN} where fi ∈ F. Iterative five-fold cross validation runs using SN was performed based on the six ensemble predictors, i.e., RF, ET, GBC, LGBM, XGB, and CB. The default hyperparameters of the Scikit-learn package were used for all the models. The cross validation experiments were carried out with PyCaret [58] and the results were evaluated with MCC. Let MCCNi be the MCC for ML model i using SN, The best MCC based on SN was defined as
(2)
Where ML model i corresponds to any of the six ML models (RF, ET, GBC, LGBM, XGB, and CB). The process was repeated iteratively, with N ranging from 50 to 410 in increments of 20. The best feature subset (BFS) was defined as
(3)
As a result, two feature subsets were generated by applying the heuristic algorithm to DS1-CV and DS2-CV. The pseudocode of the algorithm is shown in S2 Text. The purpose of feature subset selection is to determine the best feature number for each dataset and avoid the lengthy process of machine learning model optimization. This is the reason why we used the default hyperparameters of the six machine learning models for the task. Moreover, using ML models of different rationale has the advantage that the selected feature subsets are not biased towards a particular model.
Hyperparameter optimization and independent test
Once the feature subset was determined, hyperparameters of ML models were optimized with Optuna [59], an automatic hyperparameter optimization software package for sampling search space and pruning unpromising trials. We used the tree-structured Parzen estimator (TPE) algorithm [60], a Bayesian optimization method, as the search algorithm of Optuna because it is more likely to reach global optimum than randomized search, and is more time-efficient than grid search. The default parameters of TPE were used during the process. After the hyperparameters were determined for each model, ten-fold cross validation was performed to evaluate each predictor.
To perform independent test, an additional training process using the optimized parameters of each model on the entire DS1-CV or DS2-CV was performed. The resulting models, with the advantage of using 10% more training data compared to the models obtained from cross validation, were used to conduct independent test on DS1-IND and DS2-IND.
Performance evaluation
For benchmark comparison, prediction results were evaluated with accuracy, precision, recall or sensitivity, F1-score, and MCC (Matthew’s Correlation Coefficient) defined as follows:
(4)
(5)
(6)
(7)
(8)
(9)
where TP, TN, FP, and FN stand for the numbers of true positives, true negatives, false positives, and false negatives, respectively. Accuracy, precision, recall, specificity, and F1 range from 0 to 1, where a larger value indicates better predictive capability. MCC ranges from -1 to 1, representing completely negative and completely positive correlations, respectively, and an MCC of 0 represents random correlation. In addition, AUC (Area Under the receiver operating characteristic Curve) [61], a non-parametric and threshold independent measure, was also included for evaluation. An AUC value ranging from 0.5 to 1 is indicative of a model with good prediction performance.
Results
Selected features and motifs
Cross validation results of DS1-CV and DS2-CV using different feature numbers were evaluated with MCC, and the results are shown in S3 Fig. The analysis reveals that selecting 150 features for DS1-CV and 210 features for DS2-CV results in the highest MCC, thereby constituting the selected feature subset. Using the selected feature subset, the best MCCs achieved by the 7 machine learning models are 0.535 and 0.757 for DS1-CV and DS2-CV, respectively.
STREME generated 9 motifs for TTCAs from DS1-CV, including FATP, AEEA, ALQP, AVI, LMK, SLADTN, VFGI, RLAE, LLD, and 2 motifs for non-TTCAs from DS1-CV, including IPRHL and ALAG. For DS2-CV, STREME yielded 3 motifs from TTCAs, including LMK, ADV, and AGIGIL, and 5 motifs from non-TTCAs, including AQID, EFP, AVADE, VDEV, and ITD. Interestingly, the motifs from both datasets are quite different because the non-TTCA sequences were collected based on different criteria, resulting in dissimilar sequence patterns for the negatives. The sizes of feature vectors for DS1-CV and DS2-CV are 161 (150+11) and 218 (210+8), respectively.
Performance evaluation for cross validation
The hyperparameters of 7 ML models trained with DS1-CV are listed in S2 Table. Evaluation results of cross validation on DS1-CV for all the ML models are shown in Table 1. It can be observed ET yields accuracy, precision, F1-score, and MCC of 0.760, 0.751, 0.821, and 0.488, respectively, all of which are the highest among all the methods. CB and GBC have AUCs of 0.793 and 0.803, respectively, comparable to ET but their MCCs are 3.3% and 3.5% lower than ET. XGB has the highest recall but suffers from the lowest precision among all the methods.
Bald face indicates the highest value among all the methods.
The hyperparameters of 7 ML models trained with DS2-CV are listed in S3 Table. Evaluation results of cross validation on DS2-CV for all the ML models are shown in Table 2. ET yields accuracy, precision, F1-score, and MCC of 0.852, 0.856, 0.852, and 0.708, respectively, all of which are again the highest among all the methods. GBC, RF, CB, and XGB show marginally lower MCC and AUC (within 1% difference), indicating that these ensemble machine learning models are somewhat comparable to ET in cross validation. By comparing the evaluation results of DS1-CV with DS2-CV, it is obvious the specificities for all the models on DS1-CV are below 0.6, significantly lower than those obtained on DS2-CV, which exceed 0.82. Moreover, all the models produce ~10% lower precision for DS1 compared to DS2. The above observations indicate a higher propensity for false positive predictions on DS1, a characteristic specific to the dataset.
Bald face indicates the highest value among all the methods.
Performance evaluation for independent test
Benchmark results for independent test on DS1-IND with 7 ML models are shown in Table 3. Results for iTTCA-Hybrid, TTAgP 1.0, and LDA are also listed in the table for comparison because they were developed and benchmarked on the same dataset. It can be seen the best performing method, GBC, generates the highest MCC of 0.522 and the highest accuracy of 0.766. CB and ET yield slightly lower MCCs of 0.508 and 0.488, respectively. We further analyze the correlation between the prediction confidence (or probability output) for each model and the true positive rate, calculated by the number of actual TTCA divided by the total number of sequences predicted within the range of the prediction confidence. It can be observed from S4A Fig. that the prediction confidence for each model in general has positive correlation to the true positive rate. LDA has the highest true positive rate for the MCC range of 0 to 0.2 than all the other methods. The situation is unfavorable and explains why LDA produces the lowest MCC for independent test. GBC, CB, and ET outperform the state-of-art method, iTTCA-RF, in MCC by 4.8%, 3.4%, and 1.4%, respectively. iTTCA-RF has the highest precision and specificity among all the methods, and yet it generates a recall of 0.836, considerably lower than the recall for GBC, CB, and ET which range from 0.975 to 1.000. This is the major reason why GBC, CB, and ET outperform iTTCA-RF in terms of MCC.
Bald face indicates the highest value among all the methods.
Table 4 shows the benchmark results for independent test on DS2-IND with 7 ML models and two other existing methods, PSRTTCA and TAP 1.0, both of which were developed and benchmarked on the same dataset. It can be seen ET is the best performing model because its MCC, AUC, accuracy, and recall are 0.789, 0.960, 0.890, and 0.966, respectively, all of which are the highest among all the methods. All 7 models present good correlations between the prediction confidence and true positive rate, as illustrated in S4B Fig. Similar to the prediction results of DS1-IND, ET, GBC, and CB are the top 3 methods judging by MCC, and they outperform state-of-art method, PSRTTCA, by 13.5%, 12.7%, and 10.5% in MCC, respectively. Moreover, the improvements of ET, GBC, and CB over PSRTTCA in AUC are 7.5%, 6.6%, and 7.2%, respectively. The precision of PSRTTCA is 0.827, comparable to the precision of ET, GBC, and CB, and yet its recall is 0.822, which is 14.4%, 14.4%, and 11.0% lower than the three methods, respectively. The situation explains why PSRTTCA has much lower MCC than the three methods. The above results on DS1-IND and DS2-IND indicate that significantly improved recall is the major reason why the proposed methods compare favorably over the state-of-art approach.
Bald face indicates the highest value among all the methods.
To predict sequences from DS3-IND (consisting of 71 experimentally validated TTCAs collected from literature), we employed ET and GBC trained with DS1-CV (termed DS1-ET and DS1-GBC, respectively) because the two are the best for cross validation and independent test on DS1 judging by MCC, respectively. ET trained with DS2-CV, termed DS2-ET, were also used for benchmark comparison because the model outperforms the rest for both cross validation and independent test on DS2. DS1-ET, DS1-GBC, and DS2-ET correctly predict 62, 60, and 58 TTCAs from the dataset, respectively, namely, the accuracy of the three models ranges from 0.873 to 0.817. On the other hand, PSRTTCA, iTTCA-RF, iTTCA-Hybrid, and TAP 1.0 predict 55, 52, 52, and 45 TTCAs out of 71 actual TTCAs in DS3-IND. The accuracy of the four existing methods ranges from 0.753 to 0.616, indicating that our proposed methods significantly improve existing methods in prediction accuracy for DS3-IND.
Discussions
Data visualization with selected features
In this study, each sequence is originally encoded with various biological, physicochemical, and compositional features, resulting in a feature vector of 4349 numeric values. The improved performances of our predictors on both DS1-IND and DS2-IND are largely attributed to the selection of numerical features. Through feature engineering, a total of 150 and 210 feature values were obtained. The t-distributed stochastic neighbor embedding (t-SNE) [62, 63] was applied to validate the efficacy of the selected feature subsets to the discrimination of TTCAs. Fig 2 illustrates the t-SNE distributions of TTCAs and non-TTCAs on 2D planes using either the complete set of 4349 features or the selected feature subsets. Within these plots, red and blue spots represent TTCAs and non-TTCAs, respectively. Notably, Fig 2B and 2D highlight the enhanced separation achieved by the selected feature subsets, facilitating clearer differentiation between the distributions of TTCAs and non-TTCAs. Conversely, when employing all 4349 features for t-SNE analysis, the distributions of TTCAs and non-TTCAs exhibit serious overlap, as shown in Fig 2A and 2C. Such results indicate the importance and efficacy of the selected feature subsets, which can benefit ML models in accurate prediction of TTCAs.
t-SNE distributions of (A) DS1-CV using 4349 features, (B) DS1-CV using 150 selected features, (C) DS2-CV using 4349 features, and (D) DS2-CV using 210 selected features.
Analysis of selected feature types
The entire lists of the 161 selected features for DS1 and the 218 selected features for DS2 are shown in S4 and S5 Tables, respectively. The former belongs to 16 different feature types and the latter belongs to 25 different feature types (excluding motifs), among which 8 feature types are in common. The top 5 most frequent feature types of DS1 and DS2 are shown in Table 5. It can be observed 4 of the top 5 most frequent feature types for DS1 and DS2 belong to the common feature types, indicating a certain degree of similarity between the two datasets. The entire lists of feature types used for machine learning on DS1 and DS2 are shown in S6 Table. These feature types are mostly physicochemical properties, for example, CTD (Composition-Transition-Distribution) [42] is constructed with properties including hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, secondary structure and solvent accessibility. Other selected features of physicochemical properties include ABHPRK (Acidic, Basic, Hydrophobic, Polar, aRomatic, Kink-inducer) [64], MSW (amino acid scale based on a principal component analysis of the Molecular Surface-based Weighted holistic invariant molecular descriptor) [65], Z5 [66], and Z3 [67]. Apart from the general physicochemical properties, hydrophobicity seems to be one of the important attributes among the selected features. For example, the Aliphatic_Index [68] and OVPC_Aliphatic (OVerlapping Property Composition) [69] for DS1 and APAAC (Amphiphilic Pseudo-Amino Acid Composition) [48] for DS2 are related to hydrophobicity.
The major difference between the feature types for DS1 and DS2 is the presence of compositional features. For example, DS1 consists of 91 DDE features (Dipeptide Deviation from Expected mean) [70] while none are selected for DS2. DS2 consists of a much smaller number of compositional features, such as 5 APAAC features (Amphiphilic Pseudo-Amino Acid Composition), 2 AAC features (Amino Acid Composition), and 2 GAAC (Grouped Amino Acid Composition) [45] features. Such difference is attributed to the feature ranking algorithm, Boruta, which ranks the contribution of each feature based on its discriminating capability for a given dataset. Specifically, the differences in the criteria for non-TTCAs between DS1 and DS2 lead to different sequence patterns, thus altering the feature ranks generated by Boruta. The situation also highlights the flexibility of our approach in selecting suitable features for the target dataset.
Feature importance analysis
SHAP (SHapley Additive exPlanations) [71] analysis is known as a powerful framework to provide information about how features can affect the output of the model. Positive SHAP values signify that a feature contributed to augmenting the model’s output, making a positive prediction more likely, whereas negative values indicate the feature’s role in diminishing the output, increasing the likelihood of a negative prediction. The absolute magnitude of the SHAP values quantifies the extent of each feature’s influence, namely, feature importance. Fig 3 illustrates the SHAP values for the top 20 features based on DS1 and DS2. It can be seen the top 20 features for DS1 (Fig 3A) are mostly associated with physicochemical properties such as MSW, Z3, and Z5. On the other hand, 8 of the top 20 features for DS2 (Fig 3B) are OVPs corresponding to the hydrophobic, polar, charged, and aromatic residues on the N-terminus of peptides (N1, N2, and N3). Several CTD features corresponding to solvent accessibility and hydrophobicity also have high SHAP values for DS2. The analysis reveals different feature contributions for the two datasets, reflecting different underlying patterns or characteristics that are relevant to the prediction task.
The beeswarm plots of SHAP values for the top 20 features based on A) DS1 and B) DS2.
Peptide property with respect to prediction capability
Prediction results of the three most accurate models on independent tests, CB, GBC, and ET, were further analyzed with respect to three peptide properties. The peptide property is characterized by the ratios of hydrophobic, hydrophilic, and charged residues within a peptide. Hydrophobic amino acids include V, I, L, M, F, W, and C; hydrophilic amino acids include R, N, D, E, Q, H, K, S, and T; charged amino acids include E, D, R, K, and H. The analysis is applied to peptides from DS1-IND and DS2-IND and the results are illustrated in Fig 4. It can be observed the prediction capability for the ML models are improved for peptides of fewer hydrophobic residues, and more hydrophilic and charged residues. Moreover, the phenomena for DS1-IND (Fig 4A–4C) and DS2-IND (Fig 4D–4F) are remarkably similar. The analysis also suggests potential directions for future enhancements, particularly in refining the prediction accuracy for peptides characterized by a higher number of hydrophobic residues and a reduced presence of hydrophilic and charged residues.
Panels A, B, and C corresponds to ratios of hydrophobic, hydrophilic, and charged amino acids for peptides from DS1-IND. Panels D, E, and F corresponds to ratios of hydrophobic, hydrophilic, and charged amino acids for peptides from DS2-IND.
Conclusion
Cancer immunotherapy is a type of cancer treatment that uses the body’s own immune system to fight cancer cells. Accurate identification of TTCA is an important task for cancer immunotherapy because peptides derived from TTCAs can induce a peptide-specific T cell immune response which neutralizes tumor cells. In this study, we present ENCAP, a ML method built with ensemble classifiers and diverse sequence features for TTCA prediction. In the stage of feature engineering, sequences were encoded with a total of 4349 numeric values from 57 different feature types, followed by feature ranking with Boruta and a heuristic feature subset selection process. In addition, the sequence motifs from TTCAs were used to generate bit vectors as part of the features. The second stage utilized the selected feature subset for the optimization of hyperparameters for the 7 ML models (CB, GBC, ET, XGB, LGBM, RF, and LDA) using a Bayesian optimization approach. ENCAP was applied to two TTCA datasets, DS1 and DS2, from the public domain which have quite different criteria for the acquisition of non-TTCAs. For DS1-IND, the best ML model, XGB, yields 0.766, 0.729, 0.992, and 0.522 for accuracy, precision, recall, and MCC, respectively. For DS2-IND, the best ML model, ET, yields 0.890, 0.838, 0.966, and 0.789, for accuracy, precision, recall, and MCC, respectively. The improvements in MCC over the state-of-art methods for independent tests of DS1 and DS2 are 4.8% and 13.5%, respectively. The situation is largely attributed to the considerable increase in recall for the proposed models. For DS3, the dataset of 71 experimentally validated TTCAs gathered from the literature, our top-performing models exhibit accuracy ranging from 0.817 to 0.873. In contrast, existing methods yield accuracy within the range of 0.616 to 0.753. The above results suggest our models produce significant improvements over existing methods in identifying TTCAs.
The efficacy of the selected features is further validated with t-SNE plots, for which the 150 and 210 selected features from DS1 and DS2, respectively, demonstrate enhanced separation compared to original features. Our feature analysis reveals that the selected feature subset consists of mostly physicochemical properties, along with several features specifically corresponding to hydrophobicity and amphiphilicity. The feature subsets for DS1 and DS2 mainly differ in the selection of compositional features. The situation shows that different criteria for non-TTCAs can result in different sequence patterns, which can affect the discriminating capability of features. The SHAP analysis also reveals different feature contributions for the two datasets, reflecting different underlying patterns or characteristics that are relevant to the prediction task.
This study is the first to conduct separate benchmark comparisons of predictors on two TTCA datasets. Our results not only demonstrate significant improvements in prediction accuracy for three independent tests, but provide evidences for the effectiveness of ENCAP in adaptively selecting features for the target dataset. Furthermore, it is demonstrated that the prediction accuracy for peptides of a higher proportion of hydrophobic residues and a lower occurrence of hydrophilic and charged residues is limited, indicating a need for further improvement. As more experimentally validated TTCAs become available, it can be anticipated that employing advanced deep learning architectures and large language models may enhance the performance of TTCA predictions. These techniques could potentially capture complex sequence patterns and long-range dependencies that are challenging for traditional machine learning models. Ultimately, continued research at this intersection of cancer immunotherapy and machine learning could pave the way for more effective identification of TTCA candidates, accelerating the development of novel immunotherapeutic strategies against cancer.
Supporting information
S1 Fig.
Peptide length distributions of TTCAs and non-TTCAs for A) DS1-CV and B) DS1-IND.
https://doi.org/10.1371/journal.pone.0307176.s001
(DOCX)
S2 Fig.
Peptide length distributions of TTCAs and non-TTCAs for A) DS2-CV and B) DS2-IND.
https://doi.org/10.1371/journal.pone.0307176.s002
(DOCX)
S3 Fig.
MCCs of the best machine learning model evaluated with cross validation using different feature numbers on A) DS1-CV and B) DS2-CV.
https://doi.org/10.1371/journal.pone.0307176.s003
(DOCX)
S4 Fig.
Prediction confidence of 7 ML models on (A) DS1-IND and (B) DS2-IND.
https://doi.org/10.1371/journal.pone.0307176.s004
(DOCX)
S1 Table. Details of 57 feature types used in the study.
https://doi.org/10.1371/journal.pone.0307176.s005
(DOCX)
S2 Table. Hyperparameters of ML models trained with DS1-CV.
https://doi.org/10.1371/journal.pone.0307176.s006
(DOCX)
S3 Table. Hyperparameters of ML models trained with DS2-CV.
https://doi.org/10.1371/journal.pone.0307176.s007
(DOCX)
S4 Table. List of 161 selected feature subset for DS1.
https://doi.org/10.1371/journal.pone.0307176.s008
(DOCX)
S5 Table. List of 218 selected feature subset for DS2.
https://doi.org/10.1371/journal.pone.0307176.s009
(DOCX)
S6 Table. The selected numbers and sizes of feature types from the selected feature subsets of DS1 and DS2.
https://doi.org/10.1371/journal.pone.0307176.s010
(DOCX)
S2 Text. Pseudocode of the feature subset selection algorithm.
https://doi.org/10.1371/journal.pone.0307176.s012
(DOCX)
References
- 1. Bray F, Laversanne M, Weiderpass E, Soerjomataram I. The ever-increasing importance of cancer as a leading cause of premature death worldwide. Cancer. 2021;127: 3029–3030. pmid:34086348
- 2. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin. 2021;71: 209–249. pmid:33538338
- 3. Harris F, Dennison SR, Singh J, Phoenix DA. On the selectivity and efficacy of defense peptides with respect to cancer cells. Med Res Rev. 2013;33: 190–234. pmid:21922503
- 4. Thundimadathil J. Cancer treatment using peptides: current therapies and future prospects. J Amino Acids. 2012;2012: 967347. pmid:23316341
- 5. Yu L, Wang M, Yang Y, Xu F, Zhang X, Xie F, et al. Predicting therapeutic drugs for hepatocellular carcinoma based on tissue-specific pathways. PLOS Comput Biol. 2021;17: e1008696. pmid:33561121
- 6. Couzin-Frankel J. Cancer Immunotherapy. Science. 2013;342: 1432–1433. pmid:24357284
- 7. Liu Y, Ouyang X, Xiao Z-X, Zhang L, Cao Y. A Review on the Methods of Peptide-MHC Binding Prediction. Curr Bioinforma. 2020;15: 878–888.
- 8. Mizukoshi E, Nakamoto Y, Arai K, Yamashita T, Sakai A, Sakai Y, et al. Comparative analysis of various tumor-associated antigen-specific t-cell responses in patients with hepatocellular carcinoma. Hepatol Baltim Md. 2011;53: 1206–1216. pmid:21480325
- 9. Yang J, Zhang Q, Li K, Yin H, Zheng J-N. Composite peptide-based vaccines for cancer immunotherapy (Review). Int J Mol Med. 2015;35: 17–23. pmid:25395173
- 10. Kumai T, Lee S, Cho H-I, Sultan H, Kobayashi H, Harabuchi Y, et al. Optimization of Peptide Vaccines to Induce Robust Antitumor CD4 T-cell Responses. Cancer Immunol Res. 2017;5: 72–83. pmid:27941004
- 11. Liu W, Tang H, Li L, Wang X, Yu Z, Li J. Peptide‐based therapeutic cancer vaccine: Current trends in clinical application. Cell Prolif. 2021;54: e13025. pmid:33754407
- 12. Ali F, Akbar S, Ghulam A, Maher ZA, Unar A, Talpur DB. AFP-CMBPred: Computational identification of antifreeze proteins by extending consensus sequences into multi-blocks evolutionary information. Comput Biol Med. 2021;139: 105006. pmid:34749096
- 13. Raza A, Uddin J, Almuhaimeed A, Akbar S, Zou Q, Ahmad A. AIPs-SnTCN: Predicting Anti-Inflammatory Peptides Using fastText and Transformer Encoder-Based Hybrid Word Embedding with Self-Normalized Temporal Convolutional Networks. J Chem Inf Model. 2023;63: 6537–6554. pmid:37905969
- 14. Akbar S, Hayat M, Tahir M, Khan S, Alarfaj FK. cACP-DeepGram: Classification of anticancer peptides via deep neural network and skip-gram-based word embedding model. Artif Intell Med. 2022;131: 102349. pmid:36100346
- 15. Akbar S, Zou Q, Raza A, Alarfaj FK. iAFPs-Mv-BiTCN: Predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks. Artif Intell Med. 2024;151: 102860. pmid:38552379
- 16. Beltrán Lissabet JF, Herrera Belén L, Farias JG. TTAgP 1.0: A computational tool for the specific prediction of tumor T cell antigens. Comput Biol Chem. 2019;83: 107103. pmid:31437642
- 17. Kawashima S, Ogata H, Kanehisa M. AAindex: Amino Acid Index Database. Nucleic Acids Res. 1999;27: 368–369. pmid:9847231
- 18. Charoenkwan P, Nantasenamat C, Hasan MM, Shoombuatong W. iTTCA-Hybrid: Improved and robust identification of tumor T cell antigens by utilizing hybrid feature representation. Anal Biochem. 2020;599: 113747. pmid:32333902
- 19. Jiao S, Zou Q, Guo H, Shi L. iTTCA-RF: a random forest predictor for tumor T cell antigens. J Transl Med. 2021;19. pmid:34706730
- 20. He S, Guo F, Zou Q, HuiDing . MRMD2.0: A Python Tool for Machine Learning with Feature Ranking and Reduction. Curr Bioinforma. 2020;15: 1213–1221.
- 21. Herrera-Bravo J, Herrera Belén L, Farias JG, Beltrán JF. TAP 1.0: A robust immunoinformatic tool for the prediction of tumor T-cell antigens based on AAindex properties. Comput Biol Chem. 2021;91: 107452. pmid:33592504
- 22. Quinlan JR. Induction of decision trees. Mach Learn. 1986;1: 81–106.
- 23. Zou H, Yang F, Yin Z. iTTCA-MFF: identifying tumor T cell antigens based on multiple feature fusion. Immunogenetics. 2022;74: 447–454. pmid:35246701
- 24. Olsen LR, Tongchusak S, Lin H, Reinherz EL, Brusic V, Zhang GL. TANTIGEN: a comprehensive database of tumor T cell antigens. Cancer Immunol Immunother CII. 2017;66: 731–735. pmid:28280852
- 25. Zhang G, Chitkushev L, Olsen LR, Keskin DB, Brusic V. TANTIGEN 2.0: a knowledge base of tumor T cell antigens and epitopes. BMC Bioinformatics. 2021;22: 40. pmid:33849445
- 26. Charoenkwan P, Pipattanaboon C, Nantasenamat C, Hasan MM, Moni MA, Lio’ P, et al. PSRTTCA: A new approach for improving the prediction and characterization of tumor T cell antigens using propensity score representation learning. Comput Biol Med. 2022;152: 106368. pmid:36481763
- 27. Vita R, Overton JA, Greenbaum JA, Ponomarenko J, Clark JD, Cantrell JR, et al. The immune epitope database (IEDB) 3.0. Nucleic Acids Res. 2015;43: D405–D412. pmid:25300482
- 28. Fleri W, Paul S, Dhanda SK, Mahajan S, Xu X, Peters B, et al. The Immune Epitope Database and Analysis Resource in Epitope Discovery and Synthetic Vaccine Design. Front Immunol. 2017;8. Available: https://www.frontiersin.org/articles/10.3389/fimmu.2017.00278 pmid:28352270
- 29.
Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. ArXiv. 2018 [cited 3 Apr 2023]. Available: https://www.semanticscholar.org/paper/CatBoost%3A-gradient-boosting-with-categorical-Dorogush-Ershov/f5fbcd9ff72c5820a21b9d6871d2a3d475c9bb7f
- 30. Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Stat. 2001;29: 1189–1232.
- 31. Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63: 3–42.
- 32. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. pp. 785–794.
- 33.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2017.
- 34. Ho TK. Random decision forests. Proceedings of 3rd International Conference on Document Analysis and Recognition. 1995. pp. 278–282 vol.1.
- 35.
Fix E, Hodges J. Discriminatory Analysis—Nonparametric Discrimination: Consistency Properties. CALIFORNIA UNIV BERKELEY; 1951 Feb.
- 36. Chen K, Kurgan LA, Ruan J. Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs. BMC Struct Biol. 2007;7: 25. pmid:17437643
- 37. Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 2001;43: 246–255. pmid:11288174
- 38. Pace CN, Fu H, Fryar KL, Landua J, Trevino SR, Shirley BA, et al. Contribution of Hydrophobic Interactions to Protein Stability. J Mol Biol. 2011;408: 514–528. pmid:21377472
- 39. Chou K-C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005;21: 10–19. pmid:15308540
- 40. Liu L-M, Xu Y, Chou K-C. iPGK-PseAAC: Identify Lysine Phosphoglycerylation Sites in Proteins by Incorporating Four Different Tiers of Amino Acid Pairwise Coupling Information into the General PseAAC. Med Chem Shariqah United Arab Emir. 2017;13: 552–559. pmid:28521678
- 41. Chen X, Qiu J-D, Shi S-P, Suo S-B, Huang S-Y, Liang R-P. Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites. Bioinforma Oxf Engl. 2013;29: 1614–1622. pmid:23626001
- 42. Dubchak I, Muchnik I, Holbrook SR, Kim SH. Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci U S A. 1995;92: 8700–8704. pmid:7568000
- 43. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, et al. Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci U S A. 2007;104: 4337–4341. pmid:17360525
- 44. Saravanan V, Gautham N. Harnessing Computational Biology for Exact Linear B-Cell Epitope Prediction: A Novel Amino Acid Composition-Based Feature Descriptor. Omics J Integr Biol. 2015;19: 648–658. pmid:26406767
- 45. Lee T-Y, Lin Z-Q, Hsieh S-J, Bretaña NA, Lu C-T. Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences. Bioinformatics. 2011;27: 1780–1787. pmid:21551145
- 46. Sun J-N, Yang H-Y, Yao J, Ding H, Han S-G, Wu C-Y, et al. Prediction of Cyclin Protein Using Two-Step Feature Selection Technique. IEEE Access. 2020;8: 109535–109542.
- 47. Pande A, Patiyal S, Lathwal A, Arora C, Kaur D, Dhall A, et al. Pfeature: A Tool for Computing Wide Range of Protein Features and Building Prediction Models. J Comput Biol J Comput Mol Cell Biol. 2022. pmid:36251780
- 48. Chou KC. Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem Biophys Res Commun. 2000;278: 477–483. pmid:11097861
- 49. Guruprasad K, Reddy BV, Pandit MW. Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng. 1990;4: 155–161. pmid:2075190
- 50. Boman HG, Wade D, Boman IA, Wåhlin B, Merrifield RB. Antibacterial and antimalarial properties of peptides that are cecropin-melittin hybrids. FEBS Lett. 1989;259: 103–106. pmid:2689223
- 51. Wolfenden R. Experimental measures of amino acid hydrophobicity and the topology of transmembrane and globular proteins. J Gen Physiol. 2007;129: 357–362. pmid:17438117
- 52.
Haynes WM, editor. CRC Handbook of Chemistry and Physics. 95th ed. Boca Raton: CRC Press; 2014. https://doi.org/10.1201/b17118
- 53. Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinforma Oxf Engl. 2018;34: 2499–2502. pmid:29528364
- 54. Müller AT, Gabernet G, Hiss JA, Schneider G. modlAMP: Python for antimicrobial peptides. Bioinforma Oxf Engl. 2017;33: 2753–2755. pmid:28472272
- 55. Bailey TL. STREME: accurate and versatile sequence motif discovery. Birol I, editor. Bioinformatics. 2021;37: 2834–2840. pmid:33760053
- 56. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12: 2825–2830.
- 57. Kursa MB, Rudnicki WR. Feature Selection with the Boruta Package. J Stat Softw. 2010;36: 1–13.
- 58. Ali M. PyCaret: An open source, low-code machine learning library in Python. 2020. Available: https://www.pycaret.org
- 59.
Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: A Next-generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, NY, USA: Association for Computing Machinery; 2019. pp. 2623–2631. https://doi.org/10.1145/3292500.3330701
- 60.
Bergstra J, Bardenet R, Bengio Y, Kégl B. Algorithms for Hyper-Parameter Optimization. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2011.
- 61. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143: 29–36. pmid:7063747
- 62. Maaten L van der, Hinton G. Visualizing Data using t-SNE. J Mach Learn Res. 2008;9: 2579–2605.
- 63. Maaten L van der. Accelerating t-SNE using Tree-Based Algorithms. J Mach Learn Res. 2014;15: 3221–3245.
- 64. Timmons PB, Hewage CM. ENNAACT is a novel tool which employs neural networks for anticancer activity classification for therapeutic peptides. Biomed Pharmacother. 2021;133: 111051. pmid:33254015
- 65. Zaliani A, Gancia E. MS-WHIM Scores for Amino Acids: A New 3D-Description for Peptide QSAR and QSPR Studies. J Chem Inf Comput Sci. 1999;39: 525–533.
- 66. Sandberg M, Eriksson L, Jonsson J, Sjöström M, Wold S. New Chemical Descriptors Relevant for the Design of Biologically Active Peptides. A Multivariate Characterization of 87 Amino Acids. J Med Chem. 1998;41: 2481–2491. pmid:9651153
- 67. Hellberg S, Sjoestroem M, Skagerberg B, Wold S. Peptide quantitative structure-activity relationships, a multivariate approach. J Med Chem. 1987;30: 1126–1135. pmid:3599020
- 68. IKAI A. Thermostability and Aliphatic Index of Globular Proteins. J Biochem (Tokyo). 1980;88: 1895–1898. pmid:7462208
- 69. Manavalan B, Basith S, Shin TH, Wei L, Lee G. mAHTPred: a sequence-based meta-predictor for improving the prediction of anti-hypertensive peptides using effective feature representation. Bioinforma Oxf Engl. 2019;35: 2757–2765. pmid:30590410
- 70. Garg A, Raghava GPS. A Machine Learning Based Method for the Prediction of Secretory Proteins Using Amino Acid Composition, Their Order and Similarity-Search. In Silico Biol. 2008;8: 129–140. pmid:18928201
- 71.
Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. pp. 4768–4777.