ENCAP: Computational prediction of tumor T cell antigens with ensemble classifiers and diverse sequence features

Jen-Chieh Yu; Kuan Ni; Ching-Tai Chen

doi:10.1371/journal.pone.0307176

Abstract

Cancer immunotherapy enhances the body’s natural immune system to combat cancer, offering the advantage of lowered side effects compared to traditional treatments because of its high selectivity and efficacy. Utilizing computational methods to identify tumor T cell antigens (TTCAs) is valuable in unraveling the biological mechanisms and enhancing the effectiveness of immunotherapy. In this study, we present ENCAP, a predictor for TTCA based on ensemble classifiers and diverse sequence features. Sequences were encoded as a feature vector of 4349 entries based on 57 different feature types, followed by feature engineering and hyperparameter optimization for machine learning models, respectively. The selected feature subsets of ENCAP are primarily composed of physicochemical properties, with several features specifically related to hydrophobicity and amphiphilicity. Two publicly available datasets were used for performance evaluation. ENCAP yields an AUC (Area Under the ROC Curve) of 0.768 and an MCC (Matthew’s Correlation Coefficient) of 0.522 on the first independent test set. On the second test set, it achieves an AUC of 0.960 and an MCC of 0.789. Performance evaluations show that ENCAP generates 4.8% and 13.5% improvements in MCC over the state-of-the-art methods on two popular TTCA datasets, respectively. For the third test dataset of 71 experimentally validated TTCAs from the literature, ENCAP yields prediction accuracy of 0.873, achieving improvements ranging from 12% to 25.7% compared to three state-of-the-art methods. In general, the prediction accuracy is higher for sequences of fewer hydrophobic residues, and more hydrophilic and charged residues. The source code of ENCAP is freely available at https://github.com/YnnJ456/ENCAP.

Citation: Yu J-C, Ni K, Chen C-T (2024) ENCAP: Computational prediction of tumor T cell antigens with ensemble classifiers and diverse sequence features. PLoS ONE 19(7): e0307176. https://doi.org/10.1371/journal.pone.0307176

Editor: Shahid Akbar, Abdul Wali Khan University Mardan, PAKISTAN

Received: March 22, 2024; Accepted: July 1, 2024; Published: July 18, 2024

Copyright: © 2024 Yu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The source code of ENCAP and the datasets are freely available at https://github.com/YnnJ456/ENCAP.

Funding: This work was supported by National Science and Technology Council, Taiwan under Grant NSTC 112-2221-E-468-019. There was no additional external funding received for this study.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Cancer is a major public health challenge that is becoming increasingly prevalent. It is the leading cause of death in many countries and is a major barrier to increasing life expectancy [1]. According to GLOBOCAN (Global Cancer Statistics) 2020, there were 19.3 million new cancer cases diagnosed in 2020 and almost 10.0 million cancer deaths [2]. GLOBOCAN predicts that the number of cancer cases will increase to 28.4 million in 2040. Modern cancer therapies such as chemotherapy, radiation therapy, hormonal therapy, and targeted therapy are lethal to tumor cells, but they can destroy normal cells in the body and produce harmful reactions such as nausea, hair loss, and fatigue [3–6]. In addition, cancer cells can develop resistance to these therapies over time, making it difficult to treat the cancer.

Targeted immunotherapy is one of the promising treatment options because of its high selectivity, efficacy, and reduced side effects [6]. T cells have the ability to identify and eliminate tumor antigens, which are presented by major histocompatibility complex (MHC) class I and class II molecules situated on the surface of antigen-presenting cells [7]. Previous studies show that peptides derived from tumor-associated antigens can induce a peptide-specific T cell immune response which destroys cancer cells [8–10]. Therefore, accurate identification of T cell antigens is an important task in developing highly efficient cancer peptide vaccines [11].

Although several in vivo and in vitro assays have been developed for the characterization of tumor T cell antigens (TTCAs), they demand a significant investment of resources, time, and labor. On the other hand, computational methods are attracting increasing attention due to their great convenience and high efficiency [12–15]. TTAgP 1.0 uses a random forest (RF) model with relatively simple features including amino acid composition (AAC) and several physicochemical features derived from peptides for the identification of TTCAs [16]. iTTCA-Hybrid uses support vector machines (SVM) and RF with five types of features including AAC, dipeptide composition, pseudo amino acid composition, distribution of amino acid properties in sequences, and physicochemical properties derived from the AAindex database [17] for TTCA prediction [18]. iTTCA-RF [19] considers more features, such as global protein sequence descriptors, grouped amino acid composition, grouped dipeptide composition, grouped tripeptide composition, and pseudo-amino acid composition. These features are then processed by MRMD (Maximum-Relevance-Maximum-Distance) [20] to determine an optimal feature subset of 263 numeric values for training and independent test based on RF. TAP 1.0 [21] applies information gain [22] on 544 features from AAindex, resulting in a reduced subset of 10 properties for training and testing based on 15 machine learning (ML) algorithms, within which quadratic discriminant analysis outperforms the rest. iTTCA-MFF [23] uses 50 physicochemical properties and pairwise energy content matrix for encoding, and LASSO (Least Absolute Shrinkage and Selection Operator) algorithm to select the optimal feature subset. The selected features are provided to an SVM model to identify TTCA.

Though existing studies have achieved a certain degree of success, some of them have the drawbacks of limited selection of feature types and a lack of a systematic approach for the selection of feature subset and optimization of ML models. More importantly, there is still potential for improvement in the prediction accuracy. In this study, we present ENCAP (ENsemble predictors for tumor T Cell Antigen Prediction), a TTCA predictor constructed with ensemble classifiers on diverse sequence-based features. A total of 57 different feature types were considered for feature engineering, producing a vector of 4349 numeric values for each sequence. Next, feature ranking was performed and the feature subset for the follow-up ML process was determined with a heuristic feature subset selection algorithm. Six different ensemble classifiers such as random forest and extreme gradient boosting were applied in this study. The hyperparameters of ML models were then optimized with a Bayesian optimization algorithm based on the selected feature subset. Noteworthily, our ML models have been applied on two different TTCA datasets which consist of TTCAs obtained from TANTIGEN [24, 25], but have different criteria in the acquisition of non-TTCAs. For the first dataset, our predictor achieves a Matthew’s correlation coefficient (MCC) of 0.522 for independent test, producing a 4.8% improvement compared to the state-of-art method, iTTCA-RF. For the second dataset, our predictor achieves an MCC of 0.789 for independent test, yielding a 13.5% improvement over the state-of-art method, PSRTTCA. For the third independent test dataset of 71 experimentally validated TTCAs collected from literature, ENCAP achieves accuracy up to 0.873, producing improvements in between 12% and 25.7% over three state-of-the-art methods. Furthermore, the feature analysis indicates that the feature subset of ENCAP is primarily composed of physicochemical properties, with several features specifically related to hydrophobicity and amphiphilicity. It is demonstrated that the differences in the criteria for non-TTCAs between the two datasets can lead to differences in the selected feature subset, especially the compositional features. This situation also highlights the effectiveness of ENCAP in performing adaptive feature selection for the target dataset.

Materials and methods

Datasets

In this study, we employed two publicly available datasets to develop and assess ENCAP. The first dataset [18], containing 592 TTCAs and 393 non-TTCAs, was randomly partitioned into a cross validation (CV) dataset and an independent test set using an 80% and 20% split. The CV dataset, termed DS1-CV, consisted of 470 TTCAs and 318 non-TTCAs; the independent test set, termed DS1-IND, consisted of 122 TTCAs and 75 non-TTCAs. The second dataset consisted of two subsets: DS2-CV, comprising 474 TTCAs and 474 non-TTCAs for cross validation, and DS2-IND, comprising an additional 118 TTCAs and 118 non-TTCAs for independent test [26]. The peptide length distributions of TTCAs and non-TTCAs for DS1 and DS2 are illustrated in S1 and S2 Figs, respectively. It can be observed the lengths for a majority of the TTCAs and non-TTCAs fall within a range of 8 to 10 residues.

For both DS1 and DS2, the TTCAs were obtained from TANTIGEN (Tumor T-cell Antigen Database), and the non-TTCAs were obtained from Immune Epitope Database (IEDB) [27, 28]. The two datasets primarily differ in the acquisition of non-TTCAs. Non-TTCAs from DS1 were obtained with a single criterion: T cell antigens with no known association with any disease. In contrast, non-TTCAs from DS2 were collected using multiple criteria, such as antigens not associated with terms such as ‘tumor’, ‘cancer’, ‘carcinoma’, and ‘metastasis’, validation through in vitro and in vivo assays, presentation by MHC-I (Major Histocompatibility Complex-I) molecules, having Homo sapiens as antigen source and host, and a length of 9 to 14 amino acids. Given that DS1 and DS2 have been employed in multiple studies, we assessed our models using both datasets.

Apart from the two datasets, an additional test dataset of 73 experimentally validated TTCAs collected from literature were obtained from a previous study [26]. Two of the sequences were removed because they consisted of non-standard amino acid, resulting in 71 TTCAs termed DS3-IND. This dataset served as an additional test set for benchmark analysis of prediction models trained with DS1-CV and DS2-CV as well as existing predictors.

Overview of the method

ENCAP utilizes a two-stage process to construct the prediction model, as illustrated in Fig 1. The first stage is feature engineering and the second stage is hyperparameter optimization for ML models. In the first stage, sequences from the CV dataset were converted to various compositional and physicochemical features, followed by data normalization and feature ranking. Next, the best feature number was determined with cross validation using 7 ML models. The N selected features were combined with M features based on motifs, which were obtained with a motif searching algorithm. The N+M features were deemed the best feature subset, which were then used for hyperparameter optimization of ML models in stage 2. A total of 6 ensemble prediction models were considered, including CatBoost (CB) [29], gradient boosting classifier (GBC) [30], extra trees (ET) [31], extreme gradient boosting (XGB) [32], light gradient boosting machine (LGBM) [33], random forest (RF) [34]. Apart from the above, a statistical method, linear discriminant analysis (LDA) [35], was also used for comparison. The hyperparameters of 7 ML models were optimized with a Bayesian optimization algorithm on the CV dataset using the selected feature subset from the first stage. After the hyperparameters for each ML model were determined, 10-fold CV and independent test were performed and the results were benchmarked with various evaluation measures. The pseudocode of ENCAP is shown in S1 Text.

Download:

Fig 1. Flowchart of ENCAP.

https://doi.org/10.1371/journal.pone.0307176.g001

Feature encoding

Converting amino acid sequences into numerical representations is crucial in accurate prediction of TTCA using machine learning algorithms. Various compositional and physicochemical features were used to encode peptides, including peptide length, amino acid composition (AAC), di-peptide composition (DPC), composition of k-spaced amino acid pairs (CKSAAP) [36], amphiphilic pseudo amino acid composition (APAAC) [37–39], composition of k-spaced amino acid group pairs (CKSAAGP) [40, 41], Composition-Transition-Distribution (CTD) [42], conjoint triad (CTriad) [43], dipeptide deviation from expected mean (DDE) [44], grouped amino acid composition (GAAC) [45], grouped di-peptide composition (GDPC) [46], distance distribution of repeats (DDR) [47], residue repeats index (RRI) [47], Shannon entropy of a residue (SER) [47], Shannon entropy of a property (SEP) [47], quasi-sequence order (QSO) [48], instability index [49], Boman index [50, 51], isoelectric point [52], and aromaticity. These features were obtained with iFeature [53], pFeature [47], and modlAMP [54] packages. The entire list of 57 feature types and the size of each of them (number of numeric values) are listed in S1 Table. A feature vector of 4349 entries was generated for each sequence.

Using a diverse set of feature types for sequence encoding is advantageous because different feature types can capture distinct aspects of the peptide sequences, such as composition, physicochemical properties, and structural patterns. By combining multiple feature types, a more comprehensive representation of the sequences is generated, which can mitigate biases introduced by individual encoding methods and potentially improve the generalization ability of ML models.

Apart from the above features, a binary vector associated with sequence motifs was generated. Sequence motifs of DS1 and DS2 were obtained with STREME [55], a motif searching tool which finds ungapped motifs that are enriched in the dataset. STREME was applied to positive and negative samples separately to obtain respective list of motifs ranging from 3 to 20 amino acids. After combining the two motif lists, a binary vector was generated in which the presence and absence of a motif in a given sequence were represented by 1 and 0, respectively. The binary vector was also taken as input for machine learning models.

Normalization

Robust normalization was applied to compositional and physicochemical features with RobustScaler from Scikit-learn package [56] of Python. RobustScaler converts a feature value x_i to y_i using the following equation. (1) in which Q3(X) and Q1(X) stand for the 3rd quartile and the 1st quartile of feature X. The scaled values have a median of zero and an interquartile range of one across all sequences.

Machine learning methods

Ensemble predictors were used in this study because of their accuracy, robustness towards noises, and good model generalizability. We used 6 ensemble predictors in this study, including RF, ET, GBC, LGBM, XGB, and CB. In addition, a statistical model, LDA, was also employed for comparison. RF comprises an extensive ensemble of decision trees, and its prediction is obtained by calculating the average of the outcomes produced by individual trees. ET is similar to RF except for two notable distinctions: 1) every tree is trained using the entire training dataset rather than a bootstrap sample, and 2) the top-down node splitting is determined through a random process, as opposed to relying on a scoring function like information gain or Gini impurity. GBC, LGBM, XGB, and CB are different implementations of boosting algorithms that combines several weak learners into strong learners, in which each new model is trained to minimize the loss function such as mean squared error or cross-entropy of the previous model using gradient descent. LDA attempts to find a linear combination of features that best separates classes in a dataset while minimizing the variance within each class. Implementation of these models was based on Scikit-learn package of Python.

Heuristic feature subset selection

The Boruta package [57] implements a wrapper approach utilizing a random forest algorithm for feature ranking. The algorithm iteratively eliminates features that are statistically less relevant than randomized features, resulting in a ranked feature list based on feature importance. The algorithm with its default parameters was applied to analyze features (except for the motifs) for DS1-CV and DS2-CV, yielding two ranked feature lists.

Next, a heuristic algorithm was applied to determine the best feature numbers of DS1-CV and DS2-CV for machine learning. For a given dataset, the ranked features were expressed as an ordered set F = {f₁, f₂, … f₄₃₃₅}, sorted in descending order according to feature importance. The feature subset of the top N features is defined as S_N = {f₁, f₂, …f_N} where f_i ∈ F. Iterative five-fold cross validation runs using S_N was performed based on the six ensemble predictors, i.e., RF, ET, GBC, LGBM, XGB, and CB. The default hyperparameters of the Scikit-learn package were used for all the models. The cross validation experiments were carried out with PyCaret [58] and the results were evaluated with MCC. Let MCC_Nⁱ be the MCC for ML model i using S_N, The best MCC based on S_N was defined as (2)

Where ML model i corresponds to any of the six ML models (RF, ET, GBC, LGBM, XGB, and CB). The process was repeated iteratively, with N ranging from 50 to 410 in increments of 20. The best feature subset (BFS) was defined as (3)

As a result, two feature subsets were generated by applying the heuristic algorithm to DS1-CV and DS2-CV. The pseudocode of the algorithm is shown in S2 Text. The purpose of feature subset selection is to determine the best feature number for each dataset and avoid the lengthy process of machine learning model optimization. This is the reason why we used the default hyperparameters of the six machine learning models for the task. Moreover, using ML models of different rationale has the advantage that the selected feature subsets are not biased towards a particular model.

Hyperparameter optimization and independent test

Once the feature subset was determined, hyperparameters of ML models were optimized with Optuna [59], an automatic hyperparameter optimization software package for sampling search space and pruning unpromising trials. We used the tree-structured Parzen estimator (TPE) algorithm [60], a Bayesian optimization method, as the search algorithm of Optuna because it is more likely to reach global optimum than randomized search, and is more time-efficient than grid search. The default parameters of TPE were used during the process. After the hyperparameters were determined for each model, ten-fold cross validation was performed to evaluate each predictor.

To perform independent test, an additional training process using the optimized parameters of each model on the entire DS1-CV or DS2-CV was performed. The resulting models, with the advantage of using 10% more training data compared to the models obtained from cross validation, were used to conduct independent test on DS1-IND and DS2-IND.

Performance evaluation

For benchmark comparison, prediction results were evaluated with accuracy, precision, recall or sensitivity, F1-score, and MCC (Matthew’s Correlation Coefficient) defined as follows: (4) (5) (6) (7) (8) (9) where TP, TN, FP, and FN stand for the numbers of true positives, true negatives, false positives, and false negatives, respectively. Accuracy, precision, recall, specificity, and F1 range from 0 to 1, where a larger value indicates better predictive capability. MCC ranges from -1 to 1, representing completely negative and completely positive correlations, respectively, and an MCC of 0 represents random correlation. In addition, AUC (Area Under the receiver operating characteristic Curve) [61], a non-parametric and threshold independent measure, was also included for evaluation. An AUC value ranging from 0.5 to 1 is indicative of a model with good prediction performance.

Results

Selected features and motifs

Cross validation results of DS1-CV and DS2-CV using different feature numbers were evaluated with MCC, and the results are shown in S3 Fig. The analysis reveals that selecting 150 features for DS1-CV and 210 features for DS2-CV results in the highest MCC, thereby constituting the selected feature subset. Using the selected feature subset, the best MCCs achieved by the 7 machine learning models are 0.535 and 0.757 for DS1-CV and DS2-CV, respectively.

STREME generated 9 motifs for TTCAs from DS1-CV, including FATP, AEEA, ALQP, AVI, LMK, SLADTN, VFGI, RLAE, LLD, and 2 motifs for non-TTCAs from DS1-CV, including IPRHL and ALAG. For DS2-CV, STREME yielded 3 motifs from TTCAs, including LMK, ADV, and AGIGIL, and 5 motifs from non-TTCAs, including AQID, EFP, AVADE, VDEV, and ITD. Interestingly, the motifs from both datasets are quite different because the non-TTCA sequences were collected based on different criteria, resulting in dissimilar sequence patterns for the negatives. The sizes of feature vectors for DS1-CV and DS2-CV are 161 (150+11) and 218 (210+8), respectively.

Performance evaluation for cross validation

The hyperparameters of 7 ML models trained with DS1-CV are listed in S2 Table. Evaluation results of cross validation on DS1-CV for all the ML models are shown in Table 1. It can be observed ET yields accuracy, precision, F1-score, and MCC of 0.760, 0.751, 0.821, and 0.488, respectively, all of which are the highest among all the methods. CB and GBC have AUCs of 0.793 and 0.803, respectively, comparable to ET but their MCCs are 3.3% and 3.5% lower than ET. XGB has the highest recall but suffers from the lowest precision among all the methods.

Download:

Table 1. Evaluation results of cross validation on DS1-CV for all the ML models.

Bald face indicates the highest value among all the methods.

https://doi.org/10.1371/journal.pone.0307176.t001

The hyperparameters of 7 ML models trained with DS2-CV are listed in S3 Table. Evaluation results of cross validation on DS2-CV for all the ML models are shown in Table 2. ET yields accuracy, precision, F1-score, and MCC of 0.852, 0.856, 0.852, and 0.708, respectively, all of which are again the highest among all the methods. GBC, RF, CB, and XGB show marginally lower MCC and AUC (within 1% difference), indicating that these ensemble machine learning models are somewhat comparable to ET in cross validation. By comparing the evaluation results of DS1-CV with DS2-CV, it is obvious the specificities for all the models on DS1-CV are below 0.6, significantly lower than those obtained on DS2-CV, which exceed 0.82. Moreover, all the models produce ~10% lower precision for DS1 compared to DS2. The above observations indicate a higher propensity for false positive predictions on DS1, a characteristic specific to the dataset.

Download:

Table 2. Evaluation results of cross validation on DS2-CV for all the ML models.

Bald face indicates the highest value among all the methods.

https://doi.org/10.1371/journal.pone.0307176.t002

Performance evaluation for independent test

Benchmark results for independent test on DS1-IND with 7 ML models are shown in Table 3. Results for iTTCA-Hybrid, TTAgP 1.0, and LDA are also listed in the table for comparison because they were developed and benchmarked on the same dataset. It can be seen the best performing method, GBC, generates the highest MCC of 0.522 and the highest accuracy of 0.766. CB and ET yield slightly lower MCCs of 0.508 and 0.488, respectively. We further analyze the correlation between the prediction confidence (or probability output) for each model and the true positive rate, calculated by the number of actual TTCA divided by the total number of sequences predicted within the range of the prediction confidence. It can be observed from S4A Fig. that the prediction confidence for each model in general has positive correlation to the true positive rate. LDA has the highest true positive rate for the MCC range of 0 to 0.2 than all the other methods. The situation is unfavorable and explains why LDA produces the lowest MCC for independent test. GBC, CB, and ET outperform the state-of-art method, iTTCA-RF, in MCC by 4.8%, 3.4%, and 1.4%, respectively. iTTCA-RF has the highest precision and specificity among all the methods, and yet it generates a recall of 0.836, considerably lower than the recall for GBC, CB, and ET which range from 0.975 to 1.000. This is the major reason why GBC, CB, and ET outperform iTTCA-RF in terms of MCC.

Download:

Table 3. Evaluation results of independent test on DS1-IND for all the ML models.

Bald face indicates the highest value among all the methods.

https://doi.org/10.1371/journal.pone.0307176.t003

Table 4 shows the benchmark results for independent test on DS2-IND with 7 ML models and two other existing methods, PSRTTCA and TAP 1.0, both of which were developed and benchmarked on the same dataset. It can be seen ET is the best performing model because its MCC, AUC, accuracy, and recall are 0.789, 0.960, 0.890, and 0.966, respectively, all of which are the highest among all the methods. All 7 models present good correlations between the prediction confidence and true positive rate, as illustrated in S4B Fig. Similar to the prediction results of DS1-IND, ET, GBC, and CB are the top 3 methods judging by MCC, and they outperform state-of-art method, PSRTTCA, by 13.5%, 12.7%, and 10.5% in MCC, respectively. Moreover, the improvements of ET, GBC, and CB over PSRTTCA in AUC are 7.5%, 6.6%, and 7.2%, respectively. The precision of PSRTTCA is 0.827, comparable to the precision of ET, GBC, and CB, and yet its recall is 0.822, which is 14.4%, 14.4%, and 11.0% lower than the three methods, respectively. The situation explains why PSRTTCA has much lower MCC than the three methods. The above results on DS1-IND and DS2-IND indicate that significantly improved recall is the major reason why the proposed methods compare favorably over the state-of-art approach.

Download:

Table 4. Evaluation results of independent test on DS2-IND for all the ML models.

Bald face indicates the highest value among all the methods.

https://doi.org/10.1371/journal.pone.0307176.t004

To predict sequences from DS3-IND (consisting of 71 experimentally validated TTCAs collected from literature), we employed ET and GBC trained with DS1-CV (termed DS1-ET and DS1-GBC, respectively) because the two are the best for cross validation and independent test on DS1 judging by MCC, respectively. ET trained with DS2-CV, termed DS2-ET, were also used for benchmark comparison because the model outperforms the rest for both cross validation and independent test on DS2. DS1-ET, DS1-GBC, and DS2-ET correctly predict 62, 60, and 58 TTCAs from the dataset, respectively, namely, the accuracy of the three models ranges from 0.873 to 0.817. On the other hand, PSRTTCA, iTTCA-RF, iTTCA-Hybrid, and TAP 1.0 predict 55, 52, 52, and 45 TTCAs out of 71 actual TTCAs in DS3-IND. The accuracy of the four existing methods ranges from 0.753 to 0.616, indicating that our proposed methods significantly improve existing methods in prediction accuracy for DS3-IND.

Discussions

Data visualization with selected features

In this study, each sequence is originally encoded with various biological, physicochemical, and compositional features, resulting in a feature vector of 4349 numeric values. The improved performances of our predictors on both DS1-IND and DS2-IND are largely attributed to the selection of numerical features. Through feature engineering, a total of 150 and 210 feature values were obtained. The t-distributed stochastic neighbor embedding (t-SNE) [62, 63] was applied to validate the efficacy of the selected feature subsets to the discrimination of TTCAs. Fig 2 illustrates the t-SNE distributions of TTCAs and non-TTCAs on 2D planes using either the complete set of 4349 features or the selected feature subsets. Within these plots, red and blue spots represent TTCAs and non-TTCAs, respectively. Notably, Fig 2B and 2D highlight the enhanced separation achieved by the selected feature subsets, facilitating clearer differentiation between the distributions of TTCAs and non-TTCAs. Conversely, when employing all 4349 features for t-SNE analysis, the distributions of TTCAs and non-TTCAs exhibit serious overlap, as shown in Fig 2A and 2C. Such results indicate the importance and efficacy of the selected feature subsets, which can benefit ML models in accurate prediction of TTCAs.

Download:

Fig 2.

t-SNE distributions of (A) DS1-CV using 4349 features, (B) DS1-CV using 150 selected features, (C) DS2-CV using 4349 features, and (D) DS2-CV using 210 selected features.

https://doi.org/10.1371/journal.pone.0307176.g002

Analysis of selected feature types

The entire lists of the 161 selected features for DS1 and the 218 selected features for DS2 are shown in S4 and S5 Tables, respectively. The former belongs to 16 different feature types and the latter belongs to 25 different feature types (excluding motifs), among which 8 feature types are in common. The top 5 most frequent feature types of DS1 and DS2 are shown in Table 5. It can be observed 4 of the top 5 most frequent feature types for DS1 and DS2 belong to the common feature types, indicating a certain degree of similarity between the two datasets. The entire lists of feature types used for machine learning on DS1 and DS2 are shown in S6 Table. These feature types are mostly physicochemical properties, for example, CTD (Composition-Transition-Distribution) [42] is constructed with properties including hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, secondary structure and solvent accessibility. Other selected features of physicochemical properties include ABHPRK (Acidic, Basic, Hydrophobic, Polar, aRomatic, Kink-inducer) [64], MSW (amino acid scale based on a principal component analysis of the Molecular Surface-based Weighted holistic invariant molecular descriptor) [65], Z5 [66], and Z3 [67]. Apart from the general physicochemical properties, hydrophobicity seems to be one of the important attributes among the selected features. For example, the Aliphatic_Index [68] and OVPC_Aliphatic (OVerlapping Property Composition) [69] for DS1 and APAAC (Amphiphilic Pseudo-Amino Acid Composition) [48] for DS2 are related to hydrophobicity.

Download:

Table 5. The selected numbers and sizes of features for the top 5 most frequent feature types (excluding motifs) used for machine learning on DS1 and DS2.

https://doi.org/10.1371/journal.pone.0307176.t005

The major difference between the feature types for DS1 and DS2 is the presence of compositional features. For example, DS1 consists of 91 DDE features (Dipeptide Deviation from Expected mean) [70] while none are selected for DS2. DS2 consists of a much smaller number of compositional features, such as 5 APAAC features (Amphiphilic Pseudo-Amino Acid Composition), 2 AAC features (Amino Acid Composition), and 2 GAAC (Grouped Amino Acid Composition) [45] features. Such difference is attributed to the feature ranking algorithm, Boruta, which ranks the contribution of each feature based on its discriminating capability for a given dataset. Specifically, the differences in the criteria for non-TTCAs between DS1 and DS2 lead to different sequence patterns, thus altering the feature ranks generated by Boruta. The situation also highlights the flexibility of our approach in selecting suitable features for the target dataset.

Feature importance analysis

SHAP (SHapley Additive exPlanations) [71] analysis is known as a powerful framework to provide information about how features can affect the output of the model. Positive SHAP values signify that a feature contributed to augmenting the model’s output, making a positive prediction more likely, whereas negative values indicate the feature’s role in diminishing the output, increasing the likelihood of a negative prediction. The absolute magnitude of the SHAP values quantifies the extent of each feature’s influence, namely, feature importance. Fig 3 illustrates the SHAP values for the top 20 features based on DS1 and DS2. It can be seen the top 20 features for DS1 (Fig 3A) are mostly associated with physicochemical properties such as MSW, Z3, and Z5. On the other hand, 8 of the top 20 features for DS2 (Fig 3B) are OVPs corresponding to the hydrophobic, polar, charged, and aromatic residues on the N-terminus of peptides (N1, N2, and N3). Several CTD features corresponding to solvent accessibility and hydrophobicity also have high SHAP values for DS2. The analysis reveals different feature contributions for the two datasets, reflecting different underlying patterns or characteristics that are relevant to the prediction task.

Download:

Fig 3.

The beeswarm plots of SHAP values for the top 20 features based on A) DS1 and B) DS2.

https://doi.org/10.1371/journal.pone.0307176.g003

Peptide property with respect to prediction capability

Prediction results of the three most accurate models on independent tests, CB, GBC, and ET, were further analyzed with respect to three peptide properties. The peptide property is characterized by the ratios of hydrophobic, hydrophilic, and charged residues within a peptide. Hydrophobic amino acids include V, I, L, M, F, W, and C; hydrophilic amino acids include R, N, D, E, Q, H, K, S, and T; charged amino acids include E, D, R, K, and H. The analysis is applied to peptides from DS1-IND and DS2-IND and the results are illustrated in Fig 4. It can be observed the prediction capability for the ML models are improved for peptides of fewer hydrophobic residues, and more hydrophilic and charged residues. Moreover, the phenomena for DS1-IND (Fig 4A–4C) and DS2-IND (Fig 4D–4F) are remarkably similar. The analysis also suggests potential directions for future enhancements, particularly in refining the prediction accuracy for peptides characterized by a higher number of hydrophobic residues and a reduced presence of hydrophilic and charged residues.

Download:

Fig 4. Analysis of prediction performance evaluated with MCC with respect to three different peptide properties.

Panels A, B, and C corresponds to ratios of hydrophobic, hydrophilic, and charged amino acids for peptides from DS1-IND. Panels D, E, and F corresponds to ratios of hydrophobic, hydrophilic, and charged amino acids for peptides from DS2-IND.

https://doi.org/10.1371/journal.pone.0307176.g004

Conclusion

Cancer immunotherapy is a type of cancer treatment that uses the body’s own immune system to fight cancer cells. Accurate identification of TTCA is an important task for cancer immunotherapy because peptides derived from TTCAs can induce a peptide-specific T cell immune response which neutralizes tumor cells. In this study, we present ENCAP, a ML method built with ensemble classifiers and diverse sequence features for TTCA prediction. In the stage of feature engineering, sequences were encoded with a total of 4349 numeric values from 57 different feature types, followed by feature ranking with Boruta and a heuristic feature subset selection process. In addition, the sequence motifs from TTCAs were used to generate bit vectors as part of the features. The second stage utilized the selected feature subset for the optimization of hyperparameters for the 7 ML models (CB, GBC, ET, XGB, LGBM, RF, and LDA) using a Bayesian optimization approach. ENCAP was applied to two TTCA datasets, DS1 and DS2, from the public domain which have quite different criteria for the acquisition of non-TTCAs. For DS1-IND, the best ML model, XGB, yields 0.766, 0.729, 0.992, and 0.522 for accuracy, precision, recall, and MCC, respectively. For DS2-IND, the best ML model, ET, yields 0.890, 0.838, 0.966, and 0.789, for accuracy, precision, recall, and MCC, respectively. The improvements in MCC over the state-of-art methods for independent tests of DS1 and DS2 are 4.8% and 13.5%, respectively. The situation is largely attributed to the considerable increase in recall for the proposed models. For DS3, the dataset of 71 experimentally validated TTCAs gathered from the literature, our top-performing models exhibit accuracy ranging from 0.817 to 0.873. In contrast, existing methods yield accuracy within the range of 0.616 to 0.753. The above results suggest our models produce significant improvements over existing methods in identifying TTCAs.

The efficacy of the selected features is further validated with t-SNE plots, for which the 150 and 210 selected features from DS1 and DS2, respectively, demonstrate enhanced separation compared to original features. Our feature analysis reveals that the selected feature subset consists of mostly physicochemical properties, along with several features specifically corresponding to hydrophobicity and amphiphilicity. The feature subsets for DS1 and DS2 mainly differ in the selection of compositional features. The situation shows that different criteria for non-TTCAs can result in different sequence patterns, which can affect the discriminating capability of features. The SHAP analysis also reveals different feature contributions for the two datasets, reflecting different underlying patterns or characteristics that are relevant to the prediction task.

This study is the first to conduct separate benchmark comparisons of predictors on two TTCA datasets. Our results not only demonstrate significant improvements in prediction accuracy for three independent tests, but provide evidences for the effectiveness of ENCAP in adaptively selecting features for the target dataset. Furthermore, it is demonstrated that the prediction accuracy for peptides of a higher proportion of hydrophobic residues and a lower occurrence of hydrophilic and charged residues is limited, indicating a need for further improvement. As more experimentally validated TTCAs become available, it can be anticipated that employing advanced deep learning architectures and large language models may enhance the performance of TTCA predictions. These techniques could potentially capture complex sequence patterns and long-range dependencies that are challenging for traditional machine learning models. Ultimately, continued research at this intersection of cancer immunotherapy and machine learning could pave the way for more effective identification of TTCA candidates, accelerating the development of novel immunotherapeutic strategies against cancer.

Supporting information

S1 Fig.

Peptide length distributions of TTCAs and non-TTCAs for A) DS1-CV and B) DS1-IND.

https://doi.org/10.1371/journal.pone.0307176.s001

(DOCX)

S2 Fig.

Peptide length distributions of TTCAs and non-TTCAs for A) DS2-CV and B) DS2-IND.

https://doi.org/10.1371/journal.pone.0307176.s002

(DOCX)

S3 Fig.

MCCs of the best machine learning model evaluated with cross validation using different feature numbers on A) DS1-CV and B) DS2-CV.

https://doi.org/10.1371/journal.pone.0307176.s003

(DOCX)

S4 Fig.

Prediction confidence of 7 ML models on (A) DS1-IND and (B) DS2-IND.

https://doi.org/10.1371/journal.pone.0307176.s004

(DOCX)

S1 Table. Details of 57 feature types used in the study.

https://doi.org/10.1371/journal.pone.0307176.s005

(DOCX)

S2 Table. Hyperparameters of ML models trained with DS1-CV.

https://doi.org/10.1371/journal.pone.0307176.s006

(DOCX)

S3 Table. Hyperparameters of ML models trained with DS2-CV.

https://doi.org/10.1371/journal.pone.0307176.s007

(DOCX)

S4 Table. List of 161 selected feature subset for DS1.

https://doi.org/10.1371/journal.pone.0307176.s008

(DOCX)

S5 Table. List of 218 selected feature subset for DS2.

https://doi.org/10.1371/journal.pone.0307176.s009

(DOCX)

S6 Table. The selected numbers and sizes of feature types from the selected feature subsets of DS1 and DS2.

https://doi.org/10.1371/journal.pone.0307176.s010

(DOCX)

S1 Text. Pseudocode of ENCAP.

https://doi.org/10.1371/journal.pone.0307176.s011

(DOCX)

S2 Text. Pseudocode of the feature subset selection algorithm.

https://doi.org/10.1371/journal.pone.0307176.s012

(DOCX)

References

1. Bray F, Laversanne M, Weiderpass E, Soerjomataram I. The ever-increasing importance of cancer as a leading cause of premature death worldwide. Cancer. 2021;127: 3029–3030. pmid:34086348
- View Article
- PubMed/NCBI
- Google Scholar
2. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin. 2021;71: 209–249. pmid:33538338
- View Article
- PubMed/NCBI
- Google Scholar
3. Harris F, Dennison SR, Singh J, Phoenix DA. On the selectivity and efficacy of defense peptides with respect to cancer cells. Med Res Rev. 2013;33: 190–234. pmid:21922503
- View Article
- PubMed/NCBI
- Google Scholar
4. Thundimadathil J. Cancer treatment using peptides: current therapies and future prospects. J Amino Acids. 2012;2012: 967347. pmid:23316341
- View Article
- PubMed/NCBI
- Google Scholar
5. Yu L, Wang M, Yang Y, Xu F, Zhang X, Xie F, et al. Predicting therapeutic drugs for hepatocellular carcinoma based on tissue-specific pathways. PLOS Comput Biol. 2021;17: e1008696. pmid:33561121
- View Article
- PubMed/NCBI
- Google Scholar
6. Couzin-Frankel J. Cancer Immunotherapy. Science. 2013;342: 1432–1433. pmid:24357284
- View Article
- PubMed/NCBI
- Google Scholar
7. Liu Y, Ouyang X, Xiao Z-X, Zhang L, Cao Y. A Review on the Methods of Peptide-MHC Binding Prediction. Curr Bioinforma. 2020;15: 878–888.
- View Article
- Google Scholar
8. Mizukoshi E, Nakamoto Y, Arai K, Yamashita T, Sakai A, Sakai Y, et al. Comparative analysis of various tumor-associated antigen-specific t-cell responses in patients with hepatocellular carcinoma. Hepatol Baltim Md. 2011;53: 1206–1216. pmid:21480325
- View Article
- PubMed/NCBI
- Google Scholar
9. Yang J, Zhang Q, Li K, Yin H, Zheng J-N. Composite peptide-based vaccines for cancer immunotherapy (Review). Int J Mol Med. 2015;35: 17–23. pmid:25395173
- View Article
- PubMed/NCBI
- Google Scholar
10. Kumai T, Lee S, Cho H-I, Sultan H, Kobayashi H, Harabuchi Y, et al. Optimization of Peptide Vaccines to Induce Robust Antitumor CD4 T-cell Responses. Cancer Immunol Res. 2017;5: 72–83. pmid:27941004
- View Article
- PubMed/NCBI
- Google Scholar
11. Liu W, Tang H, Li L, Wang X, Yu Z, Li J. Peptide‐based therapeutic cancer vaccine: Current trends in clinical application. Cell Prolif. 2021;54: e13025. pmid:33754407
- View Article
- PubMed/NCBI
- Google Scholar
12. Ali F, Akbar S, Ghulam A, Maher ZA, Unar A, Talpur DB. AFP-CMBPred: Computational identification of antifreeze proteins by extending consensus sequences into multi-blocks evolutionary information. Comput Biol Med. 2021;139: 105006. pmid:34749096
- View Article
- PubMed/NCBI
- Google Scholar
13. Raza A, Uddin J, Almuhaimeed A, Akbar S, Zou Q, Ahmad A. AIPs-SnTCN: Predicting Anti-Inflammatory Peptides Using fastText and Transformer Encoder-Based Hybrid Word Embedding with Self-Normalized Temporal Convolutional Networks. J Chem Inf Model. 2023;63: 6537–6554. pmid:37905969
- View Article
- PubMed/NCBI
- Google Scholar
14. Akbar S, Hayat M, Tahir M, Khan S, Alarfaj FK. cACP-DeepGram: Classification of anticancer peptides via deep neural network and skip-gram-based word embedding model. Artif Intell Med. 2022;131: 102349. pmid:36100346
- View Article
- PubMed/NCBI
- Google Scholar
15. Akbar S, Zou Q, Raza A, Alarfaj FK. iAFPs-Mv-BiTCN: Predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks. Artif Intell Med. 2024;151: 102860. pmid:38552379
- View Article
- PubMed/NCBI
- Google Scholar
16. Beltrán Lissabet JF, Herrera Belén L, Farias JG. TTAgP 1.0: A computational tool for the specific prediction of tumor T cell antigens. Comput Biol Chem. 2019;83: 107103. pmid:31437642
- View Article
- PubMed/NCBI
- Google Scholar
17. Kawashima S, Ogata H, Kanehisa M. AAindex: Amino Acid Index Database. Nucleic Acids Res. 1999;27: 368–369. pmid:9847231
- View Article
- PubMed/NCBI
- Google Scholar
18. Charoenkwan P, Nantasenamat C, Hasan MM, Shoombuatong W. iTTCA-Hybrid: Improved and robust identification of tumor T cell antigens by utilizing hybrid feature representation. Anal Biochem. 2020;599: 113747. pmid:32333902
- View Article
- PubMed/NCBI
- Google Scholar
19. Jiao S, Zou Q, Guo H, Shi L. iTTCA-RF: a random forest predictor for tumor T cell antigens. J Transl Med. 2021;19. pmid:34706730
- View Article
- PubMed/NCBI
- Google Scholar
20. He S, Guo F, Zou Q, HuiDing . MRMD2.0: A Python Tool for Machine Learning with Feature Ranking and Reduction. Curr Bioinforma. 2020;15: 1213–1221.
- View Article
- Google Scholar
21. Herrera-Bravo J, Herrera Belén L, Farias JG, Beltrán JF. TAP 1.0: A robust immunoinformatic tool for the prediction of tumor T-cell antigens based on AAindex properties. Comput Biol Chem. 2021;91: 107452. pmid:33592504
- View Article
- PubMed/NCBI
- Google Scholar
22. Quinlan JR. Induction of decision trees. Mach Learn. 1986;1: 81–106.
- View Article
- Google Scholar
23. Zou H, Yang F, Yin Z. iTTCA-MFF: identifying tumor T cell antigens based on multiple feature fusion. Immunogenetics. 2022;74: 447–454. pmid:35246701
- View Article
- PubMed/NCBI
- Google Scholar
24. Olsen LR, Tongchusak S, Lin H, Reinherz EL, Brusic V, Zhang GL. TANTIGEN: a comprehensive database of tumor T cell antigens. Cancer Immunol Immunother CII. 2017;66: 731–735. pmid:28280852
- View Article
- PubMed/NCBI
- Google Scholar
25. Zhang G, Chitkushev L, Olsen LR, Keskin DB, Brusic V. TANTIGEN 2.0: a knowledge base of tumor T cell antigens and epitopes. BMC Bioinformatics. 2021;22: 40. pmid:33849445
- View Article
- PubMed/NCBI
- Google Scholar
26. Charoenkwan P, Pipattanaboon C, Nantasenamat C, Hasan MM, Moni MA, Lio’ P, et al. PSRTTCA: A new approach for improving the prediction and characterization of tumor T cell antigens using propensity score representation learning. Comput Biol Med. 2022;152: 106368. pmid:36481763
- View Article
- PubMed/NCBI
- Google Scholar
27. Vita R, Overton JA, Greenbaum JA, Ponomarenko J, Clark JD, Cantrell JR, et al. The immune epitope database (IEDB) 3.0. Nucleic Acids Res. 2015;43: D405–D412. pmid:25300482
- View Article
- PubMed/NCBI
- Google Scholar
28. Fleri W, Paul S, Dhanda SK, Mahajan S, Xu X, Peters B, et al. The Immune Epitope Database and Analysis Resource in Epitope Discovery and Synthetic Vaccine Design. Front Immunol. 2017;8. Available: https://www.frontiersin.org/articles/10.3389/fimmu.2017.00278 pmid:28352270
- View Article
- PubMed/NCBI
- Google Scholar
29. Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. ArXiv. 2018 [cited 3 Apr 2023]. Available: https://www.semanticscholar.org/paper/CatBoost%3A-gradient-boosting-with-categorical-Dorogush-Ershov/f5fbcd9ff72c5820a21b9d6871d2a3d475c9bb7f
30. Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Stat. 2001;29: 1189–1232.
- View Article
- Google Scholar
31. Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63: 3–42.
- View Article
- Google Scholar
32. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. pp. 785–794.
- View Article
- Google Scholar
33. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2017.
34. Ho TK. Random decision forests. Proceedings of 3rd International Conference on Document Analysis and Recognition. 1995. pp. 278–282 vol.1.
- View Article
- Google Scholar
35. Fix E, Hodges J. Discriminatory Analysis—Nonparametric Discrimination: Consistency Properties. CALIFORNIA UNIV BERKELEY; 1951 Feb.
36. Chen K, Kurgan LA, Ruan J. Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs. BMC Struct Biol. 2007;7: 25. pmid:17437643
- View Article
- PubMed/NCBI
- Google Scholar
37. Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 2001;43: 246–255. pmid:11288174
- View Article
- PubMed/NCBI
- Google Scholar
38. Pace CN, Fu H, Fryar KL, Landua J, Trevino SR, Shirley BA, et al. Contribution of Hydrophobic Interactions to Protein Stability. J Mol Biol. 2011;408: 514–528. pmid:21377472
- View Article
- PubMed/NCBI
- Google Scholar
39. Chou K-C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005;21: 10–19. pmid:15308540
- View Article
- PubMed/NCBI
- Google Scholar
40. Liu L-M, Xu Y, Chou K-C. iPGK-PseAAC: Identify Lysine Phosphoglycerylation Sites in Proteins by Incorporating Four Different Tiers of Amino Acid Pairwise Coupling Information into the General PseAAC. Med Chem Shariqah United Arab Emir. 2017;13: 552–559. pmid:28521678
- View Article
- PubMed/NCBI
- Google Scholar
41. Chen X, Qiu J-D, Shi S-P, Suo S-B, Huang S-Y, Liang R-P. Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites. Bioinforma Oxf Engl. 2013;29: 1614–1622. pmid:23626001
- View Article
- PubMed/NCBI
- Google Scholar
42. Dubchak I, Muchnik I, Holbrook SR, Kim SH. Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci U S A. 1995;92: 8700–8704. pmid:7568000
- View Article
- PubMed/NCBI
- Google Scholar
43. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, et al. Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci U S A. 2007;104: 4337–4341. pmid:17360525
- View Article
- PubMed/NCBI
- Google Scholar
44. Saravanan V, Gautham N. Harnessing Computational Biology for Exact Linear B-Cell Epitope Prediction: A Novel Amino Acid Composition-Based Feature Descriptor. Omics J Integr Biol. 2015;19: 648–658. pmid:26406767
- View Article
- PubMed/NCBI
- Google Scholar
45. Lee T-Y, Lin Z-Q, Hsieh S-J, Bretaña NA, Lu C-T. Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences. Bioinformatics. 2011;27: 1780–1787. pmid:21551145
- View Article
- PubMed/NCBI
- Google Scholar
46. Sun J-N, Yang H-Y, Yao J, Ding H, Han S-G, Wu C-Y, et al. Prediction of Cyclin Protein Using Two-Step Feature Selection Technique. IEEE Access. 2020;8: 109535–109542.
- View Article
- Google Scholar
47. Pande A, Patiyal S, Lathwal A, Arora C, Kaur D, Dhall A, et al. Pfeature: A Tool for Computing Wide Range of Protein Features and Building Prediction Models. J Comput Biol J Comput Mol Cell Biol. 2022. pmid:36251780
- View Article
- PubMed/NCBI
- Google Scholar
48. Chou KC. Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem Biophys Res Commun. 2000;278: 477–483. pmid:11097861
- View Article
- PubMed/NCBI
- Google Scholar
49. Guruprasad K, Reddy BV, Pandit MW. Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng. 1990;4: 155–161. pmid:2075190
- View Article
- PubMed/NCBI
- Google Scholar
50. Boman HG, Wade D, Boman IA, Wåhlin B, Merrifield RB. Antibacterial and antimalarial properties of peptides that are cecropin-melittin hybrids. FEBS Lett. 1989;259: 103–106. pmid:2689223
- View Article
- PubMed/NCBI
- Google Scholar
51. Wolfenden R. Experimental measures of amino acid hydrophobicity and the topology of transmembrane and globular proteins. J Gen Physiol. 2007;129: 357–362. pmid:17438117
- View Article
- PubMed/NCBI
- Google Scholar
52. Haynes WM, editor. CRC Handbook of Chemistry and Physics. 95th ed. Boca Raton: CRC Press; 2014. https://doi.org/10.1201/b17118
53. Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinforma Oxf Engl. 2018;34: 2499–2502. pmid:29528364
- View Article
- PubMed/NCBI
- Google Scholar
54. Müller AT, Gabernet G, Hiss JA, Schneider G. modlAMP: Python for antimicrobial peptides. Bioinforma Oxf Engl. 2017;33: 2753–2755. pmid:28472272
- View Article
- PubMed/NCBI
- Google Scholar
55. Bailey TL. STREME: accurate and versatile sequence motif discovery. Birol I, editor. Bioinformatics. 2021;37: 2834–2840. pmid:33760053
- View Article
- PubMed/NCBI
- Google Scholar
56. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12: 2825–2830.
- View Article
- Google Scholar
57. Kursa MB, Rudnicki WR. Feature Selection with the Boruta Package. J Stat Softw. 2010;36: 1–13.
- View Article
- Google Scholar
58. Ali M. PyCaret: An open source, low-code machine learning library in Python. 2020. Available: https://www.pycaret.org
- View Article
- Google Scholar
59. Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: A Next-generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, NY, USA: Association for Computing Machinery; 2019. pp. 2623–2631. https://doi.org/10.1145/3292500.3330701
60. Bergstra J, Bardenet R, Bengio Y, Kégl B. Algorithms for Hyper-Parameter Optimization. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2011.
61. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143: 29–36. pmid:7063747
- View Article
- PubMed/NCBI
- Google Scholar
62. Maaten L van der, Hinton G. Visualizing Data using t-SNE. J Mach Learn Res. 2008;9: 2579–2605.
- View Article
- Google Scholar
63. Maaten L van der. Accelerating t-SNE using Tree-Based Algorithms. J Mach Learn Res. 2014;15: 3221–3245.
- View Article
- Google Scholar
64. Timmons PB, Hewage CM. ENNAACT is a novel tool which employs neural networks for anticancer activity classification for therapeutic peptides. Biomed Pharmacother. 2021;133: 111051. pmid:33254015
- View Article
- PubMed/NCBI
- Google Scholar
65. Zaliani A, Gancia E. MS-WHIM Scores for Amino Acids: A New 3D-Description for Peptide QSAR and QSPR Studies. J Chem Inf Comput Sci. 1999;39: 525–533.
- View Article
- Google Scholar
66. Sandberg M, Eriksson L, Jonsson J, Sjöström M, Wold S. New Chemical Descriptors Relevant for the Design of Biologically Active Peptides. A Multivariate Characterization of 87 Amino Acids. J Med Chem. 1998;41: 2481–2491. pmid:9651153
- View Article
- PubMed/NCBI
- Google Scholar
67. Hellberg S, Sjoestroem M, Skagerberg B, Wold S. Peptide quantitative structure-activity relationships, a multivariate approach. J Med Chem. 1987;30: 1126–1135. pmid:3599020
- View Article
- PubMed/NCBI
- Google Scholar
68. IKAI A. Thermostability and Aliphatic Index of Globular Proteins. J Biochem (Tokyo). 1980;88: 1895–1898. pmid:7462208
- View Article
- PubMed/NCBI
- Google Scholar
69. Manavalan B, Basith S, Shin TH, Wei L, Lee G. mAHTPred: a sequence-based meta-predictor for improving the prediction of anti-hypertensive peptides using effective feature representation. Bioinforma Oxf Engl. 2019;35: 2757–2765. pmid:30590410
- View Article
- PubMed/NCBI
- Google Scholar
70. Garg A, Raghava GPS. A Machine Learning Based Method for the Prediction of Secretory Proteins Using Amino Acid Composition, Their Order and Similarity-Search. In Silico Biol. 2008;8: 129–140. pmid:18928201
- View Article
- PubMed/NCBI
- Google Scholar
71. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. pp. 4768–4777.

[ref1] 1. Bray F, Laversanne M, Weiderpass E, Soerjomataram I. The ever-increasing importance of cancer as a leading cause of premature death worldwide. Cancer. 2021;127: 3029–3030. pmid:34086348
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin. 2021;71: 209–249. pmid:33538338
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Harris F, Dennison SR, Singh J, Phoenix DA. On the selectivity and efficacy of defense peptides with respect to cancer cells. Med Res Rev. 2013;33: 190–234. pmid:21922503
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Thundimadathil J. Cancer treatment using peptides: current therapies and future prospects. J Amino Acids. 2012;2012: 967347. pmid:23316341
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Yu L, Wang M, Yang Y, Xu F, Zhang X, Xie F, et al. Predicting therapeutic drugs for hepatocellular carcinoma based on tissue-specific pathways. PLOS Comput Biol. 2021;17: e1008696. pmid:33561121
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Couzin-Frankel J. Cancer Immunotherapy. Science. 2013;342: 1432–1433. pmid:24357284
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Liu Y, Ouyang X, Xiao Z-X, Zhang L, Cao Y. A Review on the Methods of Peptide-MHC Binding Prediction. Curr Bioinforma. 2020;15: 878–888.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref8] 8. Mizukoshi E, Nakamoto Y, Arai K, Yamashita T, Sakai A, Sakai Y, et al. Comparative analysis of various tumor-associated antigen-specific t-cell responses in patients with hepatocellular carcinoma. Hepatol Baltim Md. 2011;53: 1206–1216. pmid:21480325
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Yang J, Zhang Q, Li K, Yin H, Zheng J-N. Composite peptide-based vaccines for cancer immunotherapy (Review). Int J Mol Med. 2015;35: 17–23. pmid:25395173
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref10] 10. Kumai T, Lee S, Cho H-I, Sultan H, Kobayashi H, Harabuchi Y, et al. Optimization of Peptide Vaccines to Induce Robust Antitumor CD4 T-cell Responses. Cancer Immunol Res. 2017;5: 72–83. pmid:27941004
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref11] 11. Liu W, Tang H, Li L, Wang X, Yu Z, Li J. Peptide‐based therapeutic cancer vaccine: Current trends in clinical application. Cell Prolif. 2021;54: e13025. pmid:33754407
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref12] 12. Ali F, Akbar S, Ghulam A, Maher ZA, Unar A, Talpur DB. AFP-CMBPred: Computational identification of antifreeze proteins by extending consensus sequences into multi-blocks evolutionary information. Comput Biol Med. 2021;139: 105006. pmid:34749096
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref13] 13. Raza A, Uddin J, Almuhaimeed A, Akbar S, Zou Q, Ahmad A. AIPs-SnTCN: Predicting Anti-Inflammatory Peptides Using fastText and Transformer Encoder-Based Hybrid Word Embedding with Self-Normalized Temporal Convolutional Networks. J Chem Inf Model. 2023;63: 6537–6554. pmid:37905969
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref14] 14. Akbar S, Hayat M, Tahir M, Khan S, Alarfaj FK. cACP-DeepGram: Classification of anticancer peptides via deep neural network and skip-gram-based word embedding model. Artif Intell Med. 2022;131: 102349. pmid:36100346
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref15] 15. Akbar S, Zou Q, Raza A, Alarfaj FK. iAFPs-Mv-BiTCN: Predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks. Artif Intell Med. 2024;151: 102860. pmid:38552379
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref16] 16. Beltrán Lissabet JF, Herrera Belén L, Farias JG. TTAgP 1.0: A computational tool for the specific prediction of tumor T cell antigens. Comput Biol Chem. 2019;83: 107103. pmid:31437642
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref17] 17. Kawashima S, Ogata H, Kanehisa M. AAindex: Amino Acid Index Database. Nucleic Acids Res. 1999;27: 368–369. pmid:9847231
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref18] 18. Charoenkwan P, Nantasenamat C, Hasan MM, Shoombuatong W. iTTCA-Hybrid: Improved and robust identification of tumor T cell antigens by utilizing hybrid feature representation. Anal Biochem. 2020;599: 113747. pmid:32333902
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref19] 19. Jiao S, Zou Q, Guo H, Shi L. iTTCA-RF: a random forest predictor for tumor T cell antigens. J Transl Med. 2021;19. pmid:34706730
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref20] 20. He S, Guo F, Zou Q, HuiDing . MRMD2.0: A Python Tool for Machine Learning with Feature Ranking and Reduction. Curr Bioinforma. 2020;15: 1213–1221.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref21] 21. Herrera-Bravo J, Herrera Belén L, Farias JG, Beltrán JF. TAP 1.0: A robust immunoinformatic tool for the prediction of tumor T-cell antigens based on AAindex properties. Comput Biol Chem. 2021;91: 107452. pmid:33592504
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref22] 22. Quinlan JR. Induction of decision trees. Mach Learn. 1986;1: 81–106.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref23] 23. Zou H, Yang F, Yin Z. iTTCA-MFF: identifying tumor T cell antigens based on multiple feature fusion. Immunogenetics. 2022;74: 447–454. pmid:35246701
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref24] 24. Olsen LR, Tongchusak S, Lin H, Reinherz EL, Brusic V, Zhang GL. TANTIGEN: a comprehensive database of tumor T cell antigens. Cancer Immunol Immunother CII. 2017;66: 731–735. pmid:28280852
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref25] 25. Zhang G, Chitkushev L, Olsen LR, Keskin DB, Brusic V. TANTIGEN 2.0: a knowledge base of tumor T cell antigens and epitopes. BMC Bioinformatics. 2021;22: 40. pmid:33849445
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref26] 26. Charoenkwan P, Pipattanaboon C, Nantasenamat C, Hasan MM, Moni MA, Lio’ P, et al. PSRTTCA: A new approach for improving the prediction and characterization of tumor T cell antigens using propensity score representation learning. Comput Biol Med. 2022;152: 106368. pmid:36481763
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref27] 27. Vita R, Overton JA, Greenbaum JA, Ponomarenko J, Clark JD, Cantrell JR, et al. The immune epitope database (IEDB) 3.0. Nucleic Acids Res. 2015;43: D405–D412. pmid:25300482
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref28] 28. Fleri W, Paul S, Dhanda SK, Mahajan S, Xu X, Peters B, et al. The Immune Epitope Database and Analysis Resource in Epitope Discovery and Synthetic Vaccine Design. Front Immunol. 2017;8. Available: https://www.frontiersin.org/articles/10.3389/fimmu.2017.00278 pmid:28352270
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref29] 29. Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. ArXiv. 2018 [cited 3 Apr 2023]. Available: https://www.semanticscholar.org/paper/CatBoost%3A-gradient-boosting-with-categorical-Dorogush-Ershov/f5fbcd9ff72c5820a21b9d6871d2a3d475c9bb7f

[ref30] 30. Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Stat. 2001;29: 1189–1232.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref31] 31. Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63: 3–42.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref32] 32. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. pp. 785–794.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref33] 33. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2017.

[ref34] 34. Ho TK. Random decision forests. Proceedings of 3rd International Conference on Document Analysis and Recognition. 1995. pp. 278–282 vol.1.
View Article
Google Scholar

[122] View Article

[123] Google Scholar

[ref35] 35. Fix E, Hodges J. Discriminatory Analysis—Nonparametric Discrimination: Consistency Properties. CALIFORNIA UNIV BERKELEY; 1951 Feb.

[ref36] 36. Chen K, Kurgan LA, Ruan J. Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs. BMC Struct Biol. 2007;7: 25. pmid:17437643
View Article
PubMed/NCBI
Google Scholar

[126] View Article

[127] PubMed/NCBI

[128] Google Scholar

[ref37] 37. Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 2001;43: 246–255. pmid:11288174
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref38] 38. Pace CN, Fu H, Fryar KL, Landua J, Trevino SR, Shirley BA, et al. Contribution of Hydrophobic Interactions to Protein Stability. J Mol Biol. 2011;408: 514–528. pmid:21377472
View Article
PubMed/NCBI
Google Scholar

[134] View Article

[135] PubMed/NCBI

[136] Google Scholar

[ref39] 39. Chou K-C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005;21: 10–19. pmid:15308540
View Article
PubMed/NCBI
Google Scholar

[138] View Article

[139] PubMed/NCBI

[140] Google Scholar

[ref40] 40. Liu L-M, Xu Y, Chou K-C. iPGK-PseAAC: Identify Lysine Phosphoglycerylation Sites in Proteins by Incorporating Four Different Tiers of Amino Acid Pairwise Coupling Information into the General PseAAC. Med Chem Shariqah United Arab Emir. 2017;13: 552–559. pmid:28521678
View Article
PubMed/NCBI
Google Scholar

[142] View Article

[143] PubMed/NCBI

[144] Google Scholar

[ref41] 41. Chen X, Qiu J-D, Shi S-P, Suo S-B, Huang S-Y, Liang R-P. Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites. Bioinforma Oxf Engl. 2013;29: 1614–1622. pmid:23626001
View Article
PubMed/NCBI
Google Scholar

[146] View Article

[147] PubMed/NCBI

[148] Google Scholar

[ref42] 42. Dubchak I, Muchnik I, Holbrook SR, Kim SH. Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci U S A. 1995;92: 8700–8704. pmid:7568000
View Article
PubMed/NCBI
Google Scholar

[150] View Article

[151] PubMed/NCBI

[152] Google Scholar

[ref43] 43. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, et al. Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci U S A. 2007;104: 4337–4341. pmid:17360525
View Article
PubMed/NCBI
Google Scholar

[154] View Article

[155] PubMed/NCBI

[156] Google Scholar

[ref44] 44. Saravanan V, Gautham N. Harnessing Computational Biology for Exact Linear B-Cell Epitope Prediction: A Novel Amino Acid Composition-Based Feature Descriptor. Omics J Integr Biol. 2015;19: 648–658. pmid:26406767
View Article
PubMed/NCBI
Google Scholar

[158] View Article

[159] PubMed/NCBI

[160] Google Scholar

[ref45] 45. Lee T-Y, Lin Z-Q, Hsieh S-J, Bretaña NA, Lu C-T. Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences. Bioinformatics. 2011;27: 1780–1787. pmid:21551145
View Article
PubMed/NCBI
Google Scholar

[162] View Article

[163] PubMed/NCBI

[164] Google Scholar

[ref46] 46. Sun J-N, Yang H-Y, Yao J, Ding H, Han S-G, Wu C-Y, et al. Prediction of Cyclin Protein Using Two-Step Feature Selection Technique. IEEE Access. 2020;8: 109535–109542.
View Article
Google Scholar

[166] View Article

[167] Google Scholar

[ref47] 47. Pande A, Patiyal S, Lathwal A, Arora C, Kaur D, Dhall A, et al. Pfeature: A Tool for Computing Wide Range of Protein Features and Building Prediction Models. J Comput Biol J Comput Mol Cell Biol. 2022. pmid:36251780
View Article
PubMed/NCBI
Google Scholar

[169] View Article

[170] PubMed/NCBI

[171] Google Scholar

[ref48] 48. Chou KC. Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem Biophys Res Commun. 2000;278: 477–483. pmid:11097861
View Article
PubMed/NCBI
Google Scholar

[173] View Article

[174] PubMed/NCBI

[175] Google Scholar

[ref49] 49. Guruprasad K, Reddy BV, Pandit MW. Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng. 1990;4: 155–161. pmid:2075190
View Article
PubMed/NCBI
Google Scholar

[177] View Article

[178] PubMed/NCBI

[179] Google Scholar

[ref50] 50. Boman HG, Wade D, Boman IA, Wåhlin B, Merrifield RB. Antibacterial and antimalarial properties of peptides that are cecropin-melittin hybrids. FEBS Lett. 1989;259: 103–106. pmid:2689223
View Article
PubMed/NCBI
Google Scholar

[181] View Article

[182] PubMed/NCBI

[183] Google Scholar

[ref51] 51. Wolfenden R. Experimental measures of amino acid hydrophobicity and the topology of transmembrane and globular proteins. J Gen Physiol. 2007;129: 357–362. pmid:17438117
View Article
PubMed/NCBI
Google Scholar

[185] View Article

[186] PubMed/NCBI

[187] Google Scholar

[ref52] 52. Haynes WM, editor. CRC Handbook of Chemistry and Physics. 95th ed. Boca Raton: CRC Press; 2014. https://doi.org/10.1201/b17118

[ref53] 53. Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinforma Oxf Engl. 2018;34: 2499–2502. pmid:29528364
View Article
PubMed/NCBI
Google Scholar

[190] View Article

[191] PubMed/NCBI

[192] Google Scholar

[ref54] 54. Müller AT, Gabernet G, Hiss JA, Schneider G. modlAMP: Python for antimicrobial peptides. Bioinforma Oxf Engl. 2017;33: 2753–2755. pmid:28472272
View Article
PubMed/NCBI
Google Scholar

[194] View Article

[195] PubMed/NCBI

[196] Google Scholar

[ref55] 55. Bailey TL. STREME: accurate and versatile sequence motif discovery. Birol I, editor. Bioinformatics. 2021;37: 2834–2840. pmid:33760053
View Article
PubMed/NCBI
Google Scholar

[198] View Article

[199] PubMed/NCBI

[200] Google Scholar

[ref56] 56. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12: 2825–2830.
View Article
Google Scholar

[202] View Article

[203] Google Scholar

[ref57] 57. Kursa MB, Rudnicki WR. Feature Selection with the Boruta Package. J Stat Softw. 2010;36: 1–13.
View Article
Google Scholar

[205] View Article

[206] Google Scholar

[ref58] 58. Ali M. PyCaret: An open source, low-code machine learning library in Python. 2020. Available: https://www.pycaret.org
View Article
Google Scholar

[208] View Article

[209] Google Scholar

[ref59] 59. Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: A Next-generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, NY, USA: Association for Computing Machinery; 2019. pp. 2623–2631. https://doi.org/10.1145/3292500.3330701

[ref60] 60. Bergstra J, Bardenet R, Bengio Y, Kégl B. Algorithms for Hyper-Parameter Optimization. Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2011.

[ref61] 61. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143: 29–36. pmid:7063747
View Article
PubMed/NCBI
Google Scholar

[213] View Article

[214] PubMed/NCBI

[215] Google Scholar

[ref62] 62. Maaten L van der, Hinton G. Visualizing Data using t-SNE. J Mach Learn Res. 2008;9: 2579–2605.
View Article
Google Scholar

[217] View Article

[218] Google Scholar

[ref63] 63. Maaten L van der. Accelerating t-SNE using Tree-Based Algorithms. J Mach Learn Res. 2014;15: 3221–3245.
View Article
Google Scholar

[220] View Article

[221] Google Scholar

[ref64] 64. Timmons PB, Hewage CM. ENNAACT is a novel tool which employs neural networks for anticancer activity classification for therapeutic peptides. Biomed Pharmacother. 2021;133: 111051. pmid:33254015
View Article
PubMed/NCBI
Google Scholar

[223] View Article

[224] PubMed/NCBI

[225] Google Scholar

[ref65] 65. Zaliani A, Gancia E. MS-WHIM Scores for Amino Acids: A New 3D-Description for Peptide QSAR and QSPR Studies. J Chem Inf Comput Sci. 1999;39: 525–533.
View Article
Google Scholar

[227] View Article

[228] Google Scholar

[ref66] 66. Sandberg M, Eriksson L, Jonsson J, Sjöström M, Wold S. New Chemical Descriptors Relevant for the Design of Biologically Active Peptides. A Multivariate Characterization of 87 Amino Acids. J Med Chem. 1998;41: 2481–2491. pmid:9651153
View Article
PubMed/NCBI
Google Scholar

[230] View Article

[231] PubMed/NCBI

[232] Google Scholar

[ref67] 67. Hellberg S, Sjoestroem M, Skagerberg B, Wold S. Peptide quantitative structure-activity relationships, a multivariate approach. J Med Chem. 1987;30: 1126–1135. pmid:3599020
View Article
PubMed/NCBI
Google Scholar

[234] View Article

[235] PubMed/NCBI

[236] Google Scholar

[ref68] 68. IKAI A. Thermostability and Aliphatic Index of Globular Proteins. J Biochem (Tokyo). 1980;88: 1895–1898. pmid:7462208
View Article
PubMed/NCBI
Google Scholar

[238] View Article

[239] PubMed/NCBI

[240] Google Scholar

[ref69] 69. Manavalan B, Basith S, Shin TH, Wei L, Lee G. mAHTPred: a sequence-based meta-predictor for improving the prediction of anti-hypertensive peptides using effective feature representation. Bioinforma Oxf Engl. 2019;35: 2757–2765. pmid:30590410
View Article
PubMed/NCBI
Google Scholar

[242] View Article

[243] PubMed/NCBI

[244] Google Scholar

[ref70] 70. Garg A, Raghava GPS. A Machine Learning Based Method for the Prediction of Secretory Proteins Using Amino Acid Composition, Their Order and Similarity-Search. In Silico Biol. 2008;8: 129–140. pmid:18928201
View Article
PubMed/NCBI
Google Scholar

[246] View Article

[247] PubMed/NCBI

[248] Google Scholar

[ref71] 71. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. pp. 4768–4777.

Figures

Abstract

Introduction

Materials and methods

Datasets

Overview of the method

Feature encoding

Normalization

Machine learning methods

Heuristic feature subset selection

Hyperparameter optimization and independent test

Performance evaluation

Results

Selected features and motifs

Performance evaluation for cross validation

Performance evaluation for independent test

Discussions

Data visualization with selected features

Analysis of selected feature types

Feature importance analysis

Peptide property with respect to prediction capability

Conclusion

Supporting information

S1 Fig.

S2 Fig.

S3 Fig.

S4 Fig.

S1 Table. Details of 57 feature types used in the study.

S2 Table. Hyperparameters of ML models trained with DS1-CV.

S3 Table. Hyperparameters of ML models trained with DS2-CV.

S4 Table. List of 161 selected feature subset for DS1.

S5 Table. List of 218 selected feature subset for DS2.

S6 Table. The selected numbers and sizes of feature types from the selected feature subsets of DS1 and DS2.

S1 Text. Pseudocode of ENCAP.

S2 Text. Pseudocode of the feature subset selection algorithm.

References