PredIL13: Stacking a variety of machine and deep learning methods with ESM-2 language model for identifying IL13-inducing peptides

Hiroyuki Kurata; Md. Harun-Or-Roshid; Sho Tsukiyama; Kazuhiro Maeda

doi:10.1371/journal.pone.0309078

Abstract

Interleukin (IL)-13 has emerged as one of the recently identified cytokine. Since IL-13 causes the severity of COVID-19 and alters crucial biological processes, it is urgent to explore novel molecules or peptides capable of including IL-13. Computational prediction has received attention as a complementary method to in-vivo and in-vitro experimental identification of IL-13 inducing peptides, because experimental identification is time-consuming, laborious, and expensive. A few computational tools have been presented, including the IL13Pred and iIL13Pred. To increase prediction capability, we have developed PredIL13, a cutting-edge ensemble learning method with the latest ESM-2 protein language model. This method stacked the probability scores outputted by 168 single-feature machine/deep learning models, and then trained a logistic regression-based meta-classifier with the stacked probability score vectors. The key technology was to implement ESM-2 and to select the optimal single-feature models according to their absolute weight coefficient for logistic regression (AWCLR), an indicator of the importance of each single-feature model. Especially, the sequential deletion of single-feature models based on the iterative AWCLR ranking (SDIWC) method constructed the meta-classifier consisting of the top 16 single-feature models, named PredIL13, while considering the model’s accuracy. The PredIL13 greatly outperformed the-state-of-the-art predictors, thus is an invaluable tool for accelerating the detection of IL13-inducing peptide within the human genome.

Citation: Kurata H, Harun-Or-Roshid M, Tsukiyama S, Maeda K (2024) PredIL13: Stacking a variety of machine and deep learning methods with ESM-2 language model for identifying IL13-inducing peptides. PLoS ONE 19(8): e0309078. https://doi.org/10.1371/journal.pone.0309078

Editor: Shahid Akbar, Abdul Wali Khan University Mardan, PAKISTAN

Received: May 23, 2024; Accepted: August 5, 2024; Published: August 22, 2024

Copyright: © 2024 Kurata et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The source codes are freely accessible at https://github.com/kuratahiroyuki/PredIL13. The web application is freely available at http://kurata35.bio.kyutech.ac.jp/PredIL13.

Funding: This work is supported by Japan Society for the Promotion of Science(JSPS) with grant number 22H03688. In relation to this, the funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. There was no additional external funding received for this study.

Competing interests: The authors have declared that no competing interests exist.

Introduction

A cascade of cytokines, marked by the overproduction of inflammatory signaling molecules such as Interleukin (IL)-1, IL-2, IL-6, IL-13, IL-17, Interferon-γ, and TNF-α is identified as a physiological and pathological factor intricately linked to the severity of the Coronavirus disease 2019 (COVID-19) [1, 2]. In-vitro experimental investigations, which were supported by in-vivo data and enriched by insights resulting from single-cell RNA-sequencing, revealed that inflammatory agents present in the blood serum of COVID-19 patients trigger dysfunction in the endothelial cells. Such dysfunction is closely connected to COVID-19-related endotheliopathy, providing some evidences of the detrimental functions caused by inflammatory cytokines [3].

IL-13 has emerged as one of the recently identified cytokines as contributors to the severity of COVID-19 [4, 5]. IL-13, a versatile cytokine, is discharged by T-Helper 2 (Th-2) cells, basophils, mast cells, eosinophils, and natural killer cells. Analogous to IL-4, this cytokine plays a crucial role in Th-2-mediated immunity, encompassing responses to allergic reactions and parasitic infections [6]. Actually, IL-13 triggers the transition to IgG4 and IgE antibodies in naive human B cells [7] and proves indispensable in expelling gastrointestinal nematodes [8]. It stands out as a pivotal mediator in the airway inflammation observed in conditions such as asthma and reactive airway diseases [9].

Since IL-13 causes the severity of COVID-19 and alters crucial biological processes, it is urgent to explore novel molecules capable of modulating IL-13. Computational or in-silico prediction has received attention as a complementary method to in-vivo and in-vitro experimental identification of IL-13 [10]. To data only a few predictors have been presented and the development of IL-13 predictors has just begun. Jain et al. introduced the first predictor of IL13Pred in 2022, which was designed to categorize peptides into those inducing IL-13 and those lacking this property [11]. They presented a benchmark dataset that comprised 313 experimentally validated IL-13-inducing peptides retrieved from the immune epitope database [12]. In addition, 2908 non-IL-13-inducing peptides were extracted from the same database as negative datasets. They used the Pfeature algorithm to compute 9151 features for each peptide, executed feature selection using the linear support vector classifier with the L1 penalty method, resulting in the identification of 95 relevant features. They employed a decision tree-based algorithm to predict IL-13-inducing peptides. Arora et al. developed iIL13Pred that used seven conventional machine learning (ML) classifiers: decision tree, Gaussian Naïve Bayes, k-Nearest Neighbour (KN), Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), and eXtreme Gradient Boosting (XGB) while introducing a multivariate feature selection approach [13].

In this study, we have developed PredIL13, a cutting-edge ensemble learning model that accurately identified IL-13 inducing peptides, as shown in Fig 1. This method stacked the probability scores generated by168 single-feature machine/deep learning models, then trained a logistic regression-based meta-classifier with the stacked probability score vectors. The key technology was to implement Evolutionary Scale Modeling-2 (ESM-2) [14] language model and to select the optimal single-feature models according to their absolute weight coefficient for logistic regression (AWCLR), an indicator of the importance of each single-feature model. Especially, the sequential deletion of single-feature models based on the iterative AWCLR ranking (SDIWC) method constructed the meta-classifier with the top 16 single-feature models, named IL13Pred, while considering the prediction accuracy. The SDIWC method enables us to intelligibly determine the optimal number of single-feature models. The PredIL13 greatly outperformed the-state-of-the-art predictors, thus is an invaluable tool in accelerating the detection of IL13-inducing peptide within the human genome. To aid the scientific community in identifying latent IL-13 inducing peptides, we present a web application and standalone programs of the proposed predictor, which can be freely accessed. The web application is freely available at http://kurata35.bio.kyutech.ac.jp/PredIL13. The source codes are freely accessible at https://github.com/kuratahiroyuki/PredIL13.

Download:

Fig 1. An overview of the process employed in the development of the PredIL13 predictor.

This process comprises the following steps: preparing the dataset, extracting features, training the model, evaluating its performance, and creating web application and standalone programs.

https://doi.org/10.1371/journal.pone.0309078.g001

Materials and methods

Overall framework of PredIL13

Fig 1 provides an overview of the process employed in the development of the IL13- predictor. This multifaceted procedure consists of the following steps: dataset preparation, construction of single-feature models using ML and DL methods, construction of the meta-classifier that stacks the single-feature models, evaluation, and web application/standalone program construction.

Dataset preparation

We used the dataset used in the latest study [11, 13]. For the sake of comparison, all the datasets including the positive and negative datasets used in this study were obtained from the original study [11]. The positive dataset included 313 IL-13-inducing peptides whereas the negative dataset included 2908 non-IL-13-inducing peptides. The entire dataset was randomly divided into the training and test datasets at a ratio of 4 to 1, respectively. The training dataset was used for 5-fold cross validation (CV), where the ratio of the training and validation datasets was 4 to 1. While the ratio of positive to negative samples in the dataset was imbalanced, we neither used over-sampling nor under-sampling methods.

Feature encoding method

The amino acid sequences with a different length were encoded using 21 different encoding techniques. These methods included Amino Acid Composition (AAC), Di-Peptide Composition (DPC), Composition of k-spaced Amino Acid Pairs (CKSAAP), Grouped Amino Acid Composition (GAAC), Conjoint Triad (CTriad), Composition/Transition/Distribution (CTD) Composition (CTDC), CTD Transition (CTDT), CTD Distribution (CTDD), Binary Encoding (BE), Enhanced Amino Acid Composition (EAAC), Amino Acid indices (AAindex), BLOSUM62, Z-Scale (ZSCALE), Evolutionary Scale Modeling-2 (ESM-2), and Word2Vec (W2V) with different Kmers. All these sequence encoding techniques can be readily computed using the open-source software packages including iLearn [15]. These encoding techniques capture a variety of properties including compositional, position-order, evolutional and physicochemical characteristics, and linguistic patterns/distributions. Each of these encoding methodologies brings a unique lens to the sequence and contributes to a comprehensive and detailed analysis. Details in each encoding method are described below:

Composition-based encoding

Generally, the Kmer (k) encodes an amino acid sequence ({R_i} (i = 1,2, …, L), L is the length of a sequence) as the occurrence frequencies of consecutive amino acids of length k. The Kmer encoding provides 20^k features to each sequence, given by: (1) where q_k is any one of the consecutive Kmer amino acids and f(AA_k) is the occurrence number of q_k [15, 16]. For example, at Kmer = 2 a sequence is encoded as 400 (20²) descriptors (features) representing the frequency of all the dinucleotides (2-mers). Amino Acid Composition (AAC) is a variant of Kmer = 1 that gives 20 features for a sequence. Di-Peptide acid Composition (DPC) and Tri-Peptide Composition (TPC) are the variants of Kmer = 2 that gives 400 (20²) descriptors and Kmer = 3 that gives 8000 (20³) features, respectively.

Pseudo-Amino Acid Composition (PAAC)

PAAC encodes the amino acid sequences mainly using a matrix of amino-acid frequencies, which deals with proteins without significant sequential homology to other proteins [17]. Compared to AAC, PAAC has 25 descriptors that can include some local sequence order information or a series of rank-different correlation factors along a protein sequence.

Composition of k-spaced Amino Acid Pairs (CKSAAP)

The CKSAAP feature encoding calculates the frequency of amino acid pairs separated by any k residues (k = 0, 1, 2, …, 5) [18]. The default maximum value of k is 5. For example, it generates 400 descriptors that correspond to DPC at k = 0.

Grouped Amino Acid Composition (GAAC)

The 20 amino acid types are categorized into five classes according to their physicochemical properties, including hydrophobicity, charge and molecular size [19]. The five classes include the aliphatic group (g1: GAVLMI), aromatic group (g2: FYW), positive charge group (g3: KRH), negative charged group (g4: DE) and uncharged group (g5: STCPNQ). GAAC encodes an amino acid sequence as the occurrence frequencies of the different grouped amino acids, generating 5 features. The 400 di-peptide types are categorized into 25 classes based on the physicochemical properties in the same manner as GAAC. Grouped Di-Peptide Composition (GDPC) encodes an amino acid sequence as the occurrence frequencies with respect to the different grouped dipeptides, generating 25 features. The Grouped Tri-Peptide Composition (GTPC) encodes an amino acid sequence as the occurrence frequencies with the different grouped tripeptides, where 125 descriptors are used.

Conjoint Triad (CTriad)

Twenty standard amino acids can be divided into 7 groups based on the dipoles and volumes of the side chains. The Conjoint Triad descriptor (CTriad) considers the properties of one amino acid and its vicinal amino acids by regarding any three continuous amino acids as a single group, generating a 7 × 7 × 7-D feature vector for a sequence [20].

Composition/Transition/Distribution (CTD)

The Composition, Transition and Distribution (CTD) features represent the amino acid distribution patterns of seven structural or physicochemical properties, including hydrophobicity, normalized Van der Waals Volume, polarity, polarizability, charge, secondary structures and solvent accessibility [21, 22]. The CTD descriptors are calculated by transforming the amino acid sequence into a vector of the specific structural or physicochemical properties of amino acid residues, where 20 amino acids are classified into three groups (polar, neutral and hydrophobic) for each of the seven different physicochemical features.

The CTD encodes an amino acid sequence as a combination of the three different matrices: Composition, Transition, and Distribution. We can obtain any of the three matrices by executing the corresponding encodings: CTDC, CTDT, and CTDD. The CTDC calculates the grouped amino acid composition for each property, and generates a 39D vector for each sequence, where the number of rows is the sequence length and the number of columns is the number of groups times the number of properties. The CTDT calculates the grouped amino acid transition for each property to generates a 39-D feature matrix. The CTDD calculates 15 values of the amino acids for each property. The Distribution descriptor consists of the five values for each of the three groups, defined as the corresponding fraction of the entire sequence, given by 0, 25, 50, 75 and 100% of occurrences. It generates a 195-D feature vector.

Position-based encoding

Binary Encoding (BE).

BE converts a single amino acid into a 20-dimensional binary vector. For example, the amino acids A, C, and Y of are represented as (10000000000000000000), (01000000000000000000), and (00000000000000000001), respectively. Therefore, an amino acid sequence with a length of L can be represented as a 20L dimensional feature vector.

NN.

NN encoding (nn.Embedding in PyTorch) converts a single amino acid into an index, which is further transformed into a fixed-size feature vectors by using a generated lookup table that refers embeddings in a fixed dictionary and size [23].

EAAC.

The Enhanced Amino Acid Composition (EAAC) feature calculates the AAC based on the sequence window of fixed length (the default value is 5) that continuously slides from the N- to C-terminus of each peptide [15]. EAAC encodes a sequence with length L as 20 × (L − k + 1) dimensional feature vector, given by: (2) where R_m represents the mth amino acid in a sequence and q_i ∈ {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V}.

Amino acid index

Physicochemical properties of amino acids are the most intuitive features for representing biochemical reactions and have been extensively applied in bioinformatics research. The Amino Acid indices database (AAindex) collects many published indices representing physicochemical properties of amino acids [24]. Each physicochemical property has a set of 20 numerical value for all amino acids. Currently, 544 physicochemical properties can be retrieved from the AAindex database. After removing the physicochemical properties with NA (not available) for any of the amino acids, AAindex generates a vector of 531 mean values for each amino acid reside in a sequence.

BLOSUM62

The BLOcks SUbstitution Matrix (BLOSUM) is a substitution matrix used for sequence alignment of proteins and scores the alignments between evolutionarily divergent protein sequences [25]. It scans the BLOCKS database for very conserved regions of protein families based on their similarity, counts the relative frequencies of amino acids, and calculate their substitution probabilities as a log-odds score for each of the 210 possible substitution pairs of the 20 standard amino acids. The BLOSUM62 is the matrix built using sequences with less than 62% similarity. The BLOSUM62 generates 20L-D feature vector to represent in a sequence, where each row in the BLOSUM62 matrix is adopted to encode one of 20 amino acids.

Z-Scale (ZSCALE)

Z-Scale characterizes each amino acid in a sequence by five physicochemical descriptors [26] and generate 25L D feature vector for each sequence. It improves the original Z-scales [27] by introducing two more Z-scales.

Language model

ESM-2 is a transformer-based language model that uses an attention mechanism to learn interaction patterns between pairs of amino acids [14]. ESM-2 can leverage evolutionary information from diverse protein sequences, enabling accurate predictions of 3D structures. ESM-2 was trained with 15 billion parameters on the protein sequences from the UniRef database [28], where it is tasked with predicting the 15% masked amino acids using the remaining 85% amino acid sequences.

W2V is invented to obtain the distributed representation of words in the field of natural language processing [29]. In W2V, the weights in a neural network are determined by learning the context of words to provide the distributed representation that encodes different linguistic regularities and patterns. In this study the Kmer in amino acid sequences are regarded as a single word and each peptide sequence is represented by multiple consecutive Kmer amino acids. Here, Kmer amino acids and each peptide sequences correspond to words and sentences in natural language. We trained a Skip-gram-based W2V model in the SWISS-PROT database [30] to learn the appearance pattern of Kmer by using the genism of the python package [31]. This study used Kmer of 1, 2, 3, and 4, feature size of 128, epoch of 100, window size of 40, sg of 1. At Kmer of K, we named W2V_K.

ML and DL classifiers

We employed seven different classifiers: Random Forest (RF) [32], Extreme Gradient Boosting (XGB) [33], LightGBM (LGBM) [34], Support Vector Machine (SVM) [35], K-nearest neighbor (KN), Naive Bayes (NB), and Logistic regression (LR). In addition, we used three DL methods: Transformer encoder network (TX), Convolutional Neural Network (CNN) and Bidirectional Long Short Term Memory (bLSTM) algorithm [36].

RF.

RF is an ensemble learning method that constructs numerous decision trees in the training phase. This approach employs bagging or bootstrap aggregating, generating multiple subsets of the original dataset randomly with replacement. A decision tree model is then established for each subset. During the prediction phase, RF combines the outputs of each tree to deliver a conclusive prediction. RF effectively addresses overfitting by reducing variance without augmenting bias, establishing itself as a robust tool for diverse applications.

XGB.

XGB stands out as a high-performance, adaptable, and easily transportable gradient boosting method. In a step-by-step fashion, XGB assembles an ensemble of weak prediction models, typically in the form of decision trees. Each subsequent tree rectifies the prediction errors of its predecessor. To counteract overfitting, XGB incorporates a regularization parameter, simplifying the model for enhanced robustness across a spectrum of predictive tasks.

LGBM.

LGBM stands as a gradient boosting platform harnessing tree-based learning algorithms, presenting heightened efficiency and swiftness compared to counterparts such as XGB and CatBoost. The distinctive features of LGBM encompass adept handling of expansive datasets, superior effectiveness, and rapid execution. What distinguishes it is the employment of a histogram-based algorithm, discretizing continuous feature values into distinct bins, a departure from conventional tree-based methods. This approach accelerates the training procedure while concurrently reducing memory consumption.

SVM.

SVM’s fundamental idea involves placing each data point in an N-dimensional space (N denotes the number of features) and defining a hyperplane that effectively separates the data points into distinct classes. The selected hyperplane seeks to maximize the margin between these classes. In cases where the data is not linearly separable, SVM employs a kernel trick to transform the input space into a higher dimension, enabling a hyperplane to delineate the data. The learning model constructs a boundary line that segregates data points into various classes. In binary classification, this decision boundary adopts the approach of creating the widest street, maximizing the distance to the closest data points from each class.

LR.

LR is used to predict the likelihood of categorical dependent variables, which makes it more of a classification algorithm than a regression one. Specifically, it is used when the dependent variable is binary, having two potential outcomes. LR models the likelihood that each input belongs to a specific category, outputting a value between 0 and 1. The probability p is defined by: (3) where β_i is the weight coefficient for the explanatory variables.

LR serves as a primary tool for establishing a decision boundary in binary classification, enabling the prediction of the corresponding class for a new set of features. An intriguing aspect of logistic regression lies in its employment of the sigmoid function as the estimator for the target class.

KN.

KN is a nonparametric supervised learning algorithm that predicts the class or value of unknown data points based on the K most similar data points in the training dataset. The similarity between the data points is typically measured using Euclidean distance. The performance of KN depends on the right choice of a value of K.

NB.

NB stands out as a well-known supervised ML technique leveraging Bayes’ theorem and assuming the independence of input features. It does not represent a singular algorithm but rather a family of algorithms united by the common principle that the classification of any two features is independent.

TX.

TX consists of an encoder and decoder that are created to process sequential input data, such as natural language for text translation and summarization. As a DL model, TX employs an attention mechanism that differentially weights the importance of each word in the input text [37]. The encoder layer consists of a multi-head attention network and a feed-forward network. In this study, we used the encoder of TX and set the number of attention heads and layers to 4 and 4, respectively [36].

CNN.

The CNN consists of convolutional and pooling layers. In the convolutional layer, significant features can be extracted based on filters. The pooling layer provides robust prediction with respect to pattern modification and suppresses overfitting by compressing the information. As proposed in a previous study [38], we applied CNN architectures consisting of two convolutional layers and two max-pooling layers.

bLSTM.

Recurrent neural networks are useful for making predictions about interdependent data such as time-series, but these networks are not suitable for learning long-term dependencies because of gradient disappearance and explosion. To address this issue, LSTM introduces gate structures and memory cells to expand the LSTM units in two directions, called BiLSTM approach [39].

The seven ML and three DL methods were implemented using scikit-learn [40] and PyTorch [23], respectively. For DLs we used Adam optimizer with binary cross-entropy function. The hyperparameters for each method were optimized during the training process, as shown in S1 Table. We utilized grid search coupled with 5-fold CV to meticulously refine these hyperparameters, and a comprehensive overview of this process can be found in our previous study [36].

Meta-classifier

By connecting seven ML classifiers and three DL ones to 22 encoding methods, we generated 168 single-feature models via 5-fold CV on the training dataset. Then, we constructed the meta-classifier that stacks the predicted probabilities resulting from each single-feature model to attain accurate prediction [41, 42]. The meta-classifier is trained via 5-fold CV so that the final predicted probabilities could fit the class labels. We investigated the seven meta-classifiers including LR, SVM, LGBM, XGB, RF, NB, and KN.

It is critically important to select single-feature models out of many 168 models. Generally, the feature selection is a dimensionality reduction technique that selects a subset of features that present the best predictive power. In this study, note that the features correspond to single-feature models. Feature selection is used to prevent overfitting and to improve interpretability with fewer features. There are multiple methods to select features: Iterative changes of features subset sequentially adds or removes the features until no improvement occurs in prediction; Ranking of features based on intrinsic characteristic such as mutual information is used to select the top few ranked features; ML-learned feature importance during the training process is used to rank features. We propose the following three feature selection methods that rank the single-feature models in terms of prediction performance (e.g., AUC) or their importance (weight coefficient) during the training process.

Sequential addition of single-feature models based on the AUC ranking (SAAUC)

The SAAUC selects the top X single-feature models in the descending order of the AUC values during the training process, where X is incremented by one. LR, SVM, LGBM, XGB, RF, NB, and KN are used as a meta-classifier.

Sequential addition of single-feature models based on the AWCLR ranking (SAWC)

The SAWC selects the top X single-feature models in the descending order of the AWCLR that corresponds to the importance of each single-feature model. LR is employed as a meta-classifier, because the importance is explicitly defined as the weight coefficients of LR by Eq (3), which indicate the contribution of each single-feature model to accurate prediction.

Sequential deletion of single-feature models based on the iterative AWCLR ranking (SDIWC)

The SDIWC selects the top X models in the descending order of the AWCLR, while iteratively updating the AWCLR ranking as follow, because the AWCLR ranking can vary with the employed single-feature model subset: (1) Calculate the AWCLR of all single-feature models by a meta-classifier of LR. (2) Sort the single-feature models in the descending order of the AWCLR. (3) Remove the single-feature model with the lowest AWCLR. (4) Input the remaining single-feature models into the LR meta-classifier. (5) Repeat (2–4) until the number of the remaining single-feature models becomes one.

Evaluation

The effectiveness of the proposed predictive models was assessed using seven established statistical measures, each offering different insights into the prediction performance. It includes accuracy (ACC), sensitivity (SEN), specificity (SPE), precision (PRE), Matthews correlation coefficient (MCC), area under the receiver operating characteristic curve (AUC) and area under the precision-recall curve (AUPRC). Comprehensive descriptions and mathematical formulations of these measures are available in previous studies [36]. A threshold value that determines whether the probability scores are classified into positive and negative samples is adjusted so as to maximize MCC.

Results and discussion

Single-feature models: Construction and independent evaluation

Generally, the performance of ML and DL classifiers depends on their learning algorithms and encoding methods. The performance of these classifiers, trained with the same encoding method, varies with their algorithms. To consider a variety of features produced by a combination of different learning methods with encodings, we constructed 168 single-feature models by combining seven ML and three DL classifiers with 22 distinct feature encodings including language models. For instance, the LR model trained with the AAC encoding is regarded as a single-feature model. All the single-feature models were trained on the training dataset via 5-fold CV, then evaluated using an independent test dataset.

As depicted in Fig 2, we characterized the prediction performance (AUC, MCC) of 168 single-feature models on the training and test datasets. Details in their prediction performances are shown in S2 Table. The performance trend of each single-feature model was almost consistent between the training and test datasets, which indicates that the models are well trained. The LGBM showed high performance (AUC > 0.8) with respect to composition-based encodings (AAC, DPC, CKSAAP, GAAC, GDPC, GTPC, CTDC, CTDD), position-based encodings (BLOSSUM62, AAINDX, ZSCALE) and the language models (ESM-2, W2V). Particularly, the LGBM with ESM-2 provided remarkable performance: AUCs of 0.847 and 843 and MCCs of 0.528 and 0.438 in the training and test datasets, respectively. The RF indicated high performance with AAINDEX, ESM-2 and W2V. The boundary-based models (SVM, LR) presented high performance with BLOSSUM62 and language models (ESM-2, W2V). On the other hand, the NB and KN provided less performance with respect to all the encodings.

Download:

Fig 2. Heatmap of the prediction performance of the 168 single-feature models.

The models were built by employing 7 ML classifiers and 3 DL classifiers with different 21 encoding methods. Mathew’s correlation coefficient (MCC), accuracy (AAC) and area under the curve (AUC) on the training and test datasets are illustrated for each single-feature model. “-1” indicates that no learning model is built.

https://doi.org/10.1371/journal.pone.0309078.g002

In the DL methods, TX and CNN with the language models (ESM-2, W2V) presented a high performance; bLSTM showed a lower performance than TX and CNN. Use of the language models could enhance the performance more than use of BE and NN. The TX models with W2V-3 and W2V-4 and the CNN models with W2V-1 and W2V-2 presented high performance. The performances of TX and CNN with the language models were competitive to those of LGBM.

To clearly illustrate the prediction capability of all the 168 single-feature models, we ranked them according to their AUC values on the training dataset (S3 Table). The LGBM with ESM-2 was the first ranked model; most of the LGBM-based models were placed at high ranks. The second ranked model was the CNN with W2V-1. In general, the language models (ESM-2, W2V) and composition-based encodings were placed at high ranks for many ML and DL methods.

Feature selection for meta-classifier construction

Since stacking of single-feature models is typically expected to enhance the prediction performance [41, 42], we proposed the three feature selection methods: SAAUC, SAWC, and SDIWC (see Methods). First, by using the SAAUC method, we stacked the top X single-feature models-generated probability vectors according to the AUC ranking, then inputted them into a meta-classifier, including LR (Fig 3), SVM, LGBM, XGB, RF, KN, and NB (S1 Fig). The meta-classifier was trained with the training dataset, while incrementing X by one. If a change in AUC with respect to X has one peak or a saturation curve, it would be easy to select the optimal X, which corresponds to the peak or just attains saturation. For LR, the AUC and MCC values tended to increase with an increase in X, while it went up and down with respect to X. The AUC of the top 2 model dropped as CNN-W2V_1 was added to the top 1 of LGBM-ESM-2, suggesting that the CNN-W2V_1 and LGBM-ESM-2 are incompatible pairs. However, the prediction performance could be improved by an increase in the number of single-feature models. The other meta-classifiers also showed the same trend in the AUC and MCC with respect to X. Since one single-feature model produces one score for each sequence, the dimension of the produced score vectors (X) is much fewer than the sample number. It may cause overfitting, but the prediction performances of all the seven classifiers in test were rather consistent with that in the validation and overfitting was not observed, showing the robustness and generality of the prediction model. Out of the seven meta-classifiers, LR was the best predictor in terms of prediction performance on both the training and test datasets. Thus, we determined LR as the meta-classifier (S4 Table).

Download:

Fig 3. Prediction performance of the SAAUC-stacked classifier consisting of the top X single-feature models.

LR is employed as the meta-classifier and X is incremented by one from one to 168. (A) validation dataset; (B) test datasets.

https://doi.org/10.1371/journal.pone.0309078.g003

AWCLR-based feature selection

To suppress the fluctuations in AUC and MCC and reveal the effectiveness of stacking, we focused on the importance of each single-feature model during training, i.e., we applied the SAWC method to a series of the LR-based meta-classifiers (X = {1, 2,…, 168}) and calculated the weight coefficients defined by Eq (3) as the importance. Specifically, we retrieved the absolute weight coefficients of LR (AWCLR) of 168 single-feature model from the meta-classifier with X = 168, as shown in Fig 4. Interestingly, the AWCLR-based ranking of the single-feature models differed from the AUC-based ranking of them. We sorted the single-feature models in the descending order of AWCLR (S5 Table) and stacked the top X single-feature models according to the AWCLR ranking to evaluate their performance on the training and test datasets, as shown in Fig 5. The AUC initially dropped, then it increased, achieving a plateau at the top 58 (X = 58). The AWCLR ranking-based method (SAWC) was found to be an indicator superior to the SAAUC for feature selection. The fluctuation in performance were suppressed at X of < 80, but it still fluctuated at X of >80.

Download:

Fig 4. Heatmap of the AWCLR of the 168 single-feature models.

The models were built by employing 7 ML classifiers and 3 DL classifiers with different 21 encoding methods. The AWCLR indicates the contribution of each single-feature model to the prediction performance.

https://doi.org/10.1371/journal.pone.0309078.g004

Download:

Fig 5. Prediction performance of the SAWC-stacked classifier consisting of top X single-feature models.

LR is employed as the meta-classifier and X is incremented by one from one to 168. (A) validation dataset; (B) test datasets.

https://doi.org/10.1371/journal.pone.0309078.g005

To further improve the fluctuation in the performance, we applied the SDIWC method to the series of the LR-based meta-classifiers. The prediction performance of the top X models varied smoothly as shown in Fig 6. The AUC initially dropped, then it increased, achieving a plateau at X = 58. The initial fall would be caused by a compatibility of the top 2 models (LGBM-ESM-2 and LGBM-W2V_4). This AUC curve enabled us to readily select the optimal single-feature models. Table 1 summarized the performance of the top X model with the highest AUC for the three feature selection methods. We selected the top 58 model as the best model, and designated it PredIL13 (top 58). PredIL13 achieved a consistent level of performance between the training and test, indicating the generality and robustness on the test dataset. The resultant two meta-classifiers built by SAWC and by SDIWC had the same optimal number (58) of single-feature models. This number might be obtained by chance, because the AWCLR ranking changed with a change in a subset of the employed single-feature models.

Download:

Fig 6. Prediction performance of the SDIWC-stacked classifier consisting of top X single-feature models.

LR is employed as the meta-classifier and X is incremented by one from one to 168. (A) validation dataset; (B) test datasets.

https://doi.org/10.1371/journal.pone.0309078.g006

Download:

Table 1. The best LR-based meta-classifier that provides the highest AUC for three feature selection methods.

The validation dataset is used for training.

https://doi.org/10.1371/journal.pone.0309078.t001

When compared to the first ranked single-feature model (LGBM-ESM-2), the stacking method greatly enhanced the prediction performance, demonstrating the effectiveness of the SDIWC method. In the independent dataset, LGBM-ESM-2 showed SEN of 0.286, SPE of 0.989, PRE of 0.796, ACC of 0.920, MCC of 0.438, AUC of 0.843, and AUPRC of 0.573; PredIL13 achieved SEN of 0.327, SPE of 0.993, PRE of 0.849, ACC of 0.928, MCC of 0.499, AUC of 0.899, and AUPRC of 0.635.

Importance of each single-feature model

To understand the contribution or importance of each single-feature model to prediction performance, we analyzed the heatmap of AWCLR (Fig 4) in comparison with the heatmap of AUC (Fig 2). Interestingly, all the DL-based models dropped to lower ranks in the AWCLR ranking, which was exemplified by the fact that the CNN-W2V_1 dropped from the second-best model in the AUC ranking to the 116th model in the AWCLR ranking and the highest ranked DL models (bLSTM-ESM-2, TX-W2V_1, CNN-W2V_4) in the AUC ranking fell to the 82th, 83th, and 84th in the AWCLR ranking. Consequently, the top 58 model did not include any DL methods. ML and DL could not work together for enhanced performance, suggesting some incompatibilities between the probability score vectors generated by ML and DL models. Many LGBM-based single feature models and SVM- and LR-based models with the language models (ESM-2, W2V) were placed at high ranks in the AWCLR ranking, indicating that the tree-based and boundary-based models are effective in the stacking approach.

Feature selection methods other than the AWCLR, which represents the contribution of each single-feature model (feature) to the log-odds of binary classification, are known. For example, SHapley Additive exPlanations (SHAP) analysis provides a more nuanced view by showing how each feature contributes to individual predictions (S2 Fig). In this study we used the AWCLR as it has a theoretical intelligible basis. SHAP analysis will be considered elsewhere.

Analysis of the top 58 single-feature models

Since the number of the single-feature models included in the top 58 was relatively large, indicating model complexity, it would be better if we reduce the number of the selected models. To further conduct the feature selection, we considered the probability distributions generated by the top 58 single-feature models for the positive and negative samples during training, as shown in Fig 7. The top 29 models were mainly based on LGBM and SVM, and the probability scores of them were clearly separated between the positive and negative samples. On the other hand, the probability scores from the 30th to 58th models were not clearly separated. Especially, in the KN-based single-feature models (KN-CKSAAP, KN-DPC, KN-ZSCALE, KN-PAAC, KN-Ctriad, KN-AAC, KN-EAAC, KN-CTDC, KN-BLOSSUM62, KN-GDPC, KN-W2V_2, KN-CTDT), their probability profiles were much overlapped between the positive and negative samples. In the NB-based models (NB-CKSAAP, NB-DPC, NB-GTPC), the probability scores for negative samples were broadly scattered from 0 to 1. These KN- and NB-based models were confirmed to present low values of AUC, MCC, and ACC (Fig 2), indicating that their contributions are rather small.

Download:

Fig 7. Boxplots of the probability distributions of the top 58 single-feature models for on the training dataset.

The single-feature models were arranged in the descending order of AWCLR from the left to the right. (Upper panel) The models from the top 1 to top 29. (Lower panel) the models from the top 30 to 58.

https://doi.org/10.1371/journal.pone.0309078.g007

To further select the single-feature models, we proposed deleting the single-feature models with ACC < 0.902 based on the following mechanism. If all the peptides are predicted to be negative, we obtain an ACC value of more than 0.902 (= 2908/(2908+313)) in our imbalanced dataset (See Dataset preparation). Following this deletion process, we selected 72 single-feature models with ACC ≥ 0.902 out of 168 and applied SDIWC to these 72 models, as shown in Fig 8. The AUC increased after X = 3, and achieved a peak at X = 16 during training, which intelligibly selected the top 16 model as the best model. It was named as PredIL13 (top 16). Looking at the top 16 single-feature models (Table 2), composition-based features (AAC, PAAC, DPC, CKSAAP, CTDD) were found to be effective in detecting the IL13-inducing activity. Particularly, the ESM-2 and multi-Kmer (1, 2, 3, and 4) W2V were very effective, which enabled the prediction model to perceive a wide range of contextual information or amino acid sequence patterns at different scales. The feature selection was also effective in enhancing the interpretability of the meta-classifier.

Download:

Fig 8. Prediction performance of the SDIWC-stacked classifier consisting of top X single-feature models.

The 72 single feature models with ACC > 0.902 on the training dataset were employed. LR is employed as the meta-classifier and X is incremented by one from one to 72. (A) validation dataset; (B) test datasets.

https://doi.org/10.1371/journal.pone.0309078.g008

Download:

Table 2. Top 16 single-feature models.

https://doi.org/10.1371/journal.pone.0309078.t002

Comparison of PredIL13 with existing predictors

To characterize the proposed methods from a fair perspective, we benchmarked their performance against two state-of-the-art predictors, IL13Pred and iIL13Pred, using the same test dataset. The single-feature model of LGBM-ESM-2 showed a little higher AUC than IL13Pred and iIL13Pred, despite its relative simplicity. This observation emphasized that the latest language model ESM-2 was effective in extracting critically important patters and distributions of amino acids.

Our proposed meta-classifier, PredIL13, yielded impressive results (Table 3). While PredIL13 (top 58, top 16) showed lower sensitivity than IL13Pred and iIL13Pred, PredIL13 distinctly surpassed IL13Pred and iIL13Pred in key performance indicators of SPE, ACC, MCC, and AUC. Especially, PredIL13 (top 16) took a remarkable advantage: 9.2% higher AUC than iIL13Pred. PredIL13 made a great progress in distinguishing IL13-inducing peptides and demonstrated its ability to capture preferred patterns and distributions from the sequence data. It improved state-of-the-art predictors in the field of IL13-inducing peptide classification.

Download:

Table 3. Comparison of our proposed PredIL13 with the-state-of-the-art method.

https://doi.org/10.1371/journal.pone.0309078.t003

Since ESM-2 required a large amount of memory (60GB in our study), we investigated if the ESM-2 encoding is indispensable. We removed the two single-feature models that possessed EMS2, i.e., LGBM-ESM-2 and LR-ESM-2, out of the top 16 models to build PredIL13 (top14). While PredIL13 (top14) slightly decreased AUC and AUPRC on the test dataset, it increased ACC and MCC. The ESM-2-free meta-classifier showed competitive performance to PredIL13(top16), indicating that the ESM-2-free meta-classifier was sufficiently effective despite the top 1 model (LGBM-ESM-2) being omitted. We recommend the users to employ the ESM-2-free model (top 14), when their servers are running out of memory.

Note that SEN decreased in PredIL13 compared to IL13Pred and iIL13Pred. We considered that a low value of SEN resulted from the evaluation process of a meta-classifier. During the evaluation, the number of the positive samples predicted to be a negative class (FP) decreased, while that of the negative samples predicted to be a positive class (FN) increased. It is because the employed datasets are imbalanced and the threshold value, which classifies the probability scores into the positive and negative ones, is set to a low value when the evaluation process maximizes MCC on the training dataset.

Application to SARS-CoV-2 spike proteins

To demonstrate the prediction capability of PredIL13, we tested it with the experimentally validated IL-13-inducing peptides that derived from the immune epitope database [12] [13]. IL13Pred and iIL13pred accurately predicted 19 and 37 peptides as IL13-inducing peptides out of a total 68 experimentally validated IL-13-inducing peptides, indicating SENs of 0.28 and 0.54, respectively. On the other hand, our PredIL13 correctly predicted 29 peptides as IL13-inducers (S6 Table), presenting a SEN of 0.43. PredIL13 predicted IL13-inducing peptides more accurately than IL13Pred, but less than iIL13Pred. Such a decreased SEN would be caused by a high value of SPE and PRE (Table 3). PredIL13 takes an advantage in increasing SPE and PRE.

Limitation

One limitation of this study is that the number of experimentally validated IL-13-inducing peptides is small. Thus, we need to construct a larger-scale dataset to increase the prediction performance. Furthermore, using the large dataset we construct a generative AI to design de novo peptide sequences and to ensure the generated sequences are biologically functional and potentially beneficial for medication.

Conclusion

We have proposed PredIL13, a novel, efficient computational methodology based on the stacking or ensemble strategy, specifically designed for predicting human IL13-inducing peptides. This method stacked 168 single-feature ML/DL models by combining their probability scores, trained the LR-based meta-classifier with the combined probability vectors, and selected a subset of the single-feature models that maximized the prediction performance. We proposed the SDIWC method to efficiently select the optimal single-feature models. From each trained single-feature model, we retrieved the AWCLR, the importance of each single-feature model, and sorted the single-feature models in the descendent order of AWCLR. The SDIWC method selected the top 58 models (IL13Pred (top 58)), while iteratively updating the AWCLR ranking. To further select the features, we applied the SDIWC method to the single-feature models with ACC ≥ 0.902. Finally, we obtained the LR-based meta-classifier that consisted of the top 16 single-feature models (IL13Pred (top 16)). The SDIWC method enabled us to intelligibly select the optimal single-feature models.

Importantly, the proposed single-feature model selection method revealed crtical fetaures responsible for IL13-inducing activity. Looking at the top 16 single-feature models, the linguistic approaches of ESM-2 and W2V, employed by the LGBM and SVM models, took a great advantage in improving the model’s performance. The ESM-2 and multi-Kmer W2V enabled the model to perceive a wide range of contextual information or amino acid sequence patterns at different scales. Particularly, ESM-2 presented the most important feature for identifying the IL13-inducing activity. Interestingly, in the meta-classifier, the high ranked DL methods in the AUC ranking moved to lower ranks in the AWCLR ranking, indicating the importance of DL methods decreased, despite the high performance of the DL-based single feature models. The resultant PredIL13 did not include any DL methods.

The PredIL13 outperformed the-state-of-the-art predictors, thus is an invaluable tool in accelerating the detection of IL13-inducing peptide in human. PredIL13 represents a computational strategy for the high-throughput and accurate prediction of human IL13-inducing peptides. The proposed selection method can be applied to a variety of sequence-based functional prediction tasks, such as peptide therapeutics.

Supporting information

S1 Fig. Prediction performance of six meta-classifiers built by the SAAUC method on the validation and test datasets.

(A) LGBM; (B) XGB; (C) RF; (D) SVM; (E) NB (F) KNN.

https://doi.org/10.1371/journal.pone.0309078.s001

(PDF)

S2 Fig. SHAP analysis of the SDIWC-stacked classifier consisting of top 72 single-feature models.

https://doi.org/10.1371/journal.pone.0309078.s002

(PDF)

S1 Table. Hyperparamter tuning for ML and DL.

https://doi.org/10.1371/journal.pone.0309078.s003

(XLSX)

S2 Table. Prediction performance of 168 single feature models on the training and test datasets.

https://doi.org/10.1371/journal.pone.0309078.s004

(XLSX)

S3 Table. Ranking of 168 single-feature models in the descending order of AUC on the training dataset.

https://doi.org/10.1371/journal.pone.0309078.s005

(XLSX)

S4 Table. Prediction performance of seven meta-classifiers on the training and test datasets.

The SAAUC method is employed. The validation dataset is used for training.

https://doi.org/10.1371/journal.pone.0309078.s006

(XLSX)

S5 Table. Ranking of 168 single-feature models in the descending order of AWCLR on the training dataset.

https://doi.org/10.1371/journal.pone.0309078.s007

(XLSX)

S6 Table. Prediction of experimentally validated IL13-inducing peptides by PreIL13 (top16).

https://doi.org/10.1371/journal.pone.0309078.s008

(XLSX)

References

1. Del Valle DM, Kim-Schulze S, Huang HH, Beckmann ND, Nirenberg S, Wang B, et al. An inflammatory cytokine signature predicts COVID-19 severity and survival. Nat Med. 2020;26(10):1636–43. Epub 2020/08/26. pmid:32839624.
- View Article
- PubMed/NCBI
- Google Scholar
2. Costela-Ruiz VJ, Illescas-Montes R, Puerta-Puerta JM, Ruiz C, Melguizo-Rodriguez L. SARS-CoV-2 infection: The role of cytokines in COVID-19 disease. Cytokine Growth Factor Rev. 2020;54:62–75. Epub 2020/06/10. pmid:32513566.
- View Article
- PubMed/NCBI
- Google Scholar
3. Khatun MS, Qin X, Pociask DA, Kolls JK. SARS-CoV2 Endotheliopathy: Insights from Single Cell RNAseq. Am J Respir Crit Care Med. 2022;206(9):1178–9. Epub 2022/07/16. pmid:35839476.
- View Article
- PubMed/NCBI
- Google Scholar
4. Donlan AN, Sutherland TE, Marie C, Preissner S, Bradley BT, Carpenter RM, et al. IL-13 is a driver of COVID-19 severity. JCI Insight. 2021;6(15). Epub 2021/06/30. pmid:34185704.
- View Article
- PubMed/NCBI
- Google Scholar
5. Morrison CB, Edwards CE, Shaffer KM, Araba KC, Wykoff JA, Williams DR, et al. SARS-CoV-2 infection of airway cells causes intense viral and cell shedding, two spreading mechanisms affected by IL-13. Proc Natl Acad Sci U S A. 2022;119(16):e2119680119. Epub 2022/03/31. pmid:35353667.
- View Article
- PubMed/NCBI
- Google Scholar
6. Junttila IS. Tuning the Cytokine Responses: An Update on Interleukin (IL)-4 and IL-13 Receptor Complexes. Front Immunol. 2018;9:888. Epub 2018/06/23. pmid:29930549.
- View Article
- PubMed/NCBI
- Google Scholar
7. Punnonen J, Aversa G, Cocks BG, McKenzie AN, Menon S, Zurawski G, et al. Interleukin 13 induces interleukin 4-independent IgG4 and IgE synthesis and CD23 expression by human B cells. Proc Natl Acad Sci U S A. 1993;90(8):3730–4. Epub 1993/04/15. pmid:8097323.
- View Article
- PubMed/NCBI
- Google Scholar
8. McKenzie GJ, Bancroft A, Grencis RK, McKenzie AN. A distinct role for interleukin-13 in Th2-cell-mediated immune responses. Curr Biol. 1998;8(6):339–42. Epub 1998/03/25. pmid:9512421.
- View Article
- PubMed/NCBI
- Google Scholar
9. Li L, Xia Y, Nguyen A, Lai YH, Feng L, Mosmann TR, et al. Effects of Th2 cytokines on chemokine expression in the lung: IL-13 potently induces eotaxin expression by airway epithelial cells. J Immunol. 1999;162(5):2477–87. Epub 1999/03/11. pmid:10072486.
- View Article
- PubMed/NCBI
- Google Scholar
10. Gupta S, Sharma AK, Shastri V, Madhu MK, Sharma VK. Prediction of anti-inflammatory proteins/peptides: an insilico approach. J Transl Med. 2017;15(1):7. Epub 2017/01/07. pmid:28057002.
- View Article
- PubMed/NCBI
- Google Scholar
11. Jain S, Dhall A, Patiyal S, Raghava GPS. IL13Pred: A method for predicting immunoregulatory cytokine IL-13 inducing peptides. Comput Biol Med. 2022;143:105297. Epub 2022/02/14. pmid:35152041.
- View Article
- PubMed/NCBI
- Google Scholar
12. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, et al. The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res. 2019;47(D1):D339–D43. Epub 2018/10/26. pmid:30357391.
- View Article
- PubMed/NCBI
- Google Scholar
13. Arora P, Periwal N, Goyal Y, Sood V, Kaur B. iIL13Pred: improved prediction of IL-13 inducing peptides using popular machine learning classifiers. BMC Bioinformatics. 2023;24(1):141. Epub 2023/04/12. pmid:37041520.
- View Article
- PubMed/NCBI
- Google Scholar
14. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30. Epub 2023/03/18. pmid:36927031.
- View Article
- PubMed/NCBI
- Google Scholar
15. Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. 2020;21(3):1047–57. Epub 2019/05/09. pmid:31067315.
- View Article
- PubMed/NCBI
- Google Scholar
16. Bhasin M, Raghava GP. Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem. 2004;279(22):23262–6. Epub 2004/03/25. pmid:15039428.
- View Article
- PubMed/NCBI
- Google Scholar
17. Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 2001;43(3):246–55. Epub 2001/04/05. pmid:11288174.
- View Article
- PubMed/NCBI
- Google Scholar
18. Chen K, Kurgan LA, Ruan J. Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs. BMC Struct Biol. 2007;7:25. Epub 2007/04/18. pmid:17437643.
- View Article
- PubMed/NCBI
- Google Scholar
19. Lee TY, Lin ZQ, Hsieh SJ, Bretana NA, Lu CT. Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences. Bioinformatics. 2011;27(13):1780–7. Epub 2011/05/10. pmid:21551145.
- View Article
- PubMed/NCBI
- Google Scholar
20. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, et al. Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci U S A. 2007;104(11):4337–41. Epub 2007/03/16. pmid:17360525.
- View Article
- PubMed/NCBI
- Google Scholar
21. Dubchak I, Muchnik I, Holbrook SR, Kim SH. Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci U S A. 1995;92(19):8700–4. Epub 1995/09/12. pmid:7568000
- View Article
- PubMed/NCBI
- Google Scholar
22. Dubchak I, Muchnik I, Mayor C, Dralyuk I, Kim SH. Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. Proteins. 1999;35(4):401–7. Epub 1999/06/26. pmid:10382667.
- View Article
- PubMed/NCBI
- Google Scholar
23. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. 2019:1–12.
24. Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res. 2000;28(1):374. Epub 1999/12/11. pmid:10592278.
- View Article
- PubMed/NCBI
- Google Scholar
25. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992;89(22):10915–9. Epub 1992/11/15. pmid:1438297.
- View Article
- PubMed/NCBI
- Google Scholar
26. Sandberg M, Eriksson L, Jonsson J, Sjostrom M, Wold S. New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J Med Chem. 1998;41(14):2481–91. Epub 1998/07/03. pmid:9651153
- View Article
- PubMed/NCBI
- Google Scholar
27. Hellberg S, Sjostrom M, Skagerberg B, Wold S. Peptide quantitative structure-activity relationships, a multivariate approach. J Med Chem. 1987;30(7):1126–35. Epub 1987/07/01. pmid:3599020.
- View Article
- PubMed/NCBI
- Google Scholar
28. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23(10):1282–8. Epub 2007/03/24. pmid:17379688.
- View Article
- PubMed/NCBI
- Google Scholar
29. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. arXiv. 2013:1310.4546.
30. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31(1):365–70. Epub 2003/01/10. pmid:12520024.
- View Article
- PubMed/NCBI
- Google Scholar
31. Rehurek R, Sojka P. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic. 2011;3.
32. Breiman L. Random Forests. Machine Learning. 2001;45:5–35.
- View Article
- Google Scholar
33. Chen T, Guestrin C, editors. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining—KDD 2016; 2016; New York: ACM Press.
34. Ke G, Meng Q, Finley T, Wang T, Chen W, Ye Q, et al., editors. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. 31st Conference on Neural Information Processing Systems (NIPS 2017); 2017; Long Beach, CA, USA: Curran Associates Inc.
35. Yang ZR. Biological applications of support vector machines. Brief Bioinform. 2004;5(4):328–38. Epub 2004/12/21. pmid:15606969.
- View Article
- PubMed/NCBI
- Google Scholar
36. Kurata H, Tsukiyama S, Manavalan B. iACVP: markedly enhanced identification of anti-coronavirus peptides using a dataset-specific word2vec model. Brief Bioinform. 2022;23(4). Epub 2022/07/01. pmid:35772910.
- View Article
- PubMed/NCBI
- Google Scholar
37. Devlin J, Chang M, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv. 2018:1810.04805.
38. Wang S, Weng S, Ma J, Tang Q. DeepCNF-D: Predicting Protein Order/Disorder Regions by Weighted Deep Convolutional Neural Fields. Int J Mol Sci. 2015;16(8):17315–30. Epub 2015/08/01. pmid:26230689.
- View Article
- PubMed/NCBI
- Google Scholar
39. Tsukiyama S, Hasan MM, Fujii S, Kurata H. LSTM-PHV: prediction of human-virus protein-protein interactions by LSTM with word2vec. Brief Bioinform. 2021;22(6). Epub 2021/06/24. pmid:34160596.
- View Article
- PubMed/NCBI
- Google Scholar
40. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. JMLR. 2011;12:2825–30.
- View Article
- Google Scholar
41. Hasan MM, Alam MA, Shoombuatong W, Deng HW, Manavalan B, Kurata H. NeuroPred-FRL: an interpretable prediction model for identifying neuropeptide using feature representation learning. Brief Bioinform. 2021;22(6). Epub 2021/05/12. pmid:33975333.
- View Article
- PubMed/NCBI
- Google Scholar
42. Harun-Or-Roshid M, Maeda K, Phan LT, Manavalan B, Kurata H. Stack-DHUpred: Advancing the accuracy of dihydrouridine modification sites detection via stacking approach. Comput Biol Med. 2023;169:107848. Epub 2023/12/26. pmid:38145601.
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Del Valle DM, Kim-Schulze S, Huang HH, Beckmann ND, Nirenberg S, Wang B, et al. An inflammatory cytokine signature predicts COVID-19 severity and survival. Nat Med. 2020;26(10):1636–43. Epub 2020/08/26. pmid:32839624.
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Costela-Ruiz VJ, Illescas-Montes R, Puerta-Puerta JM, Ruiz C, Melguizo-Rodriguez L. SARS-CoV-2 infection: The role of cytokines in COVID-19 disease. Cytokine Growth Factor Rev. 2020;54:62–75. Epub 2020/06/10. pmid:32513566.
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Khatun MS, Qin X, Pociask DA, Kolls JK. SARS-CoV2 Endotheliopathy: Insights from Single Cell RNAseq. Am J Respir Crit Care Med. 2022;206(9):1178–9. Epub 2022/07/16. pmid:35839476.
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Donlan AN, Sutherland TE, Marie C, Preissner S, Bradley BT, Carpenter RM, et al. IL-13 is a driver of COVID-19 severity. JCI Insight. 2021;6(15). Epub 2021/06/30. pmid:34185704.
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Morrison CB, Edwards CE, Shaffer KM, Araba KC, Wykoff JA, Williams DR, et al. SARS-CoV-2 infection of airway cells causes intense viral and cell shedding, two spreading mechanisms affected by IL-13. Proc Natl Acad Sci U S A. 2022;119(16):e2119680119. Epub 2022/03/31. pmid:35353667.
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Junttila IS. Tuning the Cytokine Responses: An Update on Interleukin (IL)-4 and IL-13 Receptor Complexes. Front Immunol. 2018;9:888. Epub 2018/06/23. pmid:29930549.
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Punnonen J, Aversa G, Cocks BG, McKenzie AN, Menon S, Zurawski G, et al. Interleukin 13 induces interleukin 4-independent IgG4 and IgE synthesis and CD23 expression by human B cells. Proc Natl Acad Sci U S A. 1993;90(8):3730–4. Epub 1993/04/15. pmid:8097323.
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. McKenzie GJ, Bancroft A, Grencis RK, McKenzie AN. A distinct role for interleukin-13 in Th2-cell-mediated immune responses. Curr Biol. 1998;8(6):339–42. Epub 1998/03/25. pmid:9512421.
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Li L, Xia Y, Nguyen A, Lai YH, Feng L, Mosmann TR, et al. Effects of Th2 cytokines on chemokine expression in the lung: IL-13 potently induces eotaxin expression by airway epithelial cells. J Immunol. 1999;162(5):2477–87. Epub 1999/03/11. pmid:10072486.
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Gupta S, Sharma AK, Shastri V, Madhu MK, Sharma VK. Prediction of anti-inflammatory proteins/peptides: an insilico approach. J Transl Med. 2017;15(1):7. Epub 2017/01/07. pmid:28057002.
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Jain S, Dhall A, Patiyal S, Raghava GPS. IL13Pred: A method for predicting immunoregulatory cytokine IL-13 inducing peptides. Comput Biol Med. 2022;143:105297. Epub 2022/02/14. pmid:35152041.
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, et al. The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res. 2019;47(D1):D339–D43. Epub 2018/10/26. pmid:30357391.
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Arora P, Periwal N, Goyal Y, Sood V, Kaur B. iIL13Pred: improved prediction of IL-13 inducing peptides using popular machine learning classifiers. BMC Bioinformatics. 2023;24(1):141. Epub 2023/04/12. pmid:37041520.
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30. Epub 2023/03/18. pmid:36927031.
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref15] 15. Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. 2020;21(3):1047–57. Epub 2019/05/09. pmid:31067315.
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref16] 16. Bhasin M, Raghava GP. Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem. 2004;279(22):23262–6. Epub 2004/03/25. pmid:15039428.
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref17] 17. Chou KC. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins. 2001;43(3):246–55. Epub 2001/04/05. pmid:11288174.
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref18] 18. Chen K, Kurgan LA, Ruan J. Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs. BMC Struct Biol. 2007;7:25. Epub 2007/04/18. pmid:17437643.
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref19] 19. Lee TY, Lin ZQ, Hsieh SJ, Bretana NA, Lu CT. Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences. Bioinformatics. 2011;27(13):1780–7. Epub 2011/05/10. pmid:21551145.
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref20] 20. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, et al. Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci U S A. 2007;104(11):4337–41. Epub 2007/03/16. pmid:17360525.
View Article
PubMed/NCBI
Google Scholar

[78] View Article

[79] PubMed/NCBI

[80] Google Scholar

[ref21] 21. Dubchak I, Muchnik I, Holbrook SR, Kim SH. Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci U S A. 1995;92(19):8700–4. Epub 1995/09/12. pmid:7568000
View Article
PubMed/NCBI
Google Scholar

[82] View Article

[83] PubMed/NCBI

[84] Google Scholar

[ref22] 22. Dubchak I, Muchnik I, Mayor C, Dralyuk I, Kim SH. Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. Proteins. 1999;35(4):401–7. Epub 1999/06/26. pmid:10382667.
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref23] 23. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. 2019:1–12.

[ref24] 24. Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res. 2000;28(1):374. Epub 1999/12/11. pmid:10592278.
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref25] 25. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992;89(22):10915–9. Epub 1992/11/15. pmid:1438297.
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref26] 26. Sandberg M, Eriksson L, Jonsson J, Sjostrom M, Wold S. New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J Med Chem. 1998;41(14):2481–91. Epub 1998/07/03. pmid:9651153
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref27] 27. Hellberg S, Sjostrom M, Skagerberg B, Wold S. Peptide quantitative structure-activity relationships, a multivariate approach. J Med Chem. 1987;30(7):1126–35. Epub 1987/07/01. pmid:3599020.
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref28] 28. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23(10):1282–8. Epub 2007/03/24. pmid:17379688.
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref29] 29. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. arXiv. 2013:1310.4546.

[ref30] 30. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31(1):365–70. Epub 2003/01/10. pmid:12520024.
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref31] 31. Rehurek R, Sojka P. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic. 2011;3.

[ref32] 32. Breiman L. Random Forests. Machine Learning. 2001;45:5–35.
View Article
Google Scholar

[117] View Article

[118] Google Scholar

[ref33] 33. Chen T, Guestrin C, editors. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining—KDD 2016; 2016; New York: ACM Press.

[ref34] 34. Ke G, Meng Q, Finley T, Wang T, Chen W, Ye Q, et al., editors. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. 31st Conference on Neural Information Processing Systems (NIPS 2017); 2017; Long Beach, CA, USA: Curran Associates Inc.

[ref35] 35. Yang ZR. Biological applications of support vector machines. Brief Bioinform. 2004;5(4):328–38. Epub 2004/12/21. pmid:15606969.
View Article
PubMed/NCBI
Google Scholar

[122] View Article

[123] PubMed/NCBI

[124] Google Scholar

[ref36] 36. Kurata H, Tsukiyama S, Manavalan B. iACVP: markedly enhanced identification of anti-coronavirus peptides using a dataset-specific word2vec model. Brief Bioinform. 2022;23(4). Epub 2022/07/01. pmid:35772910.
View Article
PubMed/NCBI
Google Scholar

[126] View Article

[127] PubMed/NCBI

[128] Google Scholar

[ref37] 37. Devlin J, Chang M, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv. 2018:1810.04805.

[ref38] 38. Wang S, Weng S, Ma J, Tang Q. DeepCNF-D: Predicting Protein Order/Disorder Regions by Weighted Deep Convolutional Neural Fields. Int J Mol Sci. 2015;16(8):17315–30. Epub 2015/08/01. pmid:26230689.
View Article
PubMed/NCBI
Google Scholar

[131] View Article

[132] PubMed/NCBI

[133] Google Scholar

[ref39] 39. Tsukiyama S, Hasan MM, Fujii S, Kurata H. LSTM-PHV: prediction of human-virus protein-protein interactions by LSTM with word2vec. Brief Bioinform. 2021;22(6). Epub 2021/06/24. pmid:34160596.
View Article
PubMed/NCBI
Google Scholar

[135] View Article

[136] PubMed/NCBI

[137] Google Scholar

[ref40] 40. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. JMLR. 2011;12:2825–30.
View Article
Google Scholar

[139] View Article

[140] Google Scholar

[ref41] 41. Hasan MM, Alam MA, Shoombuatong W, Deng HW, Manavalan B, Kurata H. NeuroPred-FRL: an interpretable prediction model for identifying neuropeptide using feature representation learning. Brief Bioinform. 2021;22(6). Epub 2021/05/12. pmid:33975333.
View Article
PubMed/NCBI
Google Scholar

[142] View Article

[143] PubMed/NCBI

[144] Google Scholar

[ref42] 42. Harun-Or-Roshid M, Maeda K, Phan LT, Manavalan B, Kurata H. Stack-DHUpred: Advancing the accuracy of dihydrouridine modification sites detection via stacking approach. Comput Biol Med. 2023;169:107848. Epub 2023/12/26. pmid:38145601.
View Article
PubMed/NCBI
Google Scholar

[146] View Article

[147] PubMed/NCBI

[148] Google Scholar

Figures

Abstract

Introduction

Materials and methods

Overall framework of PredIL13

Dataset preparation

Feature encoding method

Composition-based encoding

Pseudo-Amino Acid Composition (PAAC)

Composition of k-spaced Amino Acid Pairs (CKSAAP)

Grouped Amino Acid Composition (GAAC)

Conjoint Triad (CTriad)

Composition/Transition/Distribution (CTD)

Position-based encoding

Binary Encoding (BE).

NN.

EAAC.

Amino acid index

BLOSUM62

Z-Scale (ZSCALE)

Language model

ML and DL classifiers

RF.

XGB.

LGBM.

SVM.

LR.

KN.

NB.

TX.

CNN.

bLSTM.

Meta-classifier

Sequential addition of single-feature models based on the AUC ranking (SAAUC)

Sequential addition of single-feature models based on the AWCLR ranking (SAWC)

Sequential deletion of single-feature models based on the iterative AWCLR ranking (SDIWC)

Evaluation

Results and discussion

Single-feature models: Construction and independent evaluation

Feature selection for meta-classifier construction

AWCLR-based feature selection

Importance of each single-feature model

Analysis of the top 58 single-feature models

Comparison of PredIL13 with existing predictors

Application to SARS-CoV-2 spike proteins

Limitation

Conclusion

Supporting information

S1 Fig. Prediction performance of six meta-classifiers built by the SAAUC method on the validation and test datasets.

S2 Fig. SHAP analysis of the SDIWC-stacked classifier consisting of top 72 single-feature models.

S1 Table. Hyperparamter tuning for ML and DL.

S2 Table. Prediction performance of 168 single feature models on the training and test datasets.

S3 Table. Ranking of 168 single-feature models in the descending order of AUC on the training dataset.

S4 Table. Prediction performance of seven meta-classifiers on the training and test datasets.

S5 Table. Ranking of 168 single-feature models in the descending order of AWCLR on the training dataset.

S6 Table. Prediction of experimentally validated IL13-inducing peptides by PreIL13 (top16).

References