Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Optimizing lipocalin sequence classification with ensemble deep learning models

  • Yonglin Zhang ,

    Roles Formal analysis, Funding acquisition, Investigation, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    ‡ These authors share first authorship.

    Affiliation Department of Pharmacy, Affiliated Hospital of North Sichuan Medical College, Nanchong, Sichuan, China

  • Lezheng Yu ,

    Roles Conceptualization, Data curation, Funding acquisition, Methodology, Writing – original draft

    ‡ These authors share first authorship.

    Affiliation School of Chemistry and Materials Science, Guizhou Education University, Guiyang, China

  • Li Xue,

    Roles Formal analysis, Supervision, Validation

    Affiliation School of Public Health, Southwest Medical University, Luzhou, China

  • Fengjuan Liu,

    Roles Formal analysis, Supervision

    Affiliation School of Geography and Resources, Guizhou Education University, Guiyang, China

  • Runyu Jing ,

    Roles Formal analysis, Methodology, Software, Validation, Visualization, Writing – review & editing

    jingryedu@gmail.com (RJ); ljs@swmu.edu.cn (JL)

    Affiliation School of mathematics and big data, Guizhou Education University, Guiyang, China

  • Jiesi Luo

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Writing – original draft, Writing – review & editing

    jingryedu@gmail.com (RJ); ljs@swmu.edu.cn (JL)

    Affiliations School of Basic Medical Sciences, Southwest Medical University, Luzhou, Sichuan, China, Sichuan Key Medical Laboratory of New Drug Discovery and Druggability Evaluation, Luzhou Key Laboratory of Activity Screening and Druggability Evaluation for Chinese Materia Medica, Southwest Medical University, Luzhou, Sichuan, China

Abstract

Deep learning (DL) has become a powerful tool for the recognition and classification of biological sequences. However, conventional single-architecture models often struggle with suboptimal predictive performance and high computational costs. To address these challenges, we present EnsembleDL-Lipo, an innovative ensemble deep learning framework that combines Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs) to enhance the identification of lipocalin sequences. Lipocalins are multifunctional extracellular proteins involved in various diseases and stress responses, and their low sequence similarity and occurrence in the ‘twilight zone’ of sequence alignment present significant hurdles for accurate classification. These challenges necessitate efficient computational methods to complement traditional, labor-intensive experimental approaches. EnsembleDL-Lipo overcomes these issues by leveraging a set of PSSM-based features to train a large ensemble of deep learning models. The framework integrates multiple feature representations derived from position-specific scoring matrices (PSSMs), optimizing classification performance across diverse sequence patterns. The model achieved superior results on the training dataset, with an accuracy (ACC) of 97.65%, recall of 97.10%, Matthews correlation coefficient (MCC) of 0.95, and area under the curve (AUC) of 0.99. Validation on an independent test set further confirmed the robustness of the model, yielding an ACC of 95.79%, recall of 90.48%, MCC of 0.92, and AUC of 0.97. These results demonstrate that EnsembleDL-Lipo is a highly effective and computationally efficient tool for lipocalin sequence identification, significantly outperforming existing methods and offering strong potential for applications in biomarker discovery.

Introduction

Lipocalins are a subgroup within the larger calycin family, a class of small secreted proteins known for their affinity for hydrophobic molecules. They are found across all kingdoms of life except Archaea [1]. These proteins typically range in length from 165 to 200 amino acid residues and have a molecular weight of approximately 20 kDa [2, 3]. Despite generally exhibiting low amino acid sequence identity (usually not exceeding 30%), lipocalins are characterized by three structurally conserved regions (SCRs) and share significant similarities in their three-dimensional structures [4,5]. Their sequences and distribution are highly diverse, and they play pivotal roles in a variety of biological processes, including stress and immune responses, retinoid binding, pheromone transport, prostaglandin synthesis, tumorigenesis, and apoptosis [612]. Certain lipocalins, such as ɑ1-microglobulin [13], apolipoprotein D, complement C8 gamma, lipocalin 2 [14], orosomucoid, protein HC, prostaglandin D synthase, retinol-binding protein, and tear lipocalin [15]—have been identified as biomarkers for various diseases. Consequently, there is an urgent need for efficient and accurate methods to identify lipocalins, which will aid in understanding their diverse functions and facilitate the development of novel therapeutics.

Accurately identifying and classifying lipocalin proteins is a challenging task in computational bioinformatics due to their structural and functional diversity, as well as the complex relationships between their sequences. Traditional experimental approaches for protein classification are labor-intensive and cannot keep pace with the growing volume of sequence data generated by high-throughput sequencing technologies. Computational methods, particularly those based on machine learning, offer a scalable solution to this challenge. However, the development of reliable computational models for lipocalin classification is hindered by the limited availability of annotated datasets, potential biases in feature selection, and the need for robust algorithms capable of generalizing across diverse lipocalin families. Muthukumar and colleagues utilized MALDI-TOF/MS to validate the presence of a 14.5 kDa lipocalin protein in the urine of female rodents, establishing a correlation between its expression in urine and the phases of the estrous cycle [16]. Yao et al. conducted a detailed characterization of rLcn13, a member of the rat epididymal lipocalin family [17]. The identification of the rLcn13 lipocalin protein involved various experimental procedures, including breeding white mice, cloning serum, immunohistochemical staining, and reverse transcription quantitative PCR (RT-qPCR). To reduce time and infrastructure costs, several computational methods employing machine learning algorithms have been developed as more accessible solutions for lipocalin identification. Ramana and Gupta employed a support vector machine (SVM) approach named LipocalinPred, utilizing amino acid composition (AAC), dipeptide composition (DPC), secondary structure composition (SSC), and position-specific scoring matrix (PSSM) as input features [18]. The integrated features of PSSM and SSC produced the best model, with an overall accuracy of 90.72%, sensitivity of 88.97%, and specificity of 92.16%. Pugalenthi et al. introduced an SVM-based tool, LipoPred, demonstrating its effectiveness in predicting lipocalin proteins, achieving an accuracy of 88.61%, sensitivity of 89.26%, specificity of 85.27%, and Matthews correlation coefficient (MCC) of 0.74 [19]. Nath and Subbiah leveraged diverse balanced training sets and classifier fusion schemes to enhance prediction performance [20]. Using Random Forest (RF) and K-nearest neighbor (KNN) classifiers, along with AAC, attribute group composition, and rationalized n-grams as features, they achieved high performance on test sets. Zulfiqar et al. utilized an RF-based approach, incorporating six types of features to predict lipocalins, achieving an impressive accuracy of 95.03% and an area under the curve (AUC) of 0.987 during 10-fold cross-validation [21]. While the traditional machine learning algorithms mentioned have shown promise, exploring improved alternative models remains essential for further enhancing predictive performance.

Recently, deep learning (DL), a significant sub-discipline of machine learning, has been successfully applied to the identification and classification of various biological sequences [22, 23]. For example, StackedEnC-AOP, which employs stacked ensemble techniques, achieves high predictive performance for antioxidant peptides, demonstrating the potential of combining multiple architectures [24]. A novel approach, DeepAVPTPPred, applies deep learning to predict antiviral peptides (AVPs) using sequence-based features, achieving high accuracy and generalization [25]. Additionally, an innovative stacked ensemble deep learning approach has been tailored for predicting antiviral peptides, further demonstrating the utility of stacked models in complex biological sequence classification tasks [26]. Applications of deep learning in genomics include transcription factor binding, DNase sensitivity, CpG methylation, and predicting the effects of genetic variation on gene regulatory mechanisms, such as DNA accessibility and splicing [27]. In proteomics, deep learning plays a crucial role in predicting protein structure, classifying protein sequences, determining protein subcellular localization, and identifying peptides. Therefore, constructing DL models to enhance the predictive performance of lipocalin protein classification is of substantial interest. DL model architectures primarily include convolutional neural networks (CNNs), recurrent neural networks (RNNs) with bidirectional long short-term memory (BiLSTM) or bidirectional gated recurrent units (BiGRU), and combinations of these networks (e.g., CNN-BiLSTM and CNN-BiGRU). Classical deep neural networks (DNNs), evolved from artificial neural networks (ANNs), typically utilize sequence, structure, function, and other features as input. Several studies have reported that ensemble frameworks incorporating different DL architectures tend to achieve superior predictive performance compared to single architectures [2830].

In this study, we address these challenges by developing a deep learning-based framework with tailored feature selection and sequence encoding strategies to enable accurate and interpretable classification of lipocalin sequences. We employed Convolutional Neural Network (CNN) and Deep Neural Network (DNN) architectures to construct the ensemble framework, EnsembleDL-Lipo, for the precise identification of lipocalins from their primary sequences. The CNN architecture utilized a dictionary encoding method to extract protein sequence information, while the DNN architecture employed nine PSSM-based features to represent protein sequences. A total of 511 unique deep learning models were generated through permutations, and their performance in lipocalin recognition was evaluated, with particular emphasis on the top ten models exhibiting exceptional results. By integrating these individual models with varying input features, we developed the ensemble deep learning model, combining a CNN model with dictionary encoding and several DNN models using three specific PSSM-based features (DFMCA_PSSM, DPC-PSSM, and PSSM-AC). To ascertain the most effective approach for lipocalin recognition, the performance of a single high-accuracy deep learning model was compared with the ensemble deep learning framework. The efficiency of the ensemble method in identifying lipocalin proteins was assessed using a training dataset consisting of 212 positive samples and 211 negative samples, alongside an independent test dataset containing 42 lipoproteins and 53 non-lipoproteins. Finally, to demonstrate the superior discriminatory capabilities of the ensemble approach, its performance metrics, such as accuracy (ACC), F-value, recall, precision (PRE), and Matthews correlation coefficient (MCC), were compared with those of the previously established Lipo-RF and LipocalinPred models. We propose a tailored feature selection strategy, systematically evaluating 511 feature combinations to identify the optimal biologically relevant feature set, and integrate an ensemble learning framework to enhance robustness and generalization across diverse lipocalin sequences. Our model is benchmarked against existing approaches, demonstrating its state-of-the-art performance and practical utility in bioinformatics applications.

Materials and methods

Generation of training and test dataset

Accurate and reliable deep learning prediction models rely on high-quality datasets. The dataset curated by Zulfiqar and colleagues (available at https://zenodo.org/record/5844993#.YeAL7fgRVPZ) serves as an example of a large, balanced dataset meticulously selected for training and testing deep learning models designed to identify lipocalins [21]. A total of 614 protein sequences, equally divided between 307 lipocalins and 307 non-lipocalins, were sourced from the UniProt database. An identity threshold of 40% was set, and redundant sequences were removed using the CD-HIT method [31]. The resulting training dataset consisted of 212 positive and 211 negative samples. The performance of the developed deep learning models was evaluated using independent test datasets, which included 42 lipoproteins and 53 non-lipoproteins.

Feature encoding schemes

This study aims to develop an ensemble deep learning approach for the accurate and efficient identification of lipocalin proteins from sequence data. A standardized dataset is essential for the effective extraction of relevant features from these sequences. To identify the optimal feature set for encoding lipocalin protein sequences, we employed a dictionary encoding method in combination with nine PSSM-based features.

Dictionary encoding

Each protein sequence is encoded numerically, with the 20 natural amino acids represented by integers from 1 to 20 (e.g., alanine as 1) [32], while unknown or pseudo-amino acids are encoded as 0. This method transforms each protein sequence into an L-dimensional vector, where L is the sequence length.

Evolutionary information-based features

Previous research has consistently demonstrated that incorporating evolutionary information significantly enhances the performance of classifiers in protein recognition [3335]. To enhance the classification performance for protein recognition, we incorporated evolutionary information by leveraging the position-specific scoring matrix (PSSM) [36], which captures evolutionary patterns within protein sequences. Using the PSI–BLAST program [37], homologous sequences were searched in the Swiss-Prot or NCBI non-redundant protein databases, followed by multiple sequence alignment to generate an initial PSSM matrix. The matrix rows correspond to amino acid residue positions, columns denote residue names, and values represent the binary logarithms of residue frequencies from the alignments. Positive values indicate conserved residues, while negative values indicate non-conserved ones. Various matrix transformations were applied to extract nine PSSM-based features. These features were calculated using the PSSMCOOL package, with details available at PSSMCOOL documentation [38].

Single deep learning model architectures

The study explored the feasibility of two distinct deep learning architectures by employing 10 individual models to differentiate lipocalin proteins. In the CNN architecture, protein sequences were encoded using an amino acid dictionary representation, and a probability score between 0 and 1 was computed to classify lipocalins. In the DNN architecture, nine PSSM-based features were used as input, resulting in the development of nine distinct DNN models.

Ensemble deep learning framework design

In recent years, ensemble deep learning frameworks have gained significant popularity for analyzing various biological sequences due to their superior predictive capabilities compared to individual model architectures [3941]. The ensemble deep learning model developed in this study integrates two primary architectures with diverse input features. Protein sequences of lipocalins are encoded using an amino acid dictionary embedding representation, which serves as the foundation of the model. This encoded representation is then processed by convolutional layers to capture both local sequence information and global sequence patterns. Discrete feature sets, such as PSSM-based descriptors, are processed by the DNN layer, where all feature sets are aligned to identical dimensions. Through a flattening operation, sequential features are transformed into one-dimensional arrays for each sample. The core architecture of the ensemble deep learning framework includes CNN and DNN, with various combinations of PSSM features enhancing the model’s performance up to the final fully connected layer. The 511 DNN models integrated into the core architecture utilize a combination of PSSM features, including AAC-PSSM, DP-PSSM, DPC-PSSM, Pse-PSSM, PSSM-AC, PSSM400, SVD_PSSM, Single_Average, and DFMCA_PSSM. To improve the reliability of the model, each training iteration is repeated five times. This approach reduces potential errors caused by random factors, thereby minimizing sampling and model fitting inconsistencies.

Implementation

All single deep learning models and our ensemble deep learning framework were designed, trained, and assessed using the autoBioSeqpy tool [42]. The commands were executed on a Windows 10 workstation equipped with an NVIDIA GeForce RTX GPU and CUDA 10.2.95.

Performance assessment

To comprehensively assess the performance of our ensemble deep learning framework, five metrics commonly used in the field of bioinformatics and computational biology are adopted in this study. They are accuracy (ACC), F-value, Recall, precision (PRE), and Matthew’s correlation coefficient (MCC), and defined as follows:

(1)(2)(3)(4)(5)

where TP, TN, FP, and FN represent the numbers of true positive, true negative, false positive, and false negative, respectively.

In addition, the receiver operating characteristic (ROC) curves are selected for a visible performance comparison. As two key quantitative indexes of the overall performance, the areas under the ROC curve (AUC) and the precision-recall (PR) curve are also computed, and shown by the ROC and PR plots respectively.

Results and discussion

Performance comparison of different single deep learning models

The ten distinct single deep learning models were analyzed, comprising one CNN model and nine DNN models (AAC-PSSM, DFMCA-PSSM, DP-PSSM, DPC-PSSM, Pse-PSSM, AC-PSSM, PSSM400, SVD-PSSM, Single-Average), with the detailed structure of each model depicted in Fig 1. To facilitate robust and equitable comparisons, each model underwent rigorous optimization via hyperparameter tuning (S1 Fig). The comprehensive prediction outcomes of the ten deep learning models on the training dataset are presented in Fig 2. Notably, among the ten deep learning models evaluated, the DNN model with the input feature DPC-PSSM (DNN_DPC-PSSM) exhibited superior predictive performance, achieving an accuracy (ACC) of 93.18%, an F-value of 93.18%, a Recall rate of 92.62%, and a Matthews correlation coefficient (MCC) of 0.86. In comparison, the DNN model incorporating PSSM-AC showed slightly lower prediction accuracy, with ACC, F-value, Recall, and MCC values of 92.24%, 92.45%, 92.58%, and 0.85, respectively. Following these models, the CNN architecture leveraging dictionary encoding demonstrated relatively strong predictive performance, achieving the highest average precision (PRE) score of 95.27% and ranking third in average accuracy (ACC) (91.76%), F-value (91.14%), and MCC (0.84). However, the DNN model with SVD-PSSM as the input feature yielded the poorest average prediction performance, with an average precision of 26.16% and ACC, F-value, Recall, and MCC values of 48.24%, 8.89%, 6.00%, and -0.06, respectively. The superior performance of the DNN architecture with DPC-PSSM features can be attributed to the more detailed and informative nature of these features, which provide a richer representation of sequence characteristics. This enables the model to capture more relevant patterns in the lipocalin sequences. In contrast, SVD-PSSM shows poorer performance, likely due to its more compressed feature representation, which may result in a loss of important sequence information, leading to suboptimal predictions.

thumbnail
Fig 1. Overview of all deep learning (DL) architectures, comprising one Convolutional Neural Network (CNN) and nine Deep Neural Network (DNN) model configurations.

https://doi.org/10.1371/journal.pone.0319329.g001

thumbnail
Fig 2. Performance comparison of different deep learning models on the training dataset.

https://doi.org/10.1371/journal.pone.0319329.g002

Creation and determination of the ensemble deep learning framework

The ensemble deep learning framework exhibits remarkable superiority in predicting lipocalin proteins, owing to its integration of distinct advantages from diverse deep learning algorithms. Leveraging the CNN core structure and a tandem DNN architecture, we exhaustively explored all feasible combinations of nine PSSM-based features (AAC-PSSM, DFMCA_PSSM, DP_PSSM, DPC_PSSM, Pse_PSSM, PSSM400, PSSM-AC, Single_Average, and SVD_PSSM), culminating in the development of an ensemble comprising 511 deep learning models. The comprehensive predictive results of these ensemble deep learning models on the training dataset are detailed in the S1 Table, with Fig 3 summarizing the prediction performance of the top 10 ensemble models. These outcomes underscore the enhanced performance potential of ensemble deep learning frameworks over single deep learning models, showcasing the varied predictive performance achieved through the fusion of diverse feature inputs.

thumbnail
Fig 3. The performance of the top 10 ensemble models on the training dataset.

https://doi.org/10.1371/journal.pone.0319329.g003

A detailed analysis from the S1 Table reveals that the amalgamated deep learning framework utilizing DFMCA_PSSM, DPC_PSSM, PSSM-AC as input features demonstrated the optimal prediction performance, achieving ACC, F-value, recall, precision (PRE), and MCC values of 97.65%, 97.53%, 97.10%, 97.96%, and 0.95, respectively. In contrast, the ensemble framework encompassing all nine features attained an average ACC of 78.84% and an MCC of 0.86, notably lower than the former framework. Among the top ten ensemble models highlighted in Fig 3, five models were constructed based on a combination of four features, two models derived from a combination of three features, and three other models stemmed from combinations of two, five, and six features, respectively. Notably, no deep algorithmic prediction model incorporated more than seven features. It is important to emphasize that an increasing number of PSSM-based feature combinations within the ensemble framework does not always result in superior predictive performance, indicating the presence of redundant information across these features. Ultimately, the deep learning composite framework leveraging the feature combination of DFMCA_PSSM, DPC_PSSM, and PSSM-AC was designated as the ultimate ensemble model for lipocalin protein identification, referred to as EnsembleDL-Lipo (Fig 4).

Performance evaluation of EnsembleDL-Lipo using the independent test dataset

The generalization performance of the EnsembleDL-Lipo deep learning framework for predicting lipocalins was further assessed using an independent test dataset comprising 42 lipocalin and 53 non-lipocalin proteins. The results consistently demonstrated the superior predictive capability of the ensemble framework, with only four lipocalins erroneously identified as non-lipocalins, yielding an overall accuracy (ACC) of 95.79%, an F-value of 95.00%, a recall of 90.48%, a precision (PRE) of 100.00%, and a Matthews correlation coefficient (MCC) of 0.92. Additionally, the receiver operating characteristic (ROC) and precision-recall (PR) curves were employed to assess the predictive performance of EnsembleDL-Lipo (Figs 5A and 5B). The areas under the ROC and PR curves were calculated as 0.97 and 0.98, respectively. An acc-loss curve (Fig 5C) was generated for further analysis. Furthermore, the layer UMAP technique was utilized to dissect the ensemble deep learning framework (Fig 5D) [43]. The 2D UMAP maps highlight the distinct distribution of lipocalin proteins (red point cloud, label 1) and non-lipocalin proteins (purple point cloud, label 0), with the latter positioned in the latent space of the final hidden layer. To enhance the dispersal of data points in the projection, specific UMAP parameters such as metric, n_neighbors, and min_dist were adjusted to cosine, 28, and 0.8, respectively. It is crucial to recognize that the accurate identification of lipocalin proteins was achieved through the combined features extracted by the ensemble deep learning framework. This empirical evidence underscores the exceptional efficacy and robustness of our proposed ensemble-based deep learning methodology.

thumbnail
Fig 5. The ROC, PR, accuracy-loss and UMAP of EnsembleDL-Lipo on an independent test set.

https://doi.org/10.1371/journal.pone.0319329.g005

Performance comparison of EnsembleDL-Lipo with the latest methods

In this study, we conducted a comparative analysis of the prediction performance of the EnsembleDL-Lipo deep learning algorithm proposed in this work with the Lipo-RF and Lipocalin-Pred methods for lipocalin proteins. The experimental results of the three methods are summarized in Table 1, utilizing four evaluation metrics: ACC, Recall, MCC, and AUC. The results of the Lipo-RF and Lipocalin-Pred methods were directly obtained from the work of Zulfiqar et al. [21]. For the training dataset, the Lipo-RF method achieved ACC, Recall, MCC, and AUC values of 95.03%, 96.20%, 0.90, and 0.99, respectively. Conversely, the EnsembleDL-Lipo approach proposed in this work exhibited superior performance with ACC, Recall, MCC, and AUC values of 97.65%, 97.10%, 0.95, and 0.99, respectively. The enhanced performance of the EnsembleDL-Lipo method is further highlighted on the independent test dataset, although the 91.73% recall obtained by Lipo-RF slightly outperforms the 90.48% achieved by EnsembleDL-Lipo. From the data presented in Table 1, it is evident that the EnsembleDL-Lipo approach significantly improved ACC, MCC, and AUC by 5.89-10.06%, 4.92-14.12%, and 1.35-4.75% compared to Lipo-RF and Lipocalin-Pred, respectively. The effectiveness and utility of the EnsembleDL-Lipo technique for predicting lipocalin proteins were reaffirmed through the comparison of our self-designed protocol with other state-of-the-art methods based on various assessment metrics. These results emphasize the superior predictive capabilities and performance of the EnsembleDL-Lipo deep learning framework in the domain of lipocalin protein identification.

thumbnail
Table 1. Performance comparison of EnsembleDL-ATG and two exiting methods on the training and independent test datasets.

https://doi.org/10.1371/journal.pone.0319329.t001

Conclusion

In this study, a novel deep learning ensemble model named EnsembleDL-Lipo was developed for the accurate identification of lipocalin proteins. Lipocalins are a diverse group of secreted proteins known for their role in binding and transporting various small hydrophobic molecules, serving as biomarkers for a range of diseases [4446]. While traditional machine learning algorithms like SVM, RF, and KNN have demonstrated some success in identifying lipocalins, there remains a need to improve predictive performance. The EnsembleDL-Lipo deep learning architecture combines CNN and DNN models to build classifiers, testing 511 different feature combinations. The model utilizing the ‘DFMCA_PSSM+DPC-PSSM+PSSM-AC’ feature set demonstrated the best overall performance on the training dataset, achieving ACC, F-value, recall, and MCC scores of 97.65%, 97.53%, 97.10%, and 0.95, respectively. The predictive power of EnsembleDL-Lipo was further validated on an independent test dataset, where it achieved an ACC of 95.79%, an F-value of 95.00%, a recall of 90.48%, a precision of 100.00%, and an MCC of 0.92, with only four misclassifications of lipocalins as non-lipocalins. Evaluation through ROC and PR curves yielded impressive area scores of 0.97 and 0.98, respectively, showcasing the robust predictive performance of EnsembleDL-Lipo.

A comparative analysis with existing methods, Lipo-RF and Lipocalin-Pred, revealed the superiority of EnsembleDL-Lipo across both training and independent test datasets, exhibiting improvements in ACC, MCC, and AUC ranging from 5.89% to 10.06%, 4.92% to 14.12%, and 1.35% to 4.75%, respectively. In summary, the proposed EnsembleDL-Lipo deep learning approach offers an efficient computational method for identifying lipocalin proteins, potentially aiding in elucidating their diverse functions and facilitating the discovery of novel therapeutics. The source code and benchmark data for this study are freely available at https://github.com/jingry/autoBioSeqpy/tree/2.0/examples/Lipo.

Supporting information

S1 Fig. Hyperparameter optimization for 10 single-feature deep learning models.

(A) Performance evaluation of CNN architectures across varying convolutional kernel sizes and filter counts. (B) Comparative analysis of DNN architectures using different protein descriptors and hidden layer configurations.

https://doi.org/10.1371/journal.pone.0319329.s001

(TIF)

S1 Table. Experimental results of the 511 combinations on the training dataset.

https://doi.org/10.1371/journal.pone.0319329.s002

(XLSX)

Acknowledgments

The authors would like to acknowledge the support provided by the Joint Project of Luzhou Municipal People’s Government and Southwest medical University

References

  1. 1. Diez-Hermano S, Ganfornina MD, Skerra A, Gutiérrez G, Sanchez D. An Evolutionary Perspective of the Lipocalin Protein Family. Front Physiol. 2021;12:718983. pmid:34497539
  2. 2. Grzyb J, Latowski D, Strzałka K. Lipocalins - a family portrait. J Plant Physiol. 2006;163(9):895–915. pmid:16504339
  3. 3. Lakshmi B, Mishra M, Srinivasan N, Archunan G. Structure-Based Phylogenetic Analysis of the Lipocalin Superfamily. PLoS One. 2015;10(8):e0135507. pmid:26263546
  4. 4. Flower DR, North AC, Attwood TK. Structure and sequence relationships in the lipocalins and related proteins. Protein Sci. 1993;2(5):753–61. pmid:7684291
  5. 5. Flower DR, North AC, Sansom CE. The lipocalin protein family: structural and sequence overview. Biochim Biophys Acta. 2000;1482(1–2):9–24. pmid:11058743
  6. 6. Schiefner A, Skerra A. The menagerie of human lipocalins: a natural protein scaffold for molecular recognition of physiological compounds. Acc Chem Res. 2015;48(4):976-85.
  7. 7. Parmar T, Parmar VM, Perusek L, Georges A, Takahashi M, Crabb JW, et al. Lipocalin 2 Plays an Important Role in Regulating Inflammation in Retinal Degeneration. J Immunol. 2018;200(9):3128–41. pmid:29602770
  8. 8. Guardado S, Ojeda-Juárez D, Kaul M, Nordgren TM. Comprehensive review of lipocalin 2-mediated effects in lung inflammation. Am J Physiol Lung Cell Mol Physiol. 2021;321(4):L726–33. pmid:34468208
  9. 9. Zhang J, Wang Z, Zhang H, Li S, Li J, Liu H, et al. The role of lipocalin 2 in brain injury and recovery after ischemic and hemorrhagic stroke. Front Mol Neurosci. 2022;15:930526. pmid:36187347
  10. 10. Gupta U, Ghosh S, Wallace CT, Shang P, Xin Y, Nair AP, et al. Increased LCN2 (lipocalin 2) in the RPE decreases autophagy and activates inflammasome-ferroptosis processes in a mouse model of dry AMD. Autophagy. 2023;19(1):92–111. pmid:35473441
  11. 11. Xu S, Venge P. Lipocalins as biochemical markers of disease. Biochim Biophys Acta. 2000;1482(1–2):298–307. pmid:11058770
  12. 12. Ghosh S, Stepicheva N, Yazdankhah M, Shang P, Watson AM, Hose S, et al. The role of lipocalin-2 in age-related macular degeneration (AMD). Cell Mol Life Sci. 2020;77(5):835–51. pmid:31901947
  13. 13. Anderson UD, Olsson MG, Rutardóttir S, Centlow M, Kristensen KH, Isberg PE, et al. Fetal hemoglobin and α1-microglobulin as first- and early second-trimester predictive biomarkers for preeclampsia. Am J Obstet Gynecol. 2011;204(6):520.e1-5. pmid:21439542
  14. 14. Chakraborty S, Kaur S, Guha S, Batra SK. The multifaceted roles of neutrophil gelatinase associated lipocalin (NGAL) in inflammation and cancer. Biochim Biophys Acta. 2012;1826(1):129–69. pmid:22513004
  15. 15. Dartt DA. Tear lipocalin: structure and function. Ocul Surf. 2011;9(3):126–38. pmid:21791187
  16. 16. Muthukumar S, Rajesh D, Saibaba G, Alagesan A, Rengarajan RL, Archunan G. Urinary lipocalin protein in a female rodent with correlation to phases in the estrous cycle: an experimental study accompanied by in silico analysis. PLoS One. 2013;8(8):e71357. pmid:23967199
  17. 17. Yao G, Xie S, Wan X, Zhang L, Liu Q, Hu S. Identification, characterization and expression analysis of rLcn13, an epididymal lipocalin in rats. Acta Biochim Biophys Sin (Shanghai). 2023;55(2):314–21. pmid:36762499
  18. 18. Ramana J, Gupta D. LipocalinPred: a SVM-based method for prediction of lipocalins. BMC Bioinformatics. 2009;10:445. pmid:20030857
  19. 19. Pugalenthi G, Kandaswamy KK, Suganthan PN, Archunan G, Sowdhamini R. Identification of functionally diverse lipocalin proteins from sequence information using support vector machine. Amino Acids. 2010;39(3):777–83.
  20. 20. Nath A, Subbiah K. Maximizing lipocalin prediction through balanced and diversified training set and decision fusion. Comput Biol Chem. 2015;59 Pt A:101–10. pmid:26433483
  21. 21. Zulfiqar H, Ahmed Z, Ma C-Y, Khan RS, Grace-Mercure BK, Yu X-L, et al. Comprehensive Prediction of Lipocalin Proteins Using Artificial Intelligence Strategy. Front Biosci (Landmark Ed). 2022;27(3):84. pmid:35345316
  22. 22. Yu L, Xue L, Liu F, Li Y, Jing R, Luo J. The applications of deep learning algorithms on in silico druggable proteins identification. J Adv Res. 2022;41:219–31. pmid:36328750
  23. 23. Yu L, Zhang Y, Xue L, Liu F, Chen Q, Luo J, et al. Systematic Analysis and Accurate Identification of DNA N4-Methylcytosine Sites by Deep Learning. Front Microbiol. 2022;13:843425. pmid:35401453
  24. 24. Rukh G, Akbar S, Rehman G, Alarfaj FK, Zou Q. StackedEnC-AOP: prediction of antioxidant proteins using transform evolutionary and sequential features based multi-scale vector with stacked ensemble learning. BMC Bioinformatics. 2024;25(1):256. pmid:39098908
  25. 25. Ullah M, Akbar S, Raza A, Zou Q. DeepAVP-TPPred: identification of antiviral peptides using transformed image-based localized descriptors and binary tree growth algorithm. Bioinformatics. 2024;40(5):btae305. pmid:38710482
  26. 26. Akbar S, Zou Q, Raza A, Alarfaj FK. iAFPs-Mv-BiTCN: Predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks. Artif Intell Med. 2024;151:102860. pmid:38552379
  27. 27. Akbar S, Raza A, Zou Q. Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model. BMC Bioinformatics. 2024;25(1):102. pmid:38454333
  28. 28. Chen Z, Duan J, Kang L, Qiu G. Class-Imbalanced Deep Learning via a Class-Balanced Ensemble. IEEE Trans Neural Netw Learn Syst. 2022;33(10):5626–40. pmid:33900923
  29. 29. Zhao J, Vaios E, Wang Y, Yang Z, Cui Y, Reitman ZJ, et al. Dose-Incorporated Deep Ensemble Learning for Improving Brain Metastasis Stereotactic Radiosurgery Outcome Prediction. Int J Radiat Oncol Biol Phys. 2024;120(2):603–13. pmid:38615888
  30. 30. Yu T-H, Su B-H, Battalora LC, Liu S, Tseng YJ. Ensemble modeling with machine learning and deep learning to provide interpretable generalized rules for classifying CNS drugs with high prediction power. Brief Bioinform. 2022;23(1):bbab377. pmid:34530437
  31. 31. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2. pmid:23060610
  32. 32. Veltri D, Kamath U, Shehu A. Deep learning improves antimicrobial peptide recognition. Bioinformatics. 2018;34(16):2740–7. pmid:29590297
  33. 33. Luo J, Yu L, Guo Y, Li M. Functional classification of secreted proteins by position specific scoring matrix and auto covariance. Chemom Intell Lab Syst. 2012;110(1):163–7.
  34. 34. Kabir M, Arif M, Ahmad S, Ali Z, Swati ZNK, Yu D-J. Intelligent computational method for discrimination of anticancer peptides by incorporating sequential and evolutionary profiles information. Chemom Intell Lab Syst. 2018;182:158–65.
  35. 35. Patiyal S, Dhall A, Bajaj K, Sahu H, Raghava GPS. Prediction of RNA-interacting residues in a protein using CNN and evolutionary profile. Brief Bioinform. 2023;24(1):bbac538. pmid:36516298
  36. 36. Fu H, Cao Z, Li M, Wang S. ACEP: improving antimicrobial peptides recognition through automatic feature fusion and amino acid embedding. BMC Genomics. 2020;21(1):597. pmid:32859150
  37. 37. Altschul SF, Koonin EV. Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. Trends Biochem Sci. 1998;23(11):444–7. pmid:9852764
  38. 38. Mohammadi A, Zahiri J, Mohammadi S, Khodarahmi M, Arab SS. PSSMCOOL: a comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles. Biol Methods Protoc. 2022;7(1):bpac008. pmid:35388370
  39. 39. Jin G, Sha H, Feng Y, Cheng Q, Huang J. GSEN: An ensemble deep learning benchmark model for urban hotspots spatiotemporal prediction. Neurocomputing. 2021;455:353–67.
  40. 40. Zhou H, Wekesa JS, Luan Y, Meng J. PRPI-SC: an ensemble deep learning model for predicting plant lncRNA-protein interactions. BMC Bioinformatics. 2021;22(Suppl 3):415. pmid:34429059
  41. 41. Aybey E, Gümüş Ö. SENSDeep: An Ensemble Deep Learning Method for Protein-Protein Interaction Sites Prediction. Interdiscip Sci. 2023;15(1):55–87. pmid:36346583
  42. 42. Jing R, Li Y, Xue L, Liu F, Li M, Luo J. autoBioSeqpy: A Deep Learning Tool for the Classification of Biological Sequences. J Chem Inf Model. 2020;60(8):3755–64. pmid:32786512
  43. 43. Jing R, Xue L, Li M, Yu L, Luo J. layerUMAP: A tool for visualizing and understanding deep learning models in biological sequence classification using UMAP. iScience. 2022;25(12):105530. pmid:36425757
  44. 44. Achatz S, Jarasch A, Skerra A. Structural plasticity in the loop region of engineered lipocalins with novel ligand specificities, so-called Anticalins. J Struct Biol X. 2021;6:100054. pmid:34988429
  45. 45. Nasioudis D, Witkin SS. Neutrophil gelatinase-associated lipocalin and innate immune responses to bacterial infections. Med Microbiol Immunol. 2015;204(4):471–9. pmid:25716557
  46. 46. Zhao R-Y, Wei P-J, Sun X, Zhang D-H, He Q-Y, Liu J, et al. Role of lipocalin 2 in stroke. Neurobiol Dis. 2023;179:106044. pmid:36804285