Figures
Abstract
Deep learning (DL) has become a powerful tool for the recognition and classification of biological sequences. However, conventional single-architecture models often struggle with suboptimal predictive performance and high computational costs. To address these challenges, we present EnsembleDL-Lipo, an innovative ensemble deep learning framework that combines Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs) to enhance the identification of lipocalin sequences. Lipocalins are multifunctional extracellular proteins involved in various diseases and stress responses, and their low sequence similarity and occurrence in the ‘twilight zone’ of sequence alignment present significant hurdles for accurate classification. These challenges necessitate efficient computational methods to complement traditional, labor-intensive experimental approaches. EnsembleDL-Lipo overcomes these issues by leveraging a set of PSSM-based features to train a large ensemble of deep learning models. The framework integrates multiple feature representations derived from position-specific scoring matrices (PSSMs), optimizing classification performance across diverse sequence patterns. The model achieved superior results on the training dataset, with an accuracy (ACC) of 97.65%, recall of 97.10%, Matthews correlation coefficient (MCC) of 0.95, and area under the curve (AUC) of 0.99. Validation on an independent test set further confirmed the robustness of the model, yielding an ACC of 95.79%, recall of 90.48%, MCC of 0.92, and AUC of 0.97. These results demonstrate that EnsembleDL-Lipo is a highly effective and computationally efficient tool for lipocalin sequence identification, significantly outperforming existing methods and offering strong potential for applications in biomarker discovery.
Citation: Zhang Y, Yu L, Xue L, Liu F, Jing R, Luo J (2025) Optimizing lipocalin sequence classification with ensemble deep learning models. PLoS ONE 20(4): e0319329. https://doi.org/10.1371/journal.pone.0319329
Editor: Nagarajan Raju,, Emory University, UNITED STATES OF AMERICA
Received: December 3, 2024; Accepted: January 30, 2025; Published: April 16, 2025
Copyright: © 2025 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The source code and benchmark data for this study are freely available at https://github.com/jingry/autoBioSeqpy/tree/2.0/examples/Lipo.
Funding: This study received funding from the National Natural Science Foundation of China (No. 21803045 and No. 22203057), the Science and Technology Department Fund of Guizhou Province ([2017]5790-07), the Natural Science Foundation of the Department of Education of Guizhou Province ([2021]021), the Joint Project of Luzhou Municipal People’s Government and Southwest Medical University (2020LZXNYDJ39), and the Research and Development Fund Project of North Sichuan Medical College (CBY22-QNA38). Jiesi Luo provided funding for Project 1 and Project 5 (the National Natural Science Foundation of China, No. 21803045; and the Joint Project of Luzhou Municipal People’s Government and Southwest Medical University, 2020LZXNYDJ39). Jiesi Luo played a significant role in guiding the overall research direction and ensuring its alignment with scientific goals. Their support enabled the research team to explore new methodologies and reach the conclusions presented in this paper. Runyu Jing was supported by the National Natural Science Foundation of China (Project No. 22203057) for this research, contributing through developing methodology and experimental design, optimizing software and algorithms, validating results via statistical analysis and cross-verification, visualizing data with advanced computational tools, and ensuring scientific rigor and clarity through critical manuscript review and editing. Lezheng Yu supported Project 3 and Project 4 (the Science and Technology Department Fund of Guizhou Province, [2017]5790-07; and the Natural Science Foundation of the Department of Education of Guizhou Province, [2021]021). Lezheng Yu’s contribution was essential in facilitating the resources needed for experimental validation and supporting the research team in addressing specific regional health challenges. Zhang Yonglin funded Project 6 (the Research and Development Fund Project of North Sichuan Medical College, CBY22-QNA38). Yonglin Zhang contributed his expertise in data analysis and provided critical feedback on the interpretation of results, enriching the overall research approach. We would also like to thank two anonymous reviewers.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Lipocalins are a subgroup within the larger calycin family, a class of small secreted proteins known for their affinity for hydrophobic molecules. They are found across all kingdoms of life except Archaea [1]. These proteins typically range in length from 165 to 200 amino acid residues and have a molecular weight of approximately 20 kDa [2, 3]. Despite generally exhibiting low amino acid sequence identity (usually not exceeding 30%), lipocalins are characterized by three structurally conserved regions (SCRs) and share significant similarities in their three-dimensional structures [4,5]. Their sequences and distribution are highly diverse, and they play pivotal roles in a variety of biological processes, including stress and immune responses, retinoid binding, pheromone transport, prostaglandin synthesis, tumorigenesis, and apoptosis [6–12]. Certain lipocalins, such as ɑ1-microglobulin [13], apolipoprotein D, complement C8 gamma, lipocalin 2 [14], orosomucoid, protein HC, prostaglandin D synthase, retinol-binding protein, and tear lipocalin [15]—have been identified as biomarkers for various diseases. Consequently, there is an urgent need for efficient and accurate methods to identify lipocalins, which will aid in understanding their diverse functions and facilitate the development of novel therapeutics.
Accurately identifying and classifying lipocalin proteins is a challenging task in computational bioinformatics due to their structural and functional diversity, as well as the complex relationships between their sequences. Traditional experimental approaches for protein classification are labor-intensive and cannot keep pace with the growing volume of sequence data generated by high-throughput sequencing technologies. Computational methods, particularly those based on machine learning, offer a scalable solution to this challenge. However, the development of reliable computational models for lipocalin classification is hindered by the limited availability of annotated datasets, potential biases in feature selection, and the need for robust algorithms capable of generalizing across diverse lipocalin families. Muthukumar and colleagues utilized MALDI-TOF/MS to validate the presence of a 14.5 kDa lipocalin protein in the urine of female rodents, establishing a correlation between its expression in urine and the phases of the estrous cycle [16]. Yao et al. conducted a detailed characterization of rLcn13, a member of the rat epididymal lipocalin family [17]. The identification of the rLcn13 lipocalin protein involved various experimental procedures, including breeding white mice, cloning serum, immunohistochemical staining, and reverse transcription quantitative PCR (RT-qPCR). To reduce time and infrastructure costs, several computational methods employing machine learning algorithms have been developed as more accessible solutions for lipocalin identification. Ramana and Gupta employed a support vector machine (SVM) approach named LipocalinPred, utilizing amino acid composition (AAC), dipeptide composition (DPC), secondary structure composition (SSC), and position-specific scoring matrix (PSSM) as input features [18]. The integrated features of PSSM and SSC produced the best model, with an overall accuracy of 90.72%, sensitivity of 88.97%, and specificity of 92.16%. Pugalenthi et al. introduced an SVM-based tool, LipoPred, demonstrating its effectiveness in predicting lipocalin proteins, achieving an accuracy of 88.61%, sensitivity of 89.26%, specificity of 85.27%, and Matthews correlation coefficient (MCC) of 0.74 [19]. Nath and Subbiah leveraged diverse balanced training sets and classifier fusion schemes to enhance prediction performance [20]. Using Random Forest (RF) and K-nearest neighbor (KNN) classifiers, along with AAC, attribute group composition, and rationalized n-grams as features, they achieved high performance on test sets. Zulfiqar et al. utilized an RF-based approach, incorporating six types of features to predict lipocalins, achieving an impressive accuracy of 95.03% and an area under the curve (AUC) of 0.987 during 10-fold cross-validation [21]. While the traditional machine learning algorithms mentioned have shown promise, exploring improved alternative models remains essential for further enhancing predictive performance.
Recently, deep learning (DL), a significant sub-discipline of machine learning, has been successfully applied to the identification and classification of various biological sequences [22, 23]. For example, StackedEnC-AOP, which employs stacked ensemble techniques, achieves high predictive performance for antioxidant peptides, demonstrating the potential of combining multiple architectures [24]. A novel approach, DeepAVPTPPred, applies deep learning to predict antiviral peptides (AVPs) using sequence-based features, achieving high accuracy and generalization [25]. Additionally, an innovative stacked ensemble deep learning approach has been tailored for predicting antiviral peptides, further demonstrating the utility of stacked models in complex biological sequence classification tasks [26]. Applications of deep learning in genomics include transcription factor binding, DNase sensitivity, CpG methylation, and predicting the effects of genetic variation on gene regulatory mechanisms, such as DNA accessibility and splicing [27]. In proteomics, deep learning plays a crucial role in predicting protein structure, classifying protein sequences, determining protein subcellular localization, and identifying peptides. Therefore, constructing DL models to enhance the predictive performance of lipocalin protein classification is of substantial interest. DL model architectures primarily include convolutional neural networks (CNNs), recurrent neural networks (RNNs) with bidirectional long short-term memory (BiLSTM) or bidirectional gated recurrent units (BiGRU), and combinations of these networks (e.g., CNN-BiLSTM and CNN-BiGRU). Classical deep neural networks (DNNs), evolved from artificial neural networks (ANNs), typically utilize sequence, structure, function, and other features as input. Several studies have reported that ensemble frameworks incorporating different DL architectures tend to achieve superior predictive performance compared to single architectures [28–30].
In this study, we address these challenges by developing a deep learning-based framework with tailored feature selection and sequence encoding strategies to enable accurate and interpretable classification of lipocalin sequences. We employed Convolutional Neural Network (CNN) and Deep Neural Network (DNN) architectures to construct the ensemble framework, EnsembleDL-Lipo, for the precise identification of lipocalins from their primary sequences. The CNN architecture utilized a dictionary encoding method to extract protein sequence information, while the DNN architecture employed nine PSSM-based features to represent protein sequences. A total of 511 unique deep learning models were generated through permutations, and their performance in lipocalin recognition was evaluated, with particular emphasis on the top ten models exhibiting exceptional results. By integrating these individual models with varying input features, we developed the ensemble deep learning model, combining a CNN model with dictionary encoding and several DNN models using three specific PSSM-based features (DFMCA_PSSM, DPC-PSSM, and PSSM-AC). To ascertain the most effective approach for lipocalin recognition, the performance of a single high-accuracy deep learning model was compared with the ensemble deep learning framework. The efficiency of the ensemble method in identifying lipocalin proteins was assessed using a training dataset consisting of 212 positive samples and 211 negative samples, alongside an independent test dataset containing 42 lipoproteins and 53 non-lipoproteins. Finally, to demonstrate the superior discriminatory capabilities of the ensemble approach, its performance metrics, such as accuracy (ACC), F-value, recall, precision (PRE), and Matthews correlation coefficient (MCC), were compared with those of the previously established Lipo-RF and LipocalinPred models. We propose a tailored feature selection strategy, systematically evaluating 511 feature combinations to identify the optimal biologically relevant feature set, and integrate an ensemble learning framework to enhance robustness and generalization across diverse lipocalin sequences. Our model is benchmarked against existing approaches, demonstrating its state-of-the-art performance and practical utility in bioinformatics applications.
Materials and methods
Generation of training and test dataset
Accurate and reliable deep learning prediction models rely on high-quality datasets. The dataset curated by Zulfiqar and colleagues (available at https://zenodo.org/record/5844993#.YeAL7fgRVPZ) serves as an example of a large, balanced dataset meticulously selected for training and testing deep learning models designed to identify lipocalins [21]. A total of 614 protein sequences, equally divided between 307 lipocalins and 307 non-lipocalins, were sourced from the UniProt database. An identity threshold of 40% was set, and redundant sequences were removed using the CD-HIT method [31]. The resulting training dataset consisted of 212 positive and 211 negative samples. The performance of the developed deep learning models was evaluated using independent test datasets, which included 42 lipoproteins and 53 non-lipoproteins.
Feature encoding schemes
This study aims to develop an ensemble deep learning approach for the accurate and efficient identification of lipocalin proteins from sequence data. A standardized dataset is essential for the effective extraction of relevant features from these sequences. To identify the optimal feature set for encoding lipocalin protein sequences, we employed a dictionary encoding method in combination with nine PSSM-based features.
Dictionary encoding
Each protein sequence is encoded numerically, with the 20 natural amino acids represented by integers from 1 to 20 (e.g., alanine as 1) [32], while unknown or pseudo-amino acids are encoded as 0. This method transforms each protein sequence into an L-dimensional vector, where L is the sequence length.
Evolutionary information-based features
Previous research has consistently demonstrated that incorporating evolutionary information significantly enhances the performance of classifiers in protein recognition [33–35]. To enhance the classification performance for protein recognition, we incorporated evolutionary information by leveraging the position-specific scoring matrix (PSSM) [36], which captures evolutionary patterns within protein sequences. Using the PSI–BLAST program [37], homologous sequences were searched in the Swiss-Prot or NCBI non-redundant protein databases, followed by multiple sequence alignment to generate an initial PSSM matrix. The matrix rows correspond to amino acid residue positions, columns denote residue names, and values represent the binary logarithms of residue frequencies from the alignments. Positive values indicate conserved residues, while negative values indicate non-conserved ones. Various matrix transformations were applied to extract nine PSSM-based features. These features were calculated using the PSSMCOOL package, with details available at PSSMCOOL documentation [38].
Single deep learning model architectures
The study explored the feasibility of two distinct deep learning architectures by employing 10 individual models to differentiate lipocalin proteins. In the CNN architecture, protein sequences were encoded using an amino acid dictionary representation, and a probability score between 0 and 1 was computed to classify lipocalins. In the DNN architecture, nine PSSM-based features were used as input, resulting in the development of nine distinct DNN models.
Ensemble deep learning framework design
In recent years, ensemble deep learning frameworks have gained significant popularity for analyzing various biological sequences due to their superior predictive capabilities compared to individual model architectures [39–41]. The ensemble deep learning model developed in this study integrates two primary architectures with diverse input features. Protein sequences of lipocalins are encoded using an amino acid dictionary embedding representation, which serves as the foundation of the model. This encoded representation is then processed by convolutional layers to capture both local sequence information and global sequence patterns. Discrete feature sets, such as PSSM-based descriptors, are processed by the DNN layer, where all feature sets are aligned to identical dimensions. Through a flattening operation, sequential features are transformed into one-dimensional arrays for each sample. The core architecture of the ensemble deep learning framework includes CNN and DNN, with various combinations of PSSM features enhancing the model’s performance up to the final fully connected layer. The 511 DNN models integrated into the core architecture utilize a combination of PSSM features, including AAC-PSSM, DP-PSSM, DPC-PSSM, Pse-PSSM, PSSM-AC, PSSM400, SVD_PSSM, Single_Average, and DFMCA_PSSM. To improve the reliability of the model, each training iteration is repeated five times. This approach reduces potential errors caused by random factors, thereby minimizing sampling and model fitting inconsistencies.
Implementation
All single deep learning models and our ensemble deep learning framework were designed, trained, and assessed using the autoBioSeqpy tool [42]. The commands were executed on a Windows 10 workstation equipped with an NVIDIA GeForce RTX GPU and CUDA 10.2.95.
Performance assessment
To comprehensively assess the performance of our ensemble deep learning framework, five metrics commonly used in the field of bioinformatics and computational biology are adopted in this study. They are accuracy (ACC), F-value, Recall, precision (PRE), and Matthew’s correlation coefficient (MCC), and defined as follows:
where TP, TN, FP, and FN represent the numbers of true positive, true negative, false positive, and false negative, respectively.
In addition, the receiver operating characteristic (ROC) curves are selected for a visible performance comparison. As two key quantitative indexes of the overall performance, the areas under the ROC curve (AUC) and the precision-recall (PR) curve are also computed, and shown by the ROC and PR plots respectively.
Results and discussion
Performance comparison of different single deep learning models
The ten distinct single deep learning models were analyzed, comprising one CNN model and nine DNN models (AAC-PSSM, DFMCA-PSSM, DP-PSSM, DPC-PSSM, Pse-PSSM, AC-PSSM, PSSM400, SVD-PSSM, Single-Average), with the detailed structure of each model depicted in Fig 1. To facilitate robust and equitable comparisons, each model underwent rigorous optimization via hyperparameter tuning (S1 Fig). The comprehensive prediction outcomes of the ten deep learning models on the training dataset are presented in Fig 2. Notably, among the ten deep learning models evaluated, the DNN model with the input feature DPC-PSSM (DNN_DPC-PSSM) exhibited superior predictive performance, achieving an accuracy (ACC) of 93.18%, an F-value of 93.18%, a Recall rate of 92.62%, and a Matthews correlation coefficient (MCC) of 0.86. In comparison, the DNN model incorporating PSSM-AC showed slightly lower prediction accuracy, with ACC, F-value, Recall, and MCC values of 92.24%, 92.45%, 92.58%, and 0.85, respectively. Following these models, the CNN architecture leveraging dictionary encoding demonstrated relatively strong predictive performance, achieving the highest average precision (PRE) score of 95.27% and ranking third in average accuracy (ACC) (91.76%), F-value (91.14%), and MCC (0.84). However, the DNN model with SVD-PSSM as the input feature yielded the poorest average prediction performance, with an average precision of 26.16% and ACC, F-value, Recall, and MCC values of 48.24%, 8.89%, 6.00%, and -0.06, respectively. The superior performance of the DNN architecture with DPC-PSSM features can be attributed to the more detailed and informative nature of these features, which provide a richer representation of sequence characteristics. This enables the model to capture more relevant patterns in the lipocalin sequences. In contrast, SVD-PSSM shows poorer performance, likely due to its more compressed feature representation, which may result in a loss of important sequence information, leading to suboptimal predictions.
Creation and determination of the ensemble deep learning framework
The ensemble deep learning framework exhibits remarkable superiority in predicting lipocalin proteins, owing to its integration of distinct advantages from diverse deep learning algorithms. Leveraging the CNN core structure and a tandem DNN architecture, we exhaustively explored all feasible combinations of nine PSSM-based features (AAC-PSSM, DFMCA_PSSM, DP_PSSM, DPC_PSSM, Pse_PSSM, PSSM400, PSSM-AC, Single_Average, and SVD_PSSM), culminating in the development of an ensemble comprising 511 deep learning models. The comprehensive predictive results of these ensemble deep learning models on the training dataset are detailed in the S1 Table, with Fig 3 summarizing the prediction performance of the top 10 ensemble models. These outcomes underscore the enhanced performance potential of ensemble deep learning frameworks over single deep learning models, showcasing the varied predictive performance achieved through the fusion of diverse feature inputs.
A detailed analysis from the S1 Table reveals that the amalgamated deep learning framework utilizing DFMCA_PSSM, DPC_PSSM, PSSM-AC as input features demonstrated the optimal prediction performance, achieving ACC, F-value, recall, precision (PRE), and MCC values of 97.65%, 97.53%, 97.10%, 97.96%, and 0.95, respectively. In contrast, the ensemble framework encompassing all nine features attained an average ACC of 78.84% and an MCC of 0.86, notably lower than the former framework. Among the top ten ensemble models highlighted in Fig 3, five models were constructed based on a combination of four features, two models derived from a combination of three features, and three other models stemmed from combinations of two, five, and six features, respectively. Notably, no deep algorithmic prediction model incorporated more than seven features. It is important to emphasize that an increasing number of PSSM-based feature combinations within the ensemble framework does not always result in superior predictive performance, indicating the presence of redundant information across these features. Ultimately, the deep learning composite framework leveraging the feature combination of DFMCA_PSSM, DPC_PSSM, and PSSM-AC was designated as the ultimate ensemble model for lipocalin protein identification, referred to as EnsembleDL-Lipo (Fig 4).
Performance evaluation of EnsembleDL-Lipo using the independent test dataset
The generalization performance of the EnsembleDL-Lipo deep learning framework for predicting lipocalins was further assessed using an independent test dataset comprising 42 lipocalin and 53 non-lipocalin proteins. The results consistently demonstrated the superior predictive capability of the ensemble framework, with only four lipocalins erroneously identified as non-lipocalins, yielding an overall accuracy (ACC) of 95.79%, an F-value of 95.00%, a recall of 90.48%, a precision (PRE) of 100.00%, and a Matthews correlation coefficient (MCC) of 0.92. Additionally, the receiver operating characteristic (ROC) and precision-recall (PR) curves were employed to assess the predictive performance of EnsembleDL-Lipo (Figs 5A and 5B). The areas under the ROC and PR curves were calculated as 0.97 and 0.98, respectively. An acc-loss curve (Fig 5C) was generated for further analysis. Furthermore, the layer UMAP technique was utilized to dissect the ensemble deep learning framework (Fig 5D) [43]. The 2D UMAP maps highlight the distinct distribution of lipocalin proteins (red point cloud, label 1) and non-lipocalin proteins (purple point cloud, label 0), with the latter positioned in the latent space of the final hidden layer. To enhance the dispersal of data points in the projection, specific UMAP parameters such as metric, n_neighbors, and min_dist were adjusted to cosine, 28, and 0.8, respectively. It is crucial to recognize that the accurate identification of lipocalin proteins was achieved through the combined features extracted by the ensemble deep learning framework. This empirical evidence underscores the exceptional efficacy and robustness of our proposed ensemble-based deep learning methodology.
Performance comparison of EnsembleDL-Lipo with the latest methods
In this study, we conducted a comparative analysis of the prediction performance of the EnsembleDL-Lipo deep learning algorithm proposed in this work with the Lipo-RF and Lipocalin-Pred methods for lipocalin proteins. The experimental results of the three methods are summarized in Table 1, utilizing four evaluation metrics: ACC, Recall, MCC, and AUC. The results of the Lipo-RF and Lipocalin-Pred methods were directly obtained from the work of Zulfiqar et al. [21]. For the training dataset, the Lipo-RF method achieved ACC, Recall, MCC, and AUC values of 95.03%, 96.20%, 0.90, and 0.99, respectively. Conversely, the EnsembleDL-Lipo approach proposed in this work exhibited superior performance with ACC, Recall, MCC, and AUC values of 97.65%, 97.10%, 0.95, and 0.99, respectively. The enhanced performance of the EnsembleDL-Lipo method is further highlighted on the independent test dataset, although the 91.73% recall obtained by Lipo-RF slightly outperforms the 90.48% achieved by EnsembleDL-Lipo. From the data presented in Table 1, it is evident that the EnsembleDL-Lipo approach significantly improved ACC, MCC, and AUC by 5.89-10.06%, 4.92-14.12%, and 1.35-4.75% compared to Lipo-RF and Lipocalin-Pred, respectively. The effectiveness and utility of the EnsembleDL-Lipo technique for predicting lipocalin proteins were reaffirmed through the comparison of our self-designed protocol with other state-of-the-art methods based on various assessment metrics. These results emphasize the superior predictive capabilities and performance of the EnsembleDL-Lipo deep learning framework in the domain of lipocalin protein identification.
Conclusion
In this study, a novel deep learning ensemble model named EnsembleDL-Lipo was developed for the accurate identification of lipocalin proteins. Lipocalins are a diverse group of secreted proteins known for their role in binding and transporting various small hydrophobic molecules, serving as biomarkers for a range of diseases [44–46]. While traditional machine learning algorithms like SVM, RF, and KNN have demonstrated some success in identifying lipocalins, there remains a need to improve predictive performance. The EnsembleDL-Lipo deep learning architecture combines CNN and DNN models to build classifiers, testing 511 different feature combinations. The model utilizing the ‘DFMCA_PSSM+DPC-PSSM+PSSM-AC’ feature set demonstrated the best overall performance on the training dataset, achieving ACC, F-value, recall, and MCC scores of 97.65%, 97.53%, 97.10%, and 0.95, respectively. The predictive power of EnsembleDL-Lipo was further validated on an independent test dataset, where it achieved an ACC of 95.79%, an F-value of 95.00%, a recall of 90.48%, a precision of 100.00%, and an MCC of 0.92, with only four misclassifications of lipocalins as non-lipocalins. Evaluation through ROC and PR curves yielded impressive area scores of 0.97 and 0.98, respectively, showcasing the robust predictive performance of EnsembleDL-Lipo.
A comparative analysis with existing methods, Lipo-RF and Lipocalin-Pred, revealed the superiority of EnsembleDL-Lipo across both training and independent test datasets, exhibiting improvements in ACC, MCC, and AUC ranging from 5.89% to 10.06%, 4.92% to 14.12%, and 1.35% to 4.75%, respectively. In summary, the proposed EnsembleDL-Lipo deep learning approach offers an efficient computational method for identifying lipocalin proteins, potentially aiding in elucidating their diverse functions and facilitating the discovery of novel therapeutics. The source code and benchmark data for this study are freely available at https://github.com/jingry/autoBioSeqpy/tree/2.0/examples/Lipo.
Supporting information
S1 Fig. Hyperparameter optimization for 10 single-feature deep learning models.
(A) Performance evaluation of CNN architectures across varying convolutional kernel sizes and filter counts. (B) Comparative analysis of DNN architectures using different protein descriptors and hidden layer configurations.
https://doi.org/10.1371/journal.pone.0319329.s001
(TIF)
S1 Table. Experimental results of the 511 combinations on the training dataset.
https://doi.org/10.1371/journal.pone.0319329.s002
(XLSX)
Acknowledgments
The authors would like to acknowledge the support provided by the Joint Project of Luzhou Municipal People’s Government and Southwest medical University
References
- 1. Diez-Hermano S, Ganfornina MD, Skerra A, Gutiérrez G, Sanchez D. An Evolutionary Perspective of the Lipocalin Protein Family. Front Physiol. 2021;12:718983. pmid:34497539
- 2. Grzyb J, Latowski D, Strzałka K. Lipocalins - a family portrait. J Plant Physiol. 2006;163(9):895–915. pmid:16504339
- 3. Lakshmi B, Mishra M, Srinivasan N, Archunan G. Structure-Based Phylogenetic Analysis of the Lipocalin Superfamily. PLoS One. 2015;10(8):e0135507. pmid:26263546
- 4. Flower DR, North AC, Attwood TK. Structure and sequence relationships in the lipocalins and related proteins. Protein Sci. 1993;2(5):753–61. pmid:7684291
- 5. Flower DR, North AC, Sansom CE. The lipocalin protein family: structural and sequence overview. Biochim Biophys Acta. 2000;1482(1–2):9–24. pmid:11058743
- 6. Schiefner A, Skerra A. The menagerie of human lipocalins: a natural protein scaffold for molecular recognition of physiological compounds. Acc Chem Res. 2015;48(4):976-85.
- 7. Parmar T, Parmar VM, Perusek L, Georges A, Takahashi M, Crabb JW, et al. Lipocalin 2 Plays an Important Role in Regulating Inflammation in Retinal Degeneration. J Immunol. 2018;200(9):3128–41. pmid:29602770
- 8. Guardado S, Ojeda-Juárez D, Kaul M, Nordgren TM. Comprehensive review of lipocalin 2-mediated effects in lung inflammation. Am J Physiol Lung Cell Mol Physiol. 2021;321(4):L726–33. pmid:34468208
- 9. Zhang J, Wang Z, Zhang H, Li S, Li J, Liu H, et al. The role of lipocalin 2 in brain injury and recovery after ischemic and hemorrhagic stroke. Front Mol Neurosci. 2022;15:930526. pmid:36187347
- 10. Gupta U, Ghosh S, Wallace CT, Shang P, Xin Y, Nair AP, et al. Increased LCN2 (lipocalin 2) in the RPE decreases autophagy and activates inflammasome-ferroptosis processes in a mouse model of dry AMD. Autophagy. 2023;19(1):92–111. pmid:35473441
- 11. Xu S, Venge P. Lipocalins as biochemical markers of disease. Biochim Biophys Acta. 2000;1482(1–2):298–307. pmid:11058770
- 12. Ghosh S, Stepicheva N, Yazdankhah M, Shang P, Watson AM, Hose S, et al. The role of lipocalin-2 in age-related macular degeneration (AMD). Cell Mol Life Sci. 2020;77(5):835–51. pmid:31901947
- 13. Anderson UD, Olsson MG, Rutardóttir S, Centlow M, Kristensen KH, Isberg PE, et al. Fetal hemoglobin and α1-microglobulin as first- and early second-trimester predictive biomarkers for preeclampsia. Am J Obstet Gynecol. 2011;204(6):520.e1-5. pmid:21439542
- 14. Chakraborty S, Kaur S, Guha S, Batra SK. The multifaceted roles of neutrophil gelatinase associated lipocalin (NGAL) in inflammation and cancer. Biochim Biophys Acta. 2012;1826(1):129–69. pmid:22513004
- 15. Dartt DA. Tear lipocalin: structure and function. Ocul Surf. 2011;9(3):126–38. pmid:21791187
- 16. Muthukumar S, Rajesh D, Saibaba G, Alagesan A, Rengarajan RL, Archunan G. Urinary lipocalin protein in a female rodent with correlation to phases in the estrous cycle: an experimental study accompanied by in silico analysis. PLoS One. 2013;8(8):e71357. pmid:23967199
- 17. Yao G, Xie S, Wan X, Zhang L, Liu Q, Hu S. Identification, characterization and expression analysis of rLcn13, an epididymal lipocalin in rats. Acta Biochim Biophys Sin (Shanghai). 2023;55(2):314–21. pmid:36762499
- 18. Ramana J, Gupta D. LipocalinPred: a SVM-based method for prediction of lipocalins. BMC Bioinformatics. 2009;10:445. pmid:20030857
- 19. Pugalenthi G, Kandaswamy KK, Suganthan PN, Archunan G, Sowdhamini R. Identification of functionally diverse lipocalin proteins from sequence information using support vector machine. Amino Acids. 2010;39(3):777–83.
- 20. Nath A, Subbiah K. Maximizing lipocalin prediction through balanced and diversified training set and decision fusion. Comput Biol Chem. 2015;59 Pt A:101–10. pmid:26433483
- 21. Zulfiqar H, Ahmed Z, Ma C-Y, Khan RS, Grace-Mercure BK, Yu X-L, et al. Comprehensive Prediction of Lipocalin Proteins Using Artificial Intelligence Strategy. Front Biosci (Landmark Ed). 2022;27(3):84. pmid:35345316
- 22. Yu L, Xue L, Liu F, Li Y, Jing R, Luo J. The applications of deep learning algorithms on in silico druggable proteins identification. J Adv Res. 2022;41:219–31. pmid:36328750
- 23. Yu L, Zhang Y, Xue L, Liu F, Chen Q, Luo J, et al. Systematic Analysis and Accurate Identification of DNA N4-Methylcytosine Sites by Deep Learning. Front Microbiol. 2022;13:843425. pmid:35401453
- 24. Rukh G, Akbar S, Rehman G, Alarfaj FK, Zou Q. StackedEnC-AOP: prediction of antioxidant proteins using transform evolutionary and sequential features based multi-scale vector with stacked ensemble learning. BMC Bioinformatics. 2024;25(1):256. pmid:39098908
- 25. Ullah M, Akbar S, Raza A, Zou Q. DeepAVP-TPPred: identification of antiviral peptides using transformed image-based localized descriptors and binary tree growth algorithm. Bioinformatics. 2024;40(5):btae305. pmid:38710482
- 26. Akbar S, Zou Q, Raza A, Alarfaj FK. iAFPs-Mv-BiTCN: Predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks. Artif Intell Med. 2024;151:102860. pmid:38552379
- 27. Akbar S, Raza A, Zou Q. Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model. BMC Bioinformatics. 2024;25(1):102. pmid:38454333
- 28. Chen Z, Duan J, Kang L, Qiu G. Class-Imbalanced Deep Learning via a Class-Balanced Ensemble. IEEE Trans Neural Netw Learn Syst. 2022;33(10):5626–40. pmid:33900923
- 29. Zhao J, Vaios E, Wang Y, Yang Z, Cui Y, Reitman ZJ, et al. Dose-Incorporated Deep Ensemble Learning for Improving Brain Metastasis Stereotactic Radiosurgery Outcome Prediction. Int J Radiat Oncol Biol Phys. 2024;120(2):603–13. pmid:38615888
- 30. Yu T-H, Su B-H, Battalora LC, Liu S, Tseng YJ. Ensemble modeling with machine learning and deep learning to provide interpretable generalized rules for classifying CNS drugs with high prediction power. Brief Bioinform. 2022;23(1):bbab377. pmid:34530437
- 31. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2. pmid:23060610
- 32. Veltri D, Kamath U, Shehu A. Deep learning improves antimicrobial peptide recognition. Bioinformatics. 2018;34(16):2740–7. pmid:29590297
- 33. Luo J, Yu L, Guo Y, Li M. Functional classification of secreted proteins by position specific scoring matrix and auto covariance. Chemom Intell Lab Syst. 2012;110(1):163–7.
- 34. Kabir M, Arif M, Ahmad S, Ali Z, Swati ZNK, Yu D-J. Intelligent computational method for discrimination of anticancer peptides by incorporating sequential and evolutionary profiles information. Chemom Intell Lab Syst. 2018;182:158–65.
- 35. Patiyal S, Dhall A, Bajaj K, Sahu H, Raghava GPS. Prediction of RNA-interacting residues in a protein using CNN and evolutionary profile. Brief Bioinform. 2023;24(1):bbac538. pmid:36516298
- 36. Fu H, Cao Z, Li M, Wang S. ACEP: improving antimicrobial peptides recognition through automatic feature fusion and amino acid embedding. BMC Genomics. 2020;21(1):597. pmid:32859150
- 37. Altschul SF, Koonin EV. Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. Trends Biochem Sci. 1998;23(11):444–7. pmid:9852764
- 38. Mohammadi A, Zahiri J, Mohammadi S, Khodarahmi M, Arab SS. PSSMCOOL: a comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles. Biol Methods Protoc. 2022;7(1):bpac008. pmid:35388370
- 39. Jin G, Sha H, Feng Y, Cheng Q, Huang J. GSEN: An ensemble deep learning benchmark model for urban hotspots spatiotemporal prediction. Neurocomputing. 2021;455:353–67.
- 40. Zhou H, Wekesa JS, Luan Y, Meng J. PRPI-SC: an ensemble deep learning model for predicting plant lncRNA-protein interactions. BMC Bioinformatics. 2021;22(Suppl 3):415. pmid:34429059
- 41. Aybey E, Gümüş Ö. SENSDeep: An Ensemble Deep Learning Method for Protein-Protein Interaction Sites Prediction. Interdiscip Sci. 2023;15(1):55–87. pmid:36346583
- 42. Jing R, Li Y, Xue L, Liu F, Li M, Luo J. autoBioSeqpy: A Deep Learning Tool for the Classification of Biological Sequences. J Chem Inf Model. 2020;60(8):3755–64. pmid:32786512
- 43. Jing R, Xue L, Li M, Yu L, Luo J. layerUMAP: A tool for visualizing and understanding deep learning models in biological sequence classification using UMAP. iScience. 2022;25(12):105530. pmid:36425757
- 44. Achatz S, Jarasch A, Skerra A. Structural plasticity in the loop region of engineered lipocalins with novel ligand specificities, so-called Anticalins. J Struct Biol X. 2021;6:100054. pmid:34988429
- 45. Nasioudis D, Witkin SS. Neutrophil gelatinase-associated lipocalin and innate immune responses to bacterial infections. Med Microbiol Immunol. 2015;204(4):471–9. pmid:25716557
- 46. Zhao R-Y, Wei P-J, Sun X, Zhang D-H, He Q-Y, Liu J, et al. Role of lipocalin 2 in stroke. Neurobiol Dis. 2023;179:106044. pmid:36804285