Figures
Abstract
Tyrosinase plays a central role in melanin biosynthesis, and its dysregulation has been implicated in the pathogenesis of various pigmentation disorders. The precise identification of tyrosinase inhibitory peptides (TIPs) is critical, as these bioactive molecules hold significant potential for therapeutic and cosmetic applications, including the treatment of hyperpigmentation and the development of skin-whitening agents. To date, computational methods have received significant attention as a complement to experimental methods for the in silico identification of TIPs, reducing the need for extensive material resources and labor-intensive processes. In this study, we propose an innovative computational approach, BLSAM-TIP, which combines a bidirectional long short-term memory (BiLSTM) network and a self-attention mechanism (SAM) for accurate and large-scale identification of TIPs. In BLSAM-TIP, we first employed various multi-source feature embeddings, including conventional feature encodings, natural language processing-based encodings, and protein language model-based encodings, to encode comprehensive information about TIPs. Secondly, we integrated these feature embeddings to enhance feature representation, while a feature selection method was applied to optimize the hybrid features. Thirdly, the BiLSTM-SAM architecture was specially developed to highlight the crucial features. Finally, the features from BiLSTM-SAM was fed to deep neural networks (DNN) in order to identify TIPs. Experimental results on an independent test dataset demonstrate that BLSAM-TIP attains superior predictive performance compared to existing methods, with a balanced accuracy of 0.936, MCC of 0.922, and AUC of 0.988. These results indicate that this new method is an accurate and efficient tool for identifying TIPs. Our proposed method is available at https://github.com/saeed344/BLSAM-TIP for TIP identification and reproducibility purposes.
Citation: Ahmed S, Schaduangrat N, Chumnanpuen P, Mahmud SMH, Goh KOM, Shoombuatong W (2025) BLSAM-TIP: Improved and robust identification of tyrosinase inhibitory peptides by integrating bidirectional LSTM with self-attention mechanism. PLoS One 20(10): e0333614. https://doi.org/10.1371/journal.pone.0333614
Editor: Yunhe Wang, Hebei University of Technology, CHINA
Received: January 15, 2025; Accepted: September 15, 2025; Published: October 8, 2025
Copyright: © 2025 Ahmed et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The source code of BLSAM-TIP and all the data used in this study are freely available at https://figshare.com/s/e0ddad96bb9a366b373d and https://github.com/saeed344/BLSAM-TIP.
Funding: This project is funded by the National Research Council of Thailand and Mahidol University (N42A660380), and Mahidol University Partnering Initiative under the MU-KMUTT Biomedical Engineering & Biomaterials Research Consortium. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors declare that they have no competing interests.
1. Introduction
Tyrosinase is a crucial enzyme involved in the biosynthesis of melanin, catalyzing the initial steps of melanogenesis in mammals and contributing to enzymatic browning in fruits and vegetables [1–3]. This copper-containing oxidase facilitates the oxidation of phenolic compounds, leading to the formation of melanin, which plays an essential role in pigmentation and protection against UV radiation [1]. However, excessive melanin production can result in hyperpigmentation disorders, and the browning of fruits during storage can lead to economic losses in the food industry. Consequently, there is a significant interest in developing tyrosinase inhibitors as therapeutic agents for skin conditions and as preservatives in food products [4,5]. The overproduction of melanin can lead to hyperpigmentation disorders, such as melasma and age spots, which pose both cosmetic and medical concerns [6,7]. As a result, there has been growing interest in identifying effective inhibitors of tyrosinase to mitigate these conditions [8,9]. The search for effective tyrosinase inhibitors has led to the identification of various natural and synthetic compounds. These inhibitors can be classified into different categories based on their mechanisms of action and chemical structures. Some inhibitors act as competitive or non-competitive agents, while others may irreversibly bind to the enzyme, effectively inactivating it during catalysis. For instance, kojic acid is one of the most well-studied tyrosinase inhibitors and serves as a benchmark for evaluating the efficacy of new compounds [10,11].
Tyrosinase inhibitory peptides (TIPs) have emerged as promising candidates for reducing melanin production. These peptides, typically composed of 3–20 amino acids, can effectively inhibit tyrosinase activity. Recent studies have demonstrated that various bioactive peptides derived from natural sources exhibit strong tyrosinase inhibitory properties, offering a safer alternative to traditional chemical inhibitors such as hydroquinone and kojic acid, which may cause adverse side effects [12,13]. The mechanisms by which TIPs exert their inhibitory effects are multifaceted. These peptides can bind to the active site of tyrosinase, leading to competitive inhibition or, in some cases, irreversible inhibition. Additionally, TIPs may modulate signaling pathways involved in melanogenesis, further enhancing their effectiveness in treating hyperpigmentation [9,14].
The methodologies employed for identifying TIPs can be broadly categorized into in vitro and in silico approaches, each with distinct advantages and limitations. Current experimental methods face significant challenges in high-throughput screening due to their labor-intensive and expensive nature [8,15]. Recently, advancements in computational methods, such as machine learning (ML) algorithms, have facilitated the prediction and identification of novel TIPs. These approaches allow researchers to screen thousands of peptides based on their structural properties and predicted anti-tyrosinase activities, demonstrating high accuracy rates [8,12,15]. By integrating bioinformatics with peptide research, these advancements are driving more precise and efficient strategies for tyrosinase inhibition. Notable examples include ML-based methods such as TIP-KNN and TIP-RF [15], as well as TIPred [8]. Comprehensive details on these cutting-edge techniques are provided in earlier studies [8]. Despite ongoing improvements in the predictive performance of these advanced methods [8,15], their practical effectiveness in real-world applications remains inadequate. Key challenges include the limited availability of known TIPs and issues related to imbalanced learning.
Although the existing methods facilitate the identification of TIPs, several challenges remain to be addressed. First, relying on single feature descriptor is insufficient for capturing the comprehensive information of TIPs [16–20]. Second, protein language models (PLMs), inspired by natural language models (LMs), have recently shown effectiveness in generating peptide sequence representations [19,21,22]. Since PLMs are pre-trained on extensive protein databases such as BFD [23,24], UniRef [25], and Pfam [26], which collectively contain over a billion protein sequences, they can extract comprehensive and valuable information. Regrettably, no studies have yet employed PLMs to generate feature representations for TIPs. Third, the imbalance between TIPs and non-TIPs in datasets can adversely affect the prediction performance of the models. Finally, the overall prediction accuracy and robustness of existing methods remains inadequate, highlighting the need for further improvements.
To address these deficiencies, a novel computational approach, termed BLSAM-TIP, leveraging a combination of bidirectional long short-term memory (BiLSTM) and a self-attention mechanism (SAM), is proposed for the accurate and large-scale identification of TIPs (Fig 1). The major contributions of the proposed model can be summarized in the following four aspects. First, to capture multi-view and comprehensive information about TIPs, various feature encoding schemes were employed, encompassing sequential information, graphical information, statistical information, contextual information, and protein semantic information. Second, the synthetic minority oversampling technique (SMOTE) was utilized to address the impact of data imbalance on the model’s performance. Additionally, the least absolute shrinkage and selection operator (LASSO) method was applied to optimize the combined features, potentially enhancing the model performance. Third, the BiLSTM-SAM-DNN architecture was specially constructed to reduce interference from irrelevant information and subsequently employed to identify TIPs. Fourth, benchmark experiments on the independent test set illustrated that BLSAM-TIP significantly outperformed existing state-of-the-art methods, achieving a balanced accuracy (BACC) of 0.936, Matthew’s correlation coefficient (MCC) of 0.870, and an area under the receiver-operating curve (AUC) of 0.988.
(A) Data construction. (B) Overall framework of BLSAM-TIP. (C) Performance evaluation and ablation experiments.
2. Materials and methods
2.1 Data collection and curation
The existing predictors for TIPs were developed and fine-tuned using a limited dataset of TIPs and non-TIPs, as detailed in S1 Table. Developing a high-accuracy predictive model necessitates a larger sample size [27–29]. To construct an updated and high-quality dataset, specific filtering criteria were applied to the initial TIPs and non-TIPs: (i) peptide sequences containing unusual letters such as ‘B’, ‘U’, ‘X’, or ‘Z’ were eliminated; and (ii) duplicate peptide sequences were removed. After applying these filters, a refined dataset comprising 206 TIPs and 502 non-TIPs was compiled. These sequences were sourced from our previous research [8] and six recently published studies [8,12,15,30–34], while 401 non-TIPs were taken from our earlier work [8] and Qin et al. [35]. Since TIPs are typically shorter than 20 amino acid residues [9], both TIP and non-TIP datasets were restricted to peptide lengths ranging from 2 to 20 amino acids. For the establishment of training and independent test datasets, we adhered to the criteria set forth by Charoenkwan et al. [8]. The training dataset consisted of 164 TIPs and 401 non-TIPs, while the independent test dataset included 42 TIPs and 101 non-TIPs. Additional details regarding the composition of the training and independent test datasets can be found in S1 Table.
2.2 Feature encoding scheme
To capture comprehensive information about TIPs, we employed six feature encoding methods from different perspectives, including conventional feature encodings, NLP-based encodings, and protein language model-based encodings. For conventional feature encoding, we applied a novel feature extraction method called FEGS, which is capable of capturing graphical and statistical information [36]. FEGS integrates two interpretable feature descriptors (i.e., amino acid composition (AAC) and dipeptide composition (DPC)) with the physicochemical properties of amino acids. Based on FEGS, any peptide sequence is encoded as a 578-D feature vector. In the recent years, embedding methods inspired by NLP techniques have gained attention in the field of bioinformatics and computational biology [37–40]. These methods provide contextual information for peptide sequences [37,38,41]. Among them, FastText is a powerful embedding method that leverages morphological information to address the issue of out-of-vocabulary words [42,43], thereby improving performance in downstream tasks. Herein, FastText generates a 120-D feature vector for peptide sequences. With advancements in NLP techniques and the availability of millions of protein sequences, PLMs have been increasingly employed as embedding extractors. In this study, we utilized four well-known PLMs, including bidirectional encoder representations from transformers (BERT) [44], ProtT5-U50, ProtT5-BFD, and ESM-2, to encode peptide sequences into feature embeddings (i.e., distributed vector representations). The text-to-text transfer transformer (T5) [45] architecture was used to develop both ProtT5-BFD and ProtT5-UR50. ProtT5-UR50 was trained on Uniref50 [25], which contains 45 million protein sequences, while ProtT5-BFD was trained on BFD [46], a database comprising 2.1 billion protein sequences. to account for the relatively small size of the training dataset, we used ESM-2, which was trained on the UR50/D 2021_04 dataset (called esm2_t6_8M_UR50D). Additionally, we used the esm2_t33_650M_UR50D model [47], which is based on the BERT architecture and was trained on Uniref50. Using these PLMs, peptide sequences were encoded as feature vectors of varying dimensions: 768-D, 1024-D, 1024-D, and 320-D for BERT, ProtT5-U50, ProtT5-BFD, and ESM-2, respectively.
2.3 Feature selection method
In the field of bioinformatics, feature selection plays an important role in enhancing model efficiency and addressing overfitting [48–50]. Robert Tibshirani introduced a well-regarded feature selection method called LASSO, which can perform both feature selection and regularization. This method has been effective in identifying beneficial features from high-dimensional data [51]. Given and
as the features and classes, the linear regression model is defined as follows:
where
,
, and
is the error term. In the LASSO method, the goal is to determine the optimal value of
with a special penalty constraint. The LASSO estimation is defined as follows:
where represents the Euclidean norm.
2.4 Bidirectional long short-term memory and self-attention mechanism
Long short-term memory network (LSTM) can learn long-term sequential features without requiring a large number of features, unlike traditional ML models, that often depend on additional features to improve model performance. The LSTM method was developed to address the vanishing gradient problem [52,53], a challenge encountered in recurrent neural networks (RNNs) [54]. LSTMs use memory cells to decide which information to retain and which to discard, enabling them to capture long-range contextual information effectively. The structure of an LSTM typically contains three main components, such as the forget gate (), the input gate (
), and the memory cell (
). At time
, the formulations of the LSTM structure can be defined as follows:
where ,
, and
represent the weights of
,
, and
, respectively, while
,
, and
are the biases of input gate, forget gate, and output gate, respectively.
is the updated cell state, generated based on the previous cell state. . Rather than using LSTM, we applied BiLSTM, which consists of two LSTM layers – one processing sequences in the forward direction and the other in the backward direction. This design enables BiLTSM to capture both future/upcoming and historical contexts, allowing it to extract not only past information but also future features, thereby achieving better prediction performance compared to standard LSTMs.
2.5 Self-attention mechanism
To date, the attention mechanism has effectively helped models highlight important parts of sentences in several NLP tasks, such as aspect-level sentiment classification. Specifically, the attention mechanism can automatically extract significant word embeddings from text sequences during model training [55,56]. Several previous studies have demonstrated its successful application in enhancing model performance in bioinformatics and computational biology [48,57,58]. Thus, after obtaining the features generated from BiLSTM, we employed the SAM to strengthen specific BiLSTM-based feature representations. In the SAM structure, given an input, it can generate three standard matrices Query (Q), Key (K), and Value (V), which is calculated as follows:
where ,
, and
. Here,
represents the dimensionality of
and
, while
,
, and
are the weight matrices used to compute
,
, and
, respectively.
2.6 The overall framework and performance of BLSAM-TIP
Fig 1 illustrates the overall framework and performance of the proposed BLSAM-TIP model for identifying TIPs. As shown in Fig 1, BLSAM-TIP is a DL-based prediction model where the input is a query peptide sequence, and the output is the confidence score for TIP identification. The BLSAM-TIP framework consists of two main procedures: (i) multi-view feature extraction and optimization, and (ii) TIP identification using the BiLSTM-SAM-DNN architecture.
Procedure I: Multi-view feature extraction and optimization.
The input peptide sequence is processed using various feature encoding methods, including FastText, BERT, ProtT5-U50, ProtT5-BFD, ESM-2, and FEGS. These methods generate feature vectors of dimensions 120-D, 768-D, 1024-D, 1024-D, 320-D, and 578-D, respectively. These diverse feature representations capture different types of information, such as sequential information, graphical information, contextual information, and protein semantic information. To comprehensively represent multi-view information beneficial for TIP identification, we combined the above-mentioned feature representations. Given the imbalance between TIP and non-TIP samples (i.e., 158 TIPs and 408 non-TIPs), the learning accuracy and model performance might be imparied [59,60]. Thus, to address data imbalance, the SMOTE method was employed to oversample TIPs [61]. As a result, we obtained a hybrid feature vector containing 3848 features. To eliminate noise and irrelevant information, several well-known feature selection methods were applied, generating various feature subsets. Finally, the optimal feature subset was selected based on the best-performing cross-validation MCC value.
Procedure II: TIP identification using the BiLSTM-SAM-DNN architecture.
The BiLSTM-SAM architecture, which combines BiLSTM and SAM, was specifically designed to mitigate interference from irrelevant information and enhance prediction performance. Finally, the BiLSTM-SAM-based feature representations were input into deep neural networks (DNN) for the identification of TIPs [48,62]. The performance of BLSAM-TIP and related prediction models was evaluated using several metrics, including BACC, AUC, MCC, F1, area under the precision-recall curve (AUPR), sensitivity (SN), and specificity (SP) [63–68]. Additional details about these seven performance metrics are provided in the Supplementary information.
3. Results and discussion
3.1 Performance evaluation of different feature representations
Here, we selected six feature extraction methods (i.e., FastText, BERT, ProtT5-U50, ProtT5-BFD, ESM-2, and FEGS) to capture critical information about TIPs from multiple perspectives, including sequential, graphical, semantic, and evolutionary information. Specifically, the feature representations derived from these methods were processed using the SMOTE method to address the class-imbalance problem in the training dataset (164 TIPs and 401 non-TIPs) [59,60]. Finally, the processed feature representations were fed into the BiLSTM-SAM architecture. To evaluate the representational capability of these feature extraction methods, we evaluated their performance in terms of ACC, AUC, AUPR, SN, SP, MCC, and F1 scores through a five-fold cross-validation test, as detailed in Table 1. From Table 1, the MCC values of FastText, BERT, ProtT5-U50, ProtT5-BFD, ESM-2, and FEGS are 0.904, 0.963, 0.981, 0.964, 0.968, and 0.743, respectively. Interestingly, the top-three feature representations were obtained from PLMs. Among these, ProtT5-U50 provided the best feature representation, achieving ACC, SN, SP, F1, AUC, and AUPR values of 0.990, 0.985, 0.995, 0.990, 0.998, and 0.998, respectively.
3.2 Determination of optimal feature subsets
Rather than employing only the best-performing feature (ProtT5-U50) to develop the final model, we combined six feature representations into a single hybrid feature vector (named Hybrid) to improve the feature space and capture more comprehensive information about TIPs. However, the performance of the Hybrid was lower than that of ProtT5-U50 (as shown in Table 1). The possible reason for this decline is that using the Hybrid significantly increased data dimensionality, introducing noise and potentially degrading the model performance. To address this challenge, we applied eight distinct feature selection methods [50,69–71] to the training dataset. These methods included LASSO, mRMR, random projection (RP), truncated singular value decomposition (TSVD), elastic net (EN), graph autoencoders (GAE), principal component analysis (PCA), spectral embedding (SE), which generated eight different feature subsets. For ease of discussion, these feature subsets are referred to as LASSO_FS, mRMR_FS, RP_FS, TSVD_FS, EN_FS, GAE_FS, PCA_FS, and SE_FS, respectively. Specifically, we trained nine individual BiLSTM-SAM-based models, each using one of the feature subsets. The prediction results were evaluated over the cross-validation and independent test on the training and independent test datasets, respectively. The feature dimensions of the subsets were 259, 400, 230, 700, 700, 301, 37, and 500 for EN_FS, GAE_FS, LASSO_FS, mRMR_FS, PCA_FS, RP_FS, SE_FS, and TSVD_FS, respectively. The optimal feature subset, determined by the highest cross-validation MCC, was considered the most effective for TIP identification.
Fig 2 and Table 2 summarize the prediction results of the selected feature selection methods and the control, where the control refers to the hybrid feature vector. The performance of all the feature subsets exceeded that of the control, with the sole exception of the SE-based feature subset. This indicates the effectiveness of the feature selection methods. As seen in Table 2, five feature selection methods, encompassing EN_FS, LASSO_FS, RP_FS, PCA_FS, and mRMR_FS, achieved cross-validation MCC values greater than 0.980. To evaluate its generalization ability, its performance was further assessed on the independent test dataset. The corresponding MCC values for these methods were 0.841, 0.922, 0.697, 0.651, and 0.712, respectively. Overall, LASSO_FS exhibited optimal performance across both validation strategies. Notably, the MCC values of LASSO_FS were 9.02% and 33.44% higher than the control in terms of the cross-validation and independent tests, respectively. Moreover, on the independent test, the ACC, SN, SP, F1, AUC, AUPR values of LASSO_FS were 16.20, 14.58, 17.02, 33.44, 21.26, 14.32, and 20.07%, respectively, higher than the control. Thus, we utilized the LASSO_FS subset, comprising 230 selected features, to optimize the proposed BLSAM-TIP model herein.
Comparisons of the ROC curve, AUC value, PR curve, and AUPR value on the training (A, B) and independent test (C, D) datasets.
3.3 Analysis of the contribution of our multi-view features
As mentioned above, our proposed feature subset (LASSO_FS), which combines multi-view information, is a 230-D feature vector derived from 5 FastText-based, 65 BERT-based, 41 ProtT5-U50-based, 46 ProtT5-BFD-based, 12 ESM-2-based, 61 FEGS-based features (as shown in Fig 3). To investigate the effectiveness of LASSO_FS, we compared its performance with six baseline feature descriptors. The performance results for LASSO_FS and the compared feature descriptors in both cross-validation and independent tests are summarized in Fig 4 and Table 3. As observed in Table 3, LASSO_FS achieved the best overall predictive performance across both cross-validation and independent tests. Compared to the best-performing baseline feature descriptor (ProtT5-U50) in the independent test, LASSO_FS achieved ACC, MCC, F1, AUC, and AUPR values of 0.965, 0.922, 0.948, 0.988, and 0.982, respectively, representing improvements of 10.56, 24.17, 16.58, 7.00, and 11.42%.
The number (A) and of proportion (B) of each type feature embedding selected from the optimal feature set.
Comparisons of the ROC curve, AUC value, PR curve, and AUPR value on the training (A, B) and independent test (C, D) datasets.
To further illustrate the effectiveness of LASSO_FS in distinguishing TIPs from non-TIPs, we visualized the distribution of TIPs and non-TIPs using the t-distributed stochastic neighbor embedding (t-SNE) method, which reduces the original feature space to a two-dimensional space [72]. Herein, seven t-SNE plots were created, as shown in Fig 5. It is evident that LASSO_FS (Fig 5G) forms two clear clusters of TIPs and non-TIPs, whereas unclear clusters were observed in the feature spaces of FastText, BERT, ProtT5-U50, ProtT5-BFD, ESM-2, and FEGS. Overall, these analysis results are sufficient to indicate that LASSO_FS effectively captures essential and sufficient information about TIPs. This explains why the proposed BLSAM-TIP model trained with LASSO_FS can precisely classify TIPs with great prediction performance.
3.4 Performance comparison between BiLSTM-TIP and several conventional ML and DL models
To elucidate the effectiveness and robustness of the proposed BLSAM-TIP model, we compared its performance with several ML and DL classifiers using the same training and independent test datasets to ensure a fair comparison. In addition, all the compared ML and DL classifiers were constructed using the LASSO_FS feature subset and their optimal parameters, with the grid search space for each ML and DL classifier summarized in S2 and S3 Tables, respectively. Herein, we selected 12 ML methods (i.e., NB, DT, RF, KNN, ADA, LGBM, GBDT, XGB, MLP, ET, LR, and SVM) and 7 DL methods (i.e., CNN-BiLSTM, BiLSTM, DNN, BiGRU, CNN, GRU, and LSTM) to conduct the comparative experiments [18–20,73]. To date, these ML and DL methods have been widely and successfully applied to address numerous research questions in bioinformatics [20,38,48,49,74]. From Fig 6, Tables 4, 5, and S4 and S5 Tables, several key observations can be drawn as follows: (i) Among the top-five classifiers, almost all were based on DL methods (i.e., BiGRU, CNN, GRU, and LSTM), with the sole exception of SVM. The MCC values for BiGRU, CNN, SVM, GRU, and LSTM were 0.946, 0.949, 0.953, 0.968, and 0.978, respectively; (ii) On the independent test dataset, CNN and SVM still outperformed other classifiers, achieving MCC values of 0.841 and 0.861, respectively. The observation suggests that DL methods are particularly effective in leveraging information from large-scale datasets to attain impressive performance [38,74]; and (iii) When comparing BLSAM-TIP with CNN and SVM, BLSAM-TIP demonstrated slightly better performance. Specifically, BLSAM-TIP outperformed the compared models by 2.82–3.52% in ACC, 6.07–8.07% in MCC, 5.19–5.71% in F1, and 10.42–14.58% in SN. In summary, the proposed BLSAM-TIP model is more effective than several conventional ML and DL models in the identification of TIPs, especially in terms of performance on the independent test. These results indicate the excellent generalization ability and robustness of BLSAM-TIP.
(A-B) Performance comparison of BLSAM-TIP with conventional ML models. (C-D) Performance comparison of BLSAM-TIP with conventional DL models.
3.5 Performance comparison between BLSAM-TIP and the existing methods
To reveal the excellent performance of the proposed BLSAM-TIP model, we compared it with existing methods, including TIP-KNN [15], TIP-RF [15], and TIPred [8], using the independent test, as summarized in Fig 7. Since TIP-KNN and TIP-RF are not available as online web servers for TIP identification, we implemented KNN-based and RF-based classifiers in conjunction with the selected feature encodings (i.e., AAC, DPC, and PCP). For TIPred, we evaluated its web server using the default threshold. As can be seen from Fig 7, BLSAM-TIP significantly outperformed all existing methods across nearly all performance metrics, including ACC, BACC, SN, MCC, AUC, and AUPR. To be specific, compared to the runner-up TIPred, BLSAM-TIP attained improvements of 7.53, 17.26, 9.18, 3.46, and 4.83% in BACC, SN, MCC, AUC, and AUPR values, respectively. Interestingly, the outstanding SN value of BLSAM-TIP underscores its ability to effectively minimize false negatives. Taken together, these results confirm that BLSAM-TIP delivers more stable and superior performance than the existing methods.
3.6 Ablation study
In this section, we performed ablation experiments to assess the contribution of our proposed computational approach for the accurate identification of TIPs. Specifically, we compared BLSAM-TIP with two modified versions: (i) BLSAM-TIP (-SAM) is the version of BLSAM-TIP trained without the use of SAM and (ii) BLSAM-TIP (-LASSO_FS) is the version of BLSAM-TIP trained using the original hybrid feature vector containing 3848 features instead of the optimized LASSO_FS subset. It can be noticed from Fig 8, BLSAM-TIP outperformed its modified versions in terms of ACC, SN, SP, MCC, and F1 scores in both cross-validation and independent tests. While BLSAM-TIP achieved comparable AUC and AUPR values to BSLAM-TIP (-SAM) in the cross-validation test, it demonstrated superior performance in the independent test. Specifically, BLSAM-TIP achieved ACC, SN, MCC, and F1 scores that were 4.23, 8.33, 9.57, and 6.42% higher than those of BLSAM-TIP (-SAM), respectively. These results confirm that the proposed computational approach benefits from the inclusion of individual components, such as SAM and LASSO_FS feature subset, enabling it to attain more accurate and robust TIP identification.
3.7 Case study
As can be seen in the above experiments, BLSAM-TIP consistently achieved stable and superior performance in TIP identification. In this section, we conducted case studies to investigate how effectively BLSAM-TIP can identify novel TIPs in unknown samples. Initially, we collected 11 experimentally validated TIPs from previous studies [75–82], while 67 new non-TIPs were peptides that had been experimentally validated as low or non-active, ranging from 5 to 20 amino acids in length [12,83] (S6 Table). Notably, these new peptides were not included in the training or independent test datasets, ensuring an unbiased assessment of our model’s generalization ability. The detailed prediction results of BLSAM-TIP and the existing methods (i.e., TIP-KNN, TIP-RF, and TIPred) in terms of the case studies are recorded in S7 Table. As can be seen from S7 Table, BLSAM-TIP outperformed all the existing methods. Specially, when compared with TIP-KNN and TIP-RF, BLSAM-TIP (ACC of 0.833) significantly outperformed these compared methods (ACC ranging from 0.295–0.654). This capability is important for prioritizing and ranking novel TIPs among large sets of uncharacterized peptides. Altogether, BLSAM-TIP shows clear superiority over the compared methods and holds promise as a computational tool for TIP identification.
4. Conclusions
This study presents BLSAM-TIP, a novel computational approach for the accurate and large-scale identification of TIPs by combining BiLSTM with SAM. Both cross-validation and independent tests confirm that BLSAM-TIP is an accurate and robust computational tool. In terms of the independent test, BLSAM-TIP significantly outperformed state-of-the-art methods for TIP identification, achieving a BACC of 0.936, MCC of 0.870, and AUC of 0.988. The excellent performance of BLSAM-TIP can be attributed to four major reasons: (i) Several feature encoding schemes from several perspectives are employed to capture multi-view and sufficient information about TIPs, including sequential, graphical, statistical, contextual, and protein semantic information; (ii) The SMOTE method is applied to handle the issue of class imbalance effectively; (iii) The LASSO-based feature subset contains excellent discriminative information, which contributes to significant performance improvements; and (iv) The BiLSTM-SAM-DNN architecture can effectively leverage the strengths of individual components to attain more accurate and stable TIP identification. Although BLSAM-TIP has greatly enhanced and facilitated TIP identification, there is still ample room for further improvement. One possible extension is to incorporate interpretable feature representations (such as physicochemical properties (PCPs) or amino acid and dipeptide propensities) into the current feature subset. Another potential enhancement is to implement the BLSAM-TIP web server to facilitate the in-silico identification of peptides with tyrosinase inhibitory properties.
Supporting information
S1 Table. A number of TIPs and non-TIPs used for developing three TIP predictors.
https://doi.org/10.1371/journal.pone.0333614.s001
(DOCX)
S2 Table. Information of parameter settings for 12 ML methods used in this study.
https://doi.org/10.1371/journal.pone.0333614.s002
(DOCX)
S3 Table. Information of parameter settings for five DL methods used in this study.
https://doi.org/10.1371/journal.pone.0333614.s003
(DOCX)
S4 Table. Comparison of the prediction results of BLSAM-TIP and conventional ML methods over the cross-validation and independent tests.
https://doi.org/10.1371/journal.pone.0333614.s004
(DOCX)
S5 Table. Comparison of the prediction results of BLSAM-TIP and conventional DL methods over the cross-validation and independent tests.
https://doi.org/10.1371/journal.pone.0333614.s005
(DOCX)
S6 Table. Detailed information of new TIPs and non-TIPs in the case studies.
https://doi.org/10.1371/journal.pone.0333614.s006
(DOCX)
S7 Table. Detailed prediction results of TIP-KNN, TIP-RF, TIPred, and BLSAM-TIP on case studies.
https://doi.org/10.1371/journal.pone.0333614.s007
(DOCX)
References
- 1. Pathak MA, Jimbow K, Szabo G, Fitzpatrick TB. Sunlight and melanin pigmentation. Photochem Photobiol Rev. 1976;1:211–39.
- 2. Chai W, Wei W, Hu X, Bai Q, Guo Y, Zhang M, et al. Inhibitory effect and molecular mechanism on tyrosinase and browning of fresh-cut apple by longan shell tannins. Int J Biol Macromol. 2024;274(Pt 2):133326. pmid:38925198
- 3. Gandía-Herrero F, Jiménez M, Cabanes J, García-Carmona F, Escribano J. Tyrosinase inhibitory activity of cucumber compounds: enzymes responsible for browning in cucumber. J Agricul Food Chem. 2003;51(26):7764–9.
- 4. Baber MA, Crist CM, Devolve NL, Patrone JD. Tyrosinase inhibitors: a perspective. Molecules. 2023;28(15):5762.
- 5. Chang T-S. An updated review of tyrosinase inhibitors. Int J Mol Sci. 2009;10(6):2440–75. pmid:19582213
- 6. Tayier N, Qin N-Y, Zhao L-N, Zeng Y, Wang Y, Hu G, et al. Theoretical exploring of a molecular mechanism for melanin inhibitory activity of calycosin in Zebrafish. Molecules. 2021;26(22):6998. pmid:34834088
- 7. Ando H, Kondoh H, Ichihashi M, Hearing VJ. Approaches to identify inhibitors of melanin biosynthesis via the quality control of tyrosinase. J Invest Dermatol. 2007;127(4):751–61. pmid:17218941
- 8. Charoenkwan P, Kongsompong S, Schaduangrat N, Chumnanpuen P, Shoombuatong W. TIPred: a novel stacked ensemble approach for the accelerated discovery of tyrosinase inhibitory peptides. BMC Bioinform. 2023;24(1):356. pmid:37735626
- 9. Song Y, Chen S, Li L, Zeng Y, Hu X. The hypopigmentation mechanism of tyrosinase inhibitory peptides derived from food proteins: an overview. Molecules. 2022;27(9):2710. pmid:35566061
- 10. Lee SY, Baek N, Nam T. Natural, semisynthetic and synthetic tyrosinase inhibitors. J Enzyme Inhib Med Chem. 2016;31(1):1–13. pmid:25683082
- 11. Deri B, et al. The unravelling of the complex pattern of tyrosinase inhibition. Scient Rep. 2016;6(1):1–10.
- 12. Kongsompong S, E-Kobon T, Taengphan W, Sangkhawasi M, Khongkow M, Chumnanpuen P. Computer-aided virtual screening and in vitro validation of biomimetic tyrosinase inhibitory peptides from abalone peptidome. Int J Mol Sci. 2023;24(4):3154. pmid:36834568
- 13. Hassan M, Shahzadi S, Kloczkowski A. Tyrosinase inhibitors naturally present in plants and synthetic modifications of these natural products as anti-melanogenic agents: a review. Molecules. 2023;28(1):378.
- 14. Wang W, Lin H, Shen W, Qin X, Gao J, Cao W, et al. Optimization of a novel tyrosinase inhibitory peptide from atrina pectinata mantle and its molecular inhibitory mechanism. Foods. 2023;12(21):3884. pmid:37959003
- 15. Kongsompong S, E-Kobon T, Chumnanpuen P. K-nearest neighbor and random forest-based prediction of putative tyrosinase inhibitory peptides of abalone haliotis diversicolor. Molecules. 2021;26(12):3671. pmid:34208619
- 16. Charoenkwan P, Ahmed S, Nantasenamat C, Quinn JMW, Moni MA, Lio’ P, et al. AMYPred-FRL is a novel approach for accurate prediction of amyloid proteins by using feature representation learning. Sci Rep. 2022;12(1):7697. pmid:35546347
- 17. Charoenkwan P, Nantasenamat C, Hasan MM, Moni MA, Lio’ P, Manavalan B, et al. StackDPPIV: a novel computational approach for accurate prediction of dipeptidyl peptidase IV (DPP-IV) inhibitory peptides. Methods. 2022;204:189–98. pmid:34883239
- 18. Charoenkwan P, Schaduangrat N, Manavalan B, Shoombuatong W. M3S-ALG: Improved and robust prediction of allergenicity of chemical compounds by using a novel multi-step stacking strategy. Future Gener Comp Syst. 2025;162:107455.
- 19. Shoombuatong W, Homdee N, Schaduangrat N, Chumnanpuen P. Leveraging a meta-learning approach to advance the accuracy of Nav blocking peptides prediction. Sci Rep. 2024;14(1):4463. pmid:38396246
- 20. Shoombuatong W, Meewan I, Mookdarsanit L, Schaduangrat N. Stack-HDAC3i: a high-precision identification of HDAC3 inhibitors by exploiting a stacked ensemble-learning framework. Methods. 2024;230:147–57. pmid:39191338
- 21. Pham NT, Zhang Y, Rakkiyappan R, Manavalan B. HOTGpred: enhancing human O-linked threonine glycosylation prediction using integrated pretrained protein language model-based features and multi-stage feature selection approach. Comput Biol Med. 2024;179:108859. pmid:39029431
- 22. Zhu Y-H, Liu Z, Liu Y, Ji Z, Yu D-J. ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein-DNA binding site prediction. Brief Bioinform. 2024;25(2):bbae040. pmid:38349057
- 23. Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nat Commun. 2018;9(1):2542. pmid:29959318
- 24. Steinegger M, Mirdita M, Söding J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat Methods. 2019;16(7):603–6. pmid:31235882
- 25. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23(10):1282–8. pmid:17379688
- 26. Bateman A, et al. The Pfam protein families database. Nucleic Acids Res. 2004;32(suppl_1):D138–41.
- 27. Wei L, Zhou C, Chen H, Song J, Su R. ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics. 2018;34(23):4007–16. pmid:29868903
- 28. Rao B, Zhou C, Zhang G, Su R, Wei L. ACPred-Fuse: fusing multi-view information improves the prediction of anticancer peptides. Brief Bioinform. 2020;21(5):1846–55. pmid:31729528
- 29. Charoenkwan P, Chiangjong W, Nantasenamat C, Hasan MM, Manavalan B, Shoombuatong W. StackIL6: a stacking ensemble model for improving the prediction of IL-6 inducing peptides. Brief Bioinform. 2021;22(6):bbab172. pmid:33963832
- 30. Ledwoń P, Goldeman W, Hałdys K, Jewgiński M, Calamai G, Rossowska J, et al. Tripeptides conjugated with thiosemicarbazones: new inhibitors of tyrosinase for cosmeceutical use. J Enzyme Inhib Med Chem. 2023;38(1):2193676. pmid:37146256
- 31. Le NQK, Li W, Cao Y. Sequence-based prediction model of protein crystallization propensity using machine learning and two-level feature selection. Brief Bioinform. 2023;24(5):bbad319. pmid:37649385
- 32. Song Y, Li J, Tian H, Xiang H, Chen S, Li L, et al. Copper chelating peptides derived from tilapia (Oreochromis niloticus) skin as tyrosinase inhibitor: biological evaluation, in silico investigation and in vivo effects. Food Res Int. 2023;163:112307. pmid:36596203
- 33. Liu Y, Liu Y, Wang GA, Cheng Y, Bi S, Zhu X. BERT-Kgly: a bidirectional encoder representations from transformers (BERT)-based model for predicting lysine glycation site for homo sapiens. Original Res. 2022;2.
- 34. Xue W, Liu X, Zhao W, Yu Z. Identification and molecular mechanism of novel tyrosinase inhibitory peptides from collagen. J Food Sci. 2022;87(6):2744–56. pmid:35603815
- 35. Qin D, Jiao L, Wang R, Zhao Y, Hao Y, Liang G. Prediction of antioxidant peptides using a quantitative structure-activity relationship predictor (AnOxPP) based on bidirectional long short-term memory neural network and interpretable amino acid descriptors. Comput Biol Med. 2023;154:106591. pmid:36701965
- 36. Mu Z, Yu T, Liu X, Zheng H, Wei L, Liu J. FEGS: a novel feature extraction model for protein sequences and its applications. BMC Bioinform. 2021;22:1–15.
- 37. Raza A, Uddin J, Almuhaimeed A, Akbar S, Zou Q, Ahmad A. AIPs-SnTCN: Predicting anti-inflammatory peptides using fastText and transformer encoder-based hybrid word embedding with self-normalized temporal convolutional networks. J Chem Inform Model. 2023;63(21):6537–54.
- 38. Charoenkwan P, Nantasenamat C, Hasan MM, Manavalan B, Shoombuatong W. BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides. Bioinformatics. 2021;37(17):2556–62. pmid:33638635
- 39. Zulfiqar H, Sun Z-J, Huang Q-L, Yuan S-S, Lv H, Dao F-Y, et al. Deep-4mCW2V: a sequence-based predictor to identify N4-methylcytosine sites in Escherichia coli. Methods. 2022;203:558–63. pmid:34352373
- 40. Le NQK. Leveraging transformers‐based language models in proteome bioinformatics. Proteomics. 2023;23(23–24):2300011.
- 41. Do DT, Le TQT, Le NQK. Using deep neural networks and biological subwords to detect protein S-sulfenylation sites. Brief Bioinform. 2021;22(3):bbaa128. pmid:32613242
- 42. Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification. arXiv preprint. 2016.
- 43. Asgari E, Mofrad MRK. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One. 2015;10(11):e0141287. pmid:26555596
- 44.
Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. 2018.
- 45. Raffel C, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(140):1–67.
- 46. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. pmid:34265844
- 47. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30. pmid:36927031
- 48. Zhang T, Jia J, Chen C, Zhang Y, Yu B. BiGRUD-SA: protein S-sulfenylation sites prediction based on BiGRU and self-attention. Comput Biol Med. 2023;163:107145. pmid:37336062
- 49. Zhang X, Wang Y, Wei Q, He S, Salhi A, Yu B. DRBPPred-GAT: Accurate prediction of DNA-binding proteins and RNA-binding proteins based on graph multi-head attention network. Knowledge-Based Syst. 2024;285:111354.
- 50. Zou Q, Zeng J, Cao L, Ji R. A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing. 2016;173:346–54.
- 51. Li Y, Chen Z, Wang Q, Lv X, Cheng Z, Wu Y, et al. Identification of hub proteins in cerebrospinal fluid as potential biomarkers of Alzheimer’s disease by integrated bioinformatics. J Neurol. 2023;270(3):1487–500. pmid:36396814
- 52. Hochreiter S, Schmidhuber J. LSTM can solve hard long time lag problems. Adv Neural Inform Process Syst. 1996;9.
- 53. Hochreiter S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Inter J Uncertainty, Fuzziness and Knowledge-Based Syst. 1998;6(02):107–16.
- 54. Medsker LR, Jain L. Recurrent neural networks. Design Appl. 2001;5(64–67):2.
- 55. Gibbons FX. Self-attention and behavior: a review and theoretical update. Adv Exp Soc Psychol. 1990;23:249–303.
- 56. Vaswani A. Attention is all you need. Adv Neural Inform Process Syst. 2017.
- 57. Arif R, Kanwal S, Ahmed S, Kabir M. A computational predictor for accurate identification of tumor homing peptides by integrating sequential and deep BiLSTM features. Interdiscip Sci. 2024;16(2):503–18. pmid:38733473
- 58. Peng D, Zhang D, Liu C, Lu J. BG-SAC: entity relationship classification model based on self-attention supported capsule networks. Appl Soft Comput. 2020;91:106186.
- 59. Kabir M, Nantasenamat C, Kanthawong S, Charoenkwan P, Shoombuatong W. Large-scale comparative review and assessment of computational methods for phage virion proteins identification. EXCLI J. 2022;21:11–29. pmid:35145365
- 60. Schaduangrat N, Nantasenamat C, Prachayasittikul V, Shoombuatong W. ACPred: a computational tool for the prediction and analysis of anticancer peptides. Molecules. 2019;24(10):1973. pmid:31121946
- 61. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artificial Intelligence Res. 2002;16:321–57.
- 62. Yan L, Wang M, Zhou H, Liu Y, Yu B. AntiCVP-Deep: identify anti-coronavirus peptides between different negative datasets based on self-attention and deep learning. Biomed Signal Process Control. 2024;90:105909.
- 63. Mandrekar JN. Receiver operating characteristic curve in diagnostic test assessment. J Thorac Oncol. 2010;5(9):1315–6. pmid:20736804
- 64. Ge F, Arif M, Yan Z, Alahmadi H, Worachartcheewan A, Yu D-J, et al. MMPatho: leveraging multilevel consensus and evolutionary information for enhanced missense mutation pathogenic prediction. J Chem Inf Model. 2023;63(22):7239–57. pmid:37947586
- 65. Azadpour M, McKay CM, Smith RL. Estimating confidence intervals for information transfer analysis of confusion matrices. J Acous Soc Am. 2014;135(3):EL140–6.
- 66. Akbar S, Ullah M, Raza A, Zou Q, Alghamdi W. DeepAIPs-Pred: predicting anti-inflammatory peptides using local evolutionary transformation images and structural embedding-based optimal descriptors with self-normalized BiTCNs. J Chem Inf Model. 2024;64(24):9609–25. pmid:39625463
- 67. Akbar S, Zou Q, Raza A, Alarfaj FK. iAFPs-Mv-BiTCN: predicting antifungal peptides using self-attention transformer embedding and transform evolutionary based multi-view features with bidirectional temporal convolutional networks. Artif Intell Med. 2024;151:102860. pmid:38552379
- 68. Akbar S, Raza A, Zou Q. Deepstacked-AVPs: predicting antiviral peptides using tri-segment evolutionary profile and word embedding based multi-perspective features with deep stacking model. BMC Bioinform. 2024;25(1):102. pmid:38454333
- 69. Sun X, Jin T, Chen C, Cui X, Ma Q, Yu B. RBPro-RF: use Chou’s 5-steps rule to predict RNA-binding proteins via random forest with elastic net. Chemometrics Intelligent Lab Syst. 2020;197:103919.
- 70. Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Stat Soc Series B: Stat Methodol. 1996;58(1):267–88.
- 71. Bingham E, Mannila H. Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. 2001. 245–50.
- 72. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Machine Learn Res. 2008;9(11).
- 73. Charoenkwan P, Schaduangrat N, Shoombuatong W. StackTTCA: a stacking ensemble learning-based framework for accurate and high-throughput identification of tumor T cell antigens. BMC Bioinform. 2023;24(1):301. pmid:37507654
- 74. Xie R, Li J, Wang J, Dai W, Leier A, Marquez-Lago TT, et al. DeepVF: a deep learning-based hybrid framework for identifying virulence factors using the stacking strategy. Brief Bioinform. 2021;22(3):bbaa125. pmid:32599617
- 75. Zhao Y, Zhang T, Ning Y, Wang D, Li F, Fan Y, et al. Identification and molecular mechanism of novel tyrosinase inhibitory peptides from the hydrolysate of “Fengdan” peony (Paeonia ostii) seed meal proteins: peptidomics and in silico analysis. LWT. 2023;180:114695.
- 76. Kubglomsong S, Theerakulkait C, Reed RL, Yang L, Maier CS, Stevens JF. Isolation and identification of tyrosinase-inhibitory and copper-chelating peptides from hydrolyzed rice-bran-derived albumin. J Agric Food Chem. 2018;66(31):8346–54. pmid:30016586
- 77. Chen H, Yao Y, Xie T, Guo H, Chen S, Zhang Y, et al. Identification of tyrosinase inhibitory peptides from sea cucumber (Apostichopus japonicus) collagen by in silico methods and study of their molecular mechanism. Curr Protein Pept Sci. 2023;24(9):758–66. pmid:37350006
- 78. Joompang A, Anwised P, Klaynongsruang S, Taemaitree L, Wanthong A, Choowongkomon K, et al. Rational design of an N-terminal cysteine-containing tetrapeptide that inhibits tyrosinase and evaluation of its mechanism of action. Curr Res Food Sci. 2023;7:100598. pmid:37790858
- 79. Putri SA, Maharani R, Maksum IP, Siahaan TJ. Peptide design for enhanced anti-melanogenesis: optimizing molecular weight, polarity, and cyclization. Drug Design Develop Therapy. 2025:645–70.
- 80. Yu Z, Lv H, Zhou M, Fu P, Zhao W. Identification and molecular docking of tyrosinase inhibitory peptides from allophycocyanin in Spirulina platensis. J Sci Food Agric. 2024;104(6):3648–53. pmid:38224494
- 81. Song EC, Park C, Shin Y, Kim WK, Kim SB, Cho S. Neurog1-derived peptides RMNE1 and DualPep-Shine penetrate the skin and inhibit melanin synthesis by regulating MITF transcription. Int J Mol Sci. 2023;24(7):6158. pmid:37047130
- 82. Li J, Yin S, Wei Z, Xiao Z, Kang Z, Wu Y, et al. Newly identified peptide Nigrocin-OA27 inhibits UVB induced melanin production via the MITF/TYR pathway. Peptides. 2024;177:171215. pmid:38608837
- 83. Olsen TH, Yesiltas B, Marin FI, Pertseva M, García-Moreno PJ, Gregersen S, et al. AnOxPePred: using deep learning for the prediction of antioxidative properties of peptides. Sci Rep. 2020;10(1):21471. pmid:33293615