Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Classification of virulence factors based on dual-channel neural networks with pre-trained language models

  • Guanghui Li ,

    Contributed equally to this work with: Guanghui Li, Peiyang Song

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Writing – original draft

    ghli16@hnu.edu.cn (GL); alcs417@sdnu.edu.cn (CL)

    Affiliation School of Information and Software Engineering, East China Jiaotong University, Nanchang, China

  • Peiyang Song ,

    Contributed equally to this work with: Guanghui Li, Peiyang Song

    Roles Conceptualization, Data curation, Methodology, Software, Validation, Visualization, Writing – original draft

    Affiliation School of Information and Software Engineering, East China Jiaotong University, Nanchang, China

  • Jiawei Luo,

    Roles Writing – review & editing

    Affiliation College of Computer Science and Electronic Engineering, Hunan University, Changsha, China

  • Cheng Liang

    Roles Conceptualization, Funding acquisition, Project administration, Supervision, Writing – review & editing

    ghli16@hnu.edu.cn (GL); alcs417@sdnu.edu.cn (CL)

    Affiliation School of Information Science and Engineering, Shandong Normal University, Jinan, China

Abstract

Virulence factors (VFs) are crucial molecules that enable pathogens to cause infection and disease in a host. They allow pathogens to evade the host’s immune defenses and facilitate the progression of infection through various mechanisms. With the increasing prevalence of antibiotic-resistant strains and the emergence of new and re-emerging infectious agents, the classification of VFs has become more critical. This study presents PLM-GNN, an innovative dual-channel model designed for precise classification of VFs, focusing on the seven most numerous types. It integrates a structure channel, which employs a geometric graph neural network to capture the three-dimensional structure features of VFs, and a sequence channel that utilizes a pre-trained language model with Convolutional Neural Network (CNN) and Transformer architectures to extract local and global features from VF sequences, respectively. On the independent test set, the method achieved an accuracy of 86.47%, an F1 score of 86.20% and an Area Under the Receiver Operating Characteristic Curve (AUC) of 97.20%, validating its effectiveness. In conclusion, PLM-GNN can precisely classify the seven major VFs, offering a novel approach for studying their functions.

Introduction

Infectious diseases continue to represent a growing and persistent threat to human health, placing considerable strain on global public health systems [1]. Bacterial infections are primarily mediated by VFs, which facilitate pathogens in initiating and maintaining infections. The horizontal gene transfer capacity of VFs across distinct bacterial strains or species significantly elevates the likelihood of novel pathotype emergence, rendering these occurrences nearly unavoidable [2]. Therefore, identifying the VFs of pathogenic bacteria is essential for understanding the mechanisms of pathogenesis and for discovering potential targets for novel drug development and vaccine design, thereby enhancing our capacity to prevent and treat infectious diseases [3].

Moreover, the diversity and complexity of VFs necessitate a multifaceted classification system to effectively categorize and understand these elements. VFs encompass diverse categories, comprising secreted proteins (e.g., protein toxins and enzymes) as well as cell-surface components such as capsular polysaccharides, lipopolysaccharides, and outer membrane proteins [4]. A systematic categorization of VFs not only enhances the comprehension of molecular mechanisms driving bacterial pathogenesis but also supports the discovery of innovative therapeutic interventions and potential vaccine development [5].

The establishment of comprehensive VF databases, such as VFDB [6], Victors [7], and MvirDB [8], has made the identification of VFs increasingly feasible. A predominant strategy currently employed for VF prediction relies on sequence similarity to known VFs. For example, Liu et al. created VRprofile [9], a tool capable of identifying homologs of conserved gene clusters through Hidden Markov Model ER (HMMer) [10] or Basic Local Alignment Search Tool for Proteins (BLASTp) searches. This enables the prediction of virulence factors and antibiotic resistance genes within the genomes of pathogenic bacteria. Furthermore, Liu et al. [2] developed the VFanalyzer platform, an online tool that employs a hybrid methodology integrating BLAST [11] and hidden Markov models (HMMs) to perform iterative sequence similarity analyses within the VFDB. This system is designed to improve the precision of VF identification in bacterial genomes. Despite the widespread adoption of sequence alignment-based approaches for VF prediction, these methods exhibit significant limitations. Specifically, their reliance on sequence similarity restricts their ability to detect conserved VFs, while novel VFs with divergent evolutionary origins often remain undetected.

To overcome these challenges, recent studies have adopted machine learning and deep learning frameworks. Garg et al. [12] introduced VirulentPred, a prediction method designed to identify bacterial virulence proteins using a two-layer cascade Support Vector Machine (SVM). In this approach, the initial layer of classifiers is trained on a variety of sequence features and Position-Specific Scoring Matrices (PSSM). The outputs from these classifiers are then fed into a second-layer SVM for additional training and final prediction. Gupta et al. [13] developed MP3, a standalone tool and web server that employs an integrated approach combining SVM and Hidden Markov Models to perform rapid, sensitive, and accurate predictions of pathogenic proteins. Rentzsch et al. [14] proposed a novel strategy for selecting negative samples to construct the dataset, combining sequence similarity with machine learning models, which yielded promising results. DeepVF [15] thoroughly explored a diverse set of heterogeneous features using well-established machine learning algorithms. They utilized four traditional machine learning algorithms and three deep learning methods to train 62 baseline models. These models’ strengths were then effectively integrated, and a stacking strategy was applied to identify the VFs. Singh et al. [16] proposed VF-Pred, which integrates a novel sequence alignment-based feature (SeqAlignment) to significantly enhance the accuracy of machine learning-driven predictions. The model leveraged 982 features derived from rigorous feature engineering and implemented a downstream ensemble strategy, integrating outputs from 25 distinct models to enhance VF identification. In the classification of VFs, Zheng et al. [17] constructed a comprehensive dataset containing 160,495 virulence protein sequences categorized into 3,446 classes. Based on this dataset, a neural network architecture was developed, comprising two convolutional layers alternating with two max-pooling layers and a subsequent multilayer perceptron (MLP) for VF classification.

With advancements in technology, several pre-trained language models [1820] for proteins have emerged. These models were trained on large datasets using stacked Transformer blocks, ultimately generating embeddings for each amino acid. Previous studies have indicated that protein representations extracted from pre-trained language models achieve state-of-the-art performance across multiple tasks [21,22]. Sun et al. [23] developed DTVF, which utilized features generated by ProtT5 and integrated two channels—Convolutional Neural Networks and Long Short-Term Memory (LSTM). By incorporating an attention mechanism, this approach significantly enhanced the accuracy of VF recognition. However, the aforementioned model primarily emphasizes the sequence information, neglecting the Three Dimension (3D) structure of the VFs.

The sequence-structure-function paradigm posits that the amino acid sequence dictates the protein’s spatial configuration, which in turn governs its function [24]. Consequently, the 3D structure plays a crucial role in classifying VFs. Recently, ESMFold [19] developed by Meta AI has demonstrated the ability to generate 3D structures directly from protein sequences, achieving accuracy comparable to AlphaFold2 [25], but with greater speed and efficiency. Unlike AlphaFold2, which depends on multiple sequence alignment (MSA) data for its predictions, ESMFold can accurately predict 3D structures without requiring additional MSA information. This capability provides a significant advantage in scenarios where obtaining sufficient homologous sequences is challenging. This capability opens new avenues for using structure approaches to predict VFs. For example, GTAE-VF [26] leveraged 3D structures predicted by ESMFold to transform the VF recognition challenge into a graph-level prediction task. As a Graph Transformer Autoencoder, GTAE-VF integrated the strengths of Graph Convolutional Network (GCN [27]) and the Transformer [28] architecture, enabling full pairwise message passing. This allowed GTAE-VF to learn both local and global information adaptively, thereby capturing long-range dependencies and latent patterns more effectively. GTAE-VF achieved robust and reliable prediction accuracy, validating the utility of using 3D structures for VF identification. This approach demonstrates that integrating advanced structural predictions with graph-based ML models can significantly enhance our understanding and predictive power in microbial pathogenesis.

Previous approaches exhibit notable limitations: 1) The majority of existing models are confined to binary classification of VFs (e.g., classifying them as “VFs” or “non-VFs”), with limited exploration of multi-class classification for VFs. 2) Moreover, current mainstream models depend exclusively on either sequence-based or structure-based methods, neglecting the integration of both sequence and structure information into a cohesive framework. 3) Currently, models for identifying VFs that utilize structure information rely solely on the distances between alpha carbon atoms to construct contact graphs, failing to fully exploit the geometric information available [29].

Therefore, in this study, we propose the PLM-GNN, a dual-channel framework for classifying VFs by integrating pre-trained language models and geometric graph neural networks. In the sequence channel, ESM-2 is employed to extract 1280-dimensional feature representations for each amino acid, capturing complex sequence patterns, deep semantic information, and evolutionary relationships. These features are then processed through a two-layer one-dimensional convolutional neural network [30] to derive local sequence features, followed by a Transformer encoder to extract global sequence dependencies. In the structure channel, ESMFold is utilized to predict the three-dimensional structure information of VFs, while ProtT5 enhances node-level representations. These structure features are subsequently fed into a geometric graph neural network for further processing. Finally, the sequence and structure embeddings are fused and passed through an MLP [31] for classification. Extensive experimental results demonstrate that PLM-GNN effectively captures discriminative features of various VF categories and achieves state-of-the-art performance in multi-class classification tasks.

Our key contributions are summarized as follows:

  • Existing models for VF identification and classification rely exclusively on either sequence or structure information, failing to leverage the complementary nature of both data types. Our approach addresses this limitation by introducing a dual-channel framework that integrates sequence and structure data.
  • We propose PLM-GNN, a novel dual-channel model for VF classification, which achieves superior performance by effectively fusing sequence and structure features.
  • Comprehensive experiments on multiple datasets demonstrate that PLM-GNN not only excels in VF classification but also generalizes well to other protein classification tasks, highlighting its versatility and robustness.

Methods

Dataset description

The VFDB core database comprises 14 categories of VFs, totaling 4,236 sequences. Among these, the seven least-represented categories collectively contain only 216 sequences (averaging just 30 per category). Given this limited data, it is likely that deep learning models would struggle to learn adequate features from such sparse categories. Therefore, we selected the seven most abundant VF categories, comprising a total of 4,020 sequences, to constitute our initial dataset.

To enhance the generalization capacity and computational efficiency of the dataset, CD-HIT [32] was employed to remove VFs sharing sequence homology exceeding 90%. Due to the current GPU memory capacity being limited to 24 GB, the ESMFold model can only predict three-dimensional structures of proteins with amino acid sequences no longer than 1,240 residues. Therefore, we further removed sequences longer than 1,240 amino acids, which amounted to 128 entries, accounting for 3.29% of the resulting dataset. The length distribution of the resulting dataset, as shown in Fig 1, indicates that long VFs constitute only a minimal proportion. Additionally, we analyzed the length distribution for each VF type, as illustrated in Fig 2. The results reveal that very long VFs (exceeding 1240 amino acid residues) account for a relatively small proportion of each class. Based on these findings, we reasoned that the data processing strategy would have minimal impact on the overall prediction performance. To ensure that the class distribution—particularly that of minority classes—in the training, validation, and test sets remained consistent with the original dataset, we applied stratified sampling. This method effectively prevents issues such as underrepresentation or complete absence of minority class samples in any subset, which might otherwise arise from random partitioning. For the data splitting ratios, we compared three different configurations: 7.5:1.5:1.5, 8:1:1, and 9:0.5:0.5. Based on the results shown in Fig 2, the model performed best under the 8:1:1 ratio, which was therefore selected for stratified sampling. This strategy ensures that each subset is representative and reliable, thereby supporting robust model training and enabling unbiased performance evaluation. The resulting data distribution is presented in Table 1.

thumbnail
Table 1. Statistics on the number of VF classes and the distribution of the dataset.

https://doi.org/10.1371/journal.pone.0340194.t001

thumbnail
Fig 1. The distribution of sequence lengths, including frequency distribution and cumulative percentage distribution.

https://doi.org/10.1371/journal.pone.0340194.g001

An overview of PLM-GNN

The overall workflow of PLM-GNN is illustrated in Fig 3, which consists of four main parts: (1) obtaining the initial sequence feature representation of VFs; (2) representing the VFs structures obtained by ESMFold as graph data structures; (3) learning the features of the two modalities through sequence and structure channels respectively; (4) predicting the category of VFs through an MLP.

thumbnail
Fig 3. Framework of PLM-GNN model.

A) The PLM-GNN model adopts a dual-channel architecture to extract sequence and structure features; B) In the sequence channel, a CNN extracts local features, while a Transformer captures global features. The final sequence embedding is obtained through average pooling. C) In the structure channel, node and edge features are first aligned in dimension via linear transformations. They are then refined through a TransformerConv layer and a FeedForward network. Finally, an attention pooling is applied to generate the structure embedding.

https://doi.org/10.1371/journal.pone.0340194.g003

Extracting initial protein sequence features

The ESM-2 model is primarily built on the Transformer architecture. The version we use contains 650 million parameters and stacks 33 layers of Transformer encoders during training. Scholars propose that such large-scale pre-trained models, due to their vast number of parameters, exhibit emergent behavior when processing massive amino acid sequences. This behavior enables the model to capture potential structure, functional, and evolutionary information within amino acid sequences [33].

For each input protein sequence, the ESM-2 model generates a 1280-dimensional feature representation for each amino acid. Therefore, the feature representation for each virulence protein sequence is , where denotes the embeddings of each sequence generated by ESM-2, R denotes set of real numbers and L denotes the length of the VF sequence. These high-dimensional embeddings not only encapsulate the intrinsic characteristics of the sequences but also embed biological significance automatically extracted through deep learning, thereby providing a robust representational foundation for downstream tasks.

Modelling the graph data structure of VFs

First, we utilized ESMFold to predict the three-dimensional structure of the VFs. Next, we constructed the contact map by considering pairs of Carbon alpha (Cα) atoms that are within 15 Å of each other. Later, we used additional atomic information to calculate the geometric features of the protein’s nodes and edges. Subsequently, we use additional atomic information to calculate the geometric features of protein nodes and edges, where the atomic features include dihedral angles, bond angles, radial basis function embedding, node direction, totaling 184 dimensions, and the edge features include radial basis function embedding, edge direction, edge orientation, edge position encoding, totaling 450 dimensions. To enhance the learning effect of node representations, we concatenate the 1024-dimensional features from ProtT5 to the node features.

Therefore, we define the node features as where represents the i-th Cα node. Then we define the edge features between and as , where and represent the -th and -th Cα nodes, respectively. Thus, we obtain a graph . is the set of vertices where is the number of Cα node of the protein sequence. is the set of edges. A brief description of the node and edge features is provided in Table 2, and the detailed calculation methods can be found in the S2 File titled “Detailed Derivation of Geometric Graph Features” in the Supplementary information.

thumbnail
Table 2. The node and edge features for graph representation.

https://doi.org/10.1371/journal.pone.0340194.t002

Extracting local and global sequence features

Traditional sequence processing models, such as LSTM [34] and Gated Recurrent Unit (GRU) [35], mitigate the vanishing gradient problem through gating mechanisms, thereby better capturing long-term dependencies. However, these models still face challenges with very long sequences: as sequence length increases, the effectiveness of the gating mechanisms diminishes, leading to reduced information transmission efficiency and ultimately impacting model performance. For example, in Recurrent Neural Network (RNN) [36], hidden states can only retain information from a limited number of time steps, making it difficult to effectively capture long-term dependencies. In contrast, we choose to use CNNs and Transformers for extracting protein sequence features. CNNs expand their receptive fields through multiple convolutional layers to efficiently capture local features [37,38]; Transformers leverage self-attention mechanisms, allowing each position to directly connect with all other positions without being constrained by sequence length, excelling particularly in capturing long-distance dependencies. Therefore, we will comprehensively extract VF sequence features from both local and global perspectives.

For the original feature sequence X(0) that denotes the embeddings of each sequence generated by ESM-2, we first use a two-layer one-dimensional convolution to extract local features. The definition of extracting local features using one-dimensional convolution is as follows:

(1)(2)

where W1, W2 represents the learnable parameters (the convolutional kernel) and Conv1D represents Convolutional 1D. Then we apply LayerNorm to the features X(2) obtained from the two-layer CNN, resulting in X(3), which is fed into the multi-head attention mechanism to capture global patterns of the VF sequence. Here, LayerNorm indicates Layer Normalization. The mathematical formulation of multi-head attention is defined as:

(3)(4)(5)(6)(7)

where represents the learnable parameters and dk represents the dimension of Ki. The subsequent computational step integrates nonlinear transformations via activation functions, facilitating the model’s ability to capture intricate feature representations. The mathematical formulation of the FeedForward layer is defined as:

(8)(9)(10)

where W4, b4, W5, b5 represents the learnable parameters, Dropout denotes a technique for randomly deactivating neurons during training to prevent overfitting, RELU refers to the rectified linear unit activation function that introduces nonlinearity, and Avgpooling signifies the average pooling operation. And ES denotes the final output embedding from the sequence channel.

Extracting structure features

Geometric graph neural networks demonstrate high efficacy in modeling the inherent structural characteristics of proteins, such as bond lengths, angles, and dihedral angles [39,40], which are essential for understanding the spatial relationships within a protein’s 3D structure [41]. By incorporating such geometric features, geometric graph neural networks provide a more detailed representation of a protein’s topological structure and its interactions with other molecules [42]. The integration of geometric graph neural networks with ESMFold-derived structures facilitates a thorough exploration of protein topology, thereby improving our ability to predict and analyze protein function with greater precision and depth [43]. Remarkably, geometric graph neural networks have exhibited outstanding efficacy in diverse computational tasks [4446].

Therefore, we employed geometric graph neural networks to extract structure features. For the node and edge features obtained from the protein’s 3D structure and the node features derived from ProtT5, which together form the , we first input the edge and node features into an encoder module to obtain the node and edge embeddings. The formula for this process is as follows:

(11)(12)

where W6, b6, W7, b7, W8, b8, W9, b9 represents the learnable parameters. and represent the encoded features, both with a dimensionality of 256. Then we use Graph Transformer [47] to extract structure information from the obtained graph, with the specific formula as follows:

(13)(14)

where W10, W11, W12, W13, W14 represents the learnable parameters. represents the attention coefficients. N(i) represents the set of all nodes that are connected to nodei by an edge. The extracted features are propagated through residual connections and a FeedForward neural network, followed by feature refinement via the context module. The corresponding mathematical expressions are defined below:

(15)(16)(17)(18)

where W15, b15, W16, b16, W17, b17, W18, b18 represents the learnable parameters and sigmoid denotes the activation function that maps outputs to [0,1]. Finally, we obtain the structure embedding through attention pooling, as given by the following formula:

(19)

Classifying VFs and weighted cross-entropy loss

We compared three fusion strategies for integrating the sequence features and structure features learned through the dual-channel feature extraction module. The first strategy was concatenation, which combines features from different sources along the feature dimension to form a longer fused feature vector. The second strategy was direct addition, where the features obtained from the two channels are directly summed for fusion. The third strategy employed an attention mechanism, which dynamically computes the weights of sequence features and geometric structural features, achieving adaptive feature combination through weighted fusion. As shown in S5 Fig in S3 File, the direct addition strategy yielded the best performance, and was therefore ultimately adopted for multi-channel feature fusion. Specifically, the embeddings derived from the two parallel channels are summed, followed by a MLP for final classification of the VFs. The corresponding mathematical formulation is provided below:

(20)

To optimize the classification performance, we carefully design the loss function to address potential challenges in the training process. In conventional multi-class classification tasks, cross-entropy loss is widely adopted as the standard loss function. However, our dataset exhibits substantial variation in sample distribution across categories, potentially causing model predictions to favor majority classes. To mitigate this imbalance, we implemented weighted cross-entropy loss [48], which assigns category-specific weights to enhance the model’s focus on underrepresented samples. The mathematical formulation of the loss function is provided below:

(21)

where N represents the total number of samples in the dataset, C denotes the total number of classes, and wc is the weight assigned to class c, which helps adjust the importance of each class based on its frequency. The actual class label of the i-th sample is denoted by yi, whereas represents the model’s estimated probability that the i-th sample is classified under class c. Lastly, is an indicator function that takes the value 1 when the true label yi matches the class c, and 0 otherwise. For each class c, the weight is calculated as follows:

(22)(23)

where countc is the number of occurrences of class c, is a very small value (1e − 6), used to prevent division by zero.

Results

Evaluation metrics

During the validation and independent testing stages, we employed a diverse set of evaluation metrics to comprehensively analyze model performance, comprising Accuracy (ACC), confusion matrix analysis, and Receiver Operating Characteristic (ROC) curve evaluation. To address dataset imbalance, weighted variants of Accuracy, Recall, and F1-score were adopted to ensure balanced assessment of model efficacy. The mathematical definitions for ACC, Precision, Recall, and F1-score used in this study are provided below:

(24)(25)(26)(27)

where denote the -th class, is the number of classes, Supporti is the number of samples in class , and TP, TN, FP, and FN represent the counts of true positives, true negatives, false positives, and false negatives, respectively. denote the -th class. All metrics are calculated using the sklearn [49] package.

Parameter settings

The PLM-GNN model was trained and tested on an independent GeForce RTX 4090 24 GB GPU. During training, the best parameters were selected based on the minimum validation loss. Furthermore, the hyperparameters of our model were configured as follows: a learning rate of 0.00005, a batch size of 32, and a fixed random seed of 42. In terms of model architecture, we utilized a single Transformer layer with a 4-head multi-head attention mechanism, a dropout rate of 0.3, and a single-layer MLP. During training, we adopted the Cosine Annealing strategy as the learning rate scheduler [50] to dynamically adjust the learning rate. This approach facilitates faster convergence in the initial training phase while maintaining stable optimization in later stages, thereby reducing the risk of the model converging to suboptimal local minima and improving its generalization performance. To further enhance model generalization, an early stopping criterion with a patience parameter of 5 was integrated into the training protocol to mitigate overfitting risks.

The performance of PLM-GNN on independent test sets

To comprehensively assess the VF classification capability of our model, a dual evaluation framework incorporating both validation and independent test sets was utilized. During training, the most effective parameter configurations were iteratively retained based on validation set performance. This approach guaranteed robust generalizability, enabling the model to attain near-optimal predictive accuracy on previously unencountered data instances.

The experimental results of our proposed multi-classification method for VFs, termed PLM-GNN, are shown in Fig 4A. The results demonstrate that the model achieved an accuracy of 88.56% on the validation set and also performed well on the independent test set, with an accuracy of 86.47%. To provide a more intuitive reflection of the model’s overall performance, we plotted a confusion matrix as Fig 4B, which displays the number of correctly and incorrectly predicted samples for each category. From the confusion matrix, it can be observed that the model correctly predicts a large number of samples for the top six VF categories. However, for the Biofilm category, due to its limited sample size, the model’s predictive performance for this category is relatively lower, highlighting the impact of data imbalance on model performance.

thumbnail
Fig 4. The performance of PLM-GNN from three perspectives.

(A) Comparative evaluation of PLM-GNN’s classification metrics across validation and test datasets. (B) The confusion matrix of PLM-GNN on the test set. (C) The ROC curves for each type of VF by PLM-GNN on the test set.

https://doi.org/10.1371/journal.pone.0340194.g004

To systematically validate the model’s discriminative capacity across diverse VF categories, ROC curves were generated (Fig 4C). All categories achieved AUC values exceeding 0.9, demonstrating robust inter-class differentiation capability in multi-class scenarios. This quantitative evidence substantiates the model’s superior classification performance and operational reliability. Notably, despite the small sample size of the Biofilm category, its corresponding ROC curve still exhibited a high degree of discriminative capability.

Comparison with other methods in terms of VF classification

To comprehensively evaluate the performance of PLM-GNN, we selected several sequence processing models and graph neural network models for comparison. The sequence processing models included CNN [31], GRU, LSTM [35], Bidirectional Long Short-Term Memory (BiLSTM) [51], Transformer [52,53], CNN-LSTM, CNN-BiLSTM, and CNN-GRU. The graph neural network models included GCN [27], Graph Attention Network (GAT) [54], and Graph Transformer [48]. For the sequence processing models, we utilized features generated by ESM-2. These features were input into the sequence models to extract higher-level representations, which were then passed through an MLP for classification. For the graph neural network models, we employed features generated by ProtT5, which were concatenated with geometric features derived from PDB to form a combined feature set. These features were processed by the graph neural network models to extract relevant information, followed by classification using an MLP. Subsequently, we performed hyperparameter tuning on these baseline models. The detailed tuning ranges and the final selected parameters are summarized in S1 Table in S3 File, while the corresponding performance on the test set is visualized in Table 3. To comprehensively evaluate the performance of the proposed model, we selected the two best-performing baseline models from the test set and conducted a statistical significance analysis for PLM-GNN. The independent test set was partitioned into five equal subsets, with four subsets used for testing in each iteration, producing five distinct evaluation datasets. Independent samples t-tests were performed on these datasets. As illustrated in Fig 5, PLM-GNN exhibited consistent and statistically significant advantages over both GCN and GAT across multiple performance metrics (all p-values < 0.05), confirming its statistical superiority over these baseline models.

thumbnail
Table 3. Comparison results of different methods on the independent test.

https://doi.org/10.1371/journal.pone.0340194.t003

thumbnail
Fig 5. Statistical significance comparison of PLM-GNN with GCN and GAT in VF classification.

https://doi.org/10.1371/journal.pone.0340194.g005

To further evaluate the models’ ability to distinguish between classes, we analyze the clustering patterns in the t-SNE visualizations. As depicted in Fig 6, models such as GAT, CNN-LSTM, CNN-BiLSTM, CNN-GRU, LSTM and BiLSTM exhibit poorly defined class boundaries, suggesting limited capability in accurately differentiating classes, which consequently reduces classification accuracy. Similarly, the t-SNE visualizations of Graph Transformer, GCN, GRU, CNN, LSTM and Transformer reveal overlapping clusters of samples from multiple classes, indicating challenges in correctly assigning samples to their respective classes. In contrast, the t-SNE visualization of PLM-GNN shows well-separated class clusters, underscoring its enhanced ability to effectively integrate features from both sequential and structure channels. This result highlights the clear advantage of PLM-GNN in classification tasks.

thumbnail
Fig 6. The t-SNE clustering results of embeddings generated by PLM-GNN and other methods.

https://doi.org/10.1371/journal.pone.0340194.g006

Selection of protein pretrained language models

Among the currently popular protein pretrained language models are ESM-2, ProtT5, and ProtBert [20]. In this study, we conducted a feature selection experiment based on a VF multi-class classification model to determine the most effective representation combination for this task. Specifically, each of the three pretrained models was independently applied to the model’s two input channels: the sequence channel and the structural channel. The evaluation results are presented in Table 4, where the first part of each combination refers to the model used for the sequence channel, and the second part indicates the node features used in the structural channel. These results on the independent test set demonstrate that employing ESM-2 for the sequence channel and ProtT5 for the structural channel achieves the best performance. Accordingly, this combination of pretrained language models was selected for the final model configuration.

thumbnail
Table 4. Combinations of pretrained language models and the performance.

https://doi.org/10.1371/journal.pone.0340194.t004

Ablation experiments

To validate the effectiveness of the dual-channel architecture in our model, we perform ablation experiments to assess two model variants: (1) PLM-GNN -w/o Transformer: which substitutes the Transformer module with an MLP in the sequence channel while maintaining the original architecture of the structure channel; (2) PLM-GNN -w/o CNN: which replaces the CNN module with an MLP in the sequence channel, preserving the structure channel; (3) PLM-GNN -w/o Seq: which removes the entire sequence channel and exclusively utilizes structure channel and (4) PLM-GNN -w/o GNN: eliminates the structure channel and retains only the sequence channel for feature learning. The experimental results presented in Table 5 demonstrate that the full PLM-GNN model consistently outperforms these ablated variants. Notably, removing the Transformer module leads to a marked drop in performance, highlighting its critical role in capturing the global contextual features of VFs. Similarly, the exclusion of the CNN module results in a significant performance decline, underscoring its importance in extracting local sequence-level patterns. Furthermore, the full model surpasses the PLM-GNN-w/o Seq variant, emphasizing the essential contribution of protein sequence information to the identification of complex internal patterns and key discriminative features. In addition, the full model also outperforms the PLM-GNN-w/o GNN variant, which highlights the significance of structural information in modeling the three-dimensional conformation of proteins and their internal interactions. These findings collectively confirm that each module plays an indispensable role in achieving optimal performance.

Interpretability of the learned representations via t-SNE visualization

To comprehensively investigate the learning mechanisms and feature extraction capabilities of the PLM-GNN model, we first introduced its dual-channel architecture. The PLM-GNN model processes features from two distinct sources through independent channels: one channel focuses on sequence features extracted by ESM-2, while the other handles structure features generated by a geometric graph neural network. This design enables the model to capture critical information from diverse perspectives, thereby facilitating a more holistic understanding of VF characteristics.

Initially, we employed t-SNE [55] a widely used dimensionality reduction technique, to project the sequence features obtained from ESM-2 into a two-dimensional space for visualization. As illustrated in Fig 7A, the original features in this reduced-dimensional space exhibit overlapping distributions, with sample points from different categories intermixed and indistinguishable. This suggests that the original features alone are inadequate for clearly differentiating between various VF categories. In contrast, when the learned representations from the trained PLM-GNN model were projected into a two-dimensional space using t-SNE (Fig 7D), the separation between categories improved significantly, with distinct clustering patterns emerging. This outcome demonstrates that the PLM-GNN model effectively learns discriminative features suitable for VF prediction, thereby enhancing classification performance.

Further analysis of the individual performances of the two channels (Fig 7B and 7C) reveals that each channel independently achieves clear classification of VFs after processing its respective features. This not only highlights the efficacy of the two channels but also underscores their complementary roles in feature extraction. Collectively, these experiments demonstrate the effectiveness of the PLM-GNN model in extracting meaningful information from complex raw features and learning advanced representations that improve classification performance, while also validating the dual-channel design.

To further validate the interpretability of the PLM-GNN model, we conducted a statistical experiment based on attention scores extracted from the sequence channel. We selected the T4SS within the dataset for analysis. Initially, we filtered out the top 5 amino acids with the highest attention scores in each sequence. Subsequently, we conducted a statistical analysis of the overall distribution of these high-frequency amino acids across the entire T4SS category. The results showed that the five most frequently occurring amino acids were phenylalanine (F), threonine (T), proline (P), serine (S), and glycine (G). These amino acids play critical roles in the biological functions of the T4SS system. Among them, F is widely present in the hydrophobic motifs at the C-terminus of effector proteins, mediating binding with the LvgA protein and subsequent transport processes [56]; T and S serve as important phosphorylation sites involved in regulating T4SS assembly; while P and G, as structure-stabilizing amino acids, play key roles in the transmembrane channel structure of T4SS [57]. These findings consistently demonstrate that the attention mechanism of the PLM-GNN model can effectively capture key amino acid features related to transport, regulation, and structure in the T4SS system, reflecting its strong biological interpretability.

Analysis of potential causes for model prediction errors

To investigate the causes of model prediction errors, in the S1 File we computed both Pearson and Spearman correlation coefficients between the misprediction label vector of the PLM-GNN and the Predicted Local Distance Difference Test (pLDDT) [25] score vector. The results indicated no significant correlation between PLM-GNN classification errors and pLDDT scores.

Next, we examined the influence of species on prediction errors. As shown in S2 Fig in S3 File, no particular species exhibited notably prominent misprediction. We then further analyzed whether the length of virulence factor sequences is associated with prediction errors. Sequences were grouped at intervals of 200 in length, and the performance metrics for each group were calculated, as shown in S3 Fig in S3 File. The results revealed that the performance of sequences between 1000 and 1240 in length was not significantly lower than that of sequences in the 200–400 range, indicating that prediction errors are not related to sequence length.

Additionally, through t-SNE visualization of the fused features from the model (Fig 7D), we observed that samples from some different categories clustered together in the embedding space, indicating feature similarity. For instance, the features of the Immune modulation and Nutritional/Metabolic factor categories were closely distributed. It can therefore be inferred that the model’s prediction errors primarily stem from the proximity of certain categories in the feature space and their representational similarity, resulting in insufficient inter-category distinguishability.

Applied to effector delivery system classification

Effector delivery system is specialized mechanisms used by bacteria to transport proteins, known as effectors, into host cells or the extracellular environment. These systems play crucial roles in bacterial pathogenesis and symbiosis by facilitating interactions between bacteria and their hosts [58,59]. The recognition of Effector delivery system is crucial as it uncovers how pathogens manipulate host cells to facilitate infection, thereby providing key insights for developing new therapeutic strategies, vaccines, and biological defense measures. Moreover, by predicting the categories of Effector delivery system, we can also validate the effectiveness of our model in classifying VFs.

We first extracted the sequences belonging to the Effector delivery system category from the multi-classification dataset, resulting in a total of 1,580 sequences. Due to the limited number of sequences in the T5SS category, we focused our classification analysis on the five categories of T2SS, T3SS, T4SS, T6SS, and T7SS. After filtering, the sequence counts for T2SS, T3SS, T4SS, T6SS, and T7SS were 86, 649, 561, 206, and 71, respectively, amounting to a total of 1,573 sequences. These sequences were then divided into training, test, and validation sets in an 8:1:1 ratio. The detailed distribution of the data is presented in Table 6.

thumbnail
Table 6. The distribution of the Effector delivery system dataset.

https://doi.org/10.1371/journal.pone.0340194.t006

From the confusion matrix in Fig 8A, it is evident that the model performs well in predicting the categories of T2SS, T3SS, T4SS, and T6SS. However, its performance for the T7SS category is slightly weaker, primarily due to the limited number of samples in this category. Furthermore, the ROC curve in Fig 8B demonstrates that the model can accurately distinguish among the various categories in the Effector delivery system dataset. By leveraging the complementary strengths of sequence and structure data, PLM-GNN not only significantly enhances the model’s discriminative capabilities but also ensures its broad applicability across diverse datasets.

thumbnail
Fig 8. The performance on the test set for the five types of Effector delivery system.

https://doi.org/10.1371/journal.pone.0340194.g008

Performance on binary classification of VFs

Since the research on VFs usually requires first determining whether a protein is a VF and then conducting more detailed classification, we retrained and evaluated the model on a binary classification dataset to ensure that a complete VF classification process could be achieved through our model. Here, we used the DeepVF [15] dataset, from which we selected and obtained a dataset containing 2,873 VFs and 2,872 non-VFs. Subsequently, we divided it into a training set and a validation set in an 8:2 ratio. Additionally, to better compare performance with other models, we chose the DeepVF test set for testing.

We then compared PLM-GNN with several state-of-the-art VF identification models on the same test set, including BLAST [11], MP3 [13], PBVF [14], VirulentPred [12], DeepVF, VF-Pred [16], DTVF [23] and GTAE-VF [26]. As shown in Table 7, PLM-GNN achieved higher accuracy scores than the other eight models, with accuracy improvements of 11.2%, 20.2%, 6.8%, 25.5%, 5.0%, 2.7%, 1.6%, and 1.3%, respectively. Overall, our model demonstrates consistently superior performance across key evaluation metrics, underscoring the effectiveness of the proposed dual-channel architecture.

thumbnail
Table 7. Comparison of PLM-GNN’s performance with existing methods on the VF identification task using the same test set.

https://doi.org/10.1371/journal.pone.0340194.t007

Accurately classifying VFs with remote homology

In biochemistry, the function of a protein is predominantly governed by its three-dimensional structure rather than its amino acid sequence alone. Certain proteins exhibit low sequence similarity while sharing structure similarities, resulting in analogous functional characteristics. Structure alignment of proteins offers profound insights into functional relationships across extended evolutionary distances, a feat often unachievable through sequence-based alignment methods [60,61]. Protein pairs that have dissimilar sequences but comparable structures—identified by a sequence identity of less than 0.3 and a TM-score exceeding 0.5—are referred to as “remote homologs.” [6264].

As demonstrated in Fig 9, the PLM-GNN model exhibits a strong capability to learn from structurally similar VFs and identify remote homology even under low sequence similarity, enabling accurate functional classification. This highlights the model’s capacity to extract essential information from 3D protein structures, effectively recognizing and categorizing functionally analogous VFs beyond sequence-based constraints. To quantitatively validate this ability, we constructed a dedicated dataset containing 93 VFs from the test set that show remote homology to sequences in the training set. Our model achieved 91.4% accuracy, correctly predicting 85 out of 93 instances. In comparison, the standalone sequence channel achieved 79 correct predictions, and the standalone structure channel achieved 70. These results demonstrate that the dual-channel architecture—which integrates complementary sequence and structure features—significantly enhances the model’s performance in detecting remote homology.

thumbnail
Fig 9. Identify remote homology VFs.

(Train refers to the Effector delivery system type VFs in the training set, while test refers to the Effector delivery system type VFs correctly predicted in the test set.).

https://doi.org/10.1371/journal.pone.0340194.g009

Discussion and conclusion

This study proposes a novel dual-channel model, PLM-GNN, which independently captures the sequence features and structural features of VFs through a dedicated CNN-Transformer channel and a geometric graph neural network channel, respectively. By integrating the representations from both channels, the model achieved state-of-the-art performance in both multi-class and binary VF classification tasks, surpassing existing methods, as validated through comparative evaluation and t-SNE visualization. PLM-GNN also demonstrated strong performance in predicting effector delivery systems, highlighting its broad application potential in improving protein functional annotation. While the model may facilitate future studies aimed at identifying virulence-related mechanisms or therapeutic targets [65], its immediate contribution is providing a more accurate and comprehensive classification tool for VFs.

However, several important limitations of this study should be noted. First, constrained by the dataset scale, the model was trained only on the seven most abundant VF categories from the VFDB core dataset, excluding seven rarer VF categories. This inevitably limits the model’s generalization capability for under-represented VF categories, and its performance on these categories remains unclear. Second, the model underperformed in predicting minority classes such as effector delivery systems (e.g., T7SS), highlighting the challenges posed by highly imbalanced data. Future work will explore solutions such as few-shot learning or employing generative adversarial networks (GANs) to generate samples for rare categories. Third, due to GPU memory constraints, proteins longer than 1240 amino acids were excluded, which may prevent the model from learning long-sequence VFs with complex domains. Future plans include utilizing GPUs with larger memory capacity and exploring strategies such as segmenting long-sequence VFs for processing. Looking ahead, we will focus on expanding the model’s application to all categories in VFDB, enhancing the scalability of the model architecture, and integrating features from emerging large language models for amino acid sequences (e.g., Nucleotide Transformer [66]) to further improve performance.

Supporting information

S1 File. Distribution of pLDDT and correlation with model performance.

https://doi.org/10.1371/journal.pone.0340194.s001

(DOCX)

S2 File. Detailed derivation of geometric graph features.

https://doi.org/10.1371/journal.pone.0340194.s002

(DOCX)

S3 File. Supplemental figures and tables.

https://doi.org/10.1371/journal.pone.0340194.s003

(DOCX)

References

  1. 1. Becker K, Hu Y, Biller-Andorno N. Infectious diseases - a global challenge. Int J Med Microbiol. 2006;296(4–5):179–85. pmid:16446113
  2. 2. Liu B, Zheng D, Jin Q, Chen L, Yang J. VFDB 2019: a comparative pathogenomic platform with an interactive web interface. Nucleic Acids Res. 2019;47(D1):D687–92. pmid:30395255
  3. 3. Wu H-J, Wang AH-J, Jennings MP. Discovery of virulence factors of pathogenic bacteria. Curr Opin Chem Biol. 2008;12(1):93–101. pmid:18284925
  4. 4. Chen L, Yang J, Yu J, Yao Z, Sun L, Shen Y, et al. VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res. 2005;33:D325–8. pmid:15608208
  5. 5. Rasko DA, Sperandio V. Anti-virulence strategies to combat bacteria-mediated disease. Nat Rev Drug Discov. 2010;9(2):117–28. pmid:20081869
  6. 6. Liu B, Zheng D, Zhou S, Chen L, Yang J. VFDB 2022: a general classification scheme for bacterial virulence factors. Nucleic Acids Res. 2022;50(D1):D912–7. pmid:34850947
  7. 7. Sayers S, Li L, Ong E, Deng S, Fu G, Lin Y, et al. Victors: a web-based knowledge base of virulence factors in human and animal pathogens. Nucleic Acids Res. 2019;47(D1):D693–700. pmid:30365026
  8. 8. Zhou CE, Smith J, Lam M, Zemla A, Dyer MD, Slezak T. MvirDB--a microbial database of protein toxins, virulence factors and antibiotic resistance genes for bio-defence applications. Nucleic Acids Res. 2007;35:D391–4. pmid:17090593
  9. 9. Li J, Tai C, Deng Z, Zhong W, He Y, Ou H-Y. VRprofile: gene-cluster-detection-based profiling of virulence and antibiotic resistance traits encoded within genome sequences of pathogenic bacteria. Brief Bioinform. 2018;19(4):566–74. pmid:28077405
  10. 10. Finn RD, Clements J, Arndt W, Miller BL, Wheeler TJ, Schreiber F, et al. HMMER web server: 2015 update. Nucleic Acids Res. 2015;43(W1):W30–8. pmid:25943547
  11. 11. Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden TL. NCBI BLAST: a better web interface. Nucleic Acids Res. 2008;36(Web Server issue):W5–9. pmid:18440982
  12. 12. Garg A, Gupta D. VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens. BMC Bioinformatics. 2008;9:62. pmid:18226234
  13. 13. Gupta A, Kapil R, Dhakan DB, Sharma VK. MP3: a software tool for the prediction of pathogenic proteins in genomic and metagenomic data. PLoS One. 2014;9(4):e93907. pmid:24736651
  14. 14. Rentzsch R, Deneke C, Nitsche A, Renard BY. Predicting bacterial virulence factors - evaluation of machine learning and negative data strategies. Brief Bioinform. 2020;21(5):1596–608. pmid:32978619
  15. 15. Xie R, Li J, Wang J, Dai W, Leier A, Marquez-Lago TT, et al. DeepVF: a deep learning-based hybrid framework for identifying virulence factors using the stacking strategy. Brief Bioinform. 2021;22(3):bbaa125. pmid:32599617
  16. 16. Singh S, Le NQK, Wang C. VF-Pred: Predicting virulence factor using sequence alignment percentage and ensemble learning models. Comput Biol Med. 2024;168:107662. pmid:37979206
  17. 17. Zheng D, Pang G, Liu B, Chen L, Yang J. Learning transferable deep convolutional neural networks for the classification of bacterial virulence factors. Bioinformatics. 2020;36(12):3693–702. pmid:32251507
  18. 18. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118(15):e2016239118. pmid:33876751
  19. 19. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–30. pmid:36927031
  20. 20. Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans Pattern Anal Mach Intell. 2021;44:1–1. pmid:34232869
  21. 21. Zhang Y, Guan J, Li C, Wang Z, Deng Z, Gasser RB, et al. DeepSecE: a deep-learning-based framework for multiclass prediction of secreted proteins in Gram-negative bacteria. Research (Wash D C). 2023;6:0258. pmid:37886621
  22. 22. Bai P, Li G, Luo J, Liang C. Deep learning model for protein multi-label subcellular localization and function prediction based on multi-task collaborative training. Brief Bioinform. 2024;25(6):bbae568. pmid:39489606
  23. 23. Sun J, Yin H, Ju C, Wang Y, Yang Z. DTVF: a user-friendly tool for virulence factor prediction based on ProtT5 and deep transfer learning models. Genes (Basel). 2024;15(9):1170. pmid:39336761
  24. 24. Anfinsen CB. Principles that govern the folding of protein chains. Science. 1973;181(4096):223–30. pmid:4124164
  25. 25. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. pmid:34265844
  26. 26. Li G, Bai P, Chen J, Liang C. Identifying virulence factors using graph transformer autoencoder with ESMFold-predicted structures. Comput Biol Med. 2024;170:108062. pmid:38308869
  27. 27. Kipf TN, Welling M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv. 2016.
  28. 28. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. In: arXiv.org. 2017.
  29. 29. Bronstein MM, Bruna J, LeCun Y, Szlam A, Vandergheynst P. Geometric deep learning: going beyond euclidean data. IEEE Signal Process Mag. 2017;34(4):18–42.
  30. 30. LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989;1(4):541–51.
  31. 31. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323(6088):533–6.
  32. 32. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2. pmid:23060610
  33. 33. Luo Z, Wang R, Sun Y, Liu J, Chen Z, Zhang Y-J. Interpretable feature extraction and dimensionality reduction in ESM2 for protein localization prediction. Brief Bioinform. 2024;25(2):bbad534. pmid:38279650
  34. 34. Graves A. Long Short-Term Memory. Stud Comput Intell. 2012;385:37–45.
  35. 35. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H. Learning Phrase Representations Using RNN Encoder-Decoder for.Statistical Machine Translation. arXiv (Cornell University). 2014. pp. 1406.
  36. 36. Grossberg S. Recurrent neural networks. Scholarpedia. 2013;8(2):1888.
  37. 37. Koushik J. Understanding Convolutional Neural Networks. arXiv. 2016.
  38. 38. Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.
  39. 39. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers. 1983;22(12):2577–637.
  40. 40. Engh RA, Huber R. Accurate bond and angle parameters for X-ray protein structure refinement. Acta Crystallogr A Found Crystallogr. 1991;47(4):392–400.
  41. 41. Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Neural Message Passing for Quantum Chemistry. arXiv (Cornell University). 2017.
  42. 42. Lemieux GS-P, Paquet E, Viktor HL, Michalowski W. Geometric deep learning for protein–protein interaction predictions. IEEE Access. 2022;10:90045–55.
  43. 43. Bronstein MM, Bruna J, Cohen T, Veličković P. Geometric deep learning: grids, groups, graphs, geodesics, and gauges. arXiv. Cornell University. 2021.
  44. 44. Song Y, Yuan Q, Chen S, Zeng Y, Zhao H, Yang Y. Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures. Nat Commun. 2024;15(1):8180. pmid:39294165
  45. 45. Mi J, Wang H, Li J, Sun J, Li C, Wan J, et al. GGN-GO: geometric graph networks for predicting protein function by multi-scale structure features. Brief Bioinform. 2024;25(6):bbae559. pmid:39487084
  46. 46. Song Y, Yuan Q, Zhao H, Yang Y. Accurately identifying nucleic-acid-binding sites through geometric graph learning on language model predicted structures. Brief Bioinform. 2023;24(6):bbad360. pmid:37824738
  47. 47. Shi Y, Huang Z, Feng S, Zhong H, Wang W, Sun Y. Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. 2021.
  48. 48. Sudre CH, Li W, Vercauteren T, Ourselin S, Jorge Cardoso M. Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations. Cardoso MJ, Arbel T, Carneiro G, Syeda-Mahmood T, Tavares JMRS, Moradi M, et al., editors. In: Springer Link [Internet]. Cham: Springer International Publishing; 2017. pp. 240–8.
  49. 49. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12.
  50. 50. Ilya L, Frank H. SGDR: stochastic gradient descent with warm restarts. arXiv e-prints. 2016. pp. 1608.
  51. 51. Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Signal Process. 1997;45(11):2673–81.
  52. 52. Pacal I, Attallah O. Hybrid deep learning model for automated colorectal cancer detection using local and global feature extraction. Knowl-Based Syst. 2025;319:113625.
  53. 53. Pacal I, Attallah O. InceptionNeXt-Transformer: a novel multi-scale deep feature learning architecture for multimodal breast cancer diagnosis. Biomed Signal Process Control. 2025;110:108116.
  54. 54. Velickovic P, Cucurull G, Casanova A, Romero A, Lio’ P, Bengio Y. Graph attention networks. In: ICLR. 2018.
  55. 55. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(11).
  56. 56. Álvarez-Rodríguez I, Arana L, Ugarte-Uribe B, Gómez-Rubio E, Martín-Santamaría S, Garbisu C, et al. Type IV coupling proteins as potential targets to control the dissemination of antibiotic resistance. Front Mol Biosci. 2020;7:201. pmid:32903459
  57. 57. Costa TRD, Harb L, Khara P, Zeng L, Hu B, Christie PJ. Type IV secretion systems: Advances in structure, function, and activation. Mol Microbiol. 2021;115(3):436–52. pmid:33326642
  58. 58. Wagner S, Grin I, Malmsheimer S, Singh N, Torres-Vargas CE, Westerhausen S. Bacterial type III secretion systems: a complex device for the delivery of bacterial effector proteins into eukaryotic host cells. FEMS Microbiol Lett. 2018;365(19):fny201. pmid:30107569
  59. 59. Burkinshaw BJ, Liang X, Wong M, Le ANH, Lam L, Dong TG. A type VI secretion system effector delivery mechanism dependent on PAAR and a chaperone-co-chaperone complex. Nat Microbiol. 2018;3(5):632–40. pmid:29632369
  60. 60. Liu W, Wang Z, You R, Xie C, Wei H, Xiong Y, et al. PLMSearch: Protein language model powers accurate and fast sequence search for remote homology. Nat Commun. 2024;15(1):2775. pmid:38555371
  61. 61. Hamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, et al. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol. 2024;42(6):975–85. pmid:37679542
  62. 62. Li G, Zhou J, Luo J, Liang C. Accurate prediction of virulence factors using pre-train protein language model and ensemble learning. BMC Genomics. 2025;26(1):517. pmid:40399812
  63. 63. Xu J, Zhang Y. How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics. 2010;26(7):889–95. pmid:20164152
  64. 64. Zhang Y, Hubner IA, Arakaki AK, Shakhnovich E, Skolnick J. On the origin and highly likely completeness of single-domain protein structures. Proc Natl Acad Sci U S A. 2006;103(8):2605–10. pmid:16478803
  65. 65. Dickey SW, Cheung GYC, Otto M. Different drugs for bad bugs: antivirulence strategies in the age of antibiotic resistance. Nat Rev Drug Discov. 2017;16(7):457–71.
  66. 66. Dalla-Torre H, Gonzalez L, Mendoza-Revilla J, Lopez Carranza N, Grzywaczewski AH, Oteri F, et al. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat Methods. 2025;22(2):287–97. pmid:39609566