MsgaBpred: A B-cell epitope predictor integrating AlphaFold3-predicted structures with multi-scale GCNs and pre-trained language model ESM-C

doi:10.1371/journal.pcbi.1014195

Fig 1.

The overall framework of the proposed MsgaBpred model.

(A) Data Collection. A total of 245 candidate unbound antigens were curated through rigorous filtering. Among them, 200 antigens were used for training and the remaining 45 for testing. (B) Feature Extractor. The antigen sequences are fed into the pretrained language model ESM-C to obtain evolutionary embeddings. Simultaneously, the sequences are submitted to AlphaFold3 to predict the 3D structures. These predicted structures are then processed through the inverse folding model ESM-IF1 and the DSSP algorithm to extract structural embeddings. Finally, the ESM-C embeddings, ESM-IF1 outputs, and DSSP-derived features are concatenated to form comprehensive antigen representations. All pretrained models are used as frozen feature extractors; only the additive attention, multi-scale GCN, and MLP modules are trained. (C) Model Architecture. The extracted features serve as inputs for two distinct modules: (1) an additive attention module, which captures global dependencies and key residue importance; and (2) a protein graph module where nodes correspond to amino acid residues, and edges represent spatial as well as sequential relationships inferred from the structure predicted by AlphaFold3. Node and edge features are processed by a two-layer multi-scale graph convolutional network (GCN). The outputs from both the attention module and the GCN are then concatenated and fed into a multi-layer perceptron (MLP) for final prediction. (D) Multi-Scale GCN. The multi-scale GCN consists of two stacked layers, each employing three parallel graph convolution kernels with receptive field sizes of 1, 2, and 3 hops to capture structural dependencies at multiple scales. The outputs from each kernel are fused via a linear transformation. This multi-scale design enhances the model’s ability to capture both short- and long-range interactions across the protein graph.

More »

Expand

Fig 2.

A. The performance of using different feature combinations on the independent test data.

The evaluation metrics include AUC, AUPR, Precision, F1-score, MCC, and BACC. B. w/o means without the corresponding module. Ablation study results showing the performance of the full model and three variant architectures across five metrics: AUC, AUPR, Precision, F1-score, and MCC.

More »

Expand

Fig 3.

A. Performance of MsgaBpred compared to state-of-the-art approaches on the Epitope3D benchmark.

B. Comparison of MsgaBpred performance with state-of-the-art methods across three distinct test datasets, DiscoTope3_Foldx, DiscoTope3_Solved, and DiscoTope3_Af2. The blue bar graph represented by MsgaBpred is the highest in all three graphs, while the heights of the bar graphs of other colors are much smaller than it.

More »

Expand

Table 1.

The result of model performance with statistical analysis. Delta means the difference between MsgaBpred and EpiGraoh.

More »

Expand

Table 2.

The performance of using different sequence representations on the Epitope3D dataset.

More »

Expand

Table 3.

The performance of using different structure representations on the Epitope3D dataset.

More »

Expand

Fig 4.

A. B-cell epitope prediction performance of MsgaBpred using different sources of protein structural information.

A heatmap is used to visualize performance across evaluation metrics, with color intensity corresponding to the ranking of each model—darker colors indicate better performance. B. A positive association was identified between the structural quality predicted by AlphaFold3 (as measured by GDT) and the performance of MsgaBpred on the Epitope3D test set. Black scatter points indicate the GDT and AUC values for each individual protein, while the red line illustrates the functional relationship between GDT and AUC within the graph.

More »

Expand

Table 4.

The predictive performance on the example (PDB ID:3BIK, Chain:A) of different method.

More »

Expand

Fig 5.

Visualization results of MsgaBpred, EpiGraph, GraphBepi, and Disco Tope3.0 on test data (PDB ID: 3BIK, Chain: A). True positives, false negatives, and false positives are colored in green, red, and yellow, respectively.

More »

Expand

Fig 6.

Visualization results of MsgaBpredon test data (PDB ID: 1TFX, Chain: C; PDB ID: 1OQE, Chain: L). True positives are colored in green. Residues are colored in red.

More »

Expand