Figures
Abstract
Understanding the interaction between T Cell Receptors (TCRs) and peptide-bound Major Histocompatibility Complexes (pMHCs) is crucial for comprehending immune responses and developing targeted immunotherapies. While recent machine learning (ML) models show remarkable success in predicting TCR-pMHC binding within training data, these models often fail to generalize to peptides outside their training distributions, raising concerns about their applicability in therapeutic settings. Understanding and improving the generalization of these models is therefore critical to ensure real-world applications. To address this issue, we evaluate the effect of the distance between training and testing peptide distributions on ML model empirical risk assessments, using sequence-based and 3D structure-based distance metrics. In our analysis we use several state-of-the-art models for TCR-peptide binding prediction: Attentive Variational Information Bottleneck (AVIB), NetTCR-2.0 and -2.2, and ERGO II (pre-trained autoencoder) and ERGO II (LSTM). In this work, we introduce a novel approach for assessing the generalization capabilities of TCR binding predictors: the Distance Split (DS) algorithm. The DS algorithm controls the distance between training and testing peptides based on both sequence and structure, allowing for a more nuanced evaluation of model performance. We show that lower 3D shape similarity between training and test peptides is associated with a harder out-of-distribution task definition, which is more interesting when measuring the ability to generalize to unseen peptides. However, we observe the opposite effect when splitting using sequence-based similarity. These findings highlight the importance of using a distance-based splitting approach to benchmark models. This could then be used to estimate a confidence score on predictions on novel and unseen peptides, based on how different they are from the training ones. Additionally, our results may hint that employing 3D shape to complement sequence information could improve the accuracy of TCR-pMHC binding predictors.
Citation: Castorina LV, Grazioli F, Machart P, Mösch A, Errica F (2025) Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis. PLoS One 20(5): e0324011. https://doi.org/10.1371/journal.pone.0324011
Editor: Amit Kumar, University of Cagliari, ITALY
Received: March 20, 2024; Accepted: April 19, 2025; Published: May 20, 2025
Copyright: © 2025 Castorina et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data for this study are publicly available from the GitHub repository (https://github.com/nec-research/peptide-distance-analysis).
Funding: Funding for this project was provided by NEC Labs Europe GmbH.
Competing interests: Funding for this project was provided by NEC Labs Europe GmbH.
Introduction
The immune system has evolved to recognize different pathogens, such as viruses, which enter cells and exploit their resources for replication. The immune response depends on how well the system can distinguish between healthy and infected/aberrant cells. It does so by leveraging specific molecules on the cell surface called Major Histocompatibility Complexes (MHC). Class I or class II MHCs are presented depending on the cell type [1–3]. An antigenic peptide is displayed by the MHC forming a peptide-bound MHC (pMHC). These peptides are derived from the proteasome, a complex of protease enzymes in cells which break down proteins into short peptides [4].
In infected cells, the presented peptides may be derived from viral proteins, whereas in healthy cells, peptides are derived from housekeeping proteins [5]. T cells can recognize viral peptides in the pMHC by binding it with the T Cell Receptors (TCRs), leading to an immune response [4]. TCRs are able to recognize many different peptides via diverse sequences at the variable regions of their and
chains [6]. TCR binding to the pMHC primarily occurs at the Complementarity-Determining Regions (CDRs). The peptide recognition is primarily mediated by the CDR3. The CDR3-
is derived from the alleles of the V and J genes; the D gene, in addition to the V and J, is involved in shaping the sequence and structure of the CDR3-
, [7, 8]. These alleles can be recombined extensively, ensuring a high TCR repertoire diversity and allowing for a broad T cell-based immune response [9]. The exposure of a naive T cell to an antigen leads to its activation and to the development of a memory T cell population with the same TCR. This allows for a long-lasting immune response [10, 11]. A visual overview of the TCR-pMHC complex is presented in the S1 File.
Immunotherapies targeting cancer or viral infections use binding specificity to activate T cells, enhancing the immune system response. Binding specificity ensures that T cell binding occurs uniquely on the target, to kill say cancerous or infected cells, and avoid healthy cells. Research in immunotherapy covers two main categories: adoptive T cell therapy and T cell inducing vaccines. In adoptive T cell therapy, cancer patients receive specific T cells targeting and destroying the tumour [12, 13]. T cell inducing vaccines leverage antigenic peptides or other classes of antigens to trigger specific T cell development against pathogens, such as viruses [14, 15].
Immunotherapy design requires an in-depth understanding of the biochemical interactions between TCRs and pMHCs to accurately stimulate the immune system. The development of computational models that predict whether TCRs and pMHCs interact would allow for a faster in silico screening of sequences and consequent design of TCR sequences that bind specific target pMHCs. Recent advancements in machine learning (ML) lead to the development of TCR binding predictors [16–22]. The data input of these models consists of tuples of short amino acid sequences such as (peptide, CDR3-) or (peptide, CDR3-
, CDR3-
, MHC). The task is usually formalized as a binary classification between binding or non-binding pairs.
Several computational studies on TCR binding prediction employ data from the Immune Epitope Database (IEDB) [23], VDJdb [24] and McPAS-TCR [25]. These databases mainly contain CDR3- sequences, but often lack information on CDR3-
. Additionally, public TCR-pMHC interaction datasets are often limited in diversity and size. They also present sequence bias towards commonly studied viruses or binding/non-binding pairs bias from experiment setups. Furthermore, laboratory methods that validate the TCR-pMHC interactions, such as surface plasmon resonance, titration calorimetry and fluorescence anisotropy, are usually resource-intensive and time-consuming, hindering the creation of larger datasets [26].
Generally, machine learning (ML) TCR binding predictors achieve high test performance when evaluated on test sets originating from the same source as the training set. However, various studies [27–29] have shown that these methods exhibit weak cross-dataset generalization, meaning, models performance is significantly lower when training and test samples come from different distributions. Furthermore, [27] investigated the effect of different splitting techniques on the TChard dataset, which combines samples from IEDB, VDJdb and McPAS-TCR. They also investigated the effect of considering negative samples from wet lab assays and from random shuffling of the positive tuples. They observed that, when using a “vanilla” Random Split (RS) with negative samples from wet lab assays, ML-based models achieve an estimated area under the receiver operator characteristic (AUROC) larger than on the test set. In this setting, the sets of CDR3 sequences in positive and negative samples are disjoint. This allows the models to memorize whether a CDR3 sequence was observed in either the negative or positive samples at training time. [27] then showed that a simple countermeasure to this problem consists in creating negative samples by randomly shuffling the positive tuples. In this setting, when the RS is employed, models achieve an AUROC score between
and
. Nevertheless, using the RS, peptides and CDR3 sequences may appear in both the training and test sets, leading to inflated estimates of the real-world model generalization capabilities.
Hence, to approximately test model performance for unseen peptides, they propose an alternative splitting method, named Hard Split (HS). The HS ensures that test peptides are never observed at training time. Therefore, the peptide sequences in the training, validation and test sets are unique to their respective sets. When using the HS and negative samples are obtained via random shuffling, the model performs barely better than random guesses, with an AUROC smaller than .
To ensure real-world applicability, the HS, as shown by [27], aims at evaluating models’ generalization abilities under the toughest setup possible. In fact, new peptides arising from new pathogens may differ significantly from those used at training time. ML models can only be safely employed for broad real-world applications if they show sufficient generalization to unseen sequences.
In practice, however, RS and HS represent two extremes of a wider spectrum of data splitting options. Assessing other splitting options can provide further insights on the robustness of ML models.
Our work is driven by the following research question: can we estimate the performance of ML models given an unseen test peptide and its distance from the training ones? We hypothesize that, as the difference in distributions between training and test sets increases, the prediction task becomes gradually harder. Knowing how well ML predictors generalize to an increasing shift in distributions between training and test sets would provide a tangible metric to measure when it is appropriate and to which extent we can employ these methods in the real world.
For this reason, in this work we introduce a new dataset splitting algorithm called Distance Split (DS). Analogously to the HS, peptides placed in the training set are absent from the test set and vice versa. However, given a distance metric, we control the distances between sequences in the training and in the test sets. We increase such distance to test performance on peptides that are “further apart” from the ones seen by the model at training time, making the task increasingly harder.
Previous efforts have predominantly relied on sequence-based metrics, such as Levenshtein distance and BLOSUM substitution matrices, to measure similarity between peptides [30–32]. However, sequence metrics may not fully capture the structural aspects of TCR-peptide interactions, which are essential for accurate binding predictions. The integration of 3D structural information, such as Root Mean Square Deviation (RMSD) between peptides, could offer a more comprehensive approach to evaluating model performance on unseen peptides. Therefore, we explore different distance metrics, i.e., sequence metrics such as Levenshtein and BLOSUM [33], as well as shape metrics such as RMSD [34], using the predicted structures of peptides. As opposed to binning the HS based on distance, the DS allows for control over the median distance of the peptides in the test and validation.
We show that the increased peptide distance between the training and test split directly correlates with the models performances. In particular, increased RMSD (shape) distance between training and test peptides leads to decreased performance. Surprisingly, for BLOSUM (sequence) distance we see the inverse relationship. In real-world scenarios, when predictions on new unseen viral epitopes are required, we believe that leveraging RMSD between the 3D structures of the training and test peptides could serve as a valuable indicator of model reliability, with higher RMSD implying increased prediction uncertainty. Conversely, the inverse relationship observed with BLOSUM distance suggests that incorporating sequence-diverse training data may actually improve model generalization to novel peptides. Thus, a combination of structural and sequence-based metrics could provide a balanced approach, enhancing both the robustness and reliability of TCR-pMHC binding predictions.
Materials and methods
Dataset creation
With the goal of vaccine development against viruses, we select a viral subset of the VDJdb dataset, focusing on human host and MHC class I [35]. We omit MHC class II due to the small number of available samples. The dataset includes peptides from SARS-CoV-2, Influenza A, Human Immunodeficiency Virus (HIV), Hepatitis C Virus (HCV) and Cytomegalovirus (CMV). We discard data points which do not include both the CDR3- and the peptide. The dataset contains 52 unique MHC A and 1 MHC B alleles, 16,504 unique CDR3-
sequences, 28,831 unique CDR3-
sequences and 757 unique peptides, for a total of 34,415 binding samples. To generate non-binding (i.e., negative) samples, we randomly shuffle the available (peptide, CDR3-
) pairs. This process is commonly employed in the literature [19, 36] and leverages the hypothesis that random pairs will most likely not bind. To create a balanced dataset, we randomly generate 36,641 samples of non-binding combinations of CDR3-
and peptide sequences, to increase the total number of data points to 65,946. In this study, the data consists of pairs of (peptide, CDR3-
) and a binding/non-binding label.
Obtaining 3D structures of peptides.
The VDJdb dataset contains information on the primary structure of proteins, i.e., the sequence of amino acids. In reality, these sequences exist as 3D shapes and as part of a complex [35]. However, public datasets like VDJDB lack the 3D shapes of peptides. We use ESMFold, a light-weight sequence-to-shape Language Model [37] and its online API1 to generate 3D structures for the peptide sequences. For structures that gave errors, we use OmegaFold [38], a similar sequence-to-shape Language Model. We use the shapes to calculate the 3D distance between peptides, as described in the following section.
The distance split algorithm
Given a distance metric, the goal of the Distance Split (DS) algorithm is to create data splits of the available (peptide, CDR3-) samples, enforcing a specified median peptide distance between training and test (and validation).
As distance metrics, we use Levenshtein, BLOSUM and RMSD. The Levenshtein distance is a string edit distance measuring the minimum number of edits required to change one peptide sequence into another [39]. The BLOSUM distance is a substitution matrix-based metric that quantifies the evolutionary similarity between two peptide sequences by comparing their amino acid global alignments [40]. The RMSD distance is a 3D structural metric that measures the average deviation between the atomic positions of two superimposed peptide structures, providing an estimate of their conformational similarity. We calculated the RMSD shape distance using PyMol [34]. PyMol aligns any pair of peptide 3D structures computes the RMSD between C of the two structures. The C
atoms are the backbone carbon atoms in each amino acid, which are commonly used in RMSD calculations to provide a simplified yet accurate representation of the overall peptide structure. We focus on C
atoms to avoid over-penalizing minor deviations in side chains, as protein folding models can generate physically impossible conformations that would otherwise lead to inflated RMSD values [41].
For each metric (Levenshtein, BLOSUM, RMSD), we calculate the pairwise distances between all 757 peptides, resulting in three distance matrices: ,
. For each distance matrix, we then calculate the row-wise median
. Each value of
corresponds to the median distance between a given peptide and all other peptides. We then select three bin ranges over the cumulative distribution of the median distances of
, defined by a lower and an upper bound
distance, i.e., (0,33), (33,66) and (66,100). Given the median vector
, for each interval, we compute the distances corresponding to the lower and upper percentiles, i.e., dl and du, for bl and bu, respectively. We then filter out all the peptides whose median distance falls outside the interval of interest. The sets of peptides that fall inside this percentile interval are then sampled for test and validation. For each training-validation-test split, we use a 90-5-5 ratio. The DS algorithm iteratively samples the validation and test sets based on this ratio and the total available data. To guide this process, we calculate the expected number of data points for each split Ntrain, Ntest,
, which we refer to as the test and validation budget.
We select the test samples by iteratively sampling peptides from the specified percentile interval. In each iteration, we randomly choose a peptide from this set, and move all corresponding (peptide, CDR3-) pairs to the test set. This process continues until the target number of test samples (test budget) is reached. Peptides that were not selected are used to form the validation set, following the same iterative procedure. Then, the remaining peptides are assigned to the training set.
Additionally, we enforce a minimum and maximum count for peptides to be included in the validation and test sets. This is meant to constrain the peptide diversity in the set. In this work, we use a minimum count of 5 maximum count of 5,000. The complete algorithm for the DS is available in 1.
Using the DS, as well as the RS and HS, we generate training, test and validation splits of the available data points for all distance metrics, ,
and
. As shown in S4 Fig, the average number of unique peptides is roughly the same across splits.
TCR-peptide interaction prediction models
We select 5 state-of-the-art models for TCR-peptide interaction prediction: Attentive Variational Information Bottleneck (AVIB) [27], NetTCR-2.0 [36], NetTCR-2.2 [42], ERGO II (with pre-trained TCR autoencoder), and ERGO II-LSTM [16] (see details in S1 File, Sect 2). With the exception of ERGO, all models encode the TCR and peptide sequences using the BLOSUM substitution matrix [40].
Algorithm 1. Distance Split (DS) algorithm based on peptide distances within specified percentile bounds.
In AVIB, the TCR and Peptide encoders estimate Gaussian posteriors over a latent space. The various posteriors are then combined using an attention mechanism to estimate a single multi-sequence posterior distribution, which is then used to estimate the binding probability [27]. In NetTCR-2.0, the TCR
and peptide encoders are encoded using several convolutional and global max pooling layers. The encoding is then concatenated and passed through two fully connected layers [36]. NetTCR-2.2 introduces a dropout layer before the fully connected layers, ReLU over Sigmoid, and 64 units rather than 32 for the fully connected layer [42]. In the ERGO II model, the TCR
is encoded using a pre-trained autoencoder. ERGO II-LSTM encodes the peptide using an LSTM encoder. The encodings are then passed to two fully connected layers [16].
Results
We compute the pairwise distance between all peptides using Levenshtein, BLOSUM and RMSD distance for the viral VDJdb dataset. Descriptive statistics of the metrics are summarized in Table 1 and the median visualized in Fig 1. The peptides count distribution is heavily skewed towards a count of less than 10 instances, with several outliers, as shown by the high standard deviation. The median RMSD has a bimodal distribution with peaks at 0.0 and
0.5 Å. The median BLOSUM is normally distributed with most values ranging from 50 to 100. Finally, the median Levenshtein distance is
8, and shows very little distribution of values.
Distances are RMSD (3D shape), BLOSUM (sequence), and Levenshtein (sequence), respectively.
Then, we calculate the correlation between these distances in Fig 2. Levenshtein and BLOSUM distances are very significantly correlated (Spearman : 0.32, p-value: 0), as expected since BLOSUM distance uses a sequence alignment. However, additional physico-chemical information is implicitly accounted for in BLOSUM, which considers evolutionary substitution patterns of amino acids [40]. This inclusion captures the likelihood of biologically relevant substitutions based on factors like size, charge, and hydrophobicity, which are not reflected in simple sequence-based metrics like Levenshtein distance. We observe no correlation between RMSD and sequence-based metrics.
Levenshtein and BLOSUM are sequence-based distances, while RMSD is a shape-based distance. There is a positive correlation between sequence distances and no correlation between shape and sequence distances.
For our experiments, we use five state-of-the-art deep learning models for TCR-peptide interaction prediction: NetTCR-2.0 [36], NetTCR-2.2 [42], AVIB [19], ERGO II and ERGO II-LSTM [16]. We evaluate how increasing the distance between the training and test set affects their performance. We train and test all models with RS, HS as baselines and and DS. For the DS, we use lower and upper bound pairs over the cumulative distribution of the median distances: (0, 33), (33, 66) and (66, 100). We repeat each split with 5 different random seeds and the average results and 95% confidence intervals are shown in Fig 3. More detailed results with Levenshtein DS, overall and per-peptide metrics are available in S1 File Sect 2.
The commonly used Random Split allows test peptides to be observed also during training. In the Hard Split, peptides are exclusively allocated to either the training or test set. In the Distance Split, we enforce specific median distances between the training and test peptides, using either sequence distance (BLOSUM) or shape distance (RMSD). The training-test distance between peptides is controlled by selecting three percentile intervals over the cumulative median peptide-peptide distance distribution: (0,33), (33,66) and (66,100). Mean and 95% confidence intervals are calculated over 5 repeated experiments with different random seeds. More detailed results are available in S1 File Sect 2, with Levenshtein DS (S15 Fig), overall and per-peptide metrics (S1 File Sects 2.2.2 and 2.2.3).
As shown in Fig 3, AVIB and NetTCR-2.2 perform similarly in the sequence and shape DS splits, followed by NetTCR-2.0 and both ERGO II models. As expected, all models perform best in the RS, as peptides in the training set also appear in the test set. On the other hand, performance on the HS is significantly lower compared to the RS.
Regarding the DS splits, as the median distance between the training and test sets increases, model performance worsens in RMSD-based DS splits, as shown by significant Spearman correlations (p < 0.05) for NetTCR-2.2, NetTCR-2.0, and AVIB, all within a 95% confidence interval. In contrast, we observe the opposite trend for BLOSUM-based DS splits, where models perform better as the median BLOSUM distance between the training and test sets increases, with all p-values below 0.05. In general, the highest performance in the DS splits is slightly higher in the RMSD (0,33) bin, than in the BLOSUM (66,100) bin, except for ERGO II (LSTM). There is generally no significant effect on the performance trend when using Levenshtein-based distance split, except for ERGO II (LSTM) which shows a similar trend to BLOSUM (see S14 Fig).
Discussion
The TCR-peptide/pMHC interaction prediction problem presents evaluation challenges [27]. When data is randomly split, models can achieve over-optimistic test performance, as test sequences can be observed at training time.
As a solution, [27] proposed the Hard Split (HS), which consists in randomly sampling test peptides and moving all instances of that peptide to the test set. The HS ensures that test peptides are never seen at training time, thus simulating what happens in the real world, for example, when new peptides arise from new pathogens.
This raises the question of whether it is possible to infer how well a model will generalize to an unseen peptide, by knowing how different the novel peptide is from those seen at training time.
We divide the peptides into different training and test sets using a distance-based algorithm using a sequence metric (BLOSUM) and a shape metric (RMSD). To obtain the RMSD, we generate the 3D structures of the antigenic peptides using ESMFold and OmegaFold [37, 38]. We then use PyMol to align and calculate the Root Mean Squared Distance (RMSD) between the C atoms in the protein backbone [34]. We use C
RMSD, as opposed to all-atoms RMSD, as it is computationally less intensive and produces similar results [43]. Additionally, predicted 3D structures can (and often will) have artifacts such as physically-unlikely positioning of the amino acid side chains, which could unfairly increase the all-atom RMSD [41].
To simulate real-world scenarios involving unseen peptides in a controlled manner, we propose the Distance Split (DS) approach (see Algorithm 1). Using the DS, we chose each test split to guarantee a given median distance between peptides in the training and test peptides. By using the median, some peptides with properties similar to those in the training set may still be included in the test set, thereby allowing for a degree of overlap while maintaining sufficient variability. Nevertheless, when employing the DS, analogously to the HS, test and validation peptides cannot appear at training time. For more stringent conditions, the minimum distance could be used with the DS to ensure that all test peptides are distinctly different from those in the training set.
Specifically, the test peptides are chosen by sampling from a subset of peptides, whose median distance from the other peptides is between a lower and an upper bound. In our experiments, we consider the following intervals over the cumulative median distance distribution: (0,33), (33,66) and (66,100).
As shown in Fig 2, we observe no correlation between RMSD and BLOSUM. Several factors could explain this. The 3D structure, as opposed to the amino acid sequence, is more relevant to describe protein-protein interactions. For example, certain angles in structures could make specific areas interact more than others. Related to this, some small differences in sequence metrics may result in larger structural differences, for instance, if two oppositely charged amino acids are placed next to each other. The 3D structure and RMSD metric can therefore provide a stronger signal for binding prediction models and may be used for testing models performance in out-of-distribution settings, or for the development of new models. Interestingly, we observe that models’ performance is slightly lower for DS in the ranges (33,66) and (66,100) compared to that of the HS. The HS ensures that both test and training peptides are unique. However, even with this uniqueness, test and training peptides might still possess a certain degree of similarity. This similarity provides the model with additional chemical information about the peptide. In contrast, DS enforces a threshold distance between the training and test sets, thereby controlling the difficulty.
Obtaining a 3D structure of a protein often requires lengthy experiments to crystallize it, which may take days, months, or years [44]. With the advent of sequence-to-shape models like AlphaFold [45], OmegaFold [38] and ESMFold [37], this lengthy process can often be reduced to minutes or hours and makes it possible to obtain a fairly accurate 3D shapes of proteins. Structural validation in the lab of these models is currently underway, with some researchers reporting partial success, even when solving for previously unknown structures [46]. Successful applications of sequence-to-shape models in protein research include vaccine design [47], binding affinity ranking [48], protein sequence design [49, 50] and benchmarking [51].
The structures produced by sequence-to-shape models like the ones mentioned above, may be biased towards the training sequences in the Protein Data Bank (PDB), which is composed of larger structures compared to peptides, which are made up of just a few amino acids in length [52]. Nevertheless, in this study, we use the shape as a representation of the real shape and assume that differences between two shapes would be consistent. This analysis could be extended to the full VDJdb dataset to test how the findings transfer to non-viral peptides.
In the structure-based DS, as the median RMSD distance between training and test peptide is increased, the model performance generally worsens. On the other hand, we observe the opposite effect in sequence-based DS, where increasing the BLOSUM sequence between training-test leads to better generalization performance (See Fig 3).
The immune system has evolved to recognize multitudes of pathogens, some of which may have never existed. The TCR-pMHC interaction is therefore degenerate, meaning that many TCRs can recognize the same peptide and that a TCR can recognize millions of peptides, even if the sequences are not identical [53]. Additionally, TCRs may bind different peptides using different binding modes [54]. However, as shown in Fig 2, we find no correlation between sequence and shape distance metrics for the peptide. Coupled with the results of Fig 3, this may imply that TCRs bind conformationally similar viral peptides even if the sequences are different. This is supported by the fact that the peptide binding pocket of MHCs have anchor positions, like the B pocket, which can accommodate different secondary anchor amino acids without altering the overall peptide conformation [55, 56]. This flexibility allows TCRs to recognize structurally similar peptides despite sequence variations, suggesting that structural rather than purely sequence-based features drive TCR recognition efficiency in some cases [55, 56]. Additionally, viruses, like SARS-CoV-2, evolve by changing their sequences to escape immune detection while keeping the structural features needed to interact with host cells. For example, mutations in the spike protein allow SARS-CoV-2 to evade the immune system but still bind effectively to the ACE2 receptor, ensuring the virus remains functional [57, 58]. This suggests that TCRs may bind structurally similar peptides even when their sequences differ, which could explain why models trained on sequence-diverse data (high BLOSUM distance) show improved generalization. In the BLOSUM (0,33) split, where training peptides share higher sequence similarity, models may overfit to sequence-based patterns, potentially memorizing sequence motifs rather than learning underlying structural determinants of binding, leading to lower test performance. This may explain why models trained on more sequence-diverse peptides (33,66) and (66,100) generalize better, as they are forced to rely on broader structural and contextual features.
Additionally, in the field of protein design, benchmarking has revealed that even with a sequence similarity below 50%, the predicted 3D structures look very similar [51]. These results suggest that BLOSUM-based sequence augmentation, aimed to generate similar binding peptides for a given TCR, may reduce generalization. In contrast, a shape-based augmentation, for example by using protein sequence design models to generate new peptides, may enhance generalization to unseen peptides [49, 59]. Additionally, given that the peptide sequence and shape are not correlated, including the 3D shape may increase the overall performance, as shown by recent papers [60–63] However, as TCR and MHC structures are generally similar, it may be worthwhile to limit the 3D information to the regions that vary in shape, like the peptide, the CDRs, and the binding pocket to reduce the model complexity and maintain performance [62]. Future work can explore the differences in structural distance between models such as AlphaFold, ESMFold and OmegaFold in the context of peptides, to further evaluate biases and divergences of structures generated by different sequence-to-shape models.
A short-coming of the BLOSUM distance for the peptide is that changes in the anchor amino acids should be weighted more than changes outside the anchors. Future work could explore a more peptide-centric metric of distance based on anchor positions. Similarly, a short-coming of the RMSD is that due to differences in peptide lengths, it is not possible to normalize the distance by the length of the peptide. In our pairwise comparison, however, most (about 87%)2 of the peptide distances were same-length comparison. Furthermore, the RMSD is computed on predicted structures, which may contain modeling artifacts. These artifacts could introduce noise in structure-based splits, potentially leading to misleading structure-based generalization patterns. Future work could explore the RMSD of the core amino acids of the peptide and using length-normalized RMSD like RMSD100 [64].
Additionally, datasets with highly clustered peptides may present disproportionately low median distances, leading to unintended effects where the test set consists largely of local clusters rather than globally diverse peptides. Conversely, outliers with high median distances will always be placed in test or validation, making these sets more challenging than intended. We mitigate this by enforcing minimum and maximum peptide counts. Future work could explore clustering-based analysis of the peptide datasets or clustering to produce balanced peptide representation in training-test splits.
Despite the limitations of both sequence- and structure-based metrics, the DS algorithm is designed to be metric-agnostic, meaning it can be applied to any distance function that quantifies peptide similarity (see S1 File Sect 2.2.4 for results obtained using the Min operation instead of Mean in the DS algorithm). However, the choice of distance metric significantly impacts the nature of the training-test split and, consequently, model generalization. Future work could explore more biologically or immunologically relevant distances, such as a BLOSUM matrix derived from TCR-pMHC interaction data or alternative physico-chemical distance metrics [65].
Sequence-to-shape models could also be used to predict full TCR complexes for binding prediction. However, current methods for structural complex modeling are computationally expensive. While they may be comparably faster than molecular dynamics simulations,2 248,630 matching length pairs comparisons out of 285,090 using 757 peptides it is currently unfeasible to test all possible combinations of TCRs, peptides and pMHCs for the binding prediction task. Future work in lightweight representations for proteins, such as fragments, could help bridge this gap, allowing for faster and more efficient incorporation of structural features in TCR binding prediction [66].
Conclusion
Our results suggest that the use of 3D shapes in the context of TCR-pMHC interaction prediction could help reduce the uncertainty about the generalization capabilities of ML models to unseen sequences. Given enough computational resources, 3D shapes could be predicted for the whole TCR structure, as well as for the MHCs presenting the peptide. This would enable the design of models that, given the individual structures of each input sequence, will predict more accurate binding interactions.
Supporting information
S1 File. Supplementary Tables and Figures
Supplementary Tables and Figures. Supplementary figures and tables referenced in the main text, including TCR-pMHC complex diagrams, additional model results, and extended correlation tables. See also Supplementary Sects 2.1–2.3 for details
https://doi.org/10.1371/journal.pone.0324011.s001
(PDF)
References
- 1.
Alberts B, Johnson A, Lewis J, Morgan D, Raff M, Roberts K. Molecular biology of the cell. WW Norton & Company. 2017.
- 2. Rowen L, Koop BF, Hood L. The complete 685-kilobase DNA sequence of the human beta T cell receptor locus. Science. 1996;272(5269):1755–62. pmid:8650574
- 3. Glanville J, Huang H, Nau A, Hatton O, Wagar LE, Rubelt F, et al. Identifying specificity groups in the T cell receptor repertoire. Nature. 2017;547(7661):94–8. pmid:28636589
- 4. Hewitt EW. The MHC class I antigen presentation pathway: strategies for viral immune evasion. Immunology. 2003;110(2):163–9. pmid:14511229
- 5. Eisenlohr LC, Huang L, Golovina TN. Rethinking peptide supply to MHC class I molecules. Nat Rev Immunol. 2007;7(5):403–10. pmid:17457346
- 6. Smith-Garvin JE, Koretzky GA, Jordan MS. T cell activation. Annu Rev Immunol. 2009;27:591–619. pmid:19132916
- 7. Feng D, Bond CJ, Ely LK, Maynard J, Garcia KC. Structural evidence for a germline-encoded T cell receptor-major histocompatibility complex interaction ``codon’’. Nat Immunol. 2007;8(9):975–83. pmid:17694060
- 8. Rossjohn J, Gras S, Miles JJ, Turner SJ, Godfrey DI, McCluskey J. T cell antigen receptor recognition of antigen-presenting molecules. Annu Rev Immunol. 2015;33:169–200. pmid:25493333
- 9. Qi Q, Liu Y, Cheng Y, Glanville J, Zhang D, Lee J-Y, et al. Diversity and clonal selection in the human T-cell repertoire. Proc Natl Acad Sci U S A. 2014;111(36):13139–44. pmid:25157137
- 10. Jameson SC, Masopust D. Understanding subset diversity in T cell memory. Immunity. 2018;48(2):214–26. pmid:29466754
- 11. Omilusik KD, Goldrath AW. Remembering to remember: T cell memory maintenance and plasticity. Curr Opin Immunol. 2019;58:89–97. pmid:31170601
- 12. He Q, Jiang X, Zhou X, Weng J. Targeting cancers through TCR-peptide/MHC interactions. J Hematol Oncol. 2019;12(1):139. pmid:31852498
- 13. Waldman AD, Fritz JM, Lenardo MJ. A guide to cancer immunotherapy: from T cell basic science to clinical practice. Nat Rev Immunol. 2020;20(11):651–68. pmid:32433532
- 14. Gilbert SC. T-cell-inducing vaccines - what’s the future. Immunology. 2012;135(1):19–26. pmid:22044118
- 15. Heslop HE, Leen AM. T-cell therapy for viral infections. Hematol Am Soc Hematol Educ Program. 2013;2013:342–7. pmid:24319202
- 16. Springer I, Tickotsky N, Louzoun Y. Contribution of T cell receptor alpha and beta CDR3, MHC typing, V and J genes to peptide binding prediction. Front Immunol. 2021;12:664514. pmid:33981311
- 17. Montemurro A, Schuster V, Povlsen HR, Bentzen AK, Jurtz V, Chronister WD, et al. NetTCR-2.0 enables accurate prediction of TCR-peptide binding by using paired TCRα and β sequence data. Commun Biol. 2021;4(1):1060. pmid:34508155
- 18. 3984625115748Weber A, Born J, Rodriguez Martínez M. TITAN: T-cell receptor specificity prediction with bimodal attention networks. Bioinformatics. 2021;37(Suppl_1):i237–44. pmid:34252922
- 19. Grazioli F, Machart P, Mösch A, Li K, Castorina LV, Pfeifer N, et al. Attentive variational information bottleneck for TCR-peptide interaction prediction. Bioinformatics. 2023;39(1):btac820. pmid:36571499
- 20. Springer I, Besser H, Tickotsky-Moskovitz N, Dvorkin S, Louzoun Y. Prediction of specific TCR-peptide binding from large dictionaries of TCR-peptide pairs. Front Immunol. 2020;11:1803. pmid:32983088
- 21. Koyama K, Hashimoto K, Nagao C, Mizuguchi K. Attention network for predicting T-cell receptor-peptide binding can associate attention with interpretable protein structural properties. Front Bioinform. 2023;3:1274599. pmid:38170146
- 22. Springer I, Tickotsky N, Louzoun Y. Contribution of T cell receptor alpha and beta CDR3, MHC typing, v and j genes to peptide binding prediction. Front Immunol. 2021;12:664514. pmid:33981311
- 23. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, et al. The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res. 2019;47(D1):D339–43. pmid:30357391
- 24. Bagaev DV, Vroomans RMA, Samir J, Stervbo U, Rius C, Dolton G, et al. VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium. Nucleic Acids Res. 2020;48(D1):D1057–62. pmid:31588507
- 25. Tickotsky N, Sagiv T, Prilusky J, Shifrut E, Friedman N. McPAS-TCR: a manually curated catalogue of pathology-associated T cell receptor sequences. Bioinformatics. 2017;33(18):2924–9. pmid:28481982
- 26. Piepenbrink KH, Gloor BE, Armstrong KM, Baker BM. Methods for quantifying T cell receptor binding affinities and thermodynamics. Methods Enzymol. 2009;466:359–81. pmid:21609868
- 27. Grazioli F, Mösch A, Machart P, Li K, Alqassem I, O’Donnell TJ, et al. On TCR binding predictors failing to generalize to unseen peptides. Front Immunol. 2022;13:1014256. pmid:36341448
- 28. Deng L, Ly C, Abdollahi S, Zhao Y, Prinz I, Bonn S. Performance comparison of TCR-pMHC prediction tools reveals a strong data dependency. Front Immunol. 2023;14:1128326. pmid:37143667
- 29. Mahajan S, Yan Z, Jespersen MC, Jensen KK, Marcatili P, Nielsen M, et al. Benchmark datasets of immune receptor-epitope structural complexes. BMC Bioinformatics. 2019;20(1):490. pmid:31601176
- 30. Korpela D, Jokinen E, Dumitrescu A, Huuhtanen J, Mustjoki S, Lähdesmäki H. EPIC-TRACE: predicting TCR binding to unseen epitopes using attention and contextualized embeddings. Bioinformatics. 2023;39(12):btad743. pmid:38070156
- 31. Moris P, De Pauw J, Postovskaya A, Gielis S, De Neuter N, Bittremieux W, et al. Current challenges for unseen-epitope TCR interaction prediction and a new perspective derived from image classification. Brief Bioinform. 2021;22(4):bbaa318. pmid:33346826
- 32. Croce G, Bobisse S, Moreno DL, Schmidt J, Guillame P, Harari A, et al. Deep learning predictions of TCR-epitope interactions reveal epitope-specific chains in dual alpha T cells. Nat Commun. 2024;15(1):3211. pmid:38615042
- 33. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. pmid:19304878
- 34.
Schrödinger LLC. The PyMOL molecular graphics system, version 1.8. 2015.
- 35. Shugay M, Bagaev DV, Zvyagin IV, Vroomans RM, Crawford JC, Dolton G, et al. VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic Acids Res. 2018;46(D1):D419–27. pmid:28977646
- 36. Jurtz V, Jessen L, Bentzen A, Jespersen M, Mahajan S, Vita R. NetTCR: sequence-based prediction of TCR binding to peptide-MHC complexes using convolutional neural networks. BioRxiv. 2018:433706.
- 37.
Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic level protein structure with a language model. Cold Spring Harbor Laboratory. 2022. https://doi.org/10.1101/2022.07.20.500902
- 38.
Wu R, Ding F, Wang R, Shen R, Zhang X, Luo S, et al. High-resolutionde novostructure prediction from primary sequence. Cold Spring Harbor Laboratory. 2022. https://doi.org/10.1101/2022.07.21.500999
- 39. Wagner RA, Fischer MJ. The string-to-string correction problem. J ACM. 1974;21(1):168–73.
- 40. .Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992;89(22):10915–9. pmid:1438297
- 41. Moore PB, Hendrickson WA, Henderson R, Brunger AT. The protein-folding problem: not yet solved. Science. 2022;375(6580):507. pmid:35113705
- 42.
Jensen MF, Nielsen M. NetTCR 2.2 - improved TCR specificity predictions by combining pan- and peptide-specific training strategies, loss-scaling and integration of sequence similarity. eLife Sciences Publications, Ltd. 2024. https://doi.org/10.7554/elife.93934.2
- 43. Mechelke M, Habeck M. Robust probabilistic superposition and comparison of protein structures. BMC Bioinformatics. 2010;11:363. pmid:20594332
- 44. Ballone A, Lau RA, Zweipfenning FPA, Ottmann C. A new soaking procedure for X-ray crystallographic structural determination of protein-peptide complexes. Acta Crystallogr F Struct Biol Commun. 2020;76(Pt 10):501–7. pmid:33006579
- 45. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. pmid:34265844
- 46. van Breugel M, Rosa E Silva I, Andreeva A. Structural validation and assessment of AlphaFold2 predictions for centrosomal and centriolar proteins and their complexes. Commun Biol. 2022;5(1):312. pmid:35383272
- 47.
Goswami A, Kumar SM, Ullah S, Gore MM. De novodesign of anti-variant COVID-19 Vaccine. Cold Spring Harbor Laboratory. 2022. https://doi.org/10.1101/2022.10.20.513049
- 48. Chang L, Perez A. Ranking peptide binders by affinity with AlphaFold**. Angewandte Chemie. 2023;135(7).
- 49. .Castorina LV, Ünal SM, Subr K, Wood CW. TIMED-Design: flexible and accessible protein sequence design with convolutional neural networks. Protein Eng Des Sel. 2024;37:gzae002. pmid:38288671
- 50.
Watson JL, Juergens D, Bennett NR, Trippe BL, Yim J, Eisenach HE, et al. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. Cold Spring Harbor Laboratory. 2022. https://doi.org/10.1101/2022.12.09.519842
- 51. Castorina LV, Petrenas R, Subr K, Wood CW. PDBench: evaluating computational methods for protein-sequence design. Bioinformatics. 2023;39(1):btad027. pmid:36637198
- 52. Berman H, Henrick K, Nakamura H. Announcing the worldwide Protein Data Bank. Nat Struct Biol. 2003;10(12):980. pmid:14634627
- 53. Lee CH, Salio M, Napolitani G, Ogg G, Simmons A, Koohy H. Predicting cross-reactivity and antigen specificity of T cell receptors. Front Immunol. 2020;11:565096. pmid:33193332
- 54. Coles CH, Mulvaney RM, Malla S, Walker A, Smith KJ, Lloyd A, et al. TCRs with distinct specificity profiles use different binding modes to engage an identical peptide-HLA complex. J Immunol. 2020;204(7):1943–53. pmid:32102902
- 55. Kersh GJ, Miley MJ, Nelson CA, Grakoui A, Horvath S, Donermeyer DL, et al. Structural and functional consequences of altering a peptide MHC anchor residue. J Immunol. 2001;166(5):3345–54. pmid:11207290
- 56. Nguyen AT, Szeto C, Gras S. The pockets guide to HLA class I molecules. Biochem Soc Trans. 2021;49(5):2319–31. pmid:34581761
- 57. Barton MI, MacGowan SA, Kutuzov MA, Dushek O, Barton GJ, van der Merwe PA. Effects of common mutations in the SARS-CoV-2 Spike RBD and its ligand, the human ACE2 receptor on binding affinity and kinetics. Elife. 2021;10:e70658. pmid:34435953
- 58. Xue S, Han Y, Wu F, Wang Q. Mutations in the SARS-CoV-2 spike receptor binding domain and their delicate balance between ACE2 affinity and antibody evasion. Protein Cell. 2024;15(6):403–18. pmid:38442025
- 59. Qi Y, Zhang JZH. DenseCPD: improving the accuracy of neural-network-based computational protein sequence design with DenseNet. J Chem Inf Model. 2020;60(3):1245–52. pmid:32126171
- 60. Ji H, Wang X-X, Zhang Q, Zhang C, Zhang H-M. Predicting TCR sequences for unseen antigen epitopes using structural and sequence features. Brief Bioinform. 2024;25(3):bbae210. pmid:38711371
- 61. Li F, Qian X, Zhu X, Lai X, Zhang X, Wang J. TCRcost: a deep learning model utilizing TCR 3D structure for enhanced of TCR-peptide binding. Front Genet. 2024;15:1346784. pmid:39415981
- 62. Bradley P. Structure-based prediction of T cell receptor:peptide-MHC interactions. Elife. 2023;12:e82813. pmid:36661395
- 63.
Pham M-DN, Su CT-T, Nguyen T-N, Nguyen H-N, Nguyen DDA, Giang H, et al. epiTCR-KDA: knowledge distillation model on dihedral angles for TCR-peptide prediction. Cold Spring Harbor Laboratory. 2024. https://doi.org/10.1101/2024.05.18.594806
- 64. Carugo O. Statistical validation of the root-mean-square-distance, a measure of protein structural proximity. Protein Eng Des Sel. 2007;20(1):33–7. pmid:17218333
- 65. Postovskaya A, Vercauteren K, Meysman P, Laukens K. tcrBLOSUM: an amino acid substitution matrix for sensitive alignment of distant epitope-specific TCRs. Brief Bioinform. 2024;26(1):bbae602. pmid:39576224
- 66.
Castorina LV, Wood CW, Subr K. From atoms to fragments: a coarse representation for functional and efficient protein design. Cold Spring Harbor Laboratory. 2025. https://doi.org/10.1101/2025.03.19.644162