Autoencoder based local T cell repertoire density can be used to classify samples and T cell receptors

Recent advances in T cell repertoire (TCR) sequencing allow for the characterization of repertoire properties, as well as the frequency and sharing of specific TCR. However, there is no efficient measure for the local density of a given TCR. TCRs are often described either through their Complementary Determining region 3 (CDR3) sequences, or theirV/J usage, or their clone size. We here show that the local repertoire density can be estimated using a combined representation of these components through distance conserving autoencoders and Kernel Density Estimates (KDE). We present ELATE–an Encoder-based LocAl Tcr dEnsity and show that the resulting density of a sample can be used as a novel measure to study repertoire properties. The cross-density between two samples can be used as a similarity matrix to fully characterize samples from the same host. Finally, the same projection in combination with machine learning algorithms can be used to predict TCR-peptide binding through the local density of known TCRs binding a specific target.


Introduction
T-cell receptor (TCR) repertoire formation results from two major procedures: Rearrangements of the TCR α and β chains gene segments (V (D) and J) and formation of the junctions between them [1]. During segment-joining, nucleotides are inserted and deleted at the junctions between pairs of rearranging genes, forming the highly variable complementarity determining region 3 (CDR3) that is directly implicated in antigen recognition, and further augmenting diversity [1,2].
To test the possibility of estimating the local density of T cell receptor sequences as defined by their CDR3 sequence, their V gene, and their clone size, we developed the ELATE encoder. This analysis is mostly focused on the beta chain since it has been assumed the repertoire diversity and its binding to target MHC-peptide are determined more by the beta chain than the alpha chain [19], and there are more beta than alpha-beta published datasets However, ELATE can be equally applied to the alpha chain.  [20] based approach to handle different lengths was also tested but had a lower decoder accuracy. Similarly, when comparing encoder and LSTM for predicting TCR-peptide binding, encoders provided better results [21]. We have also tested other approaches, such as aligning the CDR3 to the C and N termini and zero padding in the middle or limiting the analysis to CDR3 sequences of a given length. All such attempts led to a lower accuracy.

ELATE can produce accurate reconstruction and also accurate TCR distances
The decoder predicts the entire input vector, including the stop signal and what comes after it. All values after the predicted stop signal are then zeroed. To evaluate the performance of the AE, we used the percentage of sequences successfully reconstructed with up to K errors, where a mismatch in length was defined as an error for all Ks. Note that the current definition does not take into account the similarity between amino acids as measured through either their evolutionary distance or through their biochemical distances [22]. Such distances can be easily integrated in both the training and evaluation stages of ELATE.
In all the observed datasets, very few pairs of CDR3 sequences differ by less than 2 AA one from each other (more precisely have an edit distance of less than 2). Indeed, when the distribution of edit distance between every pair of sequences from pairs of random samples is calculated, the distributions are consistently normal, with practically no values below 2 (see a representing example in Fig 2A). Thus, even an AE producing 1 or 2 errors would lead to a representation closer to the original CDR3 than the vast majority of other sequences in a different sample. Note that the absolute number of non-shared CDR3 sequences within 2AA of a CDR3 sequence from another sample depends on the sampling depth. We thus report the fraction of sequences successfully reconstructed with 0, 1, or 2 errors (Fig 2).
To test the applicability of the AE, several previously published datasets with different categories were used in this analysis. The first set studied was the GLIPH dataset which is composed of seven TCR samples, where all TCRs in a sample share specificity to the same peptide [23]. Two more datasets of tumor-bearing mice from ImmunoMap and TILs [24] were used. The first had two categories: CD8 T cells responding to TRP2, a shared self-peptide tumor antigen, and SIY, a model foreign-antigen. The second is composed of samples taken from mice treated with RT and 9H10 immunotherapies and a combination of both. Also, a set of sorted naïve and memory cells (Benny Chain dataset), CD4+ and CD8+ from alpha and beta chains of three donors was collected [25]. To explore variations over time, we used the response to the yellow fever vaccine from different time-points in three pairs of identical twins [26]. Finally, a collection of TCRs with one sample from each host with any type of cancer and samples from healthy donors were collected from the Immune Epitope Database (IEDB). The cancer analysis was performed to show the applicability of ELATE to TCR samples not directly taken from deep sequencing experiments. The GLIPH and Benny Chain datasets (Table 1) do not contain V usage in their repertoire and the GLIPH dataset has no frequencies of clones.
While ELATE can quite precisely encode the CDR3 distribution ( Fig 2C and 2D), it fails to reproduce the distance between TCRs (Fig 2E). To handle that, we produced a novel distance maintaining encoder (EM). In this framework, besides minimizing the reconstruction error. An embedding loss (loss DIS , see Methods) of the sum over pairs of TCRs of the difference between the Euclidean distance between the embeddings and the original Euclidean distance A. Autoencoder description. The autoencoder is composed of two models: encoder and decoder. Encoder's input: a vector of n dimensions of concatenated one-hot (oh) vectors, representing a CDR3 sequence. Encoder's output (decoder's input): the embedded representation with 30 dimensions. Decoder's output: the reconstructed vector. The autoencoder is trained by minimizing the reconstruction MSE (Mean Squared Error). The classifier is a combination of the encoder (with fixed weights) and 4 more fully connected layers. The classifier is trained by minimizing the BCE loss (Binary Cross Entropy). B. Data preprocessing and distance maintaining. Encoder's input: Two 1× n one-hot representations of TCR sequences (x 1 , x 2 ). Each amino acid is represented by a 21 long one-hot representation, that includes the possibility of the end of the CDR3, followed by a one hot representation for the stop signal for the end of the CDR3 and, when applicable, a n V dimensional one-hot vector (where n V is the number of all possible Vs in a given dataset) for the V gene (optionally). The encoder's output and decoder's input is a 1× 30 embedded vector with lower dimension z for each input sequence (z 1 , z 2 ). Decoder's output: a 1× n reconstructed one-hot vector for each input (x 0 1 , x 0 2 ). In the EMB model, an additional loss term is added to reduce the difference between the original distances (Euclidean distances between every x 1 and x j ) and embedded vectors distances (Euclidean distances between every z i and z j ). The distance is computed on the output vector. However, when reporting the results, the definition of the accuracy is based on a vector after the max argument on each 21 positions is performed. https://doi.org/10.1371/journal.pcbi.1009225.g001

PLOS COMPUTATIONAL BIOLOGY
between the CDR3 one-hot vectors is added at the hidden layer. When the V gene of each TCR was provided, a V gene-specific E was also concatenated to the CDR3 encoder. The V-based encoder can be used with or without the embedding loss.
The accuracies with zero, one, and two mismatches for all four models (E, VE, EM, VEM) are shown for the test set of all datasets (except for GLIPH and Benny Chain when analyzing V) in Fig 2C, 2D, 2G and 2H. None of the receptors in the test set were used for either training or hyper-parameter tuning. The accuracy of E is consistently much higher than the EM, with or without V, with high accuracy of >90% in multiple data sets for 2 mismatches accuracy. The models with the additional embedding constraints have lower accuracy (Fig 2G and 2H), as expected given the need to minimize in parallel two loss functions. Similarly, encoding with V gives a lower accuracy than without it, again as expected from the larger number of degrees of freedom. However, as will be further shown, the encoding with the embedding is better for multiple applications. To test whether the addition of the distance loss does maintain the distance distribution, the pairwise edit distances of the input one-hot vectors of the CDR3s in the original set and the all distances in the new representations set were compared (Fig 2E and  2F). The correlation is higher in the constrained model ( Fig 2F) with a Spearman correlation of 0.9 (p<1.e-100), compared with the regular AE (correlation of 0.5, Fig 2E). All accuracies of E for the different datasets with no mismatches (fully matched, red), 1 mismatch (orange), and 2 mismatches (blue). D. VE accuracy. All accuracies of VE for the different datasets (except for datasets without Vs). E. Encoder and CDR3 distances correlation. Pairwise Euclidean distances in the set of CDR3 one-hot vectors (Xs) in the x axis and pairwise Euclidean distances in the embedded set (Zs) predicted by E in the y axis. The legend contains the Spearman correlation between the axes. F. Encoder's distances correlation when combined with distances. Pairwise Euclidean distances in the set of CDR3 one-hot vectors (Xs) in the x axis and pairwise Euclidean distances in the embedded set (Zs) predicted by EM in the y axis. G. EM accuracy.

ELATE can be used to detect features of single clones
To highlight the importance of the encoding, we first tested whether the encoder reveals the reported selection against specific AA in CDR3 sequences [27]. Samples composed of 100 TCR per sample from the Benny Chain dataset were projected using the AE. To properly visualize the projection, TSNE was used to lower the dimensions to 2 ( Fig 3A). Although we used TCR from different samples, consistently, TCRs with Cysteines (red color) were projected as clusters ( Fig 3A). To quantify the clustering, we divided all TCRs in the sample into two subgroups: all TCRs that contain Cysteine, and TCRs that do not contain Cysteine in the CDR3 beyond the first position. We computed the distance between each TCR to the closest TCR that contains a Cysteine. We performed T-test on these two subgroups (p-value = 3.3e-94, average with Cysteine = 5.072, average without Cysteine = 6.558). Moreover, CDR3 sequences containing a Glycine have lower densities than other CDR3 sequences ( Fig 3G). A similar difference is observed for Cysteine, but it fails to reach a significance level. For Proline, we observe an interesting difference, where TCRs with a single Proline are denser than either without or with 2 ( Fig 3G). We currently have no clear explanation for that.
Next, we tested whether the AE captures the CDR3 length distribution of the repertoire. Indeed, if distances are maintained, similar length CDR sequences should have close projections. Moreover, CDR3 with much longer or shorter than average sequences should be again projected to the periphery, as they are also less frequent and selected against [27,28]. Again, the AE projects the mid-length CDR3 sequences to the center of the distribution and the long and short CDR3 sequences to the periphery ( Fig 3B, where each color is a different length).
Public TCRs have been described in a variety of immune responses [29]. A public clone is a clone shared by different donors. Again, if the encoded projections represent the functionality of the repertoire, one would expect public clones to be more central. We calculated the radius from the center of mass of all projected TCRs in Benny Chain dataset. Note that such an analysis is only possible following the translation to real-valued representations. We computed the radius of TCRs vs their publicity, where the publicity of each sequence is defined as the number of samples containing it, out of 56 samples ( Fig 3C). Indeed, public clones are more central (spearman correlation, p<0.001). We further tested if public clones are in denser regions of the projection, where the self-density was defined as the density of other TCR projections from the sample around the clone. Density was estimated using a Kernel Density Estimation (KDE). Again public clones reside in denser regions than private clones (spearman correlation, p<0.001) (Fig 3C and 3D).
Finally, we checked the overall characterization of the TCR repertoire as a possible biomarker for different diseases [30], for example, a biomarker for cervical cancer [11]. We expect TCR samples (those are not actual rep-seq, but rather samples of reported TCRs) in such conditions to show distinct projections compared with random repertoires. To evaluate the scattering of different categories in the embedded space, we used the AE projections in two disease state datasets. Again, the vectors were projected to 2 dimensions using TSNE to visualize their positions. When analyzing the GLIPH data set, all of the T cells' clones that recognize the TPRVTGGGAM peptide are significantly clustered together (Fig 3E). In the cancer data set, when projecting the Cancer vectors to 2 dimensions, a separation between most TIL and naïve TCRs emerged (Fig 3F, and t-test with the y-axis position p<0.001). In contrast with the previous tests, the cancer samples are from the IEDB and are enriched for disease-specific TCRs.
One can now use this density to estimate whether peptide-specific TCRs are in denser regions than random peptides from the same sample. For almost all targets, TCRs specific for this target have a higher density than random TCRs, not defined to bind a known target. This could . The x axis is the number of samples containing this TCR clone and the y axis is the distance from the center of all projections. The Spearman correlation is in the legend. D. Self-density vs Publicity (Benny Chain dataset). Axes: y-The KDE between a TCR clone and its sample, x-number of samples that contain this TCR clone. E. Encoder's projection TSNE be a sampling bias or a real difference, but the current method is the first that allows for such a measurement (Fig 3G).

ELATE with KDE can be used to cluster samples
The autoencoder can be used to compare samples and not only to compare TCRs within a sample. Again, we used a constant sample size (100 here) to avoid sample size bias. We then defined a similarity between samples and tested whether this similarity can be used to detect which sample comes from the same host. The similarity was defined using the AE, the Kullback-Leibler (KL) divergence of the V gene usage on the sample, or the edit distance (ED) between CDR3 of different samples. The ED distance between each sequence from one sample and each sequence from the other was computed. The distance between the two samples was defined as the average of all minimum ED values (see Methods). For the AE-based distance, the similarity between samples A and B is defined as the average KDE-based density of sample A projected sequences based on the projected sequences of sample B. The distance was computed for different variants of the AE, including the basic autoencoder (E), the autoencoder combined with V (VE), or with CDR3 distances (EM) and combined with both (VEM). The KDE was performed twice, one time with the original set of vectors, and again when the clones frequency was incorporated (see Methods). We analyzed the vaccine dataset. This dataset contained samples from three pairs of identical twins. The samples were taken at six different time points before and after a yellow fever vaccination. Hierarchical clustering was performed using the distance/similarity definitions above. The clustering that best reproduces the division to hosts is VE (Fig 4A). Moreover, twins are closer to each other in the hierarchy than unrelated donors. When computing the distances using only the V usage (KL distance), the clusters are shuffled, with some distinction of twins but no clear separation of hosts (Fig 4B). To quantify the distance method that best characterizes hosts, we tested the hierarchical clusters host ID frequency and time-points entropies for each method (Fig 4C and 4D). All AEs produce a better detection of hosts (lower entropy) than ED and KL methods, and the basic autoencoder combined with V (VE) without the frequencies is the best method among all models ( Fig 4C). However, none of the methods separates samples by time-points (Fig 4D). To test the similarity within and between hosts and between twins using VE, we divided all distances into three groups: within-host, between twins, and between non-twin hosts. We performed an ANOVA on the distances (p = 3.735e-108). We then compared each group using a two-sample twosided t-test, and indeed the within-host distances (3.1e13 +/-4.8e12) are much shorter than the between twins distance (1.9e14 +/-9.7e12, p = 1.36e-48), and those are much shorter than the between non-twins (3.0e14 +/-6.0e12, p = 2.6e-16).

ELATE with KDE can be used for supervised sequence classification
Another important application of AE is sequence classification. A classifier was built on top of the AE. The classifier is a combination of the encoder with three fully connected layers, which replace the decoder (Fig 1). After the autoencoder was trained to predict the vector representation of the CDR3 set, the classifier layers are trained using a binary cross-entropy loss on the class detection problem. Such a classifier was built and trained for each of our four (GLIPH). TCRs with specificity to TRPVTGGGAM peptide are colored in red. The other groups are colored in grey. F. Encoder's projection TSNE (Cancer dataset). TCRs from hosts with cancer are colored in red and TCRs from samples collected from healthy hosts are colored in grey. G. Average self-density per feature. The average self-density of all 7 peptides in GLIPH data set and for TCRs with a specific number of G (Glycine), C (Cysteine) and P (Proline). The pvalue of t-test is shown as � (<0.05), �� (<0.01), ��� (< 0.001).
https://doi.org/10.1371/journal.pcbi.1009225.g003 autoencoders. Each classifier was trained as one vs all classifier (i.e. for each class, this class vs all others), for each of the categories (Table 1). All test set AUC values are shown in Table 2.
We compared the first two datasets with DeepTCR results for a single sequence classifier. When analyzing GLIPH and Benny Chain datasets, no Vs are provided, so there are no models with the combination of V. As for the last three datasets (Benny Chain, vaccine, and cancer), we have no comparison since there are no previous classifications for these repertoires. Deep-TCR's based classifiers typically outperform ELATE on GLIPH. However, ELATE based classifier outperforms the Deep-TCR's sequence classifier when classifying TRP2 and SIY samples (ImmunoMap) with an average AUC of 0.81 vs 0.71. ELATE also performs best when handling TCR samples taken from healthy hosts and hosts with cancer, with an average AUC of 0.925. Here, different combinations of the autoencoder show different results in different datasets and there is no clear pattern or preference for any model. No significant difference was

PLOS COMPUTATIONAL BIOLOGY
found between the EM or the VEM (whenever V was available) and the DeepTCR (pvalue = 0.418).

Self-density estimate using ELATE
Finally, we checked whether the AE can be used to determine how dense is a given sample. This can be defined by the average similarity between TCR in a dataset, or the frequency of other TCRs from the same dataset around each TCR. We analyzed again the Benny Chain data set of TCR repertoires, and compared alpha to beta chains, naïve to memory and CD4 to CD8 cells self-densities. The self-density of a given sample S is formally defined as the KDE-based density between S and itself, where each TCR is ignored for the computation of the density at its own projection. The average density of all groups was normalized to 0 over all projections for the ease of presentation (Fig 5).
We computed the log of the self KDE values among all samples for the two autoencoders (E, EM), with or without the frequencies, alpha chains have a lower self-density than beta chains. This is surprising given the theoretically much larger beta repertoire. When the embedding is used, a clear hierarchy can be observed alpha is less dense than beta, CD4 is on average less dense than CD8 and naïve are slightly less dense than memory. In the encoder or when the clone size is used, the same order is kept, but with a much noisier separation. Thus, again the embedding seems to be the method to represent receptors as real-valued vectors.

Discussion
Characterization of the repertoire can serve as an indicator for immune monitoring and prognosis assessment for different diseases [11]. Next-generation sequencing (NGS) allows massive

PLOS COMPUTATIONAL BIOLOGY
analysis of the TCR repertoire with high-throughput and provides descriptive and comparative analysis for million TCRs. The repertoire can be analyzed through different features such as the CDR3 set or V gene usage and more. However, current methods of analyzing the repertoire are focused on one measure or another. We have here proposed ELATE a method to integrate them. ELATE, is a projection of TCRs that combines different aspects of the repertoire into a coherent real-valued representation of each clone. ELATE conserves important features of the repertoire. It can be used with unsupervised and supervised methods to distinguish different groups and assign a sequence to a group. ELATE is highly efficient and can be applied to repertoires with millions of clones using standard hardware. Representing clones as vectors in real space has many advantages, including the possibility of measuring the repertoire density to detect sparse vs dense regions, computing the similarity between repertoires, using clones for machine learning applications, which are often based on real-valued input and visualizations. We have here shown that ELATE can indeed perform PLOS COMPUTATIONAL BIOLOGY these different tasks, and provide important insight on the repertoire, including among many others: A) The effect of amino acids in the CDR3 region on the repertoire density (with a decreasing density as a function of the number of Glycines or Cysteines), or the increased density around TCR recognizing cognate antigens.
B) The possibility of classifying single clones to specific cognate antigens or specific compartments.
C) The estimate of the repertoire density and the difference between the density of the repertoire in naïve and memory compartments.
D) The similarity between repertoire and showing that samples from different compartments and time points of the same donor are much more similar to other samples from the same donor than to any other sample. Samples of twins are much more similar than samples of non-twins.
While we have shown some typical applications of ELATE, the range of possible applications is much greater. Each flavor of ELATE has advantages and limitations. Distance maintaining versions are better for density estimates but produce worse sequence reproduction. Adding the V gene to the encoder improves the quality of clone classification but further reduces the reproduction accuracy. Incorporating the clone frequency helps to distinguish between the average density of some compartments, but often ruins the distance definition between compartments. Thus, for each application of ELATE, one must consider which element should be introduced.
We have used standard parameters for ELATE to ensure optimal reconstruction in the vanilla flavor of ELATE (AE). One may further improve the accuracy of different tasks by tuning the projection parameters for this specific task. However, we suggest that using a uniform projection within and between experiments is important to allow for a consistent method to compare repertoires among different settings.
We have tested other variants of such encoders both in the current analysis, and a TCR-pMHC binding prediction [21], mainly an LSTM, where the output of the last amino acid is the encoding. The LSTM accuracy was lower both here and in ERGO. Two main explanations may be suggested A) There is a limited structure in the center of the CDR3. B) The beginning and end of the CDR3 are practically constant, so there is no need for an LSTM to capture that.
A previous variational autoencoder (VAE) has been proposed in the literature-DeepTCR [12]. The autoencoder used here is different than the VAE used in DeepTCR. The main differences between ELATE and DeepTCR are that in ELATE all CDR3 sequences are converted to one-hot vectors as opposed to DeepTCR, where the sequences are converted to vectors using a trainable embedding layer. Given the wide CDR3 length distribution, ELATE uses right zeropadding to combine different lengths, and a novel loss function to optimize the projection. Finally, and most crucially, ELATE conserves distances, in contrast with all existing methods, allowing for the usage of KDE methods on the projection. At the technical level, ELATE uses only fully connected layers, while DeepTCR uses convolutional layers.

Studied samples
Six previously published datasets with different categories were collected to test ELATE (Table 3): • GLIPH (Glanville) [23]. TCR sequences were taken from T-cell binding to 7 different target peptides. Each peptide has one sample of TCRs with a high probability of sharing specificity to it. The TCRs from different individuals were clustered using the GLIPH algorithm.
• Cancer [32]. A collection of TCRs from human hosts with any cancer and healthy hosts was downloaded from the Immune Epitope Database (IEDB) (See Github for FASTA file of healthy and cancer TCRs).
• Not all datasets contain resolved Vs or clone frequencies. All categories, number of samples, and whether V and frequencies are available are summarized in Table 3 All datasets are available at https://github.com/louzounlab/Autoencoder.

Data curation
TCR sequencing files were collected from the sources above. The samples were parsed to compositions of CDR3 Amino Acid sequences and when available the V usage and the frequency per clone. Sequences with non-IUPAC letters (X, � , _, #) were ignored. For each dataset, we used the pre-processing performed in the original manuscript. The J usage was ignored here for the sake of clarity. It can be incorporated the same way as V. Table 4 is a detailed glossary of all technical terms used in the analysis.

Data preprocessing
The full flow of analysis is described in Fig 1. Since TCRs have varying lengths of CDR3 sequences, right zero-padding was added to all TCRs to generate a constant length representation. A stop signal (!) was first added at the right end of each sequence of length l. The sequence was then converted to a one-hot vector. Each character had a corresponding 21 dimensions one-hot vector (for 20 possible amino acids (AA) and an additional position for the stop signal). The one hot was right zero-padded to produce a constant length representation (Fig 1). Note that other alignment options could be considered, such as performing a multiple sequence alignment on the CDR3 sequences before the encoding. However, the proposed methods provided the highest accuracy on average on the tested samples. When combining V usage with the CDR3 sequences, the V genes were represented as n V dimensional one-hot vectors, where n V is the number of all possible Vs in a given dataset (each dataset has a different number of different recognized Vs based on the algorithm used for the clone analysis and the sequencing length). The CDR3 and V one-hot vectors were then concatenated. The final representation is the concatenated one-hot vectors in this order: AA 1 , AA 2, . . .., AA l , stop codon, zero-padding, and (optionally) V. The clone size was never used for the training of the autoencoder.

Autoencoder (AE) training
The AE of ELATE (all acronyms are detailed in Table 5) has three internal fully connected symmetric encoder and decoder layers of 300, 100, and 30 nodes. Each layer has a dropout rate of 0.1 and an 'Elu' activation function. Beyond that, there are three additional layers in the decoder (Fig 1A): A) A reshape layer (to n/21 x 21) that only changes the output vector to a matrix, where each row represents a position in the CDR3, or the V gene (Technically, the V is represented as different vectors, since it has a different dimension than the AA).
B) A fully connected layer with 'softmax' activation function on each row separately C) Another reshape layer (back to nx1 dimensions).
The AE was trained with an 'Adam' optimizer with a learning rate of 0.0001 and a Mean Square Error (MSE) loss (Eq 1). The number of epochs for the training varied for different datasets, where N is the number of samples in the training set.

Training AE to maintain the distance
To combine the original CDR3s distances in the training process, we first computed the NxN size distance matrix D N of all Euclidean distances among all pairs of one-hot representation of TCRs from all samples in the training set. Then, batches of b samples were generated from the training set along with their corresponding b size distances matrices D b produced from D N . Then, to conserve the distances in the embedded space, the Euclidean distances between pairs of projections and the original Euclidean distance between the original pairs in the input set (Eq 2). was added to the loss function (loss DIS ). where z i and z j are the embedded vectors of x i and x j and D ij is the Euclidean distance between x i and x j .

Autoencoder accuracy
To test the AE prediction, we used the fraction of sequences properly reconstructed, instead of the loss function used for the learning process (See Fig 1B). The reconstructed sequence was produced by replacing the softmax with a rigid max in layer B above, and a translation of the one-hot back to a sequence, stopping at the leftmost stop signal or at the end of the CDR3. All positions beyond the leftmost stop signal were ignored. If the length of the reconstructed vector was different than the length of the original vector, the entire reconstruction was defined as an error. Otherwise, the error level was defined as the number of positions differing between the original and reconstructed sequence. Thus, an error rate of 15% for one mismatch implies that 15% of all sequences either had a wrong length or the proper length and one mismatch.

Training classifier
The classifier is composed of the previously described encoder, combined with four fully connected layers with 30, 15, 10 and 1 dimensions, with a dropout of 0.1 after the second and third layers, and a Tanh activation-function except for the last layer, where we use a sigmoid. The classifier is trained with similar parameters as the encoder and a cross-entropy loss (Eq 3) (Fig 1A). When the classifier was trained, the encoder weights were fixed. For multi-class, the encoder was trained as one vs all.
loss BinaryCrossentropy ¼ À 1 N P N i¼1 y i logðpðy i ÞÞ þ ð1 À y i Þlogð1 À pðy i ÞÞ 3 The classifier was trained as one vs all classifier, for each of the categories in the various datasets.
The TCRs were divided into a training fraction composed of 500 TCR CDR3 sequences from each sample and all the remaining CDR3 sequences (the vast majority) as a test fraction. Hyperparameter tuning was performed using a single dataset and the E encoder.

KL similarity
When analyzing only V gene usage, the V-gene distributions were computed for a random set of R clones from each sample. Then, the distance between the distribution of a pair of samples p and q was defined to be their KL divergence [33] (Eq 4).

ED similarity
A two-sample edit distance was determined as the average of all minimum values per row in E, where E is an RXR distance matrix of the edit distances (Levenshtein distance) of R random clones in both samples. Levenshtein distance represents the number of insertions, deletions, and substitutions required to change one word to another 29 . When computing the self-distance, the diagonal values were ignored. Note that the edit distance is different than the one-hot distance.

KDE similarity
Given two projections of samples, S1 = {x 1 , x 2 , . . ., x R } and S2 = {y 1 , y 2 , . . ., y R } (or a single projection if S1 and S2 are the same), we estimated the local density using a Kernel Density Estimator KDE [18]. However, in contrast with standard KDE, we here found the density of each projection in S1 with respect to the projections in S2 (Eq 5).
where K is the kernel, and h> 0 is a smoothing parameter called the bandwidth. x is the projection of a TCR in sample S1, while y are all projections in sample S2 We can define the average density of the projections S1 in S2 as the average densities for all vectors in S1. When measuring the density of samples in S1 vs S2 in combination with the frequencies, each TCR clone was multiplied by its clone size (Eq 6).
Also, the new representation of the data allowed us to define for the first time the average self-density of a given repertoire, as the density when S1 and S2 are the same, and ignoring for each projected sequence its own contribution to its density. Here, we used Gaussian kernel and h = 1.06 � STD(S) � len(S) �� (-1/5) [34].

Statistical analysis
The difference between the various methods in our analysis was tested using two-way ANOVA [35], and Tukey's test. An ANOVA test was performed with all autoencoder models. Then, Tukey's test was performed between all methods and categories pairs to find which are significantly different. The clustering of samples was done using Hierarchical cluster analysis (HCA) [36] using average linkage on the KL, ED, and KDE distances matrices.

Visualization with TSNE
T-distributed Stochastic Neighbor Embedding (TSNE) [37] is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two-or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. We here use TSNE to visualize the embedded vectors of 30 dimensions in two dimensions.