Comparing T cell receptor repertoires using optimal transport

doi:10.1371/journal.pcbi.1010681

Fig 1.

A schematic of TCR distribution comparison.

Each symbol represents a TCR in an abstract space in which distance is defined via TCRdist [30], and the two regions represent two population repertoires of interest. Each repertoire is given its own color (here orange and green). The purple arrow shows that there are regions of these TCR distributions for the green repertoire that do not have a close equivalent in the orange repertoire, which will be identified by our optimal transport methods.

More »

Expand

Fig 2.

An illustration of our optimal transport formulation of TCR repertoire comparison.

(A) A schematic of two TCR repertoires and residing in an abstract space defined by TCRdist. The circle adjacent to each TCR displays its clonotype abundance. TCRdist values are shown (in green) from to each of the TCRs in , although a TCRdist value is defined between each pair. (B) The mathematical objects that describe the setup illustrated in (A). Here, is the matrix of pairwise TCRdist values, is a vector of distribution mass values for each TCR in , is a vector of distribution mass values for each TCR in R₂, and P* is the optimal transport matrix.

More »

Expand

Fig 3.

A schematic of our clustering procedure in Algorithm 1.

Each point is a TCR portrayed in an abstract 2-D space, where the distance between points is determined by TCRdist. Our procedure starts by identifying the maximally lonely TCR t_max according to Eq (10). In each iteration, we step out s units of TCRdist, and compute the mean loneliness of all TCRs within the annulus defined by the current and previous radii (or ball in the first step). By construction of Eq (10), we expect the loneliness values to steadily decrease as we move away from t_max, until we arrive at a radius where the loneliness values have stabilized. This “breakpoint radius” thus defines the radius of our cluster.

More »

Expand

Fig 4.

Visualizations of TRBV gene frequency statistics and CDR3aa sequence logos for the top three lonely clusters of the combined repertoire analysis: (A) OT-Tremont, (B) OT-Revere, (C) OT-Ida.

The height of each stack within the sequence logo is proportional to the level of that position’s conservation, and the height of each amino acid is proportional to that amino acid’s frequency in that position. The rows below each sequence logo display the occupancies, insertion probabilities, and expected insertion lengths of the respective positions.

More »

Expand

Fig 5.

Plots of several statistics that describe the across-repertoire dynamics of the OT-Tremont, OT-Revere, and OT-Ida clusters.

(A) Distributions of cluster prevalence across combined repertoires, stratified by cluster and cell type. The DN distributions are consistently to the right of the CD4 distributions, showing that our algorithm is finding motifs that are highly represented in the DN repertoire compared to the CD4 repertoire. (B) Distributions of neighborhood loneliness scores across individual repertoires, stratified by cell type group (background/foreground) and cluster. N/A means not assigned to a cluster. We again see that the motifs distinguish DN sequences from CD4 sequences above the level of per-repertoire variation.

More »

Expand

Fig 6.

A hierarchical clustering tree built from the matrix of optimal transport repertoire distances, with the tips colored by the cell subpopulation: Green for DN, orange for CD8, and blue for CD4.

More »

Expand

Fig 7.

Visualizations of the relationship between replicate and randomization z-scores.

(A) Scatterplot of replicate z-scores versus randomization z-scores. (B) Marginal density estimates of replicate z-scores and randomization z-scores. We see strong evidence of a significantly positive linear relationship between these quantities, suggesting that our randomization procedure is able to identify significant differences between repertoire datasets.

More »

Expand

Fig 8.

Various hit rate statistics for the YFV benchmark analysis.

(A) Hit rates of our responsive TCR inferences grouped by reference timepoint and cluster rank. (B) Aggregate hit rates of our responsive TCR inferences grouped by reference timepoint and cluster rank. (C) Hit rates of our responsive TCR inferences grouped by reference timepoint and donor, for cluster rank ≤2. (D) Hit rates of our responsive TCR inferences grouped by reference timepoint and donor, for cluster rank ≤10.

More »

Expand

Table 1.

Counts of matches between our inferred responsive yellow fever (YFV) sequences and either (YFV) or cytomegalovirus (CMV) sequences obtained from VDJdb, where the CMV sequences are used as a control. Also provided are analogous counts for responsive sequences inferred by Pogorelyy et al. [20]. Columns S1—Q2 correspond to the six subjects discussed in [20], also discussed in the Materials and Methods section.

More »

Expand

Fig 9.

Recovery of “spike-in” epitope-specific TCRs from background naive TCRs based on optimal transport (left) or the ALICE algorithm (right).

The bars summarize AUROC values for 10 random replicates.

More »

Expand