Identifying T Cell Receptors from High-Throughput Sequencing: Dealing with Promiscuity in TCRα and TCRβ Pairing

Characterisation of the T cell receptors (TCR) involved in immune responses is important for the design of vaccines and immunotherapies for cancer and autoimmune disease. The specificity of the interaction between the TCR heterodimer and its peptide-MHC ligand derives largely from the juxtaposed hypervariable CDR3 regions on the TCRα and TCRβ chains, and obtaining the paired sequences of these regions is a standard for functionally defining the TCR. A brute force approach to identifying the TCRs in a population of T cells is to use high-throughput single-cell sequencing, but currently this process remains costly and risks missing small clones. Alternatively, CDR3α and CDR3β sequences can be associated using their frequency of co-occurrence in independent samples, but this approach can be confounded by the sharing of CDR3α and CDR3β across clones, commonly observed within epitope-specific T cell populations. The accurate, exhaustive, and economical recovery of TCR sequences from such populations therefore remains a challenging problem. Here we describe an algorithm for performing frequency-based pairing (alphabetr) that accommodates CDR3α- and CDR3β-sharing, cells expressing two TCRα chains, and multiple forms of sequencing error. The algorithm also yields accurate estimates of clonal frequencies.

Similarly, let be a random variable that is equal to 1 if clone appears only once in the sample, and zero otherwise. ( = 1) is ( / )(1 − 1/ ) −1 , and so the expected number of chains that appear only once is ⋅ ⟨ ⟩ = (1 − 1/ ) −1 . Figure A shows the dependence of these quantities on sample size using the current best estimate of in humans, 10 8 [3]. We can then estimate typical TCR clone size distributions obtained from bulk sequencing of T cells recovered from human blood samples. One millilitre of human blood yields typically 0.5 × 10 6 to 1.8 × 10 6 T cells, of which 0.1 × 10 6 to 0.8 × 10 6 are naive CD4 T cells and 0.03 × 10 6 to 0.2 × 10 6 are naive CD8 T cells. A 10 mL blood sample will then contain between 10 6 and 10 7 naive T cells. Eq 1.1 predicts that this will yield 1-10% of the chain repertoire, with correspondingly 99.5% and 95% of the chains recovered deriving from a single cell. In a 50 mL sample, we expect to cover between 5% and 40% of the chain repertoire, with correspondingly 98% and 80% of the chains deriving from only one cell.

conditions
In order to evaluate the robustness of alphabetr, we performed simulations to measure its ability to determine TCR pairs from a wide range of antigen-specific T cell populations with different levels of sharing, with different numbers of distinct clones, and without the presence of dual TCR chains.
Figures B-C show the results of simulations of populations with higher and lower levels of TCR -and TCR -sharing respectively. All other parameters and error models were identical to those in the main text. The higher sharing simulations in Figure B assumed 20% of TCR and TCR chains shared by two clones, 10% of TCR and TCR shared by three clones, and 10% of TCR and TCR shared by four clones (a total of 40% of all distinct chains being shared by at least two clones). The lower sharing simulations in Figure C included 5% of TCR and TCR chains shared by two different clones and 5% of TCR and TCR shared by three clones (a total of 10% of all distinct chains being shared by at least two clones). Higher sharing levels have a minimal effect on top and tail depths while causing an absolute increase in false pairing rates of only 1%-2%. Lower sharing levels also appear to have a minimal effect on top and tail depth while causing an absolute decrease in false pairing rates of approximately 1%.
Figures D-E show the results of simulations of populations comprising 3000 and 500 clones respectively to assess the effect of increased and decreased diversity on the performance of alphabetr. Error models and levels of sharing were as those used in the main text. Figure D shows that alphabetr maintains the same top depth for more diverse populations while obtaining slightly lower tail depths, which is not surprising given that larger total sample sizes are needed to achieve coverage of the larger tail of more diverse populations. There is no substantial difference in false pairing rates. In Figure E, the simulations show that alphabetr has similar top depths and higher tail depths for low-diversity populations. The higher false pairing rates seen here are due to the fact that using both of the mixed sampling strategies described in the main text (Table 2) involve sufficiently large numbers of cells per well. For low-diversity populations, common clones will appear together in wells very often with these sample sizes, creating ambiguity in pairing and increasing the apparent degree of sharing of chains between clones. Figure F shows the results of simulations of populations with no dual-TCR clones since these have been rarely reported in the literature. All other aspects of the simulated data are the same as those reported in the main text. For 5 plates and a moderate consensus threshold of 0.6, the simulations show up to 98% recovery of common clones and 61% recovery of rare clones with false pairing rates of less than 3.4%. The false pairing rate falls below 1% at a stringent threshold of 0.9.

mation regarding the probability that CDR3 regions fail to be sequenced
We performed simulations to determine the sensitivity of clonal frequency estimation to inaccuracies in estimates of the average drop rate of CDR3 and CDR3 sequences. These simulations involved creating data sets of 5 plates, the high-mixed sampling strategy and a mean drop rate of = 0.15. We then performed frequency estimation on these data sets, as described in Methods Section M2, using mean drop rates of = 0.15, = 0.08, and = 0. We show the mean bias in these frequency estimates, where Bias = Estimated frequency − True frequency True frequency The maximum likelihood approach for clonal frequency estimation involves modelling how a clone is sampled in the wells of the plates.
Since we assume conservatively that sequencing does not give any quantitative information about the number of times a clone is sampled in a well, the data fed to alphabetr need to indicate only whether a given CDR3 or CDR3 sequence is present in each well. Let denote the number of distinct sample sizes placed in the wells; = { 1 , 2 , … , } be the set of distinct sample sizes where is number of cells per well; and = { 1 , 2 , … , } be the set where represents the number of wells with sample size . Let denote the clone with chains and , and let denote the number of wells of size cells per well that contain and .
The likelihood of clone appearing in wells of sample size cells for = 1, … , is the probability of the clone being sampled in out of the possible wells, which is (1 − (clone sampled in well of size )) − We define = 1 − (clone sampled in well of size ), which the probability of clone not being sampled in a well of size . Since the appearances are independent, the probability of observing is determined by summing Eq (4.1) for all sample sizes and is given by which is Eq (3) in the main text. The probability is calculated by adding the probabilities of all of the events that would result in the clone not being sampled in the well, which are: • : clone not being sampled at all • : clone is sampled ≤ times (resulting in copies of chains and ), all copies of are dropped, and at least 1 copy of is not dropped • : clone is sampled ≤ times, at least 1 copy of chain is not dropped, and all copies of are dropped • : clone is sampled ≤ times, and all copies of and copies of are dropped.
The probability of event is given by Events and are symmetric, and the probability of these events is the probability that the clone will be sampled ≤ times multiplied by the probability of dropping all of one of the chains and not dropping at least one of the other chains. This is given by where is the probability of dropping of one of the component chains. The probability of event is derived similarly: Summing these yields Eq (4) in Methods of the main text.
The likelihood for dual clones is obtained in a similar fashion, where is calculated by summing the probabilities of the clone not being sampled at all and of being sampled but dropping one, two, or all three of the clone's chains.
The ratios in Eq (7) in Methods involve the expected number of wells in which the three chains 1 2 co-occur under the assumption that they derive from two -sharing clones 1 and 2 . We derive that quantity here. Let and be two clones that share . Let and denote the events of sampling clones and in a well of size cells respectively and and denote the complement of these events. The probability of sampling both clones in a well of cells is then In calculating Eq (5.2), including the effect of stochastic dropping of chains results in large multinomial coefficients that cannot be computed efficiently for wells with larger sample sizes (approximately ≥ 50 cells per well). Heuristically, however, neglecting the drop rate has no impact on discrimination of -sharing and dual TCR . We then have By substituting Eqs (5.3)-(5.5) into Eq (5.2) and multiplying by (the total number of wells of size ), we obtain the expected number of wells of size that contain both clones: By summing this quantity over all wells and sample sizes, we obtain Eq (8) in the main text, which forms the denominator of the ratio (Eq (7)). With inclusion of the drop rate, this ratio should be close to 1 for a true -sharing pair; neglecting dropping shifts this ratio to higher values, as seen in the left-hand cluster in Figure G, but discrimination of -sharing and dual TCR is still possible. The computational limitation on the calculation of multinomial coefficients does not exist for wells with smaller sample sizes, and so likelihoods can be directly calculated for clones that appear in these smaller wells. This approach is discussed in Section 6 of S1 Text.  Figure G: Discriminating between -sharing and dual TCR clones. For each candidate -sharing clone pair ( 1 , 2 ) we calculate the ratio of the observed number of co-occurences of the three chains to the number expected under the hypothesis they are indeed distinct clones. The latter uses the estimated frequencies of the two clones. These ratios typically partition into a lower set of values that represent sharers and those in which co-occurrences are more common (dual TCR ). Note that the lower cluster is not centred on one; this is because calculating the expected number of co-occurrences of all three chains under the two hypotheses with a non-zero drop rate is computationally intractable due to large multinomial coefficients. However, assuming = 0 biases both estimates, and -means still resolves the two clusters of ratios.

der the hypotheses of TCR -sharing and dual TCR
Discriminating between relatively abundant -sharing clones and dual TCR clones requires comparing the likelihoods of the data under these two hypotheses. Let and be a pair of candidate TCRs that share the same chain with frequencies and respectively. Given the data have wells of distinct sample sizes, we record the numbers of wells of each sample size that contain all three chains ( , , ) or contain only two of the three. For chains , , and , let , , , and ∅ denote the events of finding exactly chains , , , finding exactly chains and , finding exactly chain , and finding none of the chains in a well of sample size respectively. As before, let denote the number of wells with sample size cells per well, and let be the number of distinct sample sizes.
The likelihood of observing the data = { 1 , 2 , 3 , , ∶ = 1, 2, … , } under the hypothesis that and represent two -sharing clones is ℒ( |clone with frequency , clone with frequency ) = The likelihood of observing the data = { 1 , 2 , 3 , , ∶ = 1, 2, … , } under the hypothesis that and represent one dual TCR clone is The derivation of ( ) term for the two -sharing clone hypothesis is shown below, and the other terms can be derived in a similar fashion. We begin by writing down the events that would result in a well containing the chains and exactly (illustrated in Figure H  For event , we first calculate the probability of the two independent events of (i) sampling clone without sampling clone and (ii) not dropping all of the and chains of the sampled clone. For the former, we calculate the probability of sampling clone from 1 to times while not sampling . Each of cells in the well has a probability of of being clone and a probability 1 − − of not being clone or , which follows a binomial distribution. For the latter, each chain has a probability of being dropped, so if cells are sampled as clone , then the probability of dropping all chains from those cells is 1 − (and similarly 1 − for the chains. Combining these, we get We then calculate the probability of (i) sampling clone 1 times, sampling clone 2 times, and sampling any other clone − 1 − 2 times, (ii) not dropping all of the and chains of the sampled cells, and (iii) dropping all of the and chains of the sampled cells. This is a multinomial distribution of the three sampling events multiplied by the probability of not dropping all of the chains of clone while dropping all of the chains of , which is Adding these two together, we get For event B, we calculate the probability of 1 cells being sampled as , 2 cells being sampled as , and − 1 − 2 cells being sampled as neither of the two clones, where 1 ≥ 1, 2 ≥, 1 + 2 ≤ . This looks like a multinomial distribution of three events with probabilities , , and 1 − − occurring 1 , 2 , and − 1 − 2 times. This is multiplied by the probability of not dropping all of the chains from the 1 cells of clone , not dropping all of the chains from the 2 cells of clone , and dropping all chains from the 2 cells of clone . Then For event C, we calculate a similar multinomial probability as above and multiply it by the probability of dropping all chains from the 1 cells of clone , not dropping all chains from the 1 cells of clone , dropping all of the chains from 2 cells of clone , and not dropping all of the chains from the 2 cells of clone . From this, Since ( ) = ( ) + ( ) + ( ), we add all three expressions to obtain the term as stated above.
We assessed empirically that if the difference between the logarithms of the likelihoods in Eq (6.2) and Eq (6.1) is greater than or equal to 10, then chains , , and should be assumed to comprise a dual TCR clone. As noted before, the multinomial coefficients contained in these equations are computationally limiting for wells of large sample sizes, and so calculations of these likelihoods include only wells of sample sizes less than 50 cells per well. Since these wells are most likely to contain the common clones, this approach is applicable to distinguishing common -sharing and dual TCR clones.

populations with different clonal size distributions
We compared the depths achieved by single-cell sequencing and alphabetr by sampling cells from the same synthetic T cell populations. In addition to the comparison discussed in the main text (using a population of 2100 clones with 25 clones comprising the top 50%), we explored different skewnesses -that is, populations of 2100 clones with 5, 10, and 50 clones comprising the top 50% ( Figure I). All three sets of simulations recapitulate our conclusions that substantially larger numbers of samples need to be sequenced with single-cell methods to match the depths achieved by alphabetr, particularly for rare clones.

phocyte populations
The TIL sequencing data described by Howie et al. [4] provided an opportunity to test alphabetr on a real TCR sequencing dataset of restricted diversity. The data were obtained by sampling T cells from nine different tumour samples into one 96-well plate and determining the CDR3 and CDR3 sequences in each well. In addition, bulk sequencing of the CDR3 and CDR3 sequences was performed on PMBCs from each patient. This allowed for the association of a subset of the sequences found in the mixed 96-well plate with their tumour sources.
Because the mixed plate contained 561452 unique CDR3 nucleotide sequences and 955987 unique CDR3 nucleotide sequences, these data are too diverse for direct input into alphabetr. We therefore made tumour-specific virtual plates by matching CDR3 regions in the wells to the libraries of CDR3 sequences obtained from bulk sequencing of PMBC sampled from each patient. This was done by exact matching of the first 76 bases of the nucleotide sequences of CDR3 pairSEQ sequences and the last 76 bases of the nucleotide sequences of the CDR3 libraries of each tumour sample. For the CDR3 , matching regions were 81 bases in length. These choices reflect the different reads utilised by pairSEQ and immunoSEQ sequencing. Each plate of tumour-specific chains was then analysed using alphabetr to obtain pairs.
We compared the pairs from alphabetr to those pairs determined in ref. [4] for which both the CDR3 and CDR3 chains were explicitly associated with exactly one tumour. Howie et al. also reported clones for which only one chain was associated with a tumour and the other was not found in any of the samples. Since alphabetr was applied only to the chains from the mixed plate that were definitively associated with one tumour, we removed the partially-matched clones called in ref. [4] from our comparative analysis.   Figure 7 in the main text are recapitulated in these tumour-level comparisons; under the sampling strategy used in ref. [4], alphabetr is less efficient at identifying rare clones, and slightly more efficient at identifying clones present in more than 25% of the wells.