Population variability in the generation and selection of T-cell repertoires

The diversity of T-cell receptor (TCR) repertoires is achieved by a combination of two intrinsically stochastic steps: random receptor generation by VDJ recombination, and selection based on the recognition of random self-peptides presented on the major histocompatibility complex. These processes lead to a large receptor variability within and between individuals. However, the characterization of the variability is hampered by the limited size of the sampled repertoires. We introduce a new software tool SONIA to facilitate inference of individual-specific computational models for the generation and selection of the TCR beta chain (TRB) from sequenced repertoires of 651 individuals, separating and quantifying the variability of the two processes of generation and selection in the population. We find not only that most of the variability is driven by the VDJ generation process, but there is a large degree of consistency between individuals with the inter-individual variance of repertoires being about ∼2% of the intra-individual variance. Known viral-specific TCRs follow the same generation and selection statistics as all TCRs.


Author summary
The adaptive immune system is a naturally diverse set of many T cells with the potential to activate the organisms defense against specific threats. T cells express different surface receptors that can specifically bind molecules from viruses, bacteria or cancer cells. Using statistical models we learned the statistics of the processes generating this diversity from a large cohort of 651 individuals, including random generation and selection of T cells and their receptors. We identify the different sources of the observed variability, separating generation and selection effects. For this purpose, we developed a new computational tool SONIA that quantifies selection patterns in any sample of T cells by comparing statistics to background samples. We find common sources of variability in the population, showing that the variability in the population is secondary compared to the diversity of T cells

Introduction
Most organisms live in a similar environment, facing common pathogenic threats. However, the adaptive immune system, based on the stochastic VDJ recombination process, is a naturally diverse system, supporting both repertoire variability within the individual, and variability across the population [1]. Quantifying both types of variability, and understanding how they support a robust immune response, are still open questions. Determining the variability under normal healthy conditions is a crucial step for understanding the immune system in compromised situations such as infections, autoimmune diseases, and cancer. The adaptive immune system reacts specifically against a variety of different threats to the organism. This is achieved by maintaining a large ensemble of T cells, each having a different receptor that binds distinct subsets of antigens. The adaptive immune system maintains this diversity by generating a large repertoire of cells with different receptors [2][3][4] and then selecting them according to their binding properties. The first step of selection occurs in the thymus. Cells carrying receptors that bind too strongly or too weakly to the host's own proteins do not pass this selection [5,6]. The remaining cells are let out into the periphery and undergo selection for binding of foreign antigen which results in cell proliferation. In all cases, T cell receptors (TCR) bind to antigen fragments presented as short peptides on the major histocomptability complex (MHC) of presenting cells [7]. Each human individual has 6 types of MHC molecules encoded by the very polymorphic human leukocyte antigen (HLA) locus. All of these processes-receptor generation, selection, and peptide presentation-are stochastic in nature and depend on the host's genetic background.
High-throughput T cell repertoire sequencing (RepSeq) provides a census of the T cell repertoire found in a blood or tissue sample [8][9][10][11]. These samples are generally indicative of the true repertoire, and comparing them over a population yields similarities predicted based on MHCs, pathogenic history and general properties of the generation process [12,13]. Due to the large diversity of possible TCRs, different samples, even ones taken from the same individual under the same conditions, will often differ substantially due to statistical noise. As a result, characterization of a repertoire sample is often more reliably done by statistically modeling the underlying generation and selection processes instead of working with raw TCR sequences and read counts. In this paper we take such an approach to characterize the diversity of the human T-cell receptor beta chain (TRB) repertoire. This approach allows us to disentangle the two processes of generation and selection, and to quantify their relative contribution to the overall variability across individuals. Our results provide a quantification of natural TCR diversity which is essential for studying adaptive immunity in clinical contexts.

Data source and modeling strategy
We analyzed previously published RepSeq data from a large cohort study [14] consisting of TRB nucleotide sequences from blood samples of 651 healthy individuals. Sample sizes ranged from 50,000 to 400,000 unique CDR3 amino acid beta chains. For each individual i, we learned an individual-specific generation model, which describes the probability of generating a given amino-acid sequence σ by VDJ recombination, P i gen ðsÞ, and an individual-specific selection model, Q i (σ), defined as the fitness of each sequence upon selection. The resulting probability distribution of receptor sequences is P i post ðsÞ ¼ Q i ðsÞP i gen ðsÞ (Fig 1A). To learn these models, TRB nucleotide sequences were divided into productive and non-productive sequences, where productive sequences are defined as being in frame with no stop codon. The pipeline is summarized in Fig 1B. We applied the IGoR algorithm [15] to non-productive TRBs of each individual to learn P i gen ðsÞ. Productive sequences were used to learn individual-specific selection models Q i (σ) as in Ref. [17], by comparing them with simulated productive sequences generated from the individual specific generation models. A new software package, SONIA, was developed to perform the Q i (σ) inference. For each sequence in each individual the algorithm computes two probabilities: its generation probability, P i gen , and its post-selection probability in the periphery, P i post . We then use them to estimate the intra-and inter-person variability.

Individual variability of VDJ recombination statistics
The model of VDJ recombination, P gen , assigns a probability to each VDJ recombination scenario [4], where a scenario is a particular choice of the various recombination events: germline gene choice (V, D, and J), the number of deletions to those germline genes at the V-D and D-J junctions, and the number and identities of the untemplated, inserted nucleotides at each of the junctions (called N1 for the V-D junction, and N2 for the D-J junction). A detailed description of the model is given in the Methods section. Each recombination scenario determines a particular nucleotide sequence. The generation probability of a sequence is then the sum of all recombination scenarios that result in that sequence. Since the scenario is a hidden variable of the observed nucleotide sequence, we can use the Expectation Maximization algorithm to infer the maximum-likelihood estimator of the model parameters [4] using IGoR [15]. This model of VDJ recombination has been demonstrated to capture the statistics and correlations of unselected sequences [4,15]. Productive sequences then translate into an amino-acid sequences σ, and we denote by P gen (σ) the probability of generation of σ conditioned on it being productive, equal to the sum of the generation probabilities of all possible nucleotide variants divided by the probability to generate a productive sequence (an abuse of notation relative to the strict definition of P gen with no conditioning on being productive).
We find that the generation models learned from different individuals in our cohort, P i gen , are consistently similar to each other, with more variation in the gene usage than in the junctional diversity statistics (Fig 2). The distributions of the number of inserted N1 and N2 nucleotides vary little (Fig 2A). The biases of the untemplated inserted nucleotides, governed by a Markov model where the choice of each inserted base pair depends stochastically on the previous insertion [4], is also conserved across individuals (Fig 2B). Note that these probabilities are also similar for the N1 and N2 insertions provided that N2 is read in the anti-sense. Likewise, gene specific deletion profiles have very low variability (Fig 2E). By contrast, gene usage shows greater yet moderate inter-individual variability (Fig 2C and 2D). Overall, these results confirm the large level of reproducibility of the generation process over a large cohort.
We then asked whether these small individual variations in the recombination statistics were correlated as a result of shared biological mechanisms or genetic factors. We found that the numbers of insertions at the two junctions were highly correlated with each other (Pearson's r = 0.79), meaning that individuals that tend to have longer N1 insertions also tend to have longer N2 insertions on average (Fig 3A). N1 insertions were also slightly longer by *0.17 insertions on average. The variance of the number of insertions calculated over the repertoire of one individual is extremely correlated to its mean (Pearson's coefficients of 0.88 and 0.87, Fig 3B), suggesting a single individual-specific parameter controlling both N1 and N2  [14]. For each person i = A, B, C, . . . we define a personalized TRB generation model P i gen , and a personalized selection model Q i (σ), as both processes are expected to vary across individuals as a function of their immune history and genetic background, in particular their HLA type for selection. The generation model allows us to evaluate the probability of generating each receptor sequence σ in each individual i. Q i (σ) tells us how likely a given receptor amino-acid sequence σ is to pass thymic selection in a given individual. Combined together, the two models give the probability of a given TRB amino acid sequence in the repertoire of a given person P i post ðsÞ ¼ Q i ðsÞP i gen ðsÞ. (B) To learn these models, for each individual we separated sequences into productive and nonproductive sequences. Nonproductive sequences are free of selection effects and were used to learn the generation model, P gen , using the IGoR software [15]. Most productive sequences are subject to selection and were used to learn the selection model, Q, by matching the statistics of the data with those of sequences generated synthetically with P gen (using the OLGA software [16]) and weighted by Q. Once the model is learned, the probabilities of amino-acid TRB sequences pre-and post-selection can be calculated using OLGA and SONIA.
To quantify other correlations we calculated Pearson's correlation coefficient over the population between combinations of various parameters. In order to determine significance and account for the finite cohort size we also compute a 'shuffled' Pearson's coefficient for each parameter combination by scrambling the individuals to destroy correlations. Fig  To determine which types of parameters co-vary the most, we computed the rescaled standard deviation of the Pearson's correlation coefficients of all combinations of parameter types ( Fig 3D). This analysis reveals that V gene usage co-varies with itself, D and J usages are also correlated with each other, as well as N1 length with N2 length, and the insertion biases at N1 and N2 with each other.

Learning models of thymic selection with SONIA
After VDJ recombination, new T cells go through an initial selection process in the thymus before being released as naive T cells to the periphery. Positive thymic selection selects for functionally useful receptors, while negative selection removes T cells that recognize self-peptides to avoid auto-immunity. T cells that pass this thymic selection are then released into the periphery where some are activated and further selected by interacting with antigens. Selection skews the statistics of the repertoire of TRB sequences in quantifiable ways. This can be seen by comparing the length distribution of the Complementarity Determining Region 3 (CDR3, running from a conserved cysteine near the end of V segment through a conserved phenylalanine near the beginning of the J segment) of productive sequences drawn from the generation model to observed sequences ( Fig 4A). We observe a substantial narrowing of the distribution post-selection, eliminating sequences much longer or shorter than 14-15 amino acids [17].
To characterize these differences more systematically, we use a statistical model of selection to account for differences between the repertoire generated from the raw VDJ recombination (pre-selection) and the observed repertoire of productive sequences (post-selection). Since selection acts on the functionality of a receptor we restrict ourselves to productive amino acid sequence statistics. Mathematically, we require that the post-selection distribution, P post = Q (σ)P gen (σ), agrees with the statistics of productive sequences in the frequency of a select set of features, f 2 F , while remaining as close as possible to P gen (where distance is measured by the Kullback-Leibler divergence, ∑ σ P post (σ) ln(P post (σ)/P gen (σ))). This is done by choosing the sequence-specific selection factors Q(σ) which can be shown to take the form (see Methods): where F ðsÞ � F is the subset of features present in sequence σ. Solving for the factors q f that match the frequencies of features in the data is equivalent to maximum likelihood estimation (MLE). Features may be the presence of a given amino-acid at a given position, the use of a particular V or J gene, a particular CDR3 length, or any combination thereof. For example, some of the features of the TRB designated by (CASSGRQGVATQYF, TRBV06-05, TRBJ02-05) are 'CDR3 length 14', 'S in position 3 from the left', 'Y in position -2 from the right' and 'V gene is TRBV06-05'.
To facilitate the definition and learning of such selection models, we introduce the software package SONIA. SONIA allows for a flexible definition of model features and infers the selection factors q f using MLE. The input to SONIA is a list of selected amino acid sequences and, if needed, their V and J gene choice. By default SONIA uses P gen as provided by an IGoR inferred model (using OLGA as a generation engine [16]), but it can also take as an input a custom sample of pre-selection sequences. This can be useful for identifying selection pressures during immune challenges using different choices of pre and post-selection repertoires (see Methods for details).
We applied SONIA using two models corresponding to two choices of feature sets. In the LengthPosition model [17], features include all possible choices of combinations of V and J genes, all possible CDR3 lengths, as well as amino acids usage at each position and length ( Fig  4B, top). This choice allows for great flexibility at the cost of many parameters. The LengthPosition model replicates the results of Ref. [17] (Fig 4C).
The number of parameters can be reduced by noting that selection pressures on amino acids near the 5' (left) or 3' (right) end of the CDR3 appear to depend only on their relative position to that end, regardless of CDR3 length (Fig 4C). The Left+Right model exploits that regularity by defining features of amino-acid usage at positions relative to the 5' end of the CDR3 (denoted by a positive index), or to its 3' end (denoted by a negative index). This model has much fewer parameters, since features are defined for left and right positions regardless of CDR3 length, and can be written as a special parametrization of the LengthPosition model, in which each amino acid contributes to the selection factor through the product of a left and a right factor (Fig 4B, bottom).
To evaluate the accuracy of the Left+Right model, we computed its predictions for the frequencies of amino acid usages at each position and length (Fig 4D, see also S4 Fig for overall amino-acid usage). These statistics are by construction matched by the LengthPosition model but not necessarily by the Left+Right model, and thus provide a good test of the validity of the parameter reduction it affords. While predictions from VDJ generation model (blue dots) do not reproduce the empirical frequencies well, highlighting the need of a selection model, both the LengthPosition (red dots) and the Left+Right (black dots) models match the data well. As the Left+Right model captures the observed behavior with fewer parameters, we will work with this model for the remainder of the paper.

Population variability
To quantify more precisely the variability of the generation and selection processes across 651 individuals, we computed the distributions of log 10 P gen , log 10 Q, and log 10 P post for each individual (see Methods). Fig 5A-5C show the results as a density map over the entire population, indicating strong consistency between individuals. The distributions over sequences from the model (obtained by sampling from P post using importance sampling, black curve) agree very well with those obtained from the data (red). By contrast, sequences generated from P gen , without selection factors (blue) fail to reproduce the data. The shift to high Q values from the pre-to the post-selection model is present by construction in the distribution of the Q (Fig 5B), because the post-selection ensemble should be enriched in high selection factors. However, a similar shift to higher probabilities from pre-to post-selection is indicative of a correlation between the generation probability, P gen , and the selection factor, Q (Table 1, S5 Fig). This correlation suggests that evolution has shaped VDJ recombination to favor sequences that are likely to pass thymic selection, as previously argued [17]. Fig 5D summarizes the distributions of probabilities P in different probability ensembles of decreasing diversity: raw VDJ recombination scenarios (black), generated nucleotide sequences (green), pre-selection productive amino acid sequences (blue, same as the blue curves in Fig 5A), and post-selection productive amino acid sequences (red, the mean of which is the black curve in Fig 5C). The negative of the mean of log 10 (P) is, up to a ln(2)/ln (10) factor, equal to the Shannon entropy of the distribution expressed in bits, h−log 2 (P)i P . Fig 5E shows the distribution of these entropies across the population. The width of the distributions of log 10 P is strongly correlated with their means across individuals and also from pre-selection to post-selection (Fig 5F), suggesting again a single parameter driving individual variability, possibly the average number of N insertions.
We also plot the P post distributions of TRBs from the VDJdb database that are known to be specific to human viruses [19] (Fig 6). There does not appear to be a substantial shift in the post-selection probability of these viral-specific sequences as compared to productive TRBs from blood. A similar absence of bias was previously reported for the distribution of generation probabilities [16], suggesting that the VDJ recombination process is not explicitly skewed towards generating these viral-specific sequences. Our results further show that selection also does not seem to be biased to select for sequences specific to these viral epitopes. The distributions for all individuals are visualized using a density map indicating the local density of probability distribution curves over the cohort. (D) Density maps of the model distributions for the VDJ recombination scenarios, P scenario , the nucleotide sequences, P nt gen , the productive amino-acid sequences upon generation, P gen , and post-selection amino-acid sequences P post , over the population. The same convention for the density map is used. Error bars for (A-D) are the standard deviation over the population. (E) Distributions of the Shannon entropies of P scenario , P nt gen , P gen , and P post over the population. (F) Mean vs variance of log 10 P gen and log 10 P post over both productive and generated sequences for each individual (the curves in A and C). The x-axis highlights the shift to slightly higher average probabilities going from non-productive to observed productive sequences and from P gen to P post . The linear relationship between the mean and variance suggests the possibility of a single effective parameter explaining variability in the population.

Quantifying overall variability and its contribution due to generation and selection
The overall variability in the TRB repertoire can be characterized both between and within individuals in the population, by calculating the variance of the distribution of log 10 P post , which gives a measure of the typical fold-variation. Since log 10 P post (σ) = log 10 P gen (σ) + log 10 Q(σ), this variance can be decomposed as: Varðlog 10 P post Þ ¼ Varðlog 10 P gen Þ þ Varðlog 10 QÞ þ2Covðlog 10 P gen ; log 10 QÞ: To quantify the range of repertoire variability within an individual we calculate the variances and covariance of log 10 P i gen , log 10 Q i , log 10 P i post over the data sequences, and synthetic sequences for each individual. Table 1 summarizes the average of these variances over the 651  individuals. 80% of the variation comes from the generation process, with the remainder mostly stemming from a strong correlation between selection and generation, as previously discussed (Fig 5A and 5C, S5 Fig).
Variations in the probabilities of given sequences across individuals (averaged over sequences, see Methods for details) are much lower (Table 2), highlighting the high level of consistency in the population. The total variance of 0.091 in log 10 P post corresponds to relative variations of 10 � ffi ffi ffi ffi ffi ffi ffi 0:091 p 2 ðÀ 50%; þ100%Þ in the probability of sequences. While those differences are substantial in absolute terms, they are 1.6% of the variance over sequences within an individual (� 5.4, see Table 1). Much of this variance again stems from VDJ generation.
To further characterize variability, we learned 'consensus' or 'universal' models from sequences sampled randomly from each individual. To this end we inferred a consensus VDJ generation model (P univ gen ) from out-of-frame sequences, and a consensus Left+Right SONIA model (Q univ and P univ post ) from the productive sequences (Methods). We then compared each individual model to the universal model using the Jensen-Shannon divergence, an information-theoretic measure of distance between probability distributions expressed in bits and directly comparable to entropies (Methods). The distributions of JSD(P i gen ; P univ gen ) and JSD (P i post ; P univ post ) over the cohort highlight the consistency of these models with most individuals having < 0.3 bits JSD from both P univ gen and P univ post (Fig 7). This should be compared to the associated entropies of > 30 bits for either distribution (Fig 5E).

Discussion
By applying distinct computational procedures to the nonproductive and productive sequences of the TCR repertoires of a large cohort of 651 donors, we were able to learn individual-specific models of repertoires, separating the processes of generation and selection. This allowed us to quantify precisely the variability of each process within the population. Overall, while all underlying biological process are stochastic in nature, the emerging parameters are actually reproducible with low variability.
We found that the TRB generation process varied only moderately between individuals, with two main drivers: gene usage and average length of untemplated insertions. Because insertions contribute a lot to the generation probability, the latter is the main driver of variability in the distribution of P gen itself. V, D, and J gene usage variability may be due to variations in the regulatory signals, both genetic and epigenetic, that control the operation of the Recombination-Activating Gene (RAG) protein that initiates the recombination process [20]. It may also be due to variations in gene copy numbers of the gene segment, as was observed in the related case of the IgH locus [21]. Variations in the mean number of insertions could be attributed to differences in expression of TdT as well as other proteins involved in the nonhomologous end joining pathway [22]. We found that the inferred selection models were also variable between individuals, but the magnitude of these variations remains limited, which may be surprising considering that different individual's repertoires are subject to different selective pressures due to diverse HLA backgrounds. Overall, the ratio of the total variability across individuals to the variability across sequences within an individual (measured by the variance of the logarithm of the sequence probability) was only 1.6%, of which about 85% came from variations in the generation process, and 15% from the selection process. We also found that sequences that were previously identified to be specific to human viruses did not differ in their generation or selection probability from generic sequences from blood, finding no evidence in our models for an evolutionary mechanism to favor such viral-specific sequences (as suggested in [23]), neither in the process of VDJ recombination, nor through selection.
Previous work demonstrated that the statistical effects of thymic selection on a repertoire were well captured by a model where selection acts independently on each amino acid, regardless of the sequence context [17]. In this work we define a large class of models and use a new specific model (Left+Right) to train the selection models to differentiate between pre-selection and peripheral repertoires. To infer these models we used unsorted, productive, T cell clonotypes sampled from blood. As these cells were sampled in the periphery, they have undergone initial selection in the thymus, and possibly somatic expansion after antigen experience. The variability in the inferred parameters for the selection models is not large between the individuals in the data cohort, identifying reproducible features. This suggests that the main statistical effects captured by our model are consistent across many individuals. This leads us to think that the statistical features of selection are probably driven by positive thymic selection for amino acids that makes a folding functional receptor. The effect of HLA specific positive and negative selection, or selection by foreign antigens, on the other hand, might not be well captured by this kind of a model, which focuses on finding broad sequence features rather than specific sequences to harness more statistical power, although variations in the V and J selection factors may reflect HLA types. Our approach thus complements the strategy of looking for associations of particular TCRs with HLA type, which was previously applied to the same dataset [12]. An obvious limitation of this and other studies of that dataset is that it comprises a restricted subset of the human population.
While in this study we used SONIA for the purpose of comparing peripheral to pre-selection repertoires, the software is written to be flexible in several ways. First, if can be used to infer selection factors between any two repertoires (observed or generated), by inferring selection factors that match the statistics of the two samples. Second, SONIA can go beyond selection pressures on single amino acids, allowing features of pairs or motifs of amino acids. Finally, SONIA can be applied to other chains than TRB, notably the alpha chain of the TCR (TRA) as well as immunoglobulin IgH.
SONIA's flexibility opens up the possibility of using SONIA to find statistical correlations in various biological or clinical contexts. SONIA could be applied to samples that are known to have responded to some perturbations, for example after vaccination or infection [24,25]. In such a context clone sizes may be crucial to identify the underlying changes. To facilitate this, SONIA can also infer selection factors from read-count weighted repertoires. A major challenge in the field of immune repertoire profiling remains to decipher the specificity of the TCR-pMHC interaction. Vaccine design, immunotherapy and therapy for autoimmune conditions would all greatly benefit from the ability to find or design TCRs with known specificity. In the last couple of decades experimental methods have been developed for identifying TCRs specific to given antigens [26][27][28][29]. Based on accumulated TCR binding data [19], computational methods have been proposed recently that can find clusters of similarly reactive TCRs [25,[28][29][30], or to predict TCR specificity to a given epitope using machine-learning techniques [31][32][33][34]. SONIA could be used to learn flexible models of these antigen-specific TCR subsets and to study their organization. It could also be applied to identify specific selective pressures in particular subsets, defined by HLA specificity, pathogenic history, clinical status, T-cell phenotype (naive, effector, memory, CD4, CD8, regulatory T cells), or to differentiate distinct samples from the same individual, such as blood, tissue, or tumor samples.

Data
The data used for the inference of both the VDJ generation models and the subsequent selection models are the Adaptive Biotechnologies sequenced TRB repertoires of Emerson et. al. [14]. An initial quality control pass was done over the 666 individuals to ensure at least 10,000 unique out of frame sequences to be used to infer the VDJ generation model. 651 individuals passed this threshold and all were used in the subsequent analyses.
All analyses were done on unique nucleotide reads, discarding any cell count information. This is done to ensure that each sequence is reflective of a single recombination event, which is an important restriction when modeling VDJ recombination and thymic selection. For some selection modeling purposes (e.g. modeling antigen exposure), cell counts may be incorporated.
In practice, amino-acid sequences are reduced to the full amino acid CDR3 sequence and possible V and J genes, possibly more than one choice when the read aligned to multiple genes.
Sequences were determined to be productive and used in the selection analysis if they had a non-zero P gen . Beyond being an in-frame sequence without stop codons, this requires that a sequences retains the conserved residues defining the CDR3 region (Cysteine on the 5 0 end, Phenylalanine on the 3 0 end) as well as aligning to non-pseudo V and J genes.

Generation model
The generation model is defined at the level of the recombination scenarios in order to reflect the underlying biology of VDJ recombination. Each recombination scenario is defined by the gene choice (V, D, and J); deletions/palindromic insertions for each gene (d V , d D , d 0 D , and d J ); and the sequence of non-templated nucleotides at each junction (m 1 ; . . . ; m ' VD and n 1 ; . . . ; n ' DJ ). The probability of a recombination scenario is given in the factorized form: This model factorization, originally from Murugan et al, has been shown to capture the relevant correlations between the different recombination events in TRB [4]. The probability of a nucleotide sequence x is given by: and the probability of a productive amino-acid sequence is: where F = ∑ scenario|prod P scenario is the total probability that a random recombination event is productive (in-frame, no stop codons, preserves conserved residues, and does not use pseudogenes as germline gene choices). F can be computed directly from a generative model using OLGA [16].

Selection model
To minimize the Kullback-Leibler distance between P post and P gen while enforcing the constraints P s:f 2F ðsÞ P post ðsÞ � P post ðf Þ ¼ P data ðf Þ for each f, we extremize the following Lagrangian: where λ f are Lagrange multipliers constraining the frequencies of f, while μ ensures the normalization of P post . This extremization yields the form of P post : Defining q f ¼ e l f , and Z = e −μ , we obtain Eq 1. Given that form, the Lagrange multipliers must be adjusted to satisfy the constraints. Doing so is equivalent to maximizing the likelihood of the data under the model: where N is the number of data sequences. This can be shown by noting that the gradient of the log-likelihood, cancels when the constraints are satisfied.

SONIA implementation
SONIA is a python software built to define and infer feature-defined selection models. SONIA has built in procedures for defining and identifying sequence features of CDR3 sequences. SONIA also ships with the prepackaged selection models of LengthPosition and Left+Right features. With a feature model defined, SONIA takes as an input a list of productive amino acid CDR3s, along with any aligned V/J genes. This list of observed CDR3s can be either reduced to unique sequences (useful when learning thymic selection and the background statistics are based on unique sequences) or sequences taken with their clonality to account for a non-flat clone size distribution. As an optional input, SONIA can read in baseline CDR3 and aligned V/J genes to use as the background that the selection model is learned from. Alternatively, OLGA's sequence generation machinery [16] is built into SONIA so a generation model can be specified and background sequences automatically generated. SONIA has built-in methods to compute the feature marginals over the data sequences, background sequences, and the selection model. These marginals are use to fit the selection model iteratively using TensorFlow keras [35,36] with the Kullback-Leibler divergence as a loss function. We checked the convergence of the algorithm and its satisfying of the constraints after convergence (S6 Fig).
An inferred SONIA model can be used to compute overall selection factors Q of any sequence. In combination with OLGA, SONIA can compute P post and to generate selected sequences through rejection sampling.

Distributions of probabilities
We produced the distributions of P gen , Q, and P post shown in Fig 5A-5C by comparing the productive data sequences of each individual to a synthetic sample of productive sequences generated from P i gen of that individual using OLGA [16]. The number of generated sequences for each individual were matched to the number of productive data sequences. For each dataset, we calculated P i gen using OLGA, and Q i and P i post using SONIA's Left+Right model. The Qweighted curves are determined by weighting each generated sequence by its selection factor Q i and then renormalizing.
For Fig 5D, we used 300,000 scenarios, nucleotide, and amino acid sequences were generated from each individual's VDJ generation model. Again, we used OLGA to compute the various generation probabilities P i , where P i is P i scenario , P i;nt gen , or P i gen . Entropy was estimated as −hlog 2 P i i over the respective generated sample. For the post-selection ensemble (P i post ), the distributions were weighted by Q i computed by SONIA, and the entropy was calculated as −hQ(σ) log 2 [P gen (σ)Q(σ)]i over the generated amino acid sequences.

Inference and probability computation
Overall workflow is summarized in Fig 1B. VDJ generation models were all inferred using IGoR [15]. Amino acid P gen distributions were all computed using OLGA [16] according to the specified IGoR model parameters. All generated sequences were drawn from the corresponding VDJ generation model using OLGA. Lastly, selection models were all inferred, and evaluated using SONIA. The code for all processes is available on GitHub:

Quantifying variability
To produce the variances and covariances of Table 1 we took the productive data sequences from each individual along with an equivalent number of synthetic sequences drawn from the individual's VDJ generation model. For each sequence we computed P univ gen , Q univ , and P univ post using the consensus models. The variance and covariance of each quantity was computed over both the data sequences and generated sequences for each individual. These variances and covariances were then averaged over the individual cohort to yield the numbers in Table 1.
Error bars are the standard deviation over the cohort. For Table 2, we learned a consensus VDJ generation model P univ gen from nonproductive sequences sampled randomly from all individuals. 300,000 productive sequences were drawn from P univ gen to serve as a generated sequence pool. For data sequences we used 326,000 productive sequences sampled randomly from all individuals. We calculated for each sequence σ the individual specific P i gen , Q i , and P i post for each individual, then calculated the variances and covariances over i. Finally we averaged the results over the sequences σ from each pool.
The Jensen-Shannon divergence between two distribution P 1 and P 2 is defined as: with � P ¼ ðP 1 þ P 2 Þ=2. Scatter plots of log 10 (Q univ ) vs log 10 ðP univ gen Þ for (A) generated sequences drawn from P univ gen and (B) data sequences used to infer log 10 (Q univ ). The color scale indicates the local probability density of the points (on a log scale). This visualizes the correlation of P gen and Q as described in Tab. I. Q univ and P univ post are 'universal' models learned from sequences randomly drawn from all individuals. (PDF)