Replicate immunosequencing as a robust probe of B cell repertoire diversity

Fundamental to quantitative characterization of the B cell receptor repertoire is clonal diversity - the number of distinct somatically recombined receptors present in the repertoire and their relative abundances, defining the search space available for immune response. This study synthesizes flow cytometry and immunosequencing to study memory and naive B cells from the peripheral blood of three adults. A combinatorial experimental design was employed, constituting a sample abundance probe robust to amplification stochasticity, a crucial quantitative advance over previous sequencing studies of diversity. These data are leveraged to interrogate repertoire diversity, motivating an extension of a canonical diversity model in ecology and corpus linguistics. Maximum likelihood diversity estimates are provided for memory and naive B cell repertoires. Both evince domination by rare clones and regimes of power law scaling in abundance. Memory clones have more disparate repertoire abundances than naive clones, and most naive clones undergo no proliferation prior to antigen recognition.


I. INTRODUCTION
Under threat from diverse pathogens having regeneration times orders of magnitude shorter than human lifetimes, the human adaptive immune system achieves specific response through random somatic recombination of sequence encoding specificity of antigen receptors on T and B lymphocytes. Recognition triggers an immune response, including cellular proliferation of the antigenspecific clone, recruitment to immunological memory, and, for B cells, production of antigen-specific antibodies. The repertoire of clones and their abundances evolve throughout an individual's life in response to exposures. Due to somatic hypermutation of the B cell receptor (BCR) triggered by antigen stimulation, B cells somatically evolve increasing antigen specificity as they proliferate. The repertoire of receptors is therefore extremely diverse.
Previous approaches to antigen receptor repertoire diversity estimation redeploy methods developed in the ecology and corpus linguistics literature to estimate species diversity and vocabulary size (see review [1]), respectively. Specifically, Poisson abundance models, with both parametric and nonparametric estimators, are used. Although conceptually erroneous, mark-recapture formulae have also been applied [2]. Antigen receptor repertoires more closely achieve the idealizations of these models than the their original applications; populations are very large and well-mixed, and detection probabilities are homogeneous. However, studies suffer from limitations in sequencing data that blunt sophisticated computational approaches.
Robins et al. [3] assessed T cell receptor (TCR) richness from high-throughput immunosequencing data using a nonparametric empirical Bayes method requiring divergent series regularization [4,5] (a substantially improved regularization technique, applied to estimating the molecular complexity of PCR libraries, is advanced in [6]). However, the sequencing read count assigned to each unique TCR (after error correction) was associated with its clonal abundance in the sample. This introduces noise and bias, since each single template is stochastically amplified by PCR. Although this high-throughput study captured the diversity of a realistic biological sample, inference of repertoire richness was problematic due to limited quantitation of sample abundance for each clone.
Rempala et al. [7,8] employed a likelihood model and posterior inference for mouse TCR richness using singlecell sequencing to quantitate sample abundance of T cell clones. Although this approach allows for precise quantitation of sample abundance, it is so low throughput (one cell per well on a 96-well plate) that diversity estimation was only possible for transgenic mice engineered to have dramatically limited TCR diversity. Although quantitatively principled, severe experimental limitations restricted the study to less biologically relevant repertoires.

II. EXPERIMENTAL DESIGN
In the present study a high-throughput and quantitatively robust (albeit indirect) probe of B cell clone sample abundance was devised. B cells from three adults were sorted into memory and naive populations (with two naive replicates for subject 1), each with ∼ 10 7 cells (Fig. 1a). Extracted DNA from each sample was evenly partitioned into 188 PCR replicates for amplification and uniquely barcoded for immunosequencing of the rearranged IgH locus [9], identifying clones by unique CDR3 sequence in their B cell receptor. Instead of relying on sequencing read counts to estimate a clone's sample abundance, we use its occupancy -the number of replicates it is observed in. In the regime of small occupancies, this approximates digital cell counting -a clone observed in only one replicate almost surely has a sample abundance of one cell. For larger sample abundances, co-occupancies become more probable, so occupancy in-  creasingly underestimates abundance. Clones with sample abundance much larger than the number of replicates will saturate, appearing in all replicates.
To address possible template quantity variation across replicates and non-detection effects, we selected the subset of 150 replicates for each sample having minimum variance in the number of unique clones. Removing replicates with outlying allocations of cells or underperforming amplification is necessary to avoid breaking exchange symmetries invoked in our model.

III. MODEL
We advance a combinatorial extension of a well-studied model of sample abundance, enabling application to occupancy data. After introducing a parameterization of this extended model, a maximum likelihood diversity estimation is introduced, validated with simulations, and applied to BCR repertoire occupancy data to infer both richness (the number of clonal species) and an index of relative diversity (evenness of clone abundances).

A. Poisson abundance model of replicate occupancy
As is canon in the ecology and corpus linguistics literature, we begin by modeling sampling from a diverse population as a superposition of homogeneous Poisson processes. A mixing measure, µ(λ), characterizes the distribution of Poisson rates over all categories (B cell clones, as identified by productively rearranged IgH CDR3 segment, in our case). Since a clone's Poisson rate, λ, is given by its repertoire fraction times the number of cells sampled, µ(λ) is tantamount to the repertoire clonal abundance distribution. Homogeneity entails the approximation that the repertoire is effectively an infinite reservoir (or is being sampled with replacement), such that the data is not sensitive to depletion of the population fractions of the sampled clones. An equivalent urn model has an urn with an infinite number of balls, an unknown finite number of ball colors, and nonzero fractional abundances for each color. The total number of balls (cells) sampled from the urn (repertoire) is taken to be a Poisson sample from a multinomial population. The marginal distributions of sample cellular abundance, j, of each clone are then independently and identically distributed as To model replicate occupancy, we assume that each sampled cell is randomly assigned to one of L possible replicates with equal probability (Fig. 1b). Because the sample material was partitioned equally among the L replicates, it is not strictly correct to assume that each cell is assigned to a replicate independently. However, for large samples this approximation is very accurate. If the sample contains N cells, then under this model the number of cells in each replicate is binomially distributed with N trials and success probability 1/L. The coefficient of variation is (L − 1)/N . The number of replicates used in this study was 150, and about 10 million cells were sequenced for all samples, leading to a coefficient of variation of about 0.004.
The distribution of a clone's replicate occupancy, i, conditioned on sample abundance, j, is then determined combinatorially as This is simply the ratio of the number of ways to partition j cells into i out of L replicates, divided by the total number of ways to allocate j cells among L replicates. j i denote Stirling numbers of the second kind, which count the number of ways to partition j distinguishable objects into i indistinguishable nonempty subsets.
Marginalizing over the hidden sample abundance gives the distribution of each clone's occupancy as where we have exchanged the order of integration and summation, and identified the sum as a well-studied exponential generating function for the Stirling numbers ([10], p.83). In formal power series notation, j i = j! z j (e z − 1) i /i! . We consider a finite-dimensional subspace of measures, µ θ (λ), parameterized by the vector θ, and thus write (1) as For a repertoire with clonal diversity S, the sample occupancy of each clone is drawn from distribution (2). Let l 1 , l 2 , . . . , l S denote the replicate occupancies of S labelled clones. For a very diverse repertoire and a limited sample, many clones will not be sampled, and thus have occupancy zero (the missing species). Due to exchangeability of the clone labels, it is sufficient to consider the frequencies of nonzero occupancies, defined by the vector indicator random variable o = (o 1 , o 2 , . . . , o L ), with o i = |{c ∈ {1, 2, . . . , S} : l c = i}| (the number of clones occupying exactly i replicates).
We may write a multinomial likelihood function as There are S − s missing species.

B. Parameterization
Antigen receptor repertoires have been observed to follow Zipf's law [11]: the logarithms of the frequencies of clones are inversely proportional to the logarithms of their ranks by frequency. As a continuous analog of this discrete power law behavior, we make the parametric ansatz dµ(λ) ∝ λ γ−1 exp − λa λ − λ λ b dλ. The exponential factors cut off scaling behavior from below and above, and correspond to minimum and maximum abundances in the repertoire. This distribution, properly normalized, is the generalized inverse Gaussian [12]. For Poisson abundance models, the parameters λ a and λ b are strongly asymptotically correlated in the likelihood for fixed γ [13,14]. This manifests as a ridge in parameter space that confounds likelihood maximization. A transformation that minimizes off-diagonal components of the Fisher information matrix is therefore introduced, resulting in the more orthogonal parameterization with parameter vector θ = (γ, ω, ξ). K γ (ω) denotes the modified Bessel function of the second kind, arising by imposing normalization. Excellent fits to naive and memory occupancy data were obtained with mixtures of two such distributions (see section IV). Lognormal and Pareto distributions were also considered, but produced substantially worse results. Under the parameterization (4), the distribution (2) becomes .

C. Maximum likelihood diversity estimation
Direct maximization of the likelihood (3) is computationally formidable, as it constitutes a mixed integer nonlinear programming problem. However, it may be factorized in the suggestive form where we define the binomial and zero-truncated multinomial An approach to approximate maximization of L (θ, S|o), proposed by Sanathanan as conditional maximum likelihood estimation [15], is to first computê which is independent of S and can be obtained by nonlinear numerical maximization of the log-likelihood m (θ|o) = log L m (θ|o). A constrained gradient ascent algorithm [16] was used in the present work. Differentiation gives gradient components of the form which may be evaluated by quadrature for the parameterization (4). Having computedθ, it remains to maximize the richness piece of the likelihood. A lemma due to Chapman [17] can be invoked to givê Sanathanan's articulation of an asymptotic theory for the estimatorŜ showed it to be equivalent to direct maximization of L (θ, S|o) for large S. A corresponding approach was taken by Rodrigues [18] in an empirical Bayes treatment to approximate a posterior distribution for S. The density µ θ (λ) is viewed as a prior which is realized in the repertoire for large S. Due to the large diversity of the BCR repertoire, we employ the diversity estimator S, first investigating its accuracy via simulation.

D. Shannon diversity in a Poisson abundance model
To quantify the degree of uniformity in repertoire clonal abundance we derive a standard entropy-based index of diversity applied to a Poisson abundance model. For a repertoire with richness S and clone-wise population fractions given by π 1 , π 2 , . . . π S , the Shannon index [19] is defined as the information entropy of the clonewise abundance distribution.
The maximum entropy, H o = log S, occurs when π i = 1/S for all clones. The Shannon equitability is defined as E H = H/H o = H/ log S. E H ranges on the unit interval, with unity denoting maximally uniform abundance. It is a measure of disparity in abundance, with smaller values indicating more disparity.
In a Poisson abundance model, each clone, i, is assigned a Poisson frequency, λ i , which is related to its population fraction, π i , by λ i = n π i , where n denotes the expected sample size.
With the measure parameterized by θ this becomes where we've defined the integral

The Shannon index for a Poisson abundance model is then
The equitability is then evaluated at the MLE as Finally, we define a measure of clonality as the complement of this quantity C ranges from 0 (minimally clonal, equal abundance across clones) to 1 (maximally clonal). For parameterization (4) the necessary integrals may be evaluated in terms of modified Bessel functions as and It is trivial to extended this to a model with two mixed generalized inverse Gaussians.

A. Simulation validation
To validate our methodology for inferring richness, simulations were performed by generating random draws from the likelihood (2). Fig. 2 shows violin plots for fractional error in diversity estimation for four sets of 100 simulations. Each violin is for a set of 100 simulations with identical diversity (S) and shows the distribution of the fractional error of the MLE, (Ŝ − S)/S.
The values of S for the four sets are 10 6 , 10 7 , 10 8 , and 10 9 . For all four sets, the expected sample size is fixed at about 4.7 million cells. This is achieved by tuning the scale parameter ξ inversely as S. This is necessary to address a property of the sampling model: the expected sample size is proportional to both S and ξ, but we want expected sample size to be the same in all simulations as we tune S. Remaining parameters were fixed at γ = −1 and ω = 0.01 across all simulations (values similar to those arising in analyzing real data). Even at the high end of diversity, the expected error is only a few percent, demonstrating the efficacy of conditional maximum likelihood estimation in estimating an unknown population parameter.

B. B cell diversity estimation
Results of the diversity estimation procedure described above are presented in Fig. 3 and Table. I. MLE richness and clonality inferences are shown in Fig. 3a. Occupancy data with visualized fits of the MLE are shown in Fig. 3b. Excellent fits are obtained for all data sets, as assessed by comparison to expectation values o i =Ŝ rθ(i), i = 1, 2, . . . , L, and with variation characterized by the quantile functions of the binomial marginal of likelihood (3) atŜ andθ. Estimated richness, clonality, and parameterization values are very consistent between the two samples of subject 1's naive BCR repertoire, and take on characteristic values according to cell population.

V. DISCUSSION
By synthesizing flow cytometry and replicate immunosequencing, approximate digital cell counting of memory and naive B cell repertoires of three adults was enabled, providing the deepest and most quantitatively robust characterization of the repertoire yet available. Diversity of the repertoire was inferred using a novel likelihood model devised for replicate-based presenceabsence data. Estimates of both clonal richness and evenness of abundance distributions were attained, showing consistency across individuals, but distinct clustering by cell population. Across naive samples, the estimated richness is similar to the expected number of total naive B cells in circulation, suggesting that the typical naive B cell undergoes no proliferation prior to antigen stimulation. Memory richness is consistent with several divisions on average, but higher disparity in abundance (indicated by lower clonality), likely corresponding to clonal expansions in response to antigen stimulation.  subject populationŜ (10 9 ) θ 1 θ 2 α C(Ŝ,θ)