Predicting B cell receptor substitution profiles using public repertoire data

B cells develop high affinity receptors during the course of affinity maturation, a cyclic process of mutation and selection. At the end of affinity maturation, a number of cells sharing the same ancestor (i.e. in the same “clonal family”) are released from the germinal center; their amino acid frequency profile reflects the allowed and disallowed substitutions at each position. These clonal-family-specific frequency profiles, called “substitution profiles”, are useful for studying the course of affinity maturation as well as for antibody engineering purposes. However, most often only a single sequence is recovered from each clonal family in a sequencing experiment, making it impossible to construct a clonal-family-specific substitution profile. Given the public release of many high-quality large B cell receptor datasets, one may ask whether it is possible to use such data in a prediction model for clonal-family-specific substitution profiles. In this paper, we present the method “Substitution Profiles Using Related Families” (SPURF), a penalized tensor regression framework that integrates information from a rich assemblage of datasets to predict the clonal-family-specific substitution profile for any single input sequence. Using this framework, we show that substitution profiles from similar clonal families can be leveraged together with simulated substitution profiles and germline gene sequence information to improve prediction. We fit this model on a large public dataset and validate the robustness of our approach on two external datasets. Furthermore, we provide a command-line tool in an open-source software package (https://github.com/krdav/SPURF) implementing these ideas and providing easy prediction using our pre-fit models.


Model Interpretation
In this subsection, we provide statistical motivation for our penalized regression model, which can be interpreted as specifying an ensemble of multinomial logistic regression models at each AHo position. We use some of the notation mentioned in the methods section and introduce new notation as needed. We begin by describing the structure of the multinomial logistic regression models and then discuss how we perform model averaging with these component models to form the estimator F (X) as stated in the methods section. We conclude this subsection by showing that our regularized minimization problem can be characterized as a maximum a posteriori (MAP) inference problem.
Suppose we observe M amino acids at the jth AHo position for the ith CF; for simplicity, we let y 1 , ..., y M denote the observed amino acids. We assume that y 1 , ..., y M are drawn independently from a common multinomial distribution with 20 possible categories and define a logistic regression model for the amino acid probabilities that does not include covariates. The standard way to formulate such a model is as follows: where c indexes a particular amino acid, β (20) ≡ 0, and m = 1, ..., M . We can equivalently represent the model as: where m = 1, ..., M . If we let p c denote the observed proportion of amino acid c in the sample, then it is easy to show that maximizing the multinomial likelihood of the M observations with respect to the β (c) parameters leads to the following parameter estimates: which implies that: where P(y m = c) represents the logistic regression estimate of P(y m = c). Therefore, this multinomial logistic regression model provides simple, intuitive estimates of the amino acid probabilities. While these logistic regression estimates may seem trivial, the underlying framework allows for easy integration of CF-specific and site-specific information into our model. The above model considers the amino acid probabilities at only one AHo position so one could fit this multinomial logistic regression model at each of the 149 AHo positions in a CF to obtain a complete substitution profile estimate. In this paper, each CF-specific substitution profile estimate is a maximum likelihood estimate (MLE) obtained by fitting 149 multinomial regression models to the observed amino acid data in the CF. Given that our proposed modeling procedure computes a per-site weighted average between the subsampled profiles X and all the external profile estimates X * , one can interpret this model F (X) as defining an ensemble of multinomial logistic regression estimates at each AHo position.
To specify this relationship more clearly, we denote the likelihood of Y i,j,• as follows: where MVN means multivariate normal, µ i,j,• signifies the mean vector of Y i,j,• and defines our ensemble model, σ 2 represents an unknown variance parameter, I 20 symbolizes the 20 × 20 identity matrix, and p represents the number of external profiles in X * . Note that µ i,j,• depends on the multinomial logistic regression estimates described previously as both X and X * contain the MLE-based substitution profile estimates. In addition, the form of µ i,j,• relates to our previous definition of F (X) by observing that µ •,j,• = f (X •,j,• ; α j,• ). We can also integrate the inequality constraints of the ensemble weights α j,l into the likelihood function by including the indicator term 1 α ≡ 1{0 ≤ α j,l ≤ 1; 0 ≤ p l=1 α j,l ≤ 1; ∀j, l}. The lasso penalties can be incorporated into our model through the use of sparsity-inducing prior distributions.
Specifically, the priors placed on α can be represented as: where λ 1 , λ 2 ≥ 0 and d ∈ N are the same tuning parameters specified in the methods section (Park and Casella, 2008;Kyung et al., 2010;Faulkner et al., 2018). These Laplace-like prior distributions can be expressed as scale mixtures of normal distributions with independent gamma distributed variances (Kyung et al., 2010); for a more comprehensive discussion on shrinkage priors of this form, we refer readers to (Faulkner et al., 2018). The posterior P(α|Y) can be presented in the following manner: Note that we assume the Y i,j,• vectors are independent conditional on α j,• and the prior distribution on α factorizes across l ∈ {1, ..., p}. The MAP estimate of the ensemble weights α is obtained by maximizing P(α|Y) and is equivalent to the estimate that minimizes the regularized objective function shown in the methods section. The latter assertion can be seen as the posterior on α can be monotonically transformed into our penalized minimization problem (up to a constant factor in α).
Of course, there are limitations to this interpretation of our modeling framework. For instance, Y i,j,• is a frequency vector, yet we model the likelihood of Y i,j,• using a multivariate normal distribution, which has support over all real numbers; we could potentially remedy this problem by modeling the likelihood of Y i,j,• as a Dirichlet distribution as its negative log-likelihood looks similar to a cross-entropy loss function. In addition, we specify priors on α that also have support over the real line, which is not realistic. Our assumption that the Y i,j,• vectors are conditionally independent given α j,• is used solely for presentation purposes and does not hold in practice because the substitution profile data in both X and X * are correlated across AHo positions. Despite these issues with our statistical representation of the penalized regression model, our results demonstrate that the model is useful for predicting CF-specific substitution profiles in data-sparse situations. S-2