## Figures

## Abstract

The global effort to sequence millions of SARS-CoV-2 genomes has provided an unprecedented view of viral evolution. Characterizing how selection acts on SARS-CoV-2 is critical to developing effective, long-lasting vaccines and other treatments, but the scale and complexity of genomic surveillance data make rigorous analysis challenging. To meet this challenge, we develop Bayesian Viral Allele Selection (BVAS), a principled and scalable probabilistic method for inferring the genetic determinants of differential viral fitness and the relative growth rates of viral lineages, including newly emergent lineages. After demonstrating the accuracy and efficacy of our method through simulation, we apply BVAS to 6.9 million SARS-CoV-2 genomes. We identify numerous mutations that increase fitness, including previously identified mutations in the SARS-CoV-2 Spike and Nucleocapsid proteins, as well as mutations in non-structural proteins whose contribution to fitness is less well characterized. In addition, we extend our baseline model to identify mutations whose fitness exhibits strong dependence on vaccination status as well as pairwise interaction effects, i.e. epistasis. Strikingly, both these analyses point to the pivotal role played by the N501 residue in the Spike protein. Our method, which couples Bayesian variable selection with a diffusion approximation in allele frequency space, lays a foundation for identifying fitness-associated mutations under the assumption that most alleles are neutral.

## Author summary

The SARS-CoV-2 pandemic has been shaped by the repeated emergence of new viral lineages and mutations. Methods to identify emerging variants of epidemiological significance and characterize mutational determinants of enhanced fitness are important for public health. We develop Bayesian Viral Allele Selection (BVAS), a method that leverages the millions of SARS-CoV-2 viral genomes that have been sequenced across the globe to identify mutations linked to increased viral fitness. Ranked lists of top BVAS hits can be used to prioritize lineages and mutations for follow-up study in the lab. By providing a genome-wide view of the evolution of SARS-CoV-2, our principled probabilistic model complements more targeted experimental approaches for elucidating the functional consequences of different viral mutations. We argue that running BVAS periodically as part of a real-time genomic surveillance program could provide valuable information for public health authorities about new lineages as they emerge.

**Citation: **Jankowiak M, Obermeyer FH, Lemieux JE (2022) Inferring selection effects in SARS-CoV-2 with Bayesian Viral Allele Selection. PLoS Genet 18(12):
e1010540.
https://doi.org/10.1371/journal.pgen.1010540

**Editor: **Takashi Gojobori,
National Institute of Genetics, JAPAN

**Received: **June 15, 2022; **Accepted: **November 23, 2022; **Published: ** December 12, 2022

**Copyright: ** © 2022 Jankowiak et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **The SARS-CoV-2 data used in our analysis are provided by GISAID. A complete list of accession numbers for the viral genomes used in our study is publicly available: https://github.com/broadinstitute/bvas/raw/main/paper/accession_ids.txt.xz The UShER tree used in our pre-processing pipeline is publicly available: https://hgwdev.gi.ucsc.edu/~angie/9f94a7b/. The vaccination data we use are provided by OWID: https://github.com/owid/covid-19-data/. An open-source implementation of our analysis code is available at: https://github.com/broadinstitute/bvas. The initial portion of our data pre-processing pipeline relies on open source code described by Obermeyer et al. (2022): https://github.com/broadinstitute/pyro-cov Allele-level and lineage-level inference results from our main BVAS analysis are publicly available: https://github.com/broadinstitute/bvas/raw/main/paper/allele_summary.csv https://github.com/broadinstitute/bvas/raw/main/paper/growth_rates_summary.csv.

**Funding: **This work was supported in part by grants from MassCPR Viral Variants Program and CDC BAA 75D30120C09605 (to J.E.L). - https://www.cdc.gov/amd/whats-new/cdc-announces-awards-SARS-CoV-2-sequencing-SPHERES-initiative.html - https://masscpr.hms.harvard.edu/ The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## 1 Introduction

The SARS-CoV-2 pandemic has seen the repeated emergence of new viral lineages with higher fitness, where fitness includes any attribute that affects the lineage’s growth, including its basic reproduction number and generation time. The virus has evolved into numerous sublineages that are characterized by distinct phenotypes including enhanced pathogenicity, increased escape from convalescent and vaccine-acquired immunity, differential host tropism, and altered biochemical interaction with cell surface machinery. For example, the Spike mutation S:D614G, found in nearly all Variants of Concern, is associated with higher SARS-CoV-2 loads [1, 2]. Other mutations such as S:N439R, S:N501Y, and S:E484K, have been linked, respectively, to increased transmissibility [3], enhanced binding to ACE2 [4], and antibody escape [5, 6]. For the vast majority of observed mutations, however, links to SARS-CoV-2 fitness are unknown and functional consequences remain uncharacterized.

Fortunately, the SARS-CoV-2 pandemic has prompted a global genomic surveillance program of unprecedented scope and scale, with more than 10 million virus genomes sequenced to date. This growing quantity of genomic surveillance data provides a unique opportunity to interrogate the dynamics of viral infection and quantify selective forces acting on lineages and mutations. Current methods to analyze such data typically rely on phylogenetic analysis or parametric growth models. Phylogenetic methods usually rely on computationally expensive Markov Chain Monte Carlo (MCMC) for inference, with the result that handling more than ∼5000 samples becomes infeasible [7, 8]. By contrast, parametric growth models are scalable to large datasets but typically do not systematically account for competition between multiple lineages, have little to say about newly emergent lineages, and cannot pinpoint the genetic determinants of differential fitness [9, 10].

Two recently developed methods address some of these shortcomings. The first method, PyR_{0}, is a hierarchical Bayesian parametric growth model that jointly estimates growth rates for multiple lineages across multiple geographic regions [11]. Since PyR_{0} regresses growth rates against genotype, it can also make inferences about the genetic determinants of differential fitness. Moreover, since PyR_{0} relies on variational inference it can be applied to large datasets. However, variational inference also results in poor uncertainty estimates and the parametric likelihood that underlies PyR_{0} makes ad hoc assumptions about the noise characteristics of surveillance data. The second method, which we refer to as MAP, likewise regresses growth rates against genotype but instead utilizes an elegant diffusion-based likelihood that is better suited to the stochastic dynamics of viral transmission [12]. However, unlike PyR_{0} MAP does not assume that most alleles are approximately neutral, i.e. that most alleles have little or no effect on viral fitness. As a result MAP risks inferring non-negligible selection effects for implausibly many alleles.

In the following we set out to formulate a method—Bayesian Viral Allele Selection—that combines and improves upon the respective strengths of both PyR_{0} and MAP and achieves the following desiderata. First, we retain both methods’ scalability to large datasets and their ability to account for competition between multiple co-circulating lineages. Second, we retain both methods’ ability to infer the genetic determinants of differential fitness, which is important both for understanding the biology of transmission and pathogenesis and for predicting the fitness of emergent lineages. Third, we incorporate the sparsity assumption of PyR_{0}—namely that most alleles are approximately neutral—while adopting the principled diffusion-based likelihood that underlies MAP. Finally we discard variational inference in favor of efficient MCMC so as to obtain more plausible uncertainty estimates, which provide crucial nuance for public health agencies.

To establish the operational characteristics of our method we perform a large suite of simulations, including detailed comparisons to PyR_{0}, MAP, and another diffusion-based method we introduce (Laplace). We find that Bayesian Viral Allele Selection (BVAS) performs well across the board, with notable advantages of BVAS being its robustness to hyperparameter choices, its satisfactory uncertainty estimates and the fact that it offers interpretable Posterior Inclusion Probabilities that can be used to prioritize alleles for follow-up study.

We apply BVAS to 6.9 million SARS-CoV-2 genomes obtained through April 18^{th}, 2022, noting that, to the best of our knowledge, this is the largest such analysis to date. Our genome wide analysis identifies known functional hot spots in the SARS-CoV-2 genome like the receptor-binding domain (RBD) in the S gene as well as additional hits in regions of the genome whose function is less well understood like the ORF1ab polyprotein. We argue, based on a retrospective backtesting analysis, that running BVAS periodically as part of a real-time genomic surveillance program could provide valuable estimates of the growth rates of new lineages as they emerge. In addition, we conduct an analysis that allows for vaccination-dependent selection effects and find tantalizing evidence that S:N501Y exhibits vaccination-dependent differential fitness. Finally, we conduct an analysis that aims to identify pairs of mutations whose fitness effect is not additive (i.e. epistasis), which likewise points to the important role played by the RBD residue N501.

## 2 Models and methods

### 2.1 Viral infection as diffusion

The starting point for both MAP and BVAS is a branching process that encodes the dynamics of infected individuals at time *t* stochastically generating secondary infections at time *t* + 1. (A more detailed exposition of this and the following sections can be found in S1 File.) SARS-CoV and SARS-CoV-2 are known to exhibit super-spreading [13, 14] whereby a minority of infected individuals causes the majority of secondary infections. To account for this behavior the number of secondary infections is assumed to be governed by a Negative Binomial distribution, which has a large variance for small values of the dispersion parameter *k*. In particular we assume that if a given individual is infected with a variant *v* with reproduction number *R*_{v}, the number of secondary infections due to that individual has mean *R*_{v} and variance . If we let *n*_{v}(*t*) denote the total number of individuals at time *t* infected with variant *v*, our assumptions result in the following discrete time process:
(1)
To connect these dynamics to genotype, we assume that variants are characterized by *A* alleles and that each variant *v* is encoded as a binary vector **g**_{v} ∈ {0, 1}^{A}. We then express *R*_{v} as *R*_{v} = *R*_{0}(1 + Δ*R*_{v}), where *R*_{0} corresponds to the wild-type variant, and assume that Δ*R*_{v} is governed by a linear additive model
(2)
where are allele-level selection coefficients. Note that we consider quadratic effects in Sec. 4.7. If we transform from case counts *n*_{v}(*t*) to allele frequencies *x*_{a}(*t*), Lee et al. [12] show that the dynamics in Eq (1) are equivalent to the following diffusion process in allele frequency space
(3)
where is the *A*-dimensional drift, given by
(4)
The *A* × *A* diffusion matrix **Λ**(*t*) is given
(5)
where *x*_{ab}(*t*) is the fraction of infected individuals at time *t* who carry alleles *a* and *b*. Finally *ν* is the effective population size given by
(6)
where *n* is the total number of infected individuals. Importantly, the equivalence of Eqs (1) and (3) holds in the diffusion limit of large *n*.

It is important to note that the use of diffusion processes similar to that in Eq (3) has a long history in population genetics, including seminal work by Kimura [15] as well recent applications that employ diffusion-based likelihoods in the context of statistical inference [16–19].

### 2.2 MAP

The simplest model that utilizes the diffusion-based likelihood in Eq (3) is formulated as follows (we refer the reader to [12] for additional discussion). First we place a Multivariate-Normal prior on the selection coefficients ** β** (7)
where

*τ*> 0 is the prior precision and is the

*A*×

*A*identity matrix. For observed incremental allele frequency changes (8) the likelihood is given by (9) where we have assumed that

*ν*is constant across time. Since

**appears linearly in the drift**

*β***d**(

*t*|

**) and the prior is Multivariate-Normal, the corresponding maximum a posteriori (MAP) estimate is available in closed form: (10) An attractive property of this estimator is that it can be computed in time and is thus quite fast on modern hardware, at least for**

*β**A*up to

*A*∼ 10

^{4}−10

^{5}. An unattractive property of this estimator is that it can perform poorly in the high-dimensional regime,

*A*≫ 1, since we expect most alleles to be neutral, but will generally be non-zero for all

*a*.

### 2.3 Bayesian Viral Allele Selection

We now introduce our method: Bayesian Viral Allele Selection (BVAS). We expect most alleles to be nearly neutral (*β*_{a} ≈ 0) and we would like to explicitly include this assumption in our model. To do so we utilize the modeling motif of Bayesian Variable Selection [20]:
(11)
where *a* = 1, …, *A* and *t* = 1, …, *T* − 1. Here each Bernoulli latent variable *γ*_{a} ∈ {0, 1} controls whether the *a*^{th} coefficient *β*_{a} is included (*γ*_{a} = 1) or excluded (*γ*_{a} = 0) from the model; in other words it controls whether the *a*^{th} allele is neutral or not. The hyperparameter *h* ∈ (0, 1) controls the overall level of sparsity; in particular *S* ≡ *hA* is the expected number of non-neutral alleles a priori. The |*γ*| coefficients are governed by a Normal prior with precision *τ* where *τ* > 0 is a fixed hyperparameter. Here |*γ*| ∈ {0, 1, …, *A*} denotes the total number of non-neutral alleles in a given model. Note that in the following we drop the *γ* subscript on *β*_{γ} to simplify the notation.

In addition to inducing sparsity, an attractive feature of the model in Eq (11) is that—because it is formulated as a model selection problem—it explicitly reasons about whether each allele is neutral or not. In particular this model allows us to compute the *Posterior Inclusion Probability* or PIP, an interpretable score that satisfies 0 ≤ PIP ≤ 1. The PIP is defined as PIP(a) ≡ p(*γ*_{a} = 1|**y**_{1:T−1}), i.e. PIP(*a*) is the posterior probability that allele *a* is included in the model. This quantity should be contrasted to *h* in Eq (11), which is the *a priori* inclusion probability. Alleles that have large PIPs are good candidates for being causally linked to viral fitness.

In Eq (11) we assume that *h* is known. An alternative is to place a prior on *h*, *h* ∼ Beta(*α*_{h}, *β*_{h}), and infer *h* from data. See Sec. S9 in S1 File for details.

### 2.4 MCMC inference

BVAS admits efficient MCMC inference via a recently introduced algorithm dubbed Tempered Gibbs Sampling [21]. This is quite remarkable: the underlying inference problem is very challenging, since i) it is a transdimensional inference problem defined on a mixed discrete/continuous latent space; and ii) the size of the model space, namely 2^{A}, is astronomically large. The feasibility of MCMC inference in this setting is enabled by the specific Gaussian form of the diffusion-based likelihood in Eq (9) and would be impractical for most other (non-conjugate) likelihoods. Thus BVAS is made possible by a pleasant synergy between the form of the prior and the likelihood.

As we explain in more detail in Sec. S4 in S1 File the resulting inference algorithm has computational cost per MCMC iteration and is thus quite fast on modern hardware. Here |*γ*| is the total number of non-neutral alleles, which by assumption satisfies |*γ*| ≪ *A*. Notably the computational complexity does not include terms that are quadratic or cubic in *A*, since the (strict) sparsity of Bayesian variable selection implies that the required linear algebra never involves *A* × *A* matrices. Importantly, the viability of MCMC inference means that we expect to achieve satisfactory uncertainty estimates, in particular ones that explicitly weigh differing hypotheses about which alleles are neutral and which are not. Indeed the BVAS posterior mean of ** β** can be viewed as an evidence-weighted linear combination of 2

^{A}MAP estimates.

### 2.5 Multiple spatial regions

In the above we have assumed a single spatial region. To apply either BVAS or MAP to multiple spatial regions we simply add a subscript where necessary and form a product of diffusion-based likelihoods for *N*_{R} regions indexed by *r*:
(12)
As discussed in Sec. S4 in S1 File, including multiple regions has negligible impact on the computational cost, since all summations over the region index *r* are performed once in pre-processing.

### 2.6 Estimating the effective population size

The likelihood in Eq (9) depends on the effective population size *ν*, a quantity that we do not know a priori and need to estimate from data. For a given region *r* Eq (9) implies
(13)
so that if we assume that the drift term is subdominant we obtain the approximation
(14)
We note that, since **d**_{r}(*t*)^{T} **d**_{r}(*t*) ≥ 0, we would expect to be an underestimate of *ν*_{r}, especially if the effective population size is large. This results in the following simple estimator
(15)
where we have averaged Eq (14) over *T* − 1 time steps.

To accommodate multiple regions we compute within each region using Eq (15) and then compute a single global effective population size by computing the median of . With this choice all regions contribute equally to the likelihood. See Sec. S6 in S1 File for additional details and discussion.

### 2.7 Sampling rate

As we show in Sec. S8 in S1 File an attractive property of the diffusion process in Eq (3) is that it behaves sensibly in the presence of sampling, i.e. the fact that not all viral sequences are observed in real world datasets. Indeed if sampling is i.i.d. and the sampling rate is *ρ* with 0 < *ρ* ≪ 1 then the effect of sampling is to renormalize the effective population size in Eq (6) as
(16)
This means that the covariance structure in Eq (3) remains intact, which is important because it is precisely this 2^{nd} order information that helps BVAS and MAP disentangle driver mutations from passenger mutations. This is reassuring because for SARS-CoV-2, where even the most ambitious surveillance programs satisfy *ρ* ≪ *k*, the effective population size is dominated by the effects of sampling and .

### 2.8 Vaccination-dependent effects

Suppose we know the vaccination rate 0 ≤ *ϕ*_{r}(*t*) ≤ 1 for a given region *r*. We would like to incorporate this information into our modeling by allowing for vaccination-dependent selection. To do so we write the drift in region *r* as
where is a second group of selection coefficients whose strength is modulated by the time- and region-local vaccination rate. In particular ** α** only has a non-negligible effect on infection dynamics when

*φ*

_{r}(

*t*) is itself non-negligible. Disentangling the effects of

**and**

*β***is difficult a priori. Our hope, however, is that a Bayesian variable selection approach with robust MCMC inference should be up to the task provided we have enough data. See Sec. S7 in S1 File for additional discussion.**

*α*### 2.9 Alternative model: Laplace

Finally we describe the simplest modification of MAP that can account for the expected sparsity of non-neutral alleles. For additional discussion please refer to Sec. S10 in S1 File, where we describe several alternative models that make use of the diffusion-based likelihood in Eq (3). In this approach we place a Laplace prior on ** β** (17)
where ||

**||**

*β*_{1}is the

*L*

^{1}norm of

**and**

*β**σ*

^{Laplace}> 0 is a hyperparameter that controls the expected level of sparsity. We then define the maximum a posteriori estimate under this Laplace prior: (18) For the sake of precision we should probably refer to MAP as MAP-Gaussian and the approach described here as MAP-Laplace. However, for brevity we instead refer to these methods as MAP and Laplace, respectively. The Laplace estimator cannot be computed in closed form but can be readily approximated with iterative optimization techniques. We will consider Laplace alongside BVAS, MAP, and PyR

_{0}in our simulations, which we turn to next.

## 3 Simulation results

To assess the performance of our method we conduct an extensive suite of simulation-based experiments, including experiments that rely solely on simulated data as well as a semi-synthetic experiment that relies on perturbed SARS-CoV-2 data.

### 3.1 Simulation details

Our simulator closely follows the structure of the discrete time process in Eq (1). The most salient details are as follows (see Sec. S13 in S1 File for details). We include exactly 10 non-neutral alleles of varying effect size, with typical reproduction numbers for variants *v* ranging between 0.9 and 1.1. In each simulation we consider a given number of *N*_{R} regions and *T* = 26 time steps. The initial number of infected individuals at time *t* = 1 within each region is drawn from a Negative Binomial distribution with mean 10^{4}. Case counts for *t* = 2, …, *T* are determined by the stochastic dynamics in Eq (1) with *k* = 0.1. This value of *k* is chosen since it is consistent with estimates of the SARS-CoV-2 dispersion parameter [22–25]. These raw counts are then subjected to Binomial sampling with mean *ρ* = 0.01, i.e. the viral sequences of 99% of cases are not observed. Thus our parameter choices result in simulated data that are highly stochastic and that constitute a regime in which we expect that recovering the true selection coefficients ** β*** is quite challenging. Unless noted otherwise, we generate 20 datasets per condition. We make these choices because they result in simulated data that exhibit some of the characteristics of our SARS-CoV-2 data. In particular, typical estimated effective population sizes range from about 25 to about 140 with a mean of about 75.

### 3.2 Method comparison

We compare four methods for inferring allele-level selection using simulated data, in particular three diffusion-based methods (MAP, BVAS, and Laplace) and PyR_{0}. For all methods except for BVAS we rank allele-level hits by the absolute effect size, whereas for BVAS we rank by the Posterior Inclusion Probability (PIP). See Sec. S13.3 in S1 File for the hyperparameter choices made.

In Fig 1 we report results on the hit rate, which we define as the fraction of the top 10 hits that are causal, where causal alleles are those for which the true effect is non-zero. This metric is convenient since it does not depend on any method-specific threshold for calling hits. As expected the hit rate generally increases as the number of regions increases and decreases as the number of alleles increases (since the number of possible spurious hits increases). Strikingly, BVAS exhibits the best hit rates across the board. Laplace and MAP are competitive with BVAS in some regimes, but their performance degrades in other regimes, particularly when the number of alleles is large.

We compare the hit rate for four different methods using simulated data, where the hit rate is defined as the fraction of the top 10 hits that are causal. Results are averaged across 20 independent simulations. See Sec. 3.2 for discussion.

We hypothesize that the main reason for the poor performance of MAP in some regimes is the fact that MAP does not enforce sparsity in the allele-level coefficients ** β**. This effect is particularly evident from the mean absolute error (MAE) results in Fig 2, where it can be seen that the MAP MAE is large across the board, since MAP assigns non-negligible effect sizes to a large number of alleles. As the number of regions and thus the total amount of data increases, MAP tends to identify ever more non-negligible effects, potentially leading to a large number of spurious hits.

We report the accuracy of inferred selection coefficients ** β** for three diffusion-based methods using the simulated data described in Sec. 3.1. We consider two metrics: root mean squared error (RMSE; top) and mean absolute error (MAE; bottom). In both cases the metric is normalized such that the value of the metric for

**=**

*β***0**is equal to unity. For example, the RMSE is normalized by ||

***||**

*β*_{2}, where

*** are the true effects. We do not include a comparison to PyR**

*β*_{0}, since it utilizes a somewhat different likelihood, making direct comparison subtle. See Sec. 3.2 for discussion.

In contrast to MAP, PyR_{0} and Laplace do impose sparsity on the allele-level coefficients ** β**. We hypothesize that one of the main reasons for the poor performance of PyR

_{0}and Laplace in some regimes is the fact that they rely on hyperparameters that are difficult to choose. This is especially the case for PyR

_{0}, which contains 7 model hyperparameters, the most important of which is a direct analog to

*σ*

^{Laplace}.

To make this broader point concrete we investigate the sensitivity to the Laplace regularization scale *σ*^{Laplace} in Fig 3. We find that moderate changes in *σ*^{Laplace} lead to significant degradation in performance. Since there is no principled method to choose *σ*^{Laplace} a priori, one must instead rely on simulation-based intuition. Since, however, any simulation cannot capture all the effects that characterize real data and since it is unclear a priori what simulation parameters should be used, it remains difficult to choose *σ*^{Laplace} and so the sensitivity in Fig 3 is troubling. In the next section we show that BVAS exhibits less sensivity to hyperparameter choices.

We explore the sensitivity of the Laplace method to the prior scale *σ*^{Laplace}. Changing *σ*^{Laplace} from the optimal value of *σ*^{Laplace} ≈ 0.01 results in significantly worse performance. We consider *A* = 3000 alleles and *N*_{R} = 30 regions and generate 40 simulated datasets.

### 3.3 BVAS sensitivity to hyperparameters and

BVAS is specified by two hyperparameters: the prior inclusion probability *h* and the prior precision *τ*. For an extended dicussion of *h* see Sec. S9 in S1 File (note that if a prior is placed on *h* the hyperparameters instead become {*τ*, *α*_{h}, *β*_{h}}). The quantity controls the expected scale of effect sizes *β*. For example, for *τ* = 100 the prior standard deviation of *β* is 0.1. This choice implies that ∼95% of prior probability mass concentrates on the range *β* ∈ [−0.2, 0.2]. In Fig 4 (top row) we depict the sensitivity of BVAS to changes in *τ*. We find that the sensitivity to *τ* is small over about 4 orders of magnitude. It is only for very large *τ* (*τ* = 10^{4}) that we see a large drop in performance.

We explore the extent to which BVAS performance is sensitive to its two hyperparameters, namely *τ* and *S* = *hA*, using the simulated data described in Sec. 3.1. We simulate data for *N*_{R} = 30 regions and *A* = 3000 alleles and generate 40 datasets. See Sec. 3.3 for discussion.

In Fig 4 (bottom row) we depict the sensitivity of BVAS to changes in *S* ≡ *hA*, which is the expected number of non-neutral alleles a priori. We find only moderate sensitivity as *S* ranges from *S* = 1 to *S* = 64. In other words it is not necessary for *S* to be an accurate estimate of the number of non-neutral alleles (10 in our simulations): the posterior is a compromise between the prior and the likelihood and for reasonable choices of *S* the likelihood can overwhelm the prior if there is sufficient evidence for non-neutral alleles. Importantly the precision remains high for all values of *S*. The effect of choosing small *S* is to be more conservative; in particular some weak effects at the threshold of discovery may be assigned small PIPs. This robustness to changes in *S* is reassuring because our a priori knowledge of the number of non-neutral alleles in real data is limited.

Next we explore the sensitivity of BVAS to accurate estimation of the effective population size *ν*. Note that unlike *S* or *τ*, which appear in the prior in Eq (11), *ν* appears in the likelihood. The value of *ν* evidently plays an important role because it controls the level of noise in the diffusion process. Large values of *ν* imply that allele frequency increments **y**(*t*) are largely determined by (deterministic) drift. Conversely, small values of *ν* imply that **y**(*t*) exhibits significant (stochastic) variability that dominates the drift. Thus, with all else equal, increasing *ν* places more emphasis on fitting the observed apparent drift with the result that BVAS will tend to identify more signal, i.e. more alleles with non-negligible PIPs. Conversely, decreasing *ν* places less emphasis on fitting the observed apparent drift with the result that BVAS will tend to identify less signal, i.e. fewer alleles with non-negligible PIPs.

We investigate this effect quantitatively in Fig 5, which confirms our intuition. At least for our simulated data the consequences of underestimating *ν* are more severe than the consequences of overestimating *ν*; for example if we underestimate *ν* by a factor of 4 the hit rate drops by a factor of one half. By contrast overestimating *ν* by a factor of 4 actually improves the hit rate in this simulation, since the tighter likelihood encourages BVAS to seek out less sparse solutions, which results in additional hits for alleles at the margin of discovery. Overall the behavior in Fig 5 is encouraging, since we can estimate the effective population size with moderate accuracy in simulation (see Sec. S13.2 in S1 File). In practice of course we expect worse performance in the context of real data because the noise structure of real data will not precisely follow the noise structure assumed by our diffusion-based likelihood. Nevertheless the fact that the results in Fig 5 exhibit a good degree of robustness for *ν* estimates that are off by a factor of ∼2 suggests that running BVAS on real data should be relatively robust to the estimation strategy used.

We explore the extent to which BVAS performance is sensitive to accurate estimation of the effective population size *ν* using the simulated data described in Sec. 3.1. To do so we modulate our estimate for *ν* by the indicated multiplier, e.g. . We simulate data for *N*_{R} = 30 regions and *A* = 3000 alleles and generate 40 datasets. See Sec. 3.3 for discussion.

### 3.4 The value of PIPs

In contrast to the other methods we consider BVAS provides a Posterior Inclusion Probability for each allele. In Fig C in S1 File we demonstrate the value of PIPs by exploring the allele-level precision and sensitivity that are obtained if we declare alleles with a PIP above a threshold of 0.1 as hits. We observe very high precision across the board. In other words, if an allele has a high PIP there is good reason to believe it is causally linked to viral fitness, at least if we believe the generative process that underlies our diffusion-based likelihood. It is worth emphasizing that an allele with a moderate effect size can still exhibit a large PIP, thus signifying strong evidence for being causal.

### 3.5 Variability due to sampling rate

As discussed in Sec. 2.7 our diffusion-based likelihood, Eq (9), naturally accommodates random sampling where only a fraction *ρ* of infected individuals have their viral genomes sequenced. To explore the effects of sampling we generate data with a sampling rate that ranges between 1% and 64%. We find that the results are remarkably robust (see Fig D in S1 File), even as the effective population size decreases by a factor of ∼15 as *ρ* decreases from 64% to 1% (see Eq (16)).

### 3.6 Including vaccination-dependent effects

We now incorporate vaccination rates *ϕ*_{r}(*t*) into our simulations, assuming that *ϕ*_{r}(*t*) starts at zero everywhere and increases linearly over time. We assume 20 non-zero effects, half of which are vaccination-dependent. Otherwise our simulation follows the specifications of Sec. 3.1. See Fig E in S1 File for results. As we would expect, robustly identifying causal mutations is harder in this setting, and the hit rate for vaccination-dependent effects is lower than for all effects. Nevertheless the precision is high in all cases, which gives us confidence that high PIP vaccination-dependent alleles identified in real data may be causally linked to vaccination-dependent differential fitness.

### 3.7 Spike-in experiment

We conduct a semi-synthetic experiment where we add 200 synthetic (and therefore spurious) alleles to 3000 SARS-CoV-2 lineages (we use data from January 20^{th} 2022 for a total of *A* = 2904+ 200 = 3104 alleles). Each lineage is assigned a Binomial number of non-wild-type spiked-in alleles with mean 2. Since these assignments are independent and identically distributed, the spiked-in alleles are not correlated with the pre-existing genotype in any way and thus any apparent selection effects due to these alleles are due to chance alone. See Fig 6 for results. We find that PyR_{0} and BVAS select the fewest number of spiked-in alleles, whereas MAP and Laplace select the most. Note that we expect some small number of spiked-in alleles to be selected due to random chance alone. Importantly, across 30 replications none of the methods identifies a single spiked-in allele in the top 20 scoring hits. This is encouraging, since it suggests that the top scoring hits from all four methods should be enriched with causal alleles.

We compare the robustness of four methods for inferring allele-level selection effects to the addition of spiked-in alleles. We depict the total number of spiked-in alleles that are among the top 100 scoring alleles, where the horizontal line within each violin plot denotes the median and for consistency we rank alleles by the absolute value of the selection coefficient *β*. The orange dotted line corresponds to the number of spiked-in alleles that would be expected among the top 100 alleles if alleles were ranked at random. We report results from 30 independent simulations.

## 4 SARS-CoV-2 analysis

### 4.1 Data

Our raw data consist of 8.6 million samples downloaded from GISAID [26] on April 18^{th}, 2022. In initial pre-processing we follow the procedure in Obemeyer et al. [11], which relies on a phylogenetic tree constructed by UShER [27, 28], and results in *L* = 3000 SARS-CoV-2 clusters that are finer than the 1662 PANGO lineages in the data [29]. In our main analysis we consider *A* = 2975 non-synonymous amino acid substitutions, excluding both insertions and deletions due to limitations of UShER, and taking Wuhan A as the reference genotype, i.e. *R*_{0} ≡ *R*_{A}. After filtering to well-sampled regions there remain 6.9 million samples from *N*_{R} = 128 regions. Allele frequencies for each region are computed in time bins of 14 days and the effective population size is estimated using the global strategy described in Sec. 4.6. Vaccination data for the analysis in Sec. 2.6 are obtained from OWID [30]. For additional details on data pre-processing see Sec. S14.1 in S1 File. For additional results using data obtained through August 10^{th} 2022 see Tables C and D in S1 File.

### 4.2 Fitness of SARS-CoV-2 lineages and mutations

We use BVAS to rank the relative fitness of all SARS-CoV-2 lineages. To do so, we fit our model to allele frequencies of 2975 alleles across 128 regions, with *τ* = 100 and *S* = 50 (so *h* = *S*/*A* ≈ 0.017) reflecting our prior assumptions that a relatively modest number of non-neutral alleles with (possibly) moderately large selection effects are driving evolution of SARS-CoV-2 fitness.

In Table 1, we report relative growth rate estimates *R*/*R*_{A} for the top 20 lineages. Fitness estimates are broadly concordant with the observed pandemic, with the fittest lineages all Omicron variants. BVAS accurately captures the hierarchy of replacement by fitter lineages with Omicron (BA.2) > Omicron (BA.1) > Delta > Alpha > wild-type virus (Table 2). Notably some PANGO lineages (e.g. B.1.1) exhibit very diverse genotypes and thus correspondingly diverse growth rates, see Fig G in S1 File.

Analysis of sublineage fitness reveals that Omicron has fractured into many sublineages whose fitness has increased modestly over time (Table 1). BA.2.12.1 appears to be the fittest lineage observed to date, although many BA.2 sublineages are comparably fit. BA.5, which appears to be descended from BA.2, has recently emerged in South Africa and is reported to have enhanced fitness [31] and additional immune escape [32, 33] relative to Omicron BA.1. Since BVAS regresses growth rate against genotype, it is able to infer that BA.5 is among the fittest lineages circulating despite the fact that very few BA.5 sequences are in our dataset. Like BA.2.12.1, BA.5 also possess mutations at Spike position 452 (L452R), underscoring the key role that this site plays in SARS-CoV-2 fitness.

We report the fitness of recombinant lineages in Table A in S1 File. In contrast to highly fit lineages that emerged in the BA.2 clade, several recombinants, including those that represent recombination between Delta and BA.1 and BA.1 and BA.2, have been the source of international concern [34–36]. The fittest recombinants are XN and XT, though their fitness is intermediate to that of BA.2 and BA.1. While the appearance of recombinant lineages is striking, the fitness of existing XA—XT recombinants suggests that these particular lineages are unlikely to play an important role in the future.

We report the fitness of top-scoring mutations in Table 3 and plotted along the length of the genome in Fig 7. The strongest signal of selection is in Spike (see Fig 8), with the greatest concentration of hits located in the receptor-binding domain (RBD). Strong signals of selection are also observed in the N-terminal domain (NTD) and furin cleavage sites. By effect size, S:L452R is the top-scoring hit. This mutation is found in BA.4/BA.5 and was also an important component of the ‘California’ variants, B.1.427 and B.1.429. Deng et al. [3] have shown that this mutation increases infectivity, while [37] and [38] have shown that it promotes antibody escape. The closely related mutation S:L452Q, one of two key Spike mutations in the fast-growing BA.2.12.1 variant, is also highly ranked, underscoring the importance of this site.

We depict BVAS hits as a Manhattan plot of the entire SARS-CoV-2 genome, with gene boundaries indicated on the horizontal axis. We do not include the small number of hits with negative selection coefficients (in particular ORF1b:K1383R, ORF1a:A2554V, ORF1b:S959P, and ORF1a:G697R).

View of the 1237 amino acids of the S protein, annotated by structure [47]; many top-scoring mutations are located in the N-terminal domain (NTD), receptor-binding domain (RBD), and furin cleavage (FC) site. Regions containing the fusion peptide (FP), heptad repeat (HR) 1 and 2, transmembrane domain (TM), and C-terminal domain (CTD) are also annotated.

The top-scoring mutations cluster together in various regions of Spike (Fig 9A), particularly the ACE2 binding interface of the RBD (Fig 9B). Three of the top five hits (S:L452R, S:E484K, and S:N501Y) are within the RBD, and a fourth, S:R346K, is just adjacent to it. The mechanisms driving positive selection at these sites likely include increased affinity to ACE2, as was shown for S:N501Y [4], as well as escape from neutralizing antibodies that bind in this region, e.g. for S:E484K, S:N440K and S:R346S [39, 40].

**A**. We depict the locations of the top 20 Spike hits, ranked by PIP, on the Cryo-EM structure of a Spike trimer bound to ACE2 (magenta) at 3.9 Angstrom resolution in the single RBD ‘up’ conformation from [48] **B**. Enlarged view of the RBD-ACE2 interface, showing the spatial proximity of S:R346, S:N339, S:N440, S:L452, S:S477, S:E484, and S:N501.

We characterized antibody escape of Spike mutations further by correlating BVAS RBD estimates to predictions made by the antibody-escape calculator in Greaney et al. [41] (Fig H in S1 File). This escape calculator is based on deep mutational scanning data for 33 neutralizing antibodies elicited by SARS-CoV-2 and thus represents an independent source of experimental data. As in [11], there is a strong correlation (*ρ*_{Spearman} = 0.89) between the two sets of predictions, lending support to BVAS results for the RBD. We assessed the temporal progression of selection effects in SARS-CoV-2 lineages by aggregating Δ*R* estimates due to S gene, RBD, and non-S-genes contributions (Fig 10). The elevated contribution of S-gene mutations (notably in the RBD) over non-S-gene mutations starting around November 2021 is apparent, in agreement with the results from [11]. Collectively these two results suggest that immune escape has become an increasingly prominent factor in SARS-CoV-2 evolution over time, likely a result of rising rates of convalescent and vaccine-induced immunity to Spike. The correlation between mutations that confer antibody escape and mutations associated with fitness supports the hypothesis that coronaviruses evolve through positive selection of receptor-binding domain mutations that escape neutralizing antibody responses [42].

We aggregate BVAS *β* estimates into S-gene (including the RBD), RBD (S gene receptor-binding domain), and non-S-gene components for *L* = 3000 SARS-CoV-2 clusters (blue points). The horizontal axis denotes the date at which each cluster first emerged, and magenta squares denote the median Δ*R* within each monthly bin. The elevated contribution of S-gene mutations (notably in the RBD) over non-S-gene mutations starting around November 2021 is conspicuous. Compare to Fig 2CDE in [11].

Other hotspots within Spike include the furin cleavage site, which features three of the top 20 mutations (S:P681R, S:P681H, and S:N679K). These substitutions add positive charge at the cleavage site, likely facilitating S1/S2 cleavage and promoting infectivity by enhancing cell-cell fusion, which has been demonstrated by several groups [43–45].

BVAS also identifies numerous residues under selection outside of Spike. These include P13L in the N-terminal region and P199L and R203M in the linker region between the N- and C-terminal domains (Fig I in S1 File). These latter two mutations have been demonstrated by [46] to enhance viral packaging. Within ORF1a, several amino acid changes in NSP4 score highly, including T3090I, L3201P, and T3255I (Fig J in S1 File). The importance of these effects can be quantified on a per-ORF basis using PIPs, which provide a convenient numerical measure of the total evidence for selection due to each allele. To assess the overall contribution of various regions of the SARS-CoV-2 genome to differential viral fitness we sum PIPs across different regions (see Fig K in S1 File), quantifying the relative importance of S, N, and several non-structural proteins in ORF1ab.

### 4.3 Sensitivity analysis

We perform an extensive sensitivity analysis to better understand the robustness of our results to hyperparameter and data pre-processing choices. In Figs M-Q in S1 File we explore the sensitivity of BVAS growth rate estimates for the fittest lineages, noting that we might expect these estimates to be sensitive to the strength of regularization. We find minimal sensitivity to *S* (the number of non-neutral alleles expected a priori), the total number of regions included in the analysis *N*_{R}, as well as , which controls which time bins enter the analysis, see Figs M-O in S1 File. The sensitivity to the prior precision *τ* is somewhat larger (see Fig P in S1 File), especially as we increase *τ* to *τ* = 400, although we would argue that this is an unreasonable prior choice, as it makes even relatively moderate selection effects like *β* ∼ 0.15 a priori unlikely. Not surprisingly, the sensitivity to the estimated effective population size, , is fairly large (see Fig Q in S1 File), roughly comparable to the scale of the underlying statistical uncertainty.

We adopt a more global view in Figs R-AA in S1 File, reporting sensitivity analyses for all PIPs and *β* estimates as well as for growth rates for all 1662 PANGO lineages. Globally, growth rate estimates are remarkably stable with Pearson correlation coefficients of 0.999 or larger. Selection coefficient estimates *β* are also quite stable, with Pearson correlation coefficients of 0.9 or greater, with the largest sensitivity being again to *τ* and . Results for allele-level PIPs are broadly comparable, although they tend to exhibit more variability and outliers, especially for alleles with smaller PIPs. Importantly concordance between independent MCMC chains is exceptionally high (*R* ≥ 0.9997), see Fig R in S1 File, suggesting that MCMC error is minimal.

### 4.4 Backtesting

We perform a backtesting analysis in which we run BVAS on subsets of the data defined by varying end dates. Doing so allows us to assess the possible benefits of running BVAS periodically as part of a real-time genomic surveillance program. We caution that these results need to be interpreted with care, since in practice it can take several weeks before individual genomic samples are deposited in GISAID. Additionally, the calling of new lineages is also associated with a time lag. See Fig BB in S1 File for results. We find that by May 13^{th} 2021 BVAS predicts that various Delta sublineages are fitter than B.1.1.7 (Alpha), which was the most prevalent lineage in England and elsewhere at the time. Similarly, we find that by December 8^{th} 2021 BVAS predicts that various Omicron sublineages are fitter than AY.4.2.1 (Delta). Notably, since BVAS regresses *R*_{v} against genotype, we also obtain plausible estimates for newly emergent lineages that have only been sequenced a small number of times. Finally by January 12^{th} 2022 BVAS predicts that BA.2 was fitter than BA.1, a prediction that has been borne out by the subsequent takeover of BA.2.

During the time periods considered in our backtesting analysis the number of samples of these newly emergent lineages was increading rapidly, with the result that BVAS growth rate estimates increase markedly as more data become available and it becomes increasingly clear that these lineages were the fittest lineages yet observed. Importantly, estimates stabilize after sufficient data have been collected, see Fig CC in S1 File.

### 4.5 Comparison to MAP and PyR_{0}

We perform detailed comparisons of BVAS to both MAP [12] and PyR_{0} [11]. Here we provide a brief summary; see Sec. S14.5–6 in S1 File and Fig DD-NN in S1 File for a detailed discussion. Despite differences in methodology, results from BVAS, MAP and PyR_{0} are in broad *qualitative* agreement, suggesting that all three methods are capable of identifying (at least some) leading driver mutations in SARS-CoV-2. Because it is only very recently that genomic surveillance data have become available at this scale and the corresponding analysis methods are still in their infancy, we believe this finding is encouraging for this emerging new field.

While it is difficult to definitively establish the superiority of one method over another without a larger corpus of experimental data to serve as ground truth, we believe that the advantages of BVAS become apparent upon closer comparison. First, inferred selection effects are much sparser in the case of BVAS, which aids interpretability and is arguably more plausible a priori. Selection coefficients inferred by MAP are very dense unless the regularization parameter *γ*_{reg} = *τ* / *ν* is made sufficiently large. However, in this limit growth rate estimates appear to be over-regularized towards Wuhan A. Second, uncertainty estimates for MAP and (especially) PyR_{0} are much narrower than for BVAS, especially for allele-level quantities like selection coefficients. Finally, BVAS exhibits much less sensitivity to hyperparameter choices than PyR_{0}, which—together with rigorous sparsity requirements—is one of the key factors contributing to the strong performance of BVAS in simulations.

### 4.6 Vaccination-dependent effects

We incorporate independent data on vaccination rates from 127 regions compiled by OWID (see Sec. S7 and Sec. S14.1.6 in S1 File for additional details). This has little impact on estimates for most vaccination-independent effects (Fig PP in S1 File), i.e. BVAS finds that selection effects in the data are largely explainable by vaccination-independent effects. Indeed we find only two alleles with large PIPs in the vaccination-dependent model (Table 4).

The strongest evidence for allele-level dependence on vaccination rates is in S:N501Y, which is also the only allele for which BVAS finds it important to include both a vaccination-dependent effect *α* *and* a vaccination-independent effect *β*. In our analysis the selection coefficient estimates for S:N501Y are *α* ≈ −0.15 and *β* ≈ 0.41. This means that the model predicts a selection effect due to the S:N501Y allele of *β* ≈ 0.41 in a completely unvaccinated population and a selection effect of *α* + *β* ≈ 0.26 in a completely vaccinated population. This can be interpreted as saying that vaccination appears to confer differential protection against the S:N501Y allele, i.e. on top of the protection that is conferred against typical SARS-CoV-2 variants that do not carry S:N501Y. Notably, S:N501Y also exhibits a large PIP in an analysis conducted with a different definition of vaccination rate (Table E in S1 File).

These results are also supported by a direct analysis of raw allele frequencies of S:N501Y. Indeed S:N501Y exhibited a rapid rise and fall in prevalence in Spring 2021 (Fig QQ in S1 File), at the same time as vaccination rates were ramping up in many of the regions in our dataset. Moreover, while the behavior of S:N501Y is partially explained by the rise and fall of B.1.1.7 (Alpha), S:N501Y came to prominence in some regions (notably Brazil) via P.1 (Gamma) where B.1.1.7 was never dominant. Finally, the change in frequency of S:N501Y is correlated to vaccination status (Fig RR in S1 File); this correlation is stronger than the correlation between changes in B.1.1.7 frequency and vaccination status.

S:N501Y thus appears to exhibit vaccination-dependent selection, which explains its relative disappearance from the variant landscape over time as vaccination rates have increased. The precise mechanism underlying this behavior is unclear, but several authors have recently shown that S:N501Y exerts epistatic effects on other mutations by altering their antibody escape properties [49–51]. We hypothesize that S:N501Y serves as a linchpin residue within the RBD that constrains the possibilities for vaccine escape when present.

### 4.7 Epistasis

Mounting experimental evidence for epistasis in SARS-CoV-2 [49, 52] raises the question whether these kinds of effects can be inferred from genomic surveillance data. Doing so is expected to be difficult a priori due to the combinatorially large space of possible interaction effects coupled with the fact that many combinations of mutations are unobserved in available data. In this context we expect that explicit sparsity assumptions and high-fidelity statistical inference—key selling points of BVAS—are likely to be crucial.

Here we report initial results of such an analysis, focusing on pairwise interaction effects between non-synonymous mutations in the Spike protein. In particular we consider pairwise interactions between the 421 S mutations in our main analysis—after excluding D614G because it became fixed early in the pandemic—which corresponds to 88410 pairwise interactions. We further exclude pairs of mutations that are not observed together in at least two SARS-CoV-2 clusters—yielding 1432 pairwise interactions—and use BVAS to jointly infer selection effects for the resulting 2975 linear and 1432 pairwise effects. See Sec. S14.9 in S1 File for details.

We summarize our results as a pair of ‘interaction networks’ in Fig 11 and report more detailed results in Table F in S1 File. We find that top-scoring interaction effects in our analysis are enriched for interactions between mutations that correspond to top-scoring (linear) effects in our default analysis (*P* ≤ 10^{−4}). Put differently, we find limited evidence for interaction effects between pairs of mutations where one or both mutations are inferred as approximately neutral on their own.

We depict the two interaction networks of pairwise selection effects in Spike inferred by BVAS. Edge widths are proportional to the posterior inclusion probability of the corresponding pairwise effect. Spatial orientation has no meaning.

The interaction effect with the largest PIP is between N501Y in the RBD and P681H adjacent to the furin cleavage site. These two mutations appear together in BA.1/B.1.1.7, while N501Y appears without P681H in P.1/B.1.351 and P681H appears without N501Y in B.1.1.159/B.1.243 (Table F in S1 File). As such there is a plausible evidential basis for inferring the interaction between N501Y and P681H, with each combination of these amino acids appearing in at least 60k SARS-CoV-2 sequences in our dataset. The importance of N501Y in epistatic interactions is concordant with data from [49], who found that N501Y caused the largest shifts in the effects of mutations at other sites using deep mutational scanning libraries of RBD and measuring the impact of every amino acid mutation on ACE2-binding affinity.

Among subleading hits, we find a less well resolved picture due to linkage between different pairs of amino acid mutations, which is reflected in the more moderate PIP scores. For example, all SARS-CoV-2 clusters in our data that carry the pair of amino acids (N501Y, A67V) also carry the pairs (N501Y, P681H) and (A67V, E484A)—but not vice versa. This motivates the graphical representation in Fig 11, which reflects the multiplicity of several amino acid mutations in top-scoring interaction effects—in particular N501Y, P681H, A67V, and L452R.

We expect that resolving these putative interactions in greater detail necessitates follow-up in the lab. Mechanisms of epistasis are likely pleiotropic, but some interactions likely arise from a need to a) incorporate mutations that confer immune escape against the current landscape of circulating variants while b) maintaining protein structure and function. This behavior has been demonstrated for the recurrent S:69–70 deletion [53]. Because the space of possible combinations of alleles is very large, most studies have characterized mutations either individually or in the combinations present in common lineages. By ranking combinations of mutations according to evidence of their selective effects, our model may help focus experimental design in this challenging combinatorial setting.

## 5 Discussion

Bayesian Viral Allele Selection unifies and improves upon the two available methods for inferring selection from large scale genomic surveillance data. One of its strengths is that it makes clear assumptions: i) most alleles are neutral; and ii) viral dynamics is governed by an intuitive discrete time branching process. Other advantages of BVAS include its robustness to hyperparameter choices, its satisfactory uncertainty estimates and the fact that it offers Posterior Inclusion Probabilities. Moreover, the diffusion-based likelihood in Eq (9) is robust to a number of sources of possible bias, including varying sampling rates across time and space and changes in fitness that affect all lineages equally (due to e.g. lockdown measures or variable temperature/humidity).

We highlight the following limitations and refer the reader to Sec. S12 in S1 File for additional discussion. Estimating the effective population size *ν* is challenging, especially since *ν* can exhibit significant variability across time. While we have argued that sensitivity to *ν* is fairly moderate, improved *ν* estimates should lead to improved statistical efficiency, especially if *ν* can be estimated with finer spatial and temporal granularity. Doing so would likely require incorporating additional sources of data (e.g. case counts) and represents an important direction for future work. Several of the simplifying assumptions that underly BVAS are expected to be violated at some level in real world data. Notably, our basic fitness model is unable to account for epistasis (see Eq (2)). While we have extended this linear model to include pairwise interactions, we have limited this analysis to the Spike protein. Because a genome wide epistasis analysis must contend with millions of possible interactions a priori, additional assumptions to reduce the space of selection effects considered are likely required to make this kind of analysis statistically and computationally tractable. For example, one might limit the analysis to pairs of mutations that are near each other in space.

In summary, BVAS provides a principled statistical and computational framework to identify selection under the constraint of sparsity. Applying BVAS to 6.9 million SARS-CoV-2 genomes provides a detailed picture of viral selection in action. We anticipate that BVAS will be widely applicable to SARS-CoV-2 and other viruses as large scale genomic surveillance data become increasingly available.

## Supporting information

### S1 File. Supplementary figures, tables, and methodological details.

This supplementary pdf contains Figures A-TT and Tables A-F as well as detailed descriptions of our data processing pipeline, modeling assumptions, and inference approach.

https://doi.org/10.1371/journal.pgen.1010540.s001

(PDF)

### S2 File. Allele-level inference results.

This csv file contains allele-level inference results for our main BVAS analysis utilizing data collected through April 18^{th}, 2022.

https://doi.org/10.1371/journal.pgen.1010540.s002

(CSV)

### S3 File. Lineage-level inference results.

This csv file contains lineage-level inference results for our main BVAS analysis utilizing data collected through April 18^{th}, 2022.

https://doi.org/10.1371/journal.pgen.1010540.s003

(CSV)

## Acknowledgments

We gratefully acknowledge colleagues from the originating laboratories responsible for obtaining SARS-CoV-2 specimens. Likewise we gratefully acknowledge colleagues from the submitting laboratories where genetic sequence data were generated and shared via the GISAID initiative. This research would not be possible without their collective efforts; see data availability statement for more information on the data used. We warmly thank Angie Hinrichs for providing the UShER tree that forms a key component of our data pre-processing pipeline. This work would not be possible without her gracious assistance. We also thank Nikolaos Barkas, Stephen F. Schaffner, Jesse D. Pyle, Lonya Yurkovetskiy, Matteo Bosso, Daniel J. Park, Mehrtash Babadi, Bronwyn L. MacInnis, Jeremy Luban, and Pardis C. Sabeti for discussions about SARS-CoV-2.

## References

- 1. MacLean OA, Orton RJ, Singer JB, Robertson DL. No evidence for distinct types in the evolution of SARS-CoV-2. Virus Evolution. 2020;6(1):veaa034. pmid:32817804
- 2. Yurkovetskiy L, Wang X, Pascal KE, Tomkins-Tinch C, Nyalile TP, Wang Y, et al. Structural and functional analysis of the D614G SARS-CoV-2 spike protein variant. Cell. 2020;183(3):739–751. pmid:32991842
- 3. Deng X, Garcia-Knight MA, Khalid MM, Servellita V, Wang C, Morris MK, et al. Transmission, infectivity, and neutralization of a spike L452R SARS-CoV-2 variant. Cell. 2021;184(13):3426–3437. pmid:33991487
- 4. Starr TN, Greaney AJ, Hilton SK, Ellis D, Crawford KH, Dingens AS, et al. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell. 2020;182(5):1295–1310. pmid:32841599
- 5. Choi B, Choudhary MC, Regan J, Sparks JA, Padera RF, Qiu X, et al. Persistence and evolution of SARS-CoV-2 in an immunocompromised host. New England Journal of Medicine. 2020;383(23):2291–2293. pmid:33176080
- 6. Greaney AJ, Starr TN, Gilchuk P, Zost SJ, Binshtein E, Loes AN, et al. Complete mapping of mutations to the SARS-CoV-2 spike receptor-binding domain that escape antibody recognition. Cell host & microbe. 2021;29(1):44–57.
- 7. Pybus OG, Rambaut A. Evolutionary analysis of the dynamics of viral infectious disease. Nature Reviews Genetics. 2009;10(8):540–550. pmid:19564871
- 8. Morel B, Barbera P, Czech L, Bettisworth B, Hübner L, Lutteropp S, et al. Phylogenetic analysis of SARS-CoV-2 data is difficult. Molecular biology and evolution. 2021;38(5):1777–1791. pmid:33316067
- 9. Davies NG, Jarvis CI, Edmunds WJ, Jewell NP, Diaz-Ordaz K, Keogh RH. Increased mortality in community-tested cases of SARS-CoV-2 lineage B. 1.1. 7. Nature. 2021;593(7858):270–274. pmid:33723411
- 10. Volz E, Mishra S, Chand M, Barrett JC, Johnson R, Geidelberg L, et al. Assessing transmissibility of SARS-CoV-2 lineage B. 1.1. 7 in England. Nature. 2021;593(7858):266–269. pmid:33767447
- 11. Obermeyer F, Jankowiak M, Barkas N, Schaffner SF, Pyle JD, Yurkovetskiy L, et al. Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness. Science. 2022. pmid:35608456
- 12. Lee B, Sohail MS, Finney E, Ahmed SF, Quadeer AA, McKay MR, et al. Inferring effects of mutations on SARS-CoV-2 transmission from genomic surveillance data. medRxiv. 2022; p. 2021–12.
- 13. Lloyd-Smith JO, Schreiber SJ, Kopp PE, Getz WM. Superspreading and the effect of individual variation on disease emergence. Nature. 2005;438(7066):355–359. pmid:16292310
- 14. Althouse BM, Wenger EA, Miller JC, Scarpino SV, Allard A, Hébert-Dufresne L, et al. Superspreading events in the transmission dynamics of SARS-CoV-2: Opportunities for interventions and control. PLoS biology. 2020;18(11):e3000897. pmid:33180773
- 15. Kimura M. Diffusion models in population genetics. Journal of Applied Probability. 1964;1(2):177–232.
- 16. Lacerda M, Seoighe C. Population genetics inference for longitudinally-sampled mutants under strong selection. Genetics. 2014;198(3):1237–1250. pmid:25213172
- 17. Terhorst J, Schlötterer C, Song YS. Multi-locus analysis of genomic time series data from experimental evolution. PLoS genetics. 2015;11(4):e1005069. pmid:25849855
- 18. Ferrer-Admetlla A, Leuenberger C, Jensen JD, Wegmann D. An approximate Markov model for the Wright–Fisher diffusion and its application to time series data. Genetics. 2016;203(2):831–846. pmid:27038112
- 19. Sohail MS, Louie RH, McKay MR, Barton JP. MPL resolves genetic linkage in fitness inference from complex evolutionary histories. Nature biotechnology. 2021;39(4):472–479. pmid:33257862
- 20. Chipman H, George EI, McCulloch RE, Clyde M, Foster DP, Stine RA. The practical implementation of Bayesian model selection. Lecture Notes-Monograph Series. 2001; p. 65–134.
- 21. Zanella G, Roberts G. Scalable importance tempering and Bayesian variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2019;81(3):489–517.
- 22. Lau MS, Grenfell B, Thomas M, Bryan M, Nelson K, Lopman B. Characterizing superspreading events and age-specific infectiousness of SARS-CoV-2 transmission in Georgia, USA. Proceedings of the National Academy of Sciences. 2020;117(36):22430–22435. pmid:32820074
- 23. Endo A, et al. Estimating the overdispersion in COVID-19 transmission using outbreak sizes outside China. Wellcome open research. 2020;5. pmid:32685698
- 24. Bi Q, Wu Y, Mei S, Ye C, Zou X, Zhang Z, et al. Epidemiology and transmission of COVID-19 in 391 cases and 1286 of their close contacts in Shenzhen, China: a retrospective cohort study. The Lancet infectious diseases. 2020;20(8):911–919. pmid:32353347
- 25. Miller D, Martin MA, Harel N, Tirosh O, Kustin T, Meir M, et al. Full genome viral sequences inform patterns of SARS-CoV-2 spread into and within Israel. Nature communications. 2020;11(1):1–10. pmid:33139704
- 26. Elbe S, Buckland-Merrett G. Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global challenges. 2017;1(1):33–46. pmid:31565258
- 27. Turakhia Y, Thornlow B, Hinrichs AS, De Maio N, Gozashti L, Lanfear R, et al. Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nature Genetics. 2021;53(6):809–816. pmid:33972780
- 28. McBroome J, Thornlow B, Hinrichs AS, Kramer A, De Maio N, Goldman N, et al. A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees. Molecular biology and evolution. 2021;38(12):5819–5824. pmid:34469548
- 29. Rambaut A, Holmes EC, O’Toole Á, Hill V, McCrone JT, Ruis C, et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nature microbiology. 2020;5(11):1403–1407. pmid:32669681
- 30. Ritchie H, Mathieu E, Rodés-Guirao L, Appel C, Giattino C, Ortiz-Ospina E, et al. Coronavirus Pandemic (COVID-19). Our World in Data. 2020;.
- 31. Tegally H, Moir M, Everatt J, Giovanetti M, Scheepers C, Wilkinson E, et al. Continued Emergence and Evolution of Omicron in South Africa: New BA.4 and BA.5 lineages. medRxiv. 2022.
- 32. Khan K, Karim F, Ganga Y, Bernstein M, Jule Z, Reedoy K, et al. Omicron sub-lineages BA.4/BA.5 escape BA.1 infection elicited neutralizing immunity. medRxiv. 2022. pmid:35948557
- 33. Cao Y, Yisimayi A, Jian F, Song W, Xiao T, Wang L, et al. BA.2.12.1, BA.4 and BA.5 escape antibodies elicited by Omicron infection. Nature. 2022;608(7923):593–602. pmid:35714668
- 34. Colson P, Fournier PE, Delerce J, Million M, Bedotto M, Houhamdi L, et al. Culture and identification of a “Deltamicron” SARS-CoV-2 in a three cases cluster in southern France. Journal of Medical Virology. 2022;.
- 35. Jackson B, Boni MF, Bull MJ, Colleran A, Colquhoun RM, Darby AC, et al. Generation and transmission of interlineage recombinants in the SARS-CoV-2 pandemic. Cell. 2021;184(20):5179–5188. pmid:34499854
- 36. VanInsberghe D, Neish AS, Lowen AC, Koelle K. Recombinant SARS-CoV-2 genomes circulated at low levels over the first year of the pandemic. Virus Evolution. 2021;7(2):veab059.
- 37. Li Q, Wu J, Nie J, Zhang L, Hao H, Liu S, et al. The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity. Cell. 2020;182(5):1284–1294. pmid:32730807
- 38. Liu Z, VanBlargan LA, Bloyet LM, Rothlauf PW, Chen RE, Stumpf S, et al. Identification of SARS-CoV-2 spike mutations that attenuate monoclonal and serum antibody neutralization. Cell host & microbe. 2021;29(3):477–488. pmid:33535027
- 39. Weisblum Y, Schmidt F, Zhang F, DaSilva J, Poston D, Lorenzi JC, et al. Escape from neutralizing antibodies by SARS-CoV-2 spike protein variants. Elife. 2020;9:e61312. pmid:33112236
- 40. Iketani S, Liu L, Guo Y, Liu L, Chan JFW, Huang Y, et al. Antibody evasion properties of SARS-CoV-2 Omicron sublineages. Nature. 2022; p. 1–4. pmid:35240676
- 41. Greaney AJ, Starr TN, Bloom JD. An Antibody-Escape Estimator for Mutations to the SARS-CoV-2 Receptor-Binding Domain. Virus Evolution. 2022;. pmid:35573973
- 42. Kistler KE, Bedford T. Evidence for adaptive evolution in the receptor-binding domain of seasonal coronaviruses OC43 and 229e. eLife. 2021;10:e64509. pmid:33463525
- 43. Saito A, Irie T, Suzuki R, Maemura T, Nasser H, Uriu K, et al. Enhanced fusogenicity and pathogenicity of SARS-CoV-2 Delta P681R mutation. Nature. 2022;602(7896):300–306. pmid:34823256
- 44. Mohammad A, Abubaker J, Al-Mulla F. Structural modelling of SARS-CoV-2 alpha variant (B. 1.1. 7) suggests enhanced furin binding and infectivity. Virus Research. 2021;303:198522. pmid:34314772
- 45. Lista MJ, Winstone H, Wilson HD, Dyer A, Pickering S, Galao RP, et al. The P681H mutation in the Spike glycoprotein confers Type I interferon resistance in the SARS-CoV-2 alpha (B. 1.1. 7) variant. bioRxiv. 2021;.
- 46. Syed AM, Taha TY, Tabata T, Chen IP, Ciling A, Khalid MM, et al. Rapid assessment of SARS-CoV-2–evolved variants using virus-like particles. Science. 2021;374(6575):1626–1632. pmid:34735219
- 47. Huang Y, Yang C, Xu Xf, Xu W, Liu Sw. Structural and functional properties of SARS-CoV-2 spike protein: potential antivirus drug development for COVID-19. Acta Pharmacologica Sinica. 2020;41(9):1141–1149. pmid:32747721
- 48. Zhou T, Tsybovsky Y, Gorman J, Rapp M, Cerutti G, Chuang GY, et al. Cryo-EM structures of SARS-CoV-2 spike without and with ACE2 reveal a pH-dependent switch to mediate endosomal positioning of receptor-binding domains. Cell host & microbe. 2020;28(6):867–879.
- 49. Starr TN, Greaney AJ, Hannon WW, Loes AN, Hauser K, Dillen JR, et al. Shifting mutational constraints in the SARS-CoV-2 receptor-binding domain during viral evolution. Science. 2022;377(6604):420–424. pmid:35762884
- 50. Zahradník J, Marciano S, Shemesh M, Zoler E, Harari D, Chiaravalli J, et al. SARS-CoV-2 variant prediction and antiviral drug design are enabled by RBD in vitro evolution. Nature microbiology. 2021;6(9):1188–1198. pmid:34400835
- 51. Bate N, Savva CG, Moody PC, Brown EA, Evans SE, Ball JK, et al. In vitro evolution predicts emerging SARS-CoV-2 mutations with high affinity for ACE2 and cross-species binding. PLoS pathogens. 2022;18(7):e1010733. pmid:35849637
- 52. Javanmardi K, Segall-Shapiro TH, Chou CW, Boutz DR, Olsen RJ, Xie X, et al. Antibody escape and cryptic cross-domain stabilization in the SARS-CoV-2 Omicron spike protein. Cell Host & Microbe. 2022;30(9):1242–1254.e6. pmid:35988543
- 53. Meng B, Kemp SA, Papa G, Datir R, Ferreira IA, Marelli S, et al. Recurrent emergence of SARS-CoV-2 spike deletion H69/V70 and its role in the Alpha variant B. 1.1. 7. Cell reports. 2021;35(13):109292. pmid:34166617