Inferring selection effects in SARS-CoV-2 with Bayesian Viral Allele Selection

The global effort to sequence millions of SARS-CoV-2 genomes has provided an unprecedented view of viral evolution. Characterizing how selection acts on SARS-CoV-2 is critical to developing effective, long-lasting vaccines and other treatments, but the scale and complexity of genomic surveillance data make rigorous analysis challenging. To meet this challenge, we develop Bayesian Viral Allele Selection (BVAS), a principled and scalable probabilistic method for inferring the genetic determinants of differential viral fitness and the relative growth rates of viral lineages, including newly emergent lineages. After demonstrating the accuracy and efficacy of our method through simulation, we apply BVAS to 6.9 million SARS-CoV-2 genomes. We identify numerous mutations that increase fitness, including previously identified mutations in the SARS-CoV-2 Spike and Nucleocapsid proteins, as well as mutations in non-structural proteins whose contribution to fitness is less well characterized. In addition, we extend our baseline model to identify mutations whose fitness exhibits strong dependence on vaccination status as well as pairwise interaction effects, i.e. epistasis. Strikingly, both these analyses point to the pivotal role played by the N501 residue in the Spike protein. Our method, which couples Bayesian variable selection with a diffusion approximation in allele frequency space, lays a foundation for identifying fitness-associated mutations under the assumption that most alleles are neutral.


S1 From discrete-time branching process to diffusion
We provide an overview of the population genetics based approach we take to formulating a likelihood that connects observed count data to unobserved selection coefficients. For additional details (in somewhat different notation) please refer to [1].

S1.1 Negative Binomial branching process
A natural model of viral infection is to suppose that an infected individual at time step t infects a stochastic number of individuals in the next time step, where the number of infected individuals is governed by a Negative Binomial distribution with mean R and variance R`R 2 {k, where k is the dispersion parameter. In the limit that k Ñ 8 we recover a Poisson distribution with mean and variance both equal to R. In the opposite 'super-spreading' limit, k Ñ 0, the Negative Binomial distribution has a (potentially very large) variance that scales like k´1. We now suppose that there are n infected individuals at time step t. Under this assumption the total number of infected individuals at time step t`1 is governed by a Negative Binomial distribution with mean nR and variance given by pnRq`pnRq 2 {pnkq " nR`nR 2 {k.
To make the model more realistic we suppose there are V distinct viral variants in circulation. Each viral variant v has an effective reproduction number given by R v " R 0 p1`∆R v q where R 0 corresponds to the wild type. At the current time step t there are n v ptq individuals infected with variant v. Given our assumptions, the number of infected individuals with variant v at the next time step is governed by the distribution where we have assumed that k is the same for all V variants and disregard the possibility of mutation.

S1.2 Diffusion in variant frequency space
Following [1] we now transform from case counts n v ptq to variant frequencies z v ptq: In the limit of large population size, i.e. the diffusion limit n " 1, and under the assumption that |∆R v | ! 1, [1] compute approximate first and second moments for the variant frequencies z v ptq under the dynamics in Eqn. 1. These moments can in turn be used to define an (approximately) equivalent diffusion process for continuous-valued variant frequencies z v ptq P r0, 1s. In particular one finds that zptq is approximately distributed like a Multivariate-Normal random variable as zpt`1q " N pzptq`dptq, ν´1Λptqq wheredptq P R V is the V -dimensional drift (4) andΛptq is the VˆV (unscaled) diffusion matrix given bỹ and ν is the effective population size given by ν "´1 R0`1 k¯´1 n " kR0 k`R0 n Eqn. 3 has a simple and intuitive form with the properties we would expect. For example, the driftdptq equals zero and thus the mean of zpt`1q is equal to zptq if all variants have the same reproduction number (i.e. ∆R v " 0 for all v). Similarly, the variant v with the largest ∆R v will tend to see its prevalence increase, i.e. Erz v pt`1q´z v ptqs ą 0. For fixed n the effective population size ν decreases as k decreases, i.e. as the dynamics become increasingly characterized by super-spreading and are thus increasingly stochastic. In particular large populations experience approximately deterministic dynamics dominated by drift, while small populations experience considerably more stochastic dynamics in which the drift is more difficult to discern.

S1.3 Diffusion in allele frequency space
Following [1], it is convenient to transform to a representation based on allele frequencies so that we can model allele-level selection. Suppose the virus has A alleles with time-dependent allele frequencies 0 ď x a ptq ď 1 for a " 1, ..., A, i.e. x a ptq is the fraction of infected individuals at time t infected by a virus that carries allele a. Denote pairwise allele frequencies as x ab ptq, i.e. x ab ptq is the fraction of infected individuals at time t infected by a virus that carries both allele a and allele b. For convenience we define x aa ptq " x a ptq. Further suppose that each variant v is given as a genotype vector g v P t0, 1u A , i.e. g v,a encodes whether variant v carries allele a (g v,a " 1) or not (g v,a " 0). We also assume that the variant-level differential reproduction number ∆R v is governed by a linear additive model in allele-space: where β a P R is an allele-level selection coefficient with positive (respectively, negative) coefficients corresponding to increased (decreased) viral fitness. With these ingredients in hand and under the linear assumption in Eqn. 7 we can now specify the dynamics in allele frequency space, where the dynamics in Eqn. 3 are transformed to xpt`1q " N pxptq`dptq, ν´1Λptqq where dptq P R A is the A-dimensional drift, 1 given by dptq " Λptqβ ðñ d a ptq " x a ptqp1´x a ptqqβ a`ÿ b‰a px ab ptq´x a ptqx b ptqq β b (9) and ν is as in Eqn. 6. Here Λptq, analog to the diffusion matrix in Eqn. 5, is given by the AˆA matrix Λ ab ptq " x ab ptq´x a ptqx b ptq (10) Note that whenever alleles a and b exhibit non-trivial correlation, i.e. whenever x ab ptq ‰ x a ptqx b ptq, d a ptq depends on β b (in addition to β a ). In other words, a neutral allele can be 'dragged along' (i.e. exhibit apparent drift) by a non-neutral allele via hitchhiking. This effect is sometimes called genetic draft.
It is convenient to introduce incremental allele frequency changes yptq " xpt`1q´xptq (11) so that Eqn. 8 becomes Eqn. 12, together with the definitions in Eqn. 9-11, forms the basis of everything that follows.

S1.4 Relationship to PyR 0
An alternative approach to inferring allele-level selection effects is the PyR 0 model described in [2]. At its core this hierarchical Bayesian model relies on the following multivariate logistic growth ansatz where zptq are variant frequences like in Eqn. 3 and ∆R v are variant-level differential growth rates.
In direct analogy to the approach adopted in this work as well as [1] PyR 0 proceeds to regress ∆R v against allele-level features (cf. Eqn. 7). How is the PyR 0 ansatz in Eqn. 13 related to the diffusion-based dynamics in Eqn. 3? Eqn. 3 can be viewed as a specification of a stochastic differential equation in which there is deterministic driftdptq as well as stochastic diffusion controlled byΛptq.
In the infinite population limit, ν Ñ 8, the dynamics are deterministic and reduce to an ordinary differential equation. It is easy to check that Eqn. 13 is a solution to precisely that differential equation, namely d dt zptq "dptq (14) It would thus appear that the PyR 0 likelihood can be viewed as the deterministic counterpart of the diffusion-based likelihood in Eqn. 3. However, this is not quite correct, since PyR 0 uses the ansatz in Eqn. 13 to parameterize the probabilities of a V -dimensional Multinomial distribution. That is, PyR 0 describes an infinite population undergoing deterministic dynamics, with only a finite number of viral sequences observed at each time point. As a consequence 2 the noise structure of the PyR 0 Multinomial likelihood mirrors that of Eqn. 3, with the difference that the Multinomial likelihood implicitly assumes a fixed population size of n, where n is the total number of observed viral sequences at a given time step. Importantly, it does not allow for the reduction in effective population size that results from super-spreading (small k, see Eqn. 6) or a finite sampling rate (see Sec. S8). As we will see in our analysis of SARS-CoV-2 data, the effective population size in real data is quite modest, even in relatively well-sampled regions. Thus the stochastic component of the dynamics is expected to play an important role and, we would argue, should be accounted for explicitly in the model using a data-driven estimate of the effective population size. The ability to do so is one of the main benefits of using a diffusion-based likelihood.

S2 Maximum a posteriori inference (MAP)
The simplest model that utilizes the diffusion-based likelihood in Eqn. 12 is formulated as follows (we refer the reader to [1] for additional discussion). First we place a Multivariate-Normal prior on the A-dimensional vector of selection coefficients β ppβ|τ q " N pβ|0, τ´11 A q (15) where τ ą 0 is the prior precision on the coefficients and 1 A is the AˆA identity matrix. For observed incremental allele frequency changes yptq for t " 1, ..., T´1 the likelihood is given by where we have assumed that ν is constant across time. Since β appears linearly in the drift dpt|βq (see Eqn. 9) and the prior is Multivariate-Normal, the corresponding posterior distribution, which is given by can be computed in closed form and is itself a Multivariate-Normal distribution. In particular the posterior mean, which in this case is also the maximum a posteriori (MAP) estimate, is given by " E ppβ|y 1:T´1 ,ν,τ q rβs We note that the effective regularization parameter for MAP is given by the ratio between the prior precision τ and the effective population size ν: An attractive property of this estimator is that it can be computed in OpA 3 q time and is thus quite fast on modern hardware, at least for A up to A " 10 4´1 0 5 . An unattractive property of this estimator is that it can perform poorly in the high-dimensional case, A " 1, since we expect most alleles to be neutral, but β MAP a will generally be non-zero for all a.

S3 Bayesian Viral Allele Selection
We expect most alleles to be neutral (β a " 0) or nearly neutral (β a « 0) and we would like to explicitly include this assumption in our model. To do so we utilize the modeling motif of Bayesian Variable Selection [3]. In more detail we consider the following space of models: rselection coefficientss β γ " N p0, τ´11 |γ| q rallele frequency changess yptq " N pdpt|β γ q, ν´1Λptqq for t " 1, ..., T´1 Here each Bernoulli latent variable γ a P t0, 1u controls whether the a th coefficient β a is included (γ a " 1) or excluded (γ a " 0) from the model; in other words it controls whether the a th allele is neutral (γ a " 0) or not (γ a " 1). In the following we use γ to refer to the full A-dimensional vector pγ 1 , ..., γ A q. The hyperparameter h P p0, 1q controls the overall level of sparsity; in particular hA is the expected number of non-neutral alleles a priori. 3 The |γ| coefficients β γ P R |γ| are governed by a Normal prior with precision τ where τ ą 0 is a fixed hyperparameter. Here |γ| P t0, 1, ..., Au denotes the total number of non-neutral alleles in a given model. Finally we note that the drift dpt|β γ q a is given by Λptq aγ β γ , i.e. all alleles b not included in the model (alleles with γ b " 0) can be understood as having null coefficients β b " 0. In the following we drop the γ subscript on β γ to simplify the notation.
In addition to inducing sparsity, an attractive feature of the model in Eqn. 23 is that-because it is formulated as a model selection problem-it explicitly reasons about whether each allele is neutral or not. In particular this model allows us to compute the Posterior Inclusion Probability or PIP, an interpretable score that satisfies 0 ď PIP ď 1. The PIP is defined as PIPpaq " ppγ a " 1|y 1:T´1 , νq i.e. PIPpaq is the posterior probability that allele a is included in the model. This quantity should be contrasted to h in Eqn. 23, which is the a priori inclusion probability. Alleles that have large PIPs are good candidates for being causally linked to viral fitness.

S4 MCMC Inference
In this section we describe our approach to inference for the model in Sec. S3. Before we do so it is worth emphasizing that this is a difficult inference problem. There are two basic reasons for this. First, this is a transdimensional inference problem defined on a mixed discrete/continuous latent space. In particular the dimension of β depends on |γ|. Second, the size of the model space is astronomically large whenever there are more than a few dozen alleles. Indeed for A alleles the total number of distinct models, namely 2 A , exceeds the estimated number of atoms in the known universe (" 10 80 ) for A Á 266. However, the model also exhibits conjugacy structure that we can exploit. In particular, as we discuss next, β can be integrated out analytically for any given γ.

S4.1 Integrating out β
Recall from Eqn. 9 that the drift is given by dptq " Λptqβ so that each term in the likelihood (see Eqn. 12) can be written as where we have ignored the pre-factor of´ν 2 and dropped terms that do not depend on β. After re-introducing´ν 2 , each term in our likelihood contains the following quadratic form: (27) With this formula in hand we can compute the log marginal likelihood in closed form for any particular γ; up to irrelevant constants we have: log ppy 1:T´1 |γq " 1 2 s y νT γ`s Λ ν γγ`τ 1 |γ|˘´1 s y ν γ´1 2 log det`s Λ ν γγ`τ 1 |γ|˘`1 2 log det`τ 1 |γ|˘ ( 28) where Eqn. 28 can be computed efficiently using basic linear algebra tricks described in e.g. [4] and [5]. It is also important to note that s y ν and s Λ ν only need to be computed once before we run MCMC inference. In particular this means that MCMC iteration cost does not depend on T , since the summation over t in Eqn. 29 can be done in pre-processing.
Crucially, since β can be integrated out analytically, our inference problem reduces to an inference problem over γ-space. This is convenient because γ P t0, 1u A has fixed dimensionality and so we no longer need to concern ourselves with transdimensional inference.

S4.2 MCMC
The resulting inference problem is still difficult, since the posterior over γ is defined on a space of size 2 A . Gibbs sampling offers a simple approach to inference for this model. However, Gibbs sampling mixes excruciatingly slowly unless A is very small. To obtain an efficient algorithm we utilize a recently introduced MCMC inference algorithm, Tempered Gibbs Sampling, described in [4]. 4 Here we give a very high-level description of this algorithm and refer the reader to [4] for more details.
We emphasize two main features of this algorithm. The first feature is that the MCMC algorithm targets an auxiliary posterior distribution that is tempered w.r.t. our original target distribution. Samples from the original target distribution are then obtained by reweighting, i.e. importance sampling. Crucially the algorithm is designed in such a way that tempering dramatically improves mixing while the variance of the importance weights remains moderate. Second, the algorithm makes extensive use of the fact that Eqn. 28 can be computed in closed form. In particular it makes use of the fact that, for any given value of γ, it is relatively cheap to compute the quantity ppγ a |γ´a, y 1:T´1 q for all a in parallel. Here γ´a denotes all the entries of γ except for a. This information is then used by the algorithm to make greedy moves in the immediate neighborhood of γ. Consequently the algorithm is able to quickly and efficiently explore regions of γ space with high posterior probability.

S4.3 Computational complexity
The computational complexity of a single MCMC iteration is given by Op|γ| where |γ| is the number of non-neutral alleles at a given iteration and A is the total number of alleles. If we assume that the sparsity assumption is valid so that |γ| ! A the computational complexity simplifies to Op|γ| 2 Aq and is dominated by the cost of computing ppγ a |γ´a, y 1:T´1 q for a " 1, ..., A. In practice we find that this is quite fast on a mid-grade GPU. For example, for A " 2904 and using a NVIDIA Tesla T4 GPU it takes about 45 minutes to generate half a million MCMC samples, which is about 200 samples per second.

S5 Multiple spatial regions
Above we have assumed that observed allele frequencies xptq are from a single spatial region. The generalization to multiple regions is straightforward. First we add a r " 1, ..., N R subscript to the likelihood: y r ptq " N pd r ptq, ν´1 r Λ r ptqq for r " 1, ..., N R and t " 1, ..., T´1 Depending on the strategy used to estimate the effective population size (see Sec. S6) each ν r in Eqn. 30 is either distinct or is instead a single global value. The equation for the marginal log likelihood, Eqn. 28, is unchanged except we now write Eqn. 29 as This equation makes it evident that our effective population size estimates play an important role, since they control the relative contribution of different regions to the region-summed totals in s y ν and s Λ ν . In particular using a per-regionν estimator has the consequence that regions with larger effective population sizes contribute more to the likelihood than regions with small effective population sizes. This is of course as expected, since regions with smaller effective population sizes are more stochastic and thus contribute less certain information about the drift. We discuss our strategy for estimating ν in the next section.

S6 Inferring the effective population size ν
The likelihood in Eqn. 12 depends on the effective population size ν, a quantity that we do not know a priori and need to estimate from data. By assumption the increment yptq is distributed as Under this assumption If we assume that the drift term is subdominant so that this leads to the estimatorν We note that, since dptq T dptq ě 0, we might expectν to be an underestimate of ν.
Thus for a particular region we can estimate the effective population size aŝ where we have averaged Eqn. 35 over T´1 time steps. We can also partially correct for the bias introduced by the (a priori unknown and non-zero) drift by instead using a centered estimator: denoting time-integrated allele increments. We find that both estimators in Eqn. 36 and Eqn. 37 work well in simulation, but for simplicity we use Eqn. 36 throughout this work. Indeed it is also possible to constructν estimators that are more time-local in nature and can thus accommodate population sizes that change over time, but for simplicity we prefer to use the hyperparameter-free estimator in Eqn. 36, as we expect this choice to be more robust. More complex schemes are certainly possible, but it is likely that they would require additional sources of data beyond (partially observed) genomic surveillance data if they are to be reliable (e.g. case count data). The discussion above assumes a single spatial region. How can we accommodate multiple regions? We adopt two possible schemes. In the global scheme we computeν within each region using Eqn. 36 and then compute a single global effective population sizeν via a mean or median over all regions. In the regional scheme we simply computeν within each region using Eqn. 36. Since we expect the latter scheme to lack robustness when confronted with the complexities of real-world SARS-CoV-2 data, we only make comparisons to the regional scheme in simulation.

S7 Incorporating time-dependent vaccination rates
Suppose we know the vaccination rate 0 ď ϕ r ptq ď 1 for a given region r. We would like to incorporate this information into our modeling. In particular we would like to allow for vaccination-dependent selection. To do this we write the drift in region r as d r,a ptq " x r,a ptqp1´x r,a ptqq pβ a`ϕr ptqα a q`ÿ b‰a px r,ab ptq´x r,a ptqx r,b ptqq pβ b`ϕr ptqα b q where α P R A is a second group of selection coefficients whose strength is modulated by the timeand region-local vaccination rate ϕ r ptq. In particular α only has a non-negligible effect on viral dynamics when the vaccination rate has attained a non-negligible level. Disentangling the effects of β and α is difficult a priori. However, our hope is that a Bayesian variable selection approach with robust MCMC inference should be up to the task provided we have enough data. The particular form of Eqn. 39 can be motivated as follows. Eqn. 39 implies that for any given allele there are two selection coefficients: one for unvaccinated populations (β) and another for vaccinated populations (α`β). These population-level effects can be connected back to the individual level. In particular take the k Ñ 8 limit of the discrete time process in Sec. S1 so that secondary infections are controlled by a Poisson distribution. Suppose we are considering a variant with a single non-wild-type allele and there are n infected individuals at the current time step. If we suppose that secondary infections result from contact between infected and non-infected individuals, and that an infected individual comes into contact with non-infected individuals who are vaccinationed/unvaccinated in the proportion ϕ to 1´ϕ, then the total number of secondary infections in the next time step is given by n unvac`nvac with n unvac " PoissonpR 0 p1`βqp1´ϕqnq and n vac " PoissonpR 0 p1`β`αqϕnq (40) which is governed by a PoissonpR 0 p1`β`ϕαqnq distribution. Here n unvac and n vac are the number of secondary infections among unvaccinated and vaccinated individuals, respectively. Thus the effective selection coefficient is given by β`ϕα, explaining the form of Eqn. 39.
The basic inference approach introduced in Sec. S4 is still applicable. All we need to do is suitably modify s y ν and s Λ ν in Eqn. 31. In particular we now have that s and s Λ ν P R 2Aˆ2A :

42)
This change approximately doubles the cost of MCMC inference and quadruples memory requirements, but this regime is still well within reach of a mid-grade GPU even for A " 10 4 .

S8 Sampling rate
What effect does sampling-the fact that not all viral sequences are observed in real world datasetshave on our diffusion-based likelihood? We first give a somewhat high-level analysis that ignores some of the subtleties involved. Suppose that each infected individual has their viral genome sequenced with probability 0 ă ρ ă 1. Due to sampling, a variant v that is circulating in n v individuals will be observedñ v " Binomialpn v , ρq times. This distribution has mean ρn v and variance ρp1´ρqn v . For ρ ! 1 and large n v we have thatñ v is very nearly governed by the Poisson distribution Poissonpρn v q, which has mean and variance both equal to ρn v . Note, however, that the Poisson distribution is just the k Ñ 8 limit of the Negative Binomial distribution, which in turn is the basic ingredient of the discrete time process that underlies our diffusion-based likelihood (see Eqn. 1). For this reason the combined action of one step of viral infection and sampling rate on the allele frequencies x a ptq can be described by the equation where is the effective population size corresponding to variability due to sampling and where ϵ and ϵ 1 encode variability in infection and sampling, respectively. Since, however, both sources of variability are zero mean normally distributed noise in the diffusion limit, they can easily be combined: or in other words the effect of ρ would appear to be to redefine the effective population size as As one would expect, for fixed R 0 and k decreasing ρ decreases the effective population size, since the variability due to sampling increases.

S8.1 Detailed analysis
While the simplified analysis above is already adequate for understanding the basic dynamics, it is necessary to do a more careful analysis to reveal some of the additional subtleties involved. In particular let us be careful about distinguishing between the unobserved allele frequenciesx a ptq (corresponding to all infected individuals) and the observed allele frequencies x a ptq (corresponding to infected individuals whose genomes are sequenced). We can write wheredptq andΛptq are defined in terms ofxptq andΛpt`1q is defined in terms ofxpt`1q, and where we have assumed that ρ and thus ν 1 is constant across time. Ultimately we are interested in how xpt`1q depends on xptq, since these are the quantities we observe. To make progress in analyzing Eqn. 49-51 we first make the replacementsdptq Ñ dptq,Λptq Ñ Λptq andΛpt`1q Ñ Λpt`1q, which we expect to be approximately valid in the diffusion limit n Ñ 8. 5 With this substitution we find that xpt`1q is distributed as If we now further assume that the drift is sufficiently small so that Λptq « Λpt`1q we find that so that Eqn. 48 was almost correct, except that-since we need to 'denoise' twice, i.e. once for time step t and once for time step t`1-the ρ´1 term appears with an additional factor of two. It is easy to check numerically that the approximation in Eqn. 53 is a reasonably good one provided that n is large, ρ is small, and the drift is subdominant. Eqn. 53 is reassuring because it means that sampling induces noise that is congruent with the noise induced by the Negative Binomial dynamics that underlies the discrete time process in Sec. S1. In particular this means that we do not need to separately estimate e.g. k and ρ, since it suffices to estimate the effective population size in Eqn. 54. This is a significant simplification and is a key component in the viability of our overall approach.

S9 Prior inclusion probability
An important hyperparameter in our modeling approach is the prior inclusion probability h in Eqn. 23. In an equivalent parameterization we can instead consider the quantity where S is the expected number of non-neutral alleles a priori and the expectation in Eqn. 55 is with respect to the prior in Eqn. 23. We consider two approaches to setting h (and thus S). The first is to set h by hand utilizing any a priori knowledge we may have of the expected number of non-neutral alleles. While a priori knowledge of h may be imprecise, this need not be particularly troubling: as we show in simulations in the main text results are not expected to be overly sensitive to the precise value of h. Indeed, it is worth emphasizing that h is the prior inclusion probability. Sufficient evidence in the likelihood can always overwhelm the prior and the posterior inclusion probability (the PIP) can differ significantly from h.
The second approach is to place a prior on h (and thus implicitly on S). A natural choice is to place a Beta prior on h h " Betapα h , β h q where α h ą 0 and β h ą 0 are hyperparameters that control the Beta prior. This prior assumption has the following properties. The prior mean of h is given by For β h " α h and β h " 1 this mean is approximately equal to α h {β h and we further have that the variance of h is approximately given by Consequently in this regime the ratio of the square root variance to the mean-which is a useful measure of the width of the prior distribution-is approximately given by the formula pV rhsq These formulae are useful for choosing α h and β h . For example, if we have A " 1000 alleles and we expect that S is about 10 and likely between 3 and 20 we might choose α h " 1 and β h " 100, which concentrates about 90% of the prior probability mass between S " 0.5 and S " 30.
Once we have chosen α h and β h we must still infer h. In the absence of tempering this is easy to do since the posterior conditional distribution over h is just another Beta distribution due to conjugacy. Unfortunately, the tempering introduced by the MCMC scheme we use spoils this relationship [4]. Nevertheless this conjugacy can easily be restored by introducing an additional auxiliary latent state in the MCMC algorithm as explained in detail in [5]. Thus in practice inferring h is straightforward.

S10 Alternative modeling and inference strategies
In Sec. S1 we described how a diffusion-based likelihood can be used to model genomic surveillance data. In Sec. S2 we described the simplest inference strategy that utilizes such a likelihood, namely MAP inference. In Sec. S3 we described our own diffusion-based approach: Bayesian Viral Allele Selection. This method differs from MAP in two notable ways: i) it uses a different prior; and ii) it uses a different inference algorithm (MCMC instead of maximum a posteriori inference). In this section we describe a few alternative diffusion-based modeling and inference strategies. 6

S10.1 Laplace prior
We first describe the simplest modification of MAP that can account for the expected sparsity of non-neutral alleles. In this approach we place a Laplace prior on β where ||β|| 1 is the L 1 norm of β and σ Laplace ą 0 is a hyperparameter that controls the expected level of sparsity. We then define the maximum a posteriori estimate under this Laplace prior: 7 In order to compute β Laplace we need to rely on optimization methods. In our experiments we use the Adam optimizer [6] and implement inference using the Pyro probabilistic programming framework [7]. In particular we use the Adam optimizer with an initial learning rate of 10´2 that exponentially decays to 10´4 over the course of 10 4 optimization steps. 8 Solving this optimization problem is OpA 3`A2 T opt q where T opt is the total number of optimization steps and the cubic cost in A results from computing the Cholesky factorization of Λ ν . Another possible approach would be to consider the exact same model as in Eqn. 61 but to perform variational inference, for example mean-field variational inference utilizing a product of Normal distributions as the variational distribution over β. We would expect this approach to suffer from one of the main issues that plagues the approach in Eqn. 61: the hyperparameter σ Laplace is difficult to set, but a well-chosen value of σ Laplace is crucial for good performance.

S10.2 Horseshoe prior
Another approach to inducing shrinkage in β would be to use a hierarchical shrinkage prior like the Horseshoe [8]. However, in our experience MCMC algorithms for Horseshoe-based regression can mix quite poorly in the high-dimensional setting. In addition, since the Horseshoe prior is a continuous shrinkage prior, it does not provide an interpretable allele-level score like the PIP provided by a Bayesian Variable Selection approach (see Eqn. 24). Moreover, MCMC inference in this context is likely to be expensive, since Λ ν does not have low-rank structure that can be exploited for computational speed-ups.

S10.3 Discussion
Having now discussed a few possible alternatives to BVAS, some of the favorable characteristics of BVAS come into sharper focus. First, a fully Bayesian approach that systematically integrates over different hypotheses about which alleles are neutral and which are not contributes to reduced sensitivity to prior hyperparameters. Second, the sparsity assumption has notable computational benefits with the consequence that BVAS admits efficient MCMC inference with Op|γ| 2 Aq cost per iteration (e.g. compare to the quadratic dependence on A in the Laplace method). Third, BVAS offers interpretable posterior inclusion probabilities. Fourth, the MCMC inference algorithm itself does not include any difficult-to-set hyperparameters, in contrast for example to the Laplace method above, which may require tweaking optimization hyperparameters for optimal performance.

S11 Note on Statistical Analysis
All statistical analysis in this work is done using standard Bayesian methods. In particular unless noted otherwise we report 95% credible intervals based on the (approximate) posterior distribution. Since all our models jointly consider all data-as opposed to univariate association tests-no corrections for multiple hypothesis testing are necessary.

S12 Limitations
We provide an extended discussion of some of the limitations of BVAS not touched upon in the main text. Due to limitations of our pre-processing pipeline that we inherit from UShER, we do not consider insertions and deletions. If there are insertions or deletions that have non-negligible selection effects, the exclusion of these alleles from our analysis may lead to inferences that incorrectly implicate other alleles that are linked to these insertions/deletions. These considerations are also applicable to our epistasis analysis, which may be particularly sensitive to any such missing allelic information. More broadly, the fidelity of the epistasis analysis may also be impacted by the exclusion of interaction effects outside the RBD.
Our method cannot correct for any issues in the raw input data (e.g. sequencing errors) and cannot correct for lineage-dependent sampling bias (e.g. sequencing only in case of S-gene target failure). Likewise biased inferences can result if, e.g., disease severity varies across variants and disease severity influences the likelihood that an individual's viral genome is sequenced. Our diffusion-based likelihood cannot account for differential fitness effects that depend on unobserved confounders like population age structure. In addition we assume that individuals are infected with a single SARS-CoV-2 variant, an assumption that can be violated [9].
Additional bias and/or reduced statistical power may result from our use of a global effective population size, which implicitly assumes that the sampling rate does not vary wildly from region to region. In particular by (very likely) underestimating the effective population size in well-sampled regions like the UK we lose statistical power, although we note that by using a smaller effective population size we expect our inferences to be less sensitive to noise in the data and thus more robust. Moreover by using a single effective population size we implicitly assume that the sampling rate is approximately uniform within each region. We do not expect a non-uniform sampling rate within a given region to bias results provided that the region is well mixed. Biased inferences may result, however, if a region is poorly mixed and exhibits non-uniform sampling, as this combination of violations of our modeling assumptions may overemphasize the importance of local fluctuations in allele frequencies.

S13.1 Simulation details (extended)
Unless otherwise noted, the parameters and distributional assumptions of all simulation-based experiments are as follows. We consider V " 100 viral variants and a true selection coefficient vector β˚with 10 non-zero coefficients: β˚" p0.01, 0.02, 0.04, 0.06, 0.08,´0.01,´0.02,´0.04,´0.06,´0.08,´0.10, 0, 0, 0, ..., 0q In other words there are exactly 10 non-neutral alleles, 5 of which have positive effects on viral fitness and 5 of which have negative effects on viral fitness. We sample each element of each genotype vector g v i.i.d. from a Bernoulli distribution with mean 1 5 , i.e. g v,a " Bernoullip 1 5 q for v " 1, ..., V and a " 1, ..., A where A is the total number of alleles. This choice implies that typical variants have about two non-neutral alleles. We set R 0 " 1 so that typical reproduction numbers for variants v range between R v " 0.9 and R v " 1.1. In each simulation we consider a given number of R regions and T " 26 time steps. This latter choice corresponds to about one year of data assuming time bins of 14 days.
The initial number of infected individuals at time t " 1 within each region is drawn from a Negative Binomial distribution with mean 10 4 and dispersion parameter 9 10. Within each region these infected individuals are then apportioned to each of the V variants using a Multinomial distribution with uniform probabilities V´1. Thus the mean case count of each variant v within each region at time t " 1 is given by 10 4 {V " 100. Case counts for t " 2, ..., T are then determined by the stochastic dynamics in Eqn. 1 with k " 0.1. This value of k is chosen since it consistent with estimates of the SARS-CoV-2 dispersion parameter [10,11,12,13]. These raw counts are then subjected to Binomial sampling with mean ρ " 0.01, representing a sampling rate of 1%, i.e. the viral sequences of 99% of cases are not observed. Note that this implies that the expected number of observed individuals with variant v within each region at the initial time step is 1. Thus our parameter choices result in simulated data that is highly stochastic and that constitute a regime in which we expect that recovering the true selection coefficients β˚is quite challenging. Unless noted otherwise, we generate 20 datasets per condition.
We make these choices because they result in simulated data that exhibit some of the characteristics of our SARS-CoV-2 data. In particular, typical estimated effective population sizesν range from about 25 to about 140 with a mean of about 75. Indeed from Eqn. 54 we would expect that which is indeed what we observe, see Figure A. Note the fact that most regions have an effective population size somewhat larger than 50 is due to the fact that in most regions the number of infected individuals increases throughout the course of the pandemic.

S13.2ν estimator performance
We investigate the accuracy of the effective population size estimators described in Sec. S6. Our simulated data are as in Sec. S13.1 except that we limit the duration of the pandemic to T " 10 time steps to limit the exponential growth/decay within any given region so that the true population size is approximately constant across time within any given region. We run the simulator for N R " 10 5 regions. We then use Eqn. 54 to compute the 'true' effective population size for each region, where we average Eqn. 54 across T time steps (since the true number of infected individuals fluctuates 9 The variance is thus " 3164 2 . from time step to time step). Finally we use our basicν estimator, Eqn. 36, to estimate the effective population size for each region.
In Figure B we depict histograms of ratios of the estimatedν to the true effective population size ν true for two scenarios. In the first scenario (left) we simulate data with β " β˚as in Eqn. 62, with the result that drift is relatively moderate. In the second scenario (right) we simulate data with β " 5β˚, with the result that drift is much more pronounced. We find reasonably good agreement with a relative accuracy of about 10-40% depending on the scenario.

S13.3 Method Comparison
We provide some additional information regarding the method comparison in the main text. We use the following hyperparameter settings. For MAP we test all values of τ ranging from 2´2 4 to 2 24 in powers of 2 and choose τ " 2 11 " 2048, since this value gives the best results across the board. In other words we are using the ground truth to choose τ , something that we obviously cannot do when analyzing real data. Since the effective regularization parameter for MAP is given by γ reg " τ {ν (see Eqn. 22) and ν « 75 for our simulated data this corresponds to a value of γ « 30, which is in the range of values suggested by [1]. 10 For BVAS we use τ " 100 and pα h , β h q " p1{16, A{32q except for experiments where we explicitly vary the prior inclusion probability h. 11 Unless specified otherwise these BVAS settings are used in all simulation experiments. For Laplace we use a regularization scale of σ Laplace " 0.01. For PyR 0 we use the default settings in [2]. Unless specified otherwise, we use the globalν estimator described in Sec. S6. 12 We note that we tried using z-scores to rank PyR 0 hits but found essentially identical results. We also note that the hit rate can be understood as the precision that results if the top 10 alleles are declared hits. We run BVAS for 5500 iterations, discarding the first 500 samples for burn-in. We use T opt " 10 4 optimization steps when running Laplace.

S13.4 Comparing global vs. regionalν strategies
In the main text we investigated the effect of modulating the overall scale of ν. It is also important to investigate whether we should prefer the global or regionalν estimators introduced in Sec. S6. As can be seen in Figure F, the two strategies perform nearly identically, at least on our simulated data. This suggests that the more 'ambitious' regional estimator could perform well in real data, but it also suggests that it may be entirely sufficient to use the (presumably) more robust global estimator. For this reason we only use the global estimator in our actual analysis. By virtue of this choice all regions in our analyses contribute equally to the likelihood.

S13.6 Including vaccination-dependent effects
Our simulation is as in Sec. S13.1 except we now assume 20 non-zero effects, 10 of which are vaccination-dependent and 10 of which are not (we assume these two sets of ten are disjoint and that each set has the same effect sizes as the non-zero effects in Eqn. 62). We assume that ϕ r ptq starts at zero everywhere (i.e. ϕ r pt " 1q " 0), increases linearly over time, and that the final value of ϕ r ptq in each region is given by ϕ r pT q " Uniformp 1 2 , 1q.

S14.1 Data Pre-processing
The initial steps of our data pre-processing pipeline follow the procedures described in [2] and make use of the open source code at https://github.com/broadinstitute/pyro-cov. We give a short summary of the relevant parts of this pipeline and refer the reader to [2] for additional details.
Our initial data ingest consists of 8,575,902 samples downloaded from GISAID [14] on April 18 th , 2022. Each sample record includes metadata for time of collection, geographic location, and PANGO lineage annotation [15] as well as genetic sequence data. We discard records without complete metadata. Sequences whose alignment quality is not reported as "good" are discarded. Time intervals are binned into 14-day segments, which results in a total of 62 time intervals, i.e. a bit more than two years of data. Note that we choose a multiple of 7 days to minimize weekly seasonality effects (e.g. closure of testing facilities on weekends).

S14.1.1 Lineage Clustering
Since the likelihood that underlies BVAS is defined in allele frequency space, it does not explicitly rely on notions of variant or lineage. However, the pre-processing pipeline in [2] makes use of a dynamic partitioning of genetic samples into clusters. One potential advantage of this approach is that it can remove sequencing artifacts. Like [2] our default analysis makes use of L " 3000 clusters derived from the 1694 PANGO lineages present in our dataset. That PANGO lineages are not sufficiently fine can be seen in Figure G, which shows that some PANGO lineages (notably B.1.1) consist of several sublineages with widely varying fitnesses. This recapitulates the findings in [2]. These L " 3000 clusters are created as follows. The starting point is a 9,293,486 node phylogeny of all GISAID samples maintained by Angie Hinrichs [16]. This phylogeny was created using UShER [17] and excludes private mutations, masks difficult-to-sequence regions, disregards deletions, and parsimoniously imputes missing sequence data. To coarse grain the 9.3 million node phylogenetic tree down to L " 3000 clusters we iteratively collapsed parent-child edges by greedily minimizing a notion of distance defined between pairs of mutation-annotated trees. Empirically [2] found that this heuristic clustering produces trees that are approximately balanced (both in terms of cluster size and cluster-cluster edit distance). Note that this clustering procedure removes some of the genetic diversity in the raw data. The total number of alleles retained at the end of clustering is A " 2975. In Figure T we investigate the sensitivity of our analysis to the choice of L and find only modest differences.

S14.1.2 Spatial Aggregation
To accommodate the fact some spatial regions are sparsely sampled while other spatial regions are densely sampled, we dynamically aggregate regions to encourage approximately balanced sample totals per region. For example, in well-sampled countries like the US and Germany our spatial regions are at the state/province level (e.g. California and Bavaria) while subregions of less densely sampled countries are aggregated up to the country level (e.g. Turkey, Portugal, and Slovenia).
The result of the above pre-processing steps is a collection of region-specific time series of lineage counts (encoded as a RˆTˆL count array) as well as lineage-level amino acid features (encoded as a LˆA binary array with A " 2975).

S14.1.3 Allele Frequency Space
Running BVAS MCMC requires two basic ingredients (see Eqn. 28): s y ν and s Λ ν . Since these allele frequency space quantities are unique to BVAS and the method in [1] the rest of our pre-processing pipeline diverges from [2].

S14.1.4 Diffusion Limit Filters
BVAS relies on the diffusion limit, i.e. it assumes that the number of infected individuals is relatively large. For this reason we would like to exclude regions that are poorly sampled from our analysis. 13 With that caveat in mind we would like to include as many regions as possible so that the results of our analysis are not unduly influenced by the idiosyncracies of any particular region. To manage this trade-off we introduce two hyperparameters: N tot min and N 14 min . Regions where the total number of samples is less than N tot min are not included in our analysis. By making N tot min relatively large we exclude regions that have only been observed for a handful of time intervals (note that it is particularly difficult to estimate the effective population size of such regions). We must also decide which time intervals to include in our analysis. Even if the total number of samples in a given region is large the number of samples collected early on in the pandemic may be small, with the result that dynamics in that early phase of the pandemic will tend to be dominated by noise (i.e. diffusion). For each region we thus only include those 14-day time intervals for which the number of samples is greater than or equal to N 14 min . Moreover, since our likelihood depends on allele frequency increments yptq (see Eqn. 12) we choose the longest subsequence of contiguous time intervals where each time interval contains at least N 14 min samples. 14 In our default analysis, we choose relatively conservative values of these two hyperparameters: N tot min " 10 4 and N 14 min " 50. In Sec. S14.2 we investigate the sensitivity of our estimates to N tot min and N 14 min . In Table B we list the 128 regions that enter into our default analysis with N tot min " 10 4 as well as the 67 regions that enter into an auxiliary analysis with N tot min " 20ˆ10 3 .

S14.1.5 Computing s y ν and s Λ ν
The next step is to compute s y ν and s Λ ν . First we estimate the effective population sizeν for each region using the estimator in Eqn. 36. In our default global analysis we then compute the medianν across all the regions included in the analysis (note that the number of regions included depends on N tot min ) and scale the median value ofν by a factor of 0.5. This choice is motivated by several concerns. The median effective population sizeν in our default analysis with 128 regions is 63.9. This value exceeds the estimated effective population size of 64 of the 128 regions, see Table B. Moreover, half that value, namely 32.0, exceeds the estimated effective population size of only 9 of the 128 regions. Thus by making this choice we ensure thatν is not grossly over estimated in any one region (the smallest estimatedν is 20.0 in Fukuoka, Japan). Consequently, with this choice BVAS is not expected to confuse the inherent stochasticity of the least well sampled regions with signal.
To put this choice in a different light, we note that Bayesian inference has a long history of grappling with model mis-specification and robustness. One class of methods achieves robustness by raising the likelihood to a power, which is precisely the result of scaling the effective population size. Thus our choice ofν can be understood as a way of achieving robustness against possible model mis-specification. In particular we do not expect our genomic surveillance data to exactly follow the discrete time branching process in Sec. S1 nor do we expect the reproduction number to be exactly linear in the genotype. For more discussion on these and related ideas see [18].
With this choice ofν we now compute s y ν and s Λ ν using the definitions in Eqn. 31 and using a shared value of the effective population size across all regions. Note that computing these quantities is OpN R T q, i.e. the computation scales with both the number of regions and the number of time intervals under consideration, but that s y ν and s Λ ν can be computed very efficiently in an online fashion. 15 Indeed computing s y ν and s Λ ν only takes a few seconds on a mid-grade GPU. Importantly, these OpN R T q computations only need to be done once in pre-processing, since the BVAS MCMC algorithm operates directly on s y ν and s Λ ν . We also describe an alternative strategy for computing the effective population size that we use in our analysis of SARS-CoV-2 surveillance data obtained through August 10 th 2022. As above we compute effective population size estimates for each region using Eqn. 36. We then partition the regions into two buckets based on the median effective population size: i) those whoseν exceeds the median; and ii) those whoseν is less than or equal to the median. That is we partition regions into lower and upper 50% quantiles based on their estimated effective population sizes. We then compute s y ν and s Λ ν using the medianν for the upper bucket and the minimumν (i.e. the minimum among all regions) for the lower bucket. We call this effective population size estimation strategy the 'two-bucket' strategy. It has the following nice properties: i) it is parameter free; ii) it weighs data from well-sampled regions like the UK more than poorly-sampled regions (though only by a moderate Op1q factor); and iii) it is still conservative in that we generally expect to underestimate the true effective population size in each region.

S14.1.6 Vaccination rates
Vaccination data are downloaded from OWID [19]: • https://github.com/owid/covid-19-data/raw/master/public/data/vaccinations/vaccinations.csv • https://github.com/owid/covid-19-data/raw/master/public/data/vaccinations/us_state_vaccinations.csv We use two definitions of vaccination status: people_vaccinated_per_hundred and people_fully_vaccinated_per_hundred. We use country-level data for all regions except for US states and England, Scotland, Wales. Any missing values are interpolated using linear interpolation. We consider the same 128 regions as in our default analysis, except we exclude Luxembourg, since it does not include complete vaccination information. Thus we consider 127 regions in total.

S14.1.7 MCMC
When running BVAS we do 51ˆ10 4 MCMC iterations. We discard the first 10 4 samples and retain the remaining 5ˆ10 5 samples to compute posterior quantitities of interest like PIPs. All computations are done in 64-bit precision.

S14.2 Sensitivity Analysis
See Figure M-AA and the main text for discussion.

S14.3 Backtesting
We follow the same data pre-processing pipeline as in Sec. S14.1, although we note that for any given date fewer than 128 regions enter our analysis, since we are considering subsets of the data and the N tot min " 10 4 requirement is still in place. In practice a real-time surveillance program might choose to use looser filters (i.e. smaller N tot min and N 14 min ).

S14.4 Analysis using data through August 10 th 2022
We report BVAS inference results obtained using SARS-CoV-2 data obtained through August 10 th 2022. This analysis considers 151 regions and 8497709 viral sequences. To estimate effective population sizes we use the 'two-bucket' strategy described in Sec. S14.1. Otherwise our analysis uses the same hyperparameters as in our default analysis using data through April 18 th , 2022. For results see Table C and Table D.

S14.5 Comparison to MAP
In Figure DD-II we compare BVAS results to results obtained with MAP [1]. Estimates for selection coefficients β are qualitatively similar to BVAS results, although BVAS results are much sparser, with many coefficients nearly zero ( Figure DD). Growth rate estimates are in good agreement if the MAP regularizer τ is set sufficiently low (τ " 2 8 or τ " 2 10 ), but otherwise MAP estimates appear to be over-regularized towards Wuhan A, see Figure EE-FF. Conversely, Manhattan plots for MAP estimates are dense unless τ ě 2 12 ( Figure HH). Thus MAP is unable to simultaneously yield large growth estimates for fit lineages like Omicron and assign negligible effect sizes to most alleles. If we believe that most alleles should be neutral, this is a limitation of MAP. More generally, MAP offers limited control over how much estimates should be regularized because it is controlled by a single hyperparameter τ . This is in contrast to BVAS where regularization is controlled by both S and τ . Indeed BVAS uses a value of γ reg " τ {ν that is an order of magnitude smaller than MAP, depending on S for additional regularization. The analysis in Figures EE-HH indicates that this additional regularization is crucial. We also note that MAP uncertainty estimates are quite to ii) antibody binding computed using the antibody-escape calculator in [20]. The escape calculator is based on deep mutational scanning data for 33 neutralizing antibodies elicited by SARS-CoV-2. BVAS predictions exhibit high (Spearman) correlation with predictions from [20]. Compare to Figure 2B in [2].  [21]. Many top-scoring mutations occur in the serine-arginine rich region (SR), reported to be immunogenic by [22]. narrow (see e.g. Figure EE) which can ultimately be traced to the fact that MAP considers a single hypothesis about allele neutrality (namely that no alleles are neutral).
Note that our MAP analysis follows the exact same data pre-processing pipeline as for BVAS, see Sec. S14.1. Thus BVAS and MAP comparisons utilize the exact same likelihood, although we do not scale theν estimate in the case of MAP and instead use the median effective population size directly. Also note that in Figure GG and Figure II we choose τ " 2 12 , which corresponds to a level of regularization that appears to achieve the best balance between over-and under-regularizationalthough as we have argued above this compromise appears to be suboptimal.

S14.6 Comparison to PyR 0
We also compare BVAS results to results obtained with PyR 0 , see Figure  hits are common to both methods ( Figure KK), the quantitative agreement between PyR 0 and BVAS is modest. For example, only 7 hits are common to the top 25 BVAS and top 25 PyR 0 hits, and the Pearson correlation for β estimates is R " 0.365, see Figure MM. Moreover, PyR 0 estimates for the fittest Omicron lineages are substantially higher than BVAS estimates, by as much as " 50%, see Figure NN. Additionally PyR 0 estimates, especially for allele-level quantities like selection coefficients, are implausibly small, orders of magnitude smaller than the corresponding BVAS uncertainties.
We note that, like our Laplace method (see Eqn. 60), PyR 0 has a hyperparameter σ 3 that controls the sparsity of the selection coefficients. The analysis in [2] uses σ 3 " 5ˆ10´4. However, our analysis includes approximately " 35% more data than the analysis in [2], suggesting that a smaller value of σ 3 may be required. Consequently, after comparing inference results for several values, we settle on  Figure K: As a measure of each gene's total contribution to selection we depict the summed PIP across all alleles in each gene in the SARS-CoV-2 genome (top). In the middle panel we break-up the two largest genes (ORF1a and ORF1b) into constituent NSPs (non-structural proteins). In the bottom panel we additionally normalize by the length of the gene (in units of 1000 amino acids).
one of the disadvantages of PyR 0 : it depends on several hyperparameters that can be difficult to set in practice.
PyR 0 comparisons follow the data pre-processing pipeline described in [2]. We use the default parameters listed in [2], with the difference that we decrease σ 3 (as discussed above). Note that unlike BVAS the PyR 0 analysis does not make use of filters like N tot min " 10 4 , and thus considers more data from significantly more regions (1694 instead of 128).

S14.7 Hitchhiking
As is well known, hitchhiking, i.e. apparent drift to due genetic linkage, complicates the inference of selection effects and makes it difficult to unambiguously disentangle driver and passenger mutations. Hitchhiking can involve both selective sweeps caused by the uptake of beneficial mutations [23] as well as the elimination of neutral variants linked to deleterious mutations, i.e. background selection [24]. One of the advantages of the diffusion-based likelihood in Eqn. 12 is that covariance information from Λptq can help distinguish driver mutations from passenger mutations. To make this more explicit, we identify the 3 alleles that exhibit large uptakes in allele fractions while simultaneously exhibiting small PIPs below 0.01: N:R203K, ORF14:G50N, and N:G204R. As can be seen in Figure OO these alleles have in fact peaked three times, having become very prevalent during the B.1.1, B.1.1.7 (Alpha), and BA.1 (Omicron) waves, while experiencing significant drops in interim periods when other lineages (e.g. B.1.177 and Delta) predominated. BVAS infers that these three alleles are passenger mutations whose rising and falling prevalences are best explained by selection effects due to other linked alleles. While it is difficult to validate whether this particular inference is correct, it is reassuring that all three alleles have similar PIPs, as they exhibit near perfect co-occurrence in the data.

S14.8 Vaccination-dependent effects
We follow the same data pre-processing pipeline described in Sec. S14.1, noting that the dimension of yptq doubles as we include vaccination-dependent effects. Note, however, that when we estimate the effective population size with Eqn. 36 we do so using allele frequency increments in the original A-dimensional space so that ourν estimate remains the same.

S14.9 Epistasis
We report additional details pertaining to our epistasis analysis. We first note that the variant-space dynamics in Eqn. 3 are related to the allele-space dynamics in Eqn. 8 via a linear transformation specified by a genotype matrix. This still holds true when there are higher order interactions in the fitness ansatz in Eqn. 7. Consequently all the relevant formulae in Sec. S1 remain valid, with the difference that allele-space quantities for pairwise features must be re-interpreted appropriately. For example, entries ppa 1 , a 2 q, pa 3 , a 4 qq in the diffusion matrix Λptq in Eqn. 10 corresponding to pairwise features pa 1 , a 2 q and pa 3 , a 4 q encode quartic moments w.r.t. the underlying amino-acid-level alleles. Our epistasis analysis uses the same data and pre-processing as our default analysis. We use twice as many MCMC iterations as in our default analysis. For pairwise interaction effects we use an a priori inclusion probability h quad that is one half of the value used for linear effects, i.e. h quad " 1 2 h linear « 0.0084. We verified that other values of h quad -in particular h quad " 1 4 h linear and h quad " h linear -identified the same leading pairwise interaction effects (although with a corresponding shift in the overall PIP level for quadratic effects), with (N501Y, P681H) the leading hit in all three analyses. This remains true for sensitivity analyses in whichν is modulated down by a factor of one half or the prior precision τ is halved.  As touched upon in the main text, we only consider quadratic interactions between pairs pa 1 , a 2 q of non-synonymous amino acid substitutions that co-occur in at least 2 of our L " 3000 SARS-CoV-2 clusters. Moreover for each pa 1 , a 2 q we require that there are at least two clusters that 'separate' the pair, i.e. clusters that carry the a 1 allele but not the a 2 allele or vice versa. After applying these filters there remain 1432 pairwise interaction effects for consideration, where the vast majority of pairwise interactions that do not pass our filters are excluded because of the co-occurrence requirement. Note that BVAS would not have any incentive to make use of these pairwise interactions if they were included in the analysis, since they would be unidentifiable in the context of their constituent linear effects (and as such they do not add any additional expressivity). See Figure SS for a sensitivity analysis for (linear) effects common to the epistasis analysis and the default analysis without epistasis. Top hits from our epistasis analysis are reported in Table F. The 'pairwise' posterior inclusion probabilities for these hits are depicted graphically in Figure TT.
In the main text we report that top-scoring interaction effects in our epistasis analysis are enriched for interactions between mutations that correspond to top-scoring (linear) effects in our default analysis. In particular 8 of the 10 amino acids in the top 8 quadratic hits in our epistasis analysis exhibit PIPs larger than 0.3 in our default analysis. Only A67V and E484A exhibit negligible PIPs in our default analysis but participate in a top 8 quadratic hit, namely the interaction between A67V and E484A.
The enrichment analysis is done as follows. For the 421 amino acid mutations a in Spike under consideration, we assign a score given by σpaq " 1{rank, where rank is the PIP rank of the corresponding linear effect in our default analysis (i.e. σpaq P t1{1, 1{2, 1{3, ...u). We then assign interaction effects between amino acids a 1 and a 2 a score given by the product σpa 1 qσpa 2 q. That is we assign high scores to interaction effects whose corresponding amino acids exhibit large PIPs in our default analysis. We then compute a Spearman correlation coefficient between the PIP ranks for interaction terms in our epistasis analysis with the scores σpa 1 qσpa 2 q. We find a moderate positive correlation given by 0.45. This substantiates our claim that top-scoring interaction effects in our epistasis analysis are enriched for interactions between mutations that correspond to top-scoring (linear) effects in our default analysis. To determine the statistical significance of this correlation we conduct a permutation test with 9999 permutations, resulting in a p-value less than 0.0001, i.e. none

S15 Bayesian variable selection
We briefly review the literature on Bayesian variable selection (BVS). Some of the earliest approaches to BVS were introduced by [25,26]. [3] provide an in-depth discussion of BVS for linear regression and CART models. [4] introduce Tempered Gibbs Sampling (TGS) and apply it to BVS for linear regresion. [27] and [28] review various methods for BVS. , selection coefficients β (middle), and growth rates R{R A (right) for an analysis with the default number of SARS-CoV-2 fine lineage clusters (L " 3000) versus L " 5000 (see Sec. S14.1). The latter analysis includes A " 3790 alleles; here we subset to the A " 2975 alleles that are common to both analyses. The top scoring hit that appears in the A " 3790 analysis but does not appear in the A " 2975 analysis is ORF1a:K1869R with a PIP of 0.068 and β " 0.013.    Figure BB: We investigate the ability of BVAS to estimate the fitness of lineages as they emerge. We consider several Delta (top two rows) and Omicron (bottom two rows) sublineages as they emerged in Spring 2021 and the end of 2021, respectively. In both cases we compare to the fittest lineage that was predominant (in England and elsewhere) before the rise of Delta and Omicron. We also depict time series of the number of SARS-CoV-2 sequences in our dataset for each sublineage.         Figure EE-      We compare BVAS estimates from vaccination-independent analyses to estimates derived from analyses with vaccination-dependent effects. We depict allele-level Posterior Inclusion Probabilities (left) and selection coefficients β (right); i.e. both quantitites correspond to vaccinationindependent effects. The vaccination status used in the top (respectively, bottom) panel is 'fully vaccinated' ('vaccinated'). In both panels N tot min " 10 4 and N R " 127 regions are included in the analysis. The strong concordance implies that the incorporation of vaccination status into the model has only moderate effects on the estimates of vaccination-independent effects and model fit as a whole. Table F: We report inference results for the top hits in our epistasis analysis, namely those with PIPs above 0.2. These are the same hits that are depicted in the interaction network diagram in the main text. The first column reports the PIP of the pairwise interaction term in the epistasis analysis with both linear and quadratic effects. The next two columns list the corresponding Spike amino acid mutations. The next two columns report the total number of SARS-CoV-2 sequences in our dataset that carry the first (respectively, second) but not the second (respectively, first) amino acid mutation. The next column reports the total number of SARS-CoV-2 sequences in our dataset that carry both amino acid mutations. The next three columns are like the counts in the previous three columns except they report the leading PANGO lineages corresponding to the counts in the previous three columns. Note that a lineage can appear in multiple columns because there is non-negligible genetic diversity in some PANGO lineages (e.g. BA.1). The next column reports the inferred interaction effect in the epistasis analysis with both linear and quadratic effects. The next two columns report the inferred linear effects in the epistasis analysis corresponding to amino acids AA1 and AA2. The last two columns report the inferred linear effects in the default analysis corresponding to amino acids AA1 and AA2. , selection coefficients β (middle), and growth rates R{R A (right) for an analysis with linear and quadratic effects to one with linear effects only. The former analysis includes 4407 selection effects; here we subset to the A " 2975 linear effects that are common to both analyses. The inferred linear effects are largely the same for both analyses, except for a relatively modest number of outliers. In particular there are 5 linear effects whose PIP is substantially smaller in the epistasis analysis (S:R346K, S:S371F, S:S704L, S:V213G, and S:R408S) and one linear effect (S:T95I) whose PIP is substantially larger in the epistasis analysis (where a substantial difference means a difference larger than 0.5). As expected growth rate estimates exhibit very high concordance.