Lineage frequency time series reveal elevated levels of genetic drift in SARS-CoV-2 transmission in England

Genetic drift in infectious disease transmission results from randomness of transmission and host recovery or death. The strength of genetic drift for SARS-CoV-2 transmission is expected to be high due to high levels of superspreading, and this is expected to substantially impact disease epidemiology and evolution. However, we don’t yet have an understanding of how genetic drift changes over time or across locations. Furthermore, noise that results from data collection can potentially confound estimates of genetic drift. To address this challenge, we develop and validate a method to jointly infer genetic drift and measurement noise from time-series lineage frequency data. Our method is highly scalable to increasingly large genomic datasets, which overcomes a limitation in commonly used phylogenetic methods. We apply this method to over 490,000 SARS-CoV-2 genomic sequences from England collected between March 2020 and December 2021 by the COVID-19 Genomics UK (COG-UK) consortium and separately infer the strength of genetic drift for pre-B.1.177, B.1.177, Alpha, and Delta. We find that even after correcting for measurement noise, the strength of genetic drift is consistently, throughout time, higher than that expected from the observed number of COVID-19 positive individuals in England by 1 to 3 orders of magnitude, which cannot be explained by literature values of superspreading. Our estimates of genetic drift suggest low and time-varying establishment probabilities for new mutations, inform the parametrization of SARS-CoV-2 evolutionary models, and motivate future studies of the potential mechanisms for increased stochasticity in this system.

Introduction throughout time that cannot be explained by literature values of superspreading. We discuss how community structure in the host contact network may partially explain these results. Additionally, we observe that 92 sampling of infected individuals from the population is mostly uniform for this dataset, and we also find 93 evidence of spatial structure in the transmission dynamics of B.1.177, Alpha, and Delta. In simulations with measurement noise, 100 sequences were sampled per week with the measurement noise overdispersion parameter c t = 5 (parameter defined in text). All simulations were initialized with 50 lineages at equal frequency. A lower effective population size leads to larger frequency fluctuations whose variances add over time, whereas measurement noise leads to increased frequency fluctuations whose variances do not add over time. (b) Schematic of Hidden Markov Model describing frequency trajectories. f t is the true frequency at time t (hidden states) and f obs t is the observed frequency at time t (observed states). The inferred parameters areÑ e (t) ≡ N e (t)τ (t), the effective population size scaled by the generation time, and c t , the overdispersion in measurement noise (c t = 1 corresponds to uniform sampling of sequences from the population). (c-f) Validation of method using Wright-Fisher simulations of frequency trajectories with time-varying effective population size and measurement noise. (c) Simulated number of sequences. (d) Simulated lineage frequency trajectories. (e) Inferred scaled effective population size (Ñ e (t)) on simulated data compared to true values. (f) Inferred measurement noise (c t ) on simulated data compared to true values. In (e) the shaded region shows the 95% confidence interval calculated using the posterior, and in (f) the shaded region shows the 95% confidence interval calculated using bootstrapping (see Methods).
We first summarize the statistical inference method that we developed to infer time-varying effective 98 population sizes from neutral lineage frequency time series that are affected by overdispersed measurement 99 noise (more variable than uniform sampling). We explain the method more extensively in the Methods. differences or deterministic changes in frequency: lineages pre-B. 1.177,B.1.177,Alpha,and Delta [28,17,153 32, 33]. We did not find any studies in the literature claiming detectable fitness differences between lineages 154 within each of these groups; thus, we assumed that our neutral model should be valid when analyzing lineages 155 only within a single group. We checked that this assumption is valid and describe the results below. 156 To obtain lineage frequency time series data for SARS-CoV-2 in England, we downloaded genomic meta-(see Methods). This yielded 486 lineages for pre-B.1.177, 4083 lineages for B.1.177, 6225 lineages for Alpha, 169 24867 lineages for Delta. 170 The inferred effective population size is shown in Figure 2c. The inferred effective population size was 171 lower than the number of positive individuals in the community by a factor of 16 to 1055 at different 172 points in time. The most notable differences between the changes over time in the number of positives 173 in the community and that of the effective population size were: the inferred effective population size of 174 lineages pre-B.1.177 peaked slightly before the number of pre-B.1.177 positives peaked, the inferred effective 175 population size of Alpha decreased slower than the number of positives decreased after January 2021, and the 176 shoulder for the inferred effective population size of Delta occurred earlier than in the number of positives. 177 We checked that the inferred effective population size is not sensitive to the depth at which the trees are cut 178 to create lineages ( Figure S9), the threshold counts for creating superlineages ( Figure S10), or the number 179 of weeks in the moving time window ( Figure S11).

180
The inferred measurement noise for each group of lineages is shown in Figure 2d. The inferred measure-181 ment noise overdispersion was mostly indistinguishable from 1 (uniform sampling), but at times was above 1 182 (sampling that is more variable than uniform sampling). There were also at times differences in the strength 183 of measurement noise between variants when they overlapped in time. In particular, measurement noise for 184 lineages pre-B.1.177 peaked in October 2020 despite measurement noise being low for B.1.177 at that time. 185 To better understand the observed levels of genetic drift, we compared the inferredÑ e (t) to that of an 186 SIR null model, which includes a susceptible, infectious, and recovered class. TheÑ e (t) for an SIR model 187 was derived in Ref. [35,36,37] and is given by where I(t) is number of infectious individuals, R t is the effective reproduction number, and γ I is the rate at which infectious individuals recover. For the number of infectious individuals, we used the number of positive individuals estimated from the UK Office for National Statistics' COVID-19 Infection Survey [31], which is a household surveillance study that reports positive PCR tests, regardless of symptom status. We used the measured effective reproduction number in England reported by the UK Health Security Agency [38]. We used γ −1 I = 5.5 days [39,40], and our results are robust to varying γ I within a realistic range of values ( Figure S12). We found thatÑ e SIR (t) is very similar to the number of positives because the effective reproduction number in England was very close to 1 across time and γ I is also very close to 1 in units of weeks −1 . To calculateÑ e SIR (t) for each variant or group of lineages, we rescaled the population-level I(t) and R t based on the fraction of each variant in the population and the relative differences in reproduction numbers between variants (see Methods). We then calculated the scaled true population size,Ñ (t) ≡ N (t)τ (t), for the SIR model by multiplying by the variance in offspring number, σ 2 , for the SIR model [41] N SIR (t) =Ñ e SIR (t){σ 2 } SIR (2) Overall, the inferredÑ e (t) is lower thanÑ SIR (t) by a time-dependent factor that varies between 16 and 189 589 (Figures 3c and S13), suggesting high levels of genetic drift in England across time. We find similar 190 results when using an SEIR rather than an SIR model which additionally includes an exposed class and 191 may be more realistic (Methods, Supplementary information, and Figure S14).  199 To detect any lineages that could possibly be non-neutral, we used the conservative method, and found that 200 about 20% of lineages in each group were significantly non-neutral at a significance level of 5%. Very likely, 201 some of these lineages are detected as non-neutral simply because the model does not correctly account for 202 strong genetic drift. Excluding these lineages from the analysis of the inferred effective population size leads to slightly different values of the effective population size, but by less than an order of magnitude and mostly 204 by less than the 95% confidence intervals ( Figure S15). This result shows that conservatively excluding 205 lineages that could be non-neutral does not change the result that the inferred effective population size is 206 one to two order of magnitudes lower than the SIR or SEIR model effective population size. Additionally, 207 when we used these new estimates of effective population size in the more accurate method for inferring 208 fitness coefficients that accounts for time-varying effective population size, we find that only about 1% of 209 lineages are non-neutral at a significance level of 5%, which further supports that lineages under constant 210 selection are unlikely to explain our results.

211
We also probed the spatial structure of transmission by inferring the scaled effective population size 212 separately for each region within England. We find that the scaled effective population size in the regions of

213
England is substantially smaller than that in England as a whole for B.1.177, Alpha, and Delta ( Figure S1),

214
suggesting that the transmission was not well-mixed at that time. Additionally, the discrepancy between 215 the inferred regional scaled effective population size and the observed number of positive individuals in a 216 region was comparable to that seen in England as a whole ( Figure S3), which is consistent with spatially 217 segregated dynamics with similar levels of genetic drift in each region. We further describe these results in 218 the Supplementary Information.

219
Potential mechanisms that can contribute to the high levels of genetic drift 220 Two potential mechanisms that can contribute to the observed high levels of genetic drift are: (1)

222
We investigate each of these mechanisms in turn and compare it to our results. While in reality, both 223 mechanisms (and others not explored here) are likely at play, it is challenging to tease them apart given our 224 limited data. Therefore, we consider the extreme situations where only one mechanism at a time is driving 225 the dynamics.

226
Superspreading occurs due to overdispersion in the number of secondary cases, which decreases the 227 effective population size. If superspreading were the only mechanism at play, then the variance in offspring 228 number that would explain our results would be the same as the ratio between the SIR null modelÑ SIR (t) 229 and the inferredÑ e (t) (16-589) ( Figure 3c). Current estimates of the variance in offspring number measured 230 by contact tracing and modeling across a wide range of times and locations are from around 0.7 in one 231 study to 65 in another (Table S1). We found two studies that apply to the UK, one which used a model  Figure S16). It is possible that contact tracing and modeling over-or under-estimates overdispersion due to 237 missed contacts [20]. However, on the other hand, it may be the case that superspreading is not the only 238 mechanism at play.

239
Host deme structure is another mechanism that can lead to decreased effective population size. In 240 such a model, individuals within a deme are very well-connected to one another (i.e. households or friend 241 groups, also known as "communities" in network science [44]), but there are few connections between demes 242 ( Figure 3c). It is possible for deme structure to occur without superspreading. For instance, in the schematic 243 in Figure 3c, Figure 3: Potential mechanisms that can generate a low effective population size. (a) Superspreading, where the distribution of the number of secondary cases (Z) from a single infected individual is broadly distributed (variance greater than mean). (b) Deme structure without superspreading, due to heterogeneity in the host network structure, where the distribution of the number of secondary cases is not broadly distributed (variance approximately equal to mean). (c) The ratio between theÑ SIR (t) (the scaled population size calculated from an SIR model using the number of observed positive individuals and the observed effective reproduction number) and the inferredÑ e (t) for each variant. Only data where the error in the SIR model N SIR (t) is less than 3 times the value are shown, because larger error bars make it challenging to interpret the results. The inferredÑ e (t) is lower than theÑ SIR (t) (which assumes well-mixed dynamics and no superspreading) by a factor of 16 to 589, indicating high levels of genetic drift. The variance in offspring number from the literature [42,43] does not entirely explain the discrepancy between the true and effective population sizes. (d) Simulations of deme structure without superspreading can generate high levels of genetic drift via jackpot events. SEIR dynamics are simulated within demes (with R t = 10, i.e. deterministic transmission) and Poisson transmission is simulated between demes (R t 1, i.e. stochastic transmission) such that the population R t ∼ 1 (see Methods). Simulation parameters are: mean transition rate from exposed to infected γ E = (2.5 days) −1 , mean transition rate from infected to recovered γ I = (6.5 days) −1 , total number of demes D total = 5.6 × 10 5 . The ratio between the number of infected individuals and the inferred effective population size is found to scale linearly with the deme size and not with the number of infected demes. This scaling results because of jackpot events where a lineage that happens to infect a susceptible deme grows rapidly until all susceptible individuals in the deme are infected.
chance [46], which would be consistent with this phenomenon.

255
To check our intuition that deme structure can decrease the effective population size and increase genetic  The levels of genetic drift that we observe are higher than literature values of superspreading; while this 281 may be partly due to the challenges of accurately estimating the extent of superspreading from data, it also 282 suggests that additional mechanisms may be leading to increased stochasticity. In particular, we explore a 283 simplified deme model with groups of individuals that are well-connected to each other (demes) but not to 284 individuals from other demes, and find that such a simple model can generate a low effective population 285 size even in the absence of superspreading, due to jackpot events. In reality, both superspreading and host 286 structure are likely at play. Additionally, they could interact with each other. For instance, there could be 287 superspreading within a deme. Future work can try to tease apart the contribution of these two mechanisms,

292
Accurately estimating the strength of genetic drift allows us to better understand disease spread and ex-293 tinction, as well as to better parameterize evolutionary models and understand how mutations will establish 294 in the population. We observed that with a few exceptions, the amount by which genetic drift was elevated 295 compared to the number of positives did not change much over time or across variants (Figure 3c), despite 296 changes in lockdowns and restrictions (which we may expect to decrease behavior that leads to superspread-297 ing). On the other hand, this may not be so surprising given the findings that restrictions affect the mobility 298 network structure in a complex way, decreasing some types of mobility while increasing others [51]. One 299 exception was that Alpha had significantly higher genetic drift compared to Delta and the strength of genetic 300 drift in Alpha first peaked then slowly decreased over time. This may be either due to differences in the 301 properties of the virus or differences in host behavior. For instance, it may suggest that the stochasticity 302 in the transmission of Alpha sharply increased then slowly decreased over time. Alternatively, this may 303 be driven by Alpha's expanding geographic range combined with reimported cases of Alpha into the UK 304 (observed from February 2021 onwards), which could both also decrease the effective population size [52]. 305 We observe that measurement noise of SARS-CoV-2 is mostly indistinguishable from uniform sampling, 306 but data from some variants at some times do exhibit more elevated measurement noise than uniform 307 sampling. Thus, we expect that assuming uniform sampling, as many methods do, or constant overdispersion 308 will lead to accurate estimates for this dataset [27,22,23,28]. The number of SARS-CoV-2 sequences from 309 England is extremely high and sampling biases are expected to be low, because of efforts to reduce sampling 310 biases by sampling somewhat uniformly from the population through the COVID-19 Infection Survey [31] 311 (from which a subset of positives are sequenced and included in the COG-UK surveillance sequencing data 312 that we use). On the other hand, other countries may have higher sampling biases, so jointly estimating 313 measurement noise and genetic drift may be more crucial in those settings. It may also be interesting to use this method to test whether genomics data taken from wastewater has lower levels of measurement noise as 315 compared to sequenced cases. 316 We find that constant selection is unlikely to explain our results, as excluding potentially non-neutral 317 lineages does not significantly change the inferred effective population size. Our method allows for the 318 inference of constant selection while accounting for time-varying effective population size, and thus may be 319 interesting in future applications of tests for selection. However, the model ignores other processes that 320 may lead to deterministic changes in frequency, such as migration and mutation (discussed below); thus, we 321 recommend combining it with complementary methods (for instance, phylogenetic methods) that test for 322 selection as somewhat independent tests for selection.

323
In summary, we find that the strength of genetic drift in SARS-CoV-2 transmission in England is higher 324 than expected based on the number of positive individuals and the levels of superspreading reported in the 325 literature. Our results suggest that additional models and methods will be needed to better understand the 326 mechanisms behind the observed elevated levels of genetic drift.

327
Limitations of the study and opportunities for future directions 328 First, in our analysis, we focused on standing variation that existed at a particular depth in the phylogenetic 329 tree and ignored de novo mutations subsequently arising during the time series. However, we don't think this  Second, there may be biases in the way that data are collected that are not captured in our model. While 339 our method does account for sampling biases that are uncorrelated in time, sampling biases that remain over 340 time cannot be identified as such (i.e. if one geographical region was dominated by a particular lineage and it 341 consistently had higher sequencing rates compared to another geographical region), and this can potentially 342 bias the inferred effective population size; although, this is also a problem in phylogenetic methods. One 343 approach to this problem that was utilized by some early methods during the pandemic is to develop sample 344 weights based on geography, time, and number of reported cases. Future work should study the effect of 345 different sampling intensities between regions on uncorrelated and correlated sampling noise. Additionally, 346 we assume that the measurement noise overdispersion is identical for all lineages within a variant; in reality, 347 there may be differences in sampling between lineages. Future work should explore whether this is the case 348 in the actual dataset, as well as the effect of lineage-specific measurement noise overdispersion on overall 349 method performance.

350
Third, the quantity of effective population size is a summary statistic that is influenced by many factors, 351 making its interpretation challenging. The effective population size describes the population size under a 352 well-mixed Wright-Fisher model, whereas in reality, this assumption is broken by many effects, including 353 host structure and broad offspring number distributions. Thus, in our study we are careful to interpret 354 effective population size only in the broadest terms of genetic drift, without being able to determine what 355 mechanisms lead to the inferred effective population size (although we do explore some possibilities above).

356
While within-host dynamics may in principle impact the lineage frequency trajectories, this effect is likely 357 small for our analysis because we focus on acute infections (infections in the community rather than in 358 hospitals and nursing homes). This is because acute infections of SARS-CoV-2 are thought to generate little 359 within-host diversity that is passed on due to the short infection duration and small bottleneck size between 360 hosts [55, 56]; while new mutations arising within acute hosts have been observed to be transmitted, these 361 events are rare [55].

362
Fourth, the use of a sliding window of 9 weeks on the lineage frequency data will lead to smoothing of 363 sharp changes in effective population size. It may be interesting in future work to develop a continuous 364 method that uses a prior to condition on changes in effective population size, similar to those that have been 365 developed for coalescence-based methods [1,57]. This would allow us to infer continuous changes in effective 366 population size without needing to use a sliding window.

367
While we have focused on SARS-CoV-2 in this study, the method developed here can be extended to 368 study genetic drift in other natural populations that are influenced by measurement noise and where genomic 369 frequency data are available, for instance other pathogens, field studies, and ancient DNA [58,59,60]. More 370 generally, ongoing methods development that integrates genomics, epidemiological, and other data sources 371 is crucial for being able to harness the large amounts of data that have been generated to better understand 372 and predict evolutionary dynamics.

Materials and Methods
Data sources and processing 375 We downloaded sequence data from the COVID-19 Genomics UK Consortium (COG-UK) [34]. We only 376 used surveillance data (labeled as "pillar 2"); this dataset is composed of a random sample of the positive 377 cases from the COVID-19 Infection Survey, which is a surveillance study of positive individuals in the 378 community administered by the Office for National Statistics (see below). For lineages that appeared before (Epiweek) to average out measurement noise that may arise due to variations in reporting within a week.

392
Then, the lineage frequency is calculated by dividing the number of sequences from that lineage in the 393 respective tree by the total number of sequences of that variant (or group of lineages) that were assigned to 394 any lineage in the respective tree.

395
Because our model describes birth-death processes when the central limit theorem can be applied, we need 396 the lineage frequencies to be sufficiently high. Thus, we randomly combine rare lineages into "superlineages" the community setting. The reported date of positive cases is the date that the sample was taken. The error 407 on the number of positive individuals from April 17, 2020 to July 5, 2020 is reported as the 95% confidence 408 interval, and after July 5, 2020 is reported as the 95% credible interval. The regional data reported the For the Delta variant, the sequences form two distinct groups along the depth direction, as seen from the 433 last panel of Figure S18. Therefore, to divide the Delta variant into lineages with small frequencies, we cut 434 the phylogenetic tree at two depths sequentially; we first cut the tree at d (1) cut , which resulted in lineages with 435 small frequencies plus a lineage with O(1) frequency. Then, to divide the latter lineage further, we took the 436 subtree associated with this lineage and cut the subtree at d  The transition probability between the true frequencies f t (the hidden states) due to genetic drift when 447 0 f 1 has been shown in [62] to be well-described by the following expression, which we use as our 448 transition probability, is a good approximation when the central limit theorem can be applied, which is the case when f 0.

455
By assuming that f t+1 ≈ f t , and defining φ t ≡ √ f t , Equation 4 can be approximated as a simple normal .
We describe the emission probability from the true frequency f t to the observed frequency f obs where M t is the number of input sequences. Again, this distribution is generically a good description when of positive individuals (i.e. can be described by a Poisson distribution in the limit of a large number of sequences), namely Var(n obs t |n t ) = n t or equivalently Var(f obs t |f t ) = ft Mt . This is the realistic minimum 467 amount of measurement noise. When c t > 1, it describes a situation where there is bias (that is uncorrelated 468 in time) in the way that sequences are chosen from the positive population. The case of 0 < c t < 1 describes 469 underdispersed measurement noise, or noise that is less random than uniform sampling. The case of c t = 0 470 describes no measurement noise (for instance, when all cases are sampled for sequencing). These last two 471 situations are unlikely in our data, and thus as we describe below, we constrain c t ≥ 1 in the inference 472 procedure. In addition to being a good description of measurement noise, defining the emission probability 473 in the same normal distribution form as the transmission probability allows us to easily derive an analytical 474 likelihood function, described below (Note: see Ref. [26] for a method to derive an analytical likelihood 475 function for arbitrary forms of the transition and emission probabilities). 476 We derive the likelihood function (up to a constant) for the the Hidden Markov Model using the forward algorithm, although it can alternatively be derived by marginalizing over all hidden states. We assume an (improper) uniform prior on φ 0 (i.e. no information about the initial true frequency of the lineage).

479
The forward algorithm has an analytical form for the simple case of Gaussian transition and emission probabilities. We use the identity for the product of two normal distributions N (x, µ, v), where µ is the mean and v is the variance: Solving the forward algorithm recursively, we have where .
Equation 17 can be substituted into Equation 13 to obtain the full analytical likelihood function.

480
Fitting the model to data 481 We split the time series data into overlapping periods of 9 Epiweeks, over which the effective population size is 482 assumed to be constant. We first use the moments of the probability distributions combined with least squares 483 minimization to get an initial guess for the parameters. Then, we perform maximum likelihood estimation 484 using the full likelihood function. To capture uncertainties that arise from the formation of superlineages 485 from lineages, we create superlineages randomly 100 times (except where indicated otherwise). We infer the 486 strength of measurement noise and the effective population size for each superlineage combination (described 487 below).

488
Determining the initial guess for the parameters using method of moments approach 489 Combining the transition and emission probabilities, and marginalizing over the hidden states we have The first two terms of κ i,j are the contribution to the variance from measurement noise at times i ad j, and 490 the third term is the contribution to the variance from genetic drift.

491
We calculate the maximum likelihood estimate of κ i,j ,κ i,j , which is simply the mean squared displacement The standard error is given by where Z is the number of superlineages.

494
By looking across all pairs of timepoints i and j, we get a system of linear equations in κ i,j that depend 495 on the parameters c t andÑ e (t). To determine the most likely values of the parameters, we minimize using scipy.optimize.minimize with the L-BFGS-B method and the bounds 1 ≤ c t ≤ 100 and 1 ≤Ñ e (t) ≤ 10 7 .
While underdispersed measurement noise (c t < 1) is in principle possible, we constrain c t ≥1 because 498 realistically, the lowest amount of measurement noise will be from uniform sampling of sequences. An example of inferred parameters using the methods of moments approach on simulated data is shown in 500 Figure S19.

Maximum likelihood estimation of the parameters 502
For each set of superlineages, we use the inferred measurement noise values (c t ) and inferred scaled effective population size from above (Ñ e (t)) as initial guesses in the maximization the likelihood function in Equation 13 over the parameters. For the optimization, we use scipy.optimize.minimize scalar with the Bounded method and the bounds 1 ≤ c t ≤ 100 and 1 ≤Ñ e (t) ≤ 10 11 . The time t in the inferredÑ e (t) is taken to be the midpoint of the 9 Epiweek period. The reportedÑ e (t) is the median inferredÑ e (t) across all superlineage combinations whereÑ e (t) < 10 5 (values above 10 5 likely indicate non-convergence of the optimization). The reported errors onÑ e (t) are the 95% confidence intervals (again taking the median across all superlineage combinations whereÑ e (t) < 10 5 ) which are calculated by using the likelihood ratio to get a p-value [63,64].
We replace the likelihood with the profile likelihood, which has the nuisance parameters c 0:T profiled out: where I is an indicator function that equals one when the argument is true and zero otherwise, An example of inferred parameters on simulated data using the maximum likelihood estimation approach, 512 compared to the initial guesses of the parameters from the methods of moments approach, is shown in 513 Figure S19.

514
Correcting for the number of sequences assigned to lineages 515 Because some sequences occur before the cut point in the tree that is used for creating lineages, they are 516 not included in any lineages. As a result, the number of sequences assigned to lineages is lower than the 517 number of sequences in the tree. To correct for the bias in inferred effective population size that results 518 from leaving out sequences from parts of the tree, we divide the inferred effective population size by the 519 fraction of sequences in the tree that are assigned to a lineage. We note that while the number of sequences 520 in the tree is less than the total number of sampled sequences, the sequences in the tree were chosen to be a 521 representative fraction of the total sampled sequences. Thus, we do not need to additionally correct for the 522 downsampling of sequences that were included in the tree. To test that randomly subsampling sequences for 523 the analysis does not affect the results, we randomly subsampled half of the Delta sequences, and reran the 524 analyses; the inferred effective population size was very similar to that from the full number of sequences 525 ( Figure S20).

Simulations for validating method
For the model validation, we perform simulations of the lineage trajectories using a discrete Wright-Fisher model. 500 lineages are seeded initially, and the initial frequency of lineages is taken to be the same across all lineages. In each subsequent Epiweek, the true number of counts for a lineage is drawn from a multinomial distribution where the probabilities of different outcomes are the true frequencies of the lineages in the previous Epiweek and the number of experiments is the effective population size. The true frequency is calculated by dividing the true number of counts by N . The observed counts are drawn from a negative binomial distribution, which has the same mean and variance as the emission probability in Equation 6. The total number of 528 observed sequences in each timepoint is calculated empirically after the simulation is completed, as it may 529 not be exactly M t . The simulation is run for 10 weeks of "burn-in" time before recording to allow for 530 equilibration. Superlineages are created in the same way as described above.

531
For long time series simulations, some lineages will go extinct due to genetic drift, making it challenging The variance in offspring number for an SIR model is approximately 2.

537
For an SEIR model, we calculatedÑ e (t) following the framework from Ref. [36]. Using this framework, 538 we were only able to consider a situation where the epidemic is in equilibrium. We test how well this 539 approximates the situation out of equilibrium using simulations (see Supplementary Information). 540 We first considered how the mean number of lineages, A, changes going backwards in time, s, which is where the limit assumes that the number of infectious individuals, N (s), is large. In the Kingman coalescent 546 we also have .
Combining Equations 39, 40, and 41, we have 548Ñ e (t) = Thus by determining the number of transmissions per unit time, f , and the number of infectious individuals, These quantities can be derived from the equations describing the number of susceptible (S), exposed (E), infectious (I), and recovered (R) individuals in an SEIR model where β is the number of transmissions per infectious individual per unit time (the number of contacts 551 made by an infectious individual per unit time multiplied by the probability that a contact results in a 552 transmission), N H is the total population size (N H = S + E + I + R), γ E is the rate that an exposed 553 individual becomes infectious, δ E is the rate of death for an exposed individual, γ I is the rate than an 554 infectious individual recovers, and δ I is the rate of death for an infectious individual.

555
The number of infectious individuals in a generation, N (s), is given by the instantaneous number of infec-556 tious individuals plus the number of exposed individuals that will become infectious in that generation [41].

557
Thus, The number of transmissions per unit time is given by We rewrite f in terms of the effective reproduction number (for which data are available) which is given by 560 the number of transmissions per unit time (f ) divided by the number of recoveries and deaths per unit time Putting everything together, we have thatÑ e (t) for an SEIR model in equilibrium is given by For SARS-CoV-2, the death rates are much lower than the rate at which exposed individuals become in-563 fectious and the rate at which infectious individuals recover (δ E , δ I γ E , γ I ). In this limit, Equation 50 564 simplifies to 565Ñ e SEIR,eq (t) = (E + I) 2 2R t γ I I .
To calculate theÑ e for an SIR or SEIR model, we use the estimated number of positives from the 566 COVID-19 Infection Survey for I(t). This number is an estimate of the number of positive individuals in 567 the community as measured by surveillance and includes both symptomatic and asymptomatic individuals.

568
While the estimated number of positives does not include cases from hospitals, care homes, and other com-569 munal establishments, community cases likely contribute the most to transmission. We used the measured 570 effective reproduction number from the UK Health Security Agency for R t .

571
To calculate the number of exposed individuals for the SEIR model, we solved for E in Equation 45 572 affect the results ( Figure S12). The error on E was calculated by taking the minimum and maximum possible 576 values from the combined error intervals of I(t + ∆t) and I(t − ∆t) (note that this does not correspond to 577 a specific confidence interval size).

578
The error onÑ e (t) for the SIR or SEIR model was calculated similarly by taking the minimum and 579 maximum possible values from the combined error intervals of E, I, and R t . Only time points where the 580 error interval ofÑ e (t) was less than 3 times the point estimate were kept.

581
Calculating the effective population size for an SIR or SEIR model by variant

582
To calculate the effective population size for an SIR or SEIR model by variant, we needed to determine 583 the variant-specific: number of infectious individuals I(t), number of exposed individuals E(t), effective 584 reproduction number R t , and rate than an infectious individual recovers γ I . We assumed that γ I is constant via surveillance. We calculated the number of variant-specific exposed individuals E(t) in the same way as 589 described above using the variant-specific number of infectious individuals. We assumed that the rate an 590 exposed individual becomes infectious γ E is constant between variants.

591
We calculated the variant-specific effective reproduction number by rescaling the measured effective 592 reproduction number for the whole population where R w 0 is the basic reproduction number of the variant w and f w is the fraction of the infectious population  Figure S21).

597
Inference of fitness from lineage frequency time series 598 We sought to infer the fitness effects of individual lineages, so that we could then determine if putatively 599 selected lineages are influencing the estimation of the time-varying effective population sizes. We first used 600 a deterministic method to estimate lineage fitness effects, similar to the method described in [66].

601
On average, when the frequency of lineage i is sufficiently small f t,i 1, the frequency dynamics will 602 exponentially grow/decay according to the lineage fitness effect, s i , The two sources of noise-genetic drift and measurement noise-both arise from counting processes, so the 604 combined noise will follow var (f t,i ) ∝ f t,i . To account for the inherent discreteness of the number of cases 605 in a lineage-especially important to accurately model lineages at low frequencies-we modeled the observed 606 counts at Epiweek t of lineage i, r t,i , as a negative binomial random variable, var (r t,i ) = ζ t r t,i Where M t is the total number of sequences, and ζ t is a dispersion parameter. We took ζ t as the total 608 marginal variance at a given time-point, i.e. ζ t = c t + M t /N e (t), where we computed estimates of c t and N e 609 as previously described (section "Maximum likelihood estimation of the parameters"). The final likelihood for the fitness, s i , of lineage i is obtained by combining the data from all the relevant the time-points, The point estimate of the lineage fitness,ŝ i , is then numerically computed as the maximum likelihood, For ease of computation and generality we compute the p-value as the posterior probability that the 613 likelihood ratio between null and alternative hypotheses is greater than 1, i.e. the probability that the 614 data more strongly support the null hypothesis (the lineage is neutral) over the alternative (the lineage is 615 non-neutral), We used the point estimates of N e and c t that were calculated after we had excluded the selected lineages 2. Transition of an exposed individual to being infectious with probability γ E E(t) 3. Recovery of an infectious individual with probability γ I I(t) where β ≡ R 0 γ I , R 0 is the basic reproduction number, γ E is the rate that exposed individuals become 640 infectious, and γ I is the rate that infectious individuals recover. As in the rest of this work, we assume that 641 the birth rate of susceptible individuals, background death rate, and the death rate due to disease are much 642 slower compared to the rates of the above processes and thus can be neglected from the dynamics.

643
The time until the next event is drawn from an exponential distribution with rate given by the inverse within the same week an individual becomes exposed and then becomes infectious, it will cause the number 651 of susceptible individuals to decrease by 1, no change in the number of exposed individuals, and the number 652 of infectious individuals to increase by 1.

659
To test the sensitivity of the results to whether the reported PCR positive individuals are infectious or 660 whether they can also be from the exposed class, we recorded the results in two ways. In the first case, only 661 the infectious individuals we recorded as positive ( Figure S22), and in the second case both the exposed and 662 infectious individuals were recorded as positive ( Figure S23). Inference ofÑ e (t) was subsequently done on 663 the lineage frequency trajectories of the recorded positive individuals. The SIR or SEIR modelÑ e (t) were 664 calculated analytically using the true numbers of infectious and exposed individuals and numerically using 665 the number of positive individuals as described above in "Calculating the effective population size for an 666 SIR or SEIR model".    used extensively in population genetics [22,70,53,23,26,25]  includes a susceptible, exposed, infectious, and recovered class. The SEIR model is a good representation of 931 the epidemiology of SARS-CoV-2 when PCR test positivity is closely associated with an infected host being 932 infectious; the literature suggests that this is a good assumption for SARS-CoV-2 [16], but we also test this 933 assumption below. The exposed class thus represents individuals before they are infectious and test positive.
where E(t) is the number of exposed individuals, I(t) is the number of infectious individuals, R t is the 937 effective reproduction number, and γ I is the rate at which infectious individuals stop being infectious.

938
While this equation is derived under equilibrium conditions, we show using simulations that this equation In reality, it may also be the case that some people test positive in a PCR test before they become 947 infectious. To test the impact of this possibility on our results, in our simulations we recorded both exposed 948 and infectious individuals as testing positive. We then calculated the SEIR modelÑ e (t) numerically as 949 described in "Calculating the effective population size for an SIR or SEIR model" assuming that I(t) includes 950 both infectious and exposed individuals ( Figure S23). We find that the numerical solutions give slightly higher 951Ñ e (t) as compared with the true analytical solutions; however, the numerical solutions to the SEIR and SIR 952 models bound the inferredÑ e (t). Thus we also expect that our main result that the literature values of 953 superspreading do not sufficiently explain our results should still hold in this scenario.

954
To calculate the SEIR modelÑ e (t) for the actual data, for the number of infectious individuals, we used the number of positive individuals estimated from the UK Office for National Statistics' , which is a household surveillance study that reports positive PCR tests, regardless of symptom status. We used the measured effective reproduction number in England reported by the UK Health Security Agency [38]. We found thatÑ e SEIR (t) is very similar to the number of positives because the effective reproduction number in England was very close to 1 across time. To calculateÑ e SEIR (t) for each variant or group of lineages, we rescaled the population-level I(t) and R t based on the fraction of each variant in the population and the relative differences in reproduction numbers between variants (see Methods). We then calculated the scaled true population size,Ñ (t) ≡ N (t)τ (t), for the SEIR model by multiplying by the variance in offspring number, σ 2 , for the SEIR model [41] Overall, the inferredÑ e (t) is lower thanÑ SEIR (t) by a time-dependent factor that varies between 70 and 955 2000 ( Figure S14), suggesting high levels of genetic drift in England across time, which is consistent with 956 what we find with an SIR model (Figures 2 and S13). Also similarly to in the case with an SIR model, the within the defined geographical areas, but not between them) and the dynamics were the same in each region, 966 we would expect that the ratioÑ e SIR (t)/Ñ e inf (t) to be the same across regions.     (c) Inferred effective population size (Ñ e (t)) on simulated data compared to true values when c t is jointly inferred and when c t is fixed at 1 (uniform sampling cut = 1.954 · 10 −3 }. The color of the lines for the parameters that were used in the main text are the same as those shown in Figure 2.      Inferred N e Delta Figure S15: Inferred scaled effective population size of all lineages compared to that when excluding nonneutral lineages detected by conservative method (assumes no genetic drift). , where R t is the effective reproduction number and σ 2 is the variance in offspring number. The circles show the inferred overdispersion parameter if we assume there is only superspreading and no deme structure. For the inferred overdispersion parameter, the estimated effective reproduction number in England by variant (see Methods) is used for R t , and the ratio between the SIR model population size and the inferred effective population size is used for σ 2 . The shaded area for the inferred overdispersion parameter k gives an estimate of the error and is calculated by combining minimum or maximum values of the individual parameters; note that this does not correspond to a particular confidence interval.   Simulated lineage frequency trajectories. (c) Inferred effective population size (Ñ e (t)) on simulated data using the method of moments (MSD, for mean squared displacement) and maximum likelhood (HMM, for Hidden Markov Model) estimation approaches compared to true values. The shaded region shows the 95% confidence interval of the inferred values. The confidence interval using the method of moments approach was calculated by taking the middle 95% of values when bootstrapping over the superlineages. The confidence interval using the maximum likelihood estimation approach was determined using the posterior (see Methods) and takes into account joint errors in c t andÑ e (t). (d) Inferred measurement noise (c t ) on simulated data using the method of moments and maximum likelihood estimation approaches compared to true values. The shaded region shows the 95% confidence interval calculated using bootstrapping (see Methods).  Inferred N e (t) I(t) I(t) + E(t) N SEIR, eq e (t) analytical N SIR e (t) analytical N SEIR, eq e (t) numerical N SIR e (t) numerical Figure S23: Simulations of stochastic SEIR dynamics without measurement noise, and comparison of the inferredÑ e (t) to Equations 1 and 51 when the reported positive individuals include both infectious and exposed individuals. (Top) Muller plot of simulated infectious and exposed individuals' lineage trajectories (simulations described in Methods). Infectious and exposed individuals are randomly assigned a lineage in week 11, and individuals that they transmit to are infected with the same lineage. The blue lineage before week 11 indicates the infectious and exposed individuals that existed before lineages were assigned. (Bottom) Comparison of the inferredÑ e (t) using the lineage trajectories shown in the top panel to the number of infectious individuals I(t), the sum of the number of infectious and exposed individuals I(t)+E(t), Equation 51 (SEIR modelÑ e (t)), and Equation 1 (SIR modelÑ e (t)) calculated analytically or numerically as described in the Methods. The numerical solutions give slightly higherÑ e (t) as compared with the analytical solutions; however, the numerical solutions to the SEIR and SIR models bound the inferredÑ e (t).