## Figures

## Abstract

To effectively respond to an emerging infectious disease outbreak, policymakers need timely and accurate measures of disease prevalence in the general population. This paper presents a new methodology to estimate real-time population infection rates from non-random testing data. The approach compares how the observed positivity rate varies with the size of the tested population and applies this gradient to infer total population infections. Applying this methodology to daily testing data across U.S. states during the first wave of the COVID-19 pandemic, we estimated widespread undiagnosed COVID-19 infections. Nationwide, we found that for every identified case, there were 12 population infections. Our prevalence estimates align with results from seroprevalence surveys, alternate approaches to measuring COVID-19 infections, and total excess mortality during the first wave of the pandemic.

**Citation: **Benatia D, Godefroy R, Lewis J (2024) Estimating population infection rates from non-random testing data: Evidence from the COVID-19 pandemic. PLoS ONE 19(9):
e0311001.
https://doi.org/10.1371/journal.pone.0311001

**Editor: **Shrisha Rao,
International Institute of Information Technology, INDIA

**Received: **December 5, 2023; **Accepted: **September 10, 2024; **Published: ** September 26, 2024

**Copyright: ** © 2024 Benatia et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **The data supporting the findings of this study are openly available from the COVID Tracking Project (https://covidtracking.com/).

**Funding: **The author(s) received no specific funding for this work.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Early and aggressive public health interventions can substantially reduce pandemic mortality. Evidence from both the Coronavirus 2019 (COVID-19) outbreak and the 1918–1919 Influenza Pandemic suggest that tens of thousands of lives were lost due to minor delays in the initial adoption of preventative public health measures [1–3]. To effectively respond in the early stages of an infectious disease outbreak, policymakers need timely and accurate information on local disease prevalence.

Our understanding of population infection rates may be limited by constraints on testing capabilities in the early stages of an outbreak. During the first wave of COVID-19, severe constraints on the supply of PCR tests in the U.S. meant that testing was limited to a small number of high-risk individuals, and many mild or asymptomatic cases went undiagnosed [4–6]. Moreover, the absence of randomized population-based testing makes it is impossible to infer population infection rate from the share of positive cases among the tested sample, since the selection of high-risk individuals into testing will lead the sample positivity rate to overstate disease prevalence in the overall population. Finally, wide differences in testing capabilities across both countries and subnational jurisdictions can hamper our understanding of geographic spread of the disease, since more cases will be identified in locations where testing is more widely available. Indeed, by early April 2000, South Korea had conducted three times more per capita COVID-19 tests than the U.S., while New York state had conducted nearly twice as many per capita tests as New Jersey [7, 8].

The main objective of this paper is to develop a methodology to estimate real-time population infection rates in the early stages of an infectious disease outbreak. The methodology corrects the observed positivity rates among tested individuals for non-random sampling to calculate overall population infection rates. The approach builds on insights from econometrics on the issue of sample selection bias [9–14], and can be used to estimate disease prevalence at various jurisdictional levels (national and subnational) based on widely available testing data. Further, the methodology does not require information on clinical or epidemiological characteristics of the disease, such as the case fatality rate, the asymptomatic proportion, or the reproductive number; factors over which there is often considerable uncertainty during the early phases of an outbreak [15–17].

The methodology is based on the insight that the relationship between the positivity rate and the size of the tested population can be used to assess the severity of selection bias. For example, a *negative* slope indicates *positive* selection bias, since individuals who are most frequently tested have the highest probability of infection. Once the functional form of this relationship is estimated, the population infection rate can be computed as a combination of the *observed* positivity rate and the *estimated* selection gradient, which corrects for non-random testing.

The second objective of this paper is to apply this methodology to daily data on COVID-19 testing rates and positivity rates across U.S. states from late March to early April 2020 to estimate population infection rates during the first wave of the pandemic. The key identification assumption for the analysis is that testing rates must be unrelated to underlying population disease prevalence. To ensure that this assumption is met, we focused on high frequency day-to-day variation in testing across states. Intuitively, because there is little scope for disease prevalence to evolve from one day to the next, daily changes in testing rates should be orthogonal to underlying changes in population infection rates. In addition, we estimated generalized versions of the model, that relax this identification assumption.

Finally, we assessed the validity of the methodology, by comparing the estimated population infection rates across states to alternate measures of pandemic severity during the first wave. These measure include 1) estimates of population prevalence for SARS-CoV-2 antibodies taken in a number of specific geographical sites in April and May 2020, 2) estimates of population COVID-19 infections by early April based on an alternative methodology that relies on retrospective COVID-19 deaths, and 3) total all-cause excess mortality during the first wave of the pandemic.

Our empirical framework complements existing methods used to estimate population infection rates in the United States and internationally [18–25]. One approach has been based on the Susceptible Infectious Recovered (SIR) epidemiological model, which calibrates parameters to the specific characteristics of the SARS-CoV-2 pandemic to estimate current and future infections. A challenge for this approach is the large uncertainty regarding the relevant parameter values for the virus, particularly in the early stages of an outbreak. Other research has relied on Bayesian modelling to infer past disease prevalence from observed COVID-19 deaths. These models require fewer assumptions regarding the underlying parameter values. However, given the extended delay between initial infection and death [26], estimates based on this approach will identify disease prevalence with a significant lag, so cannot be used to provide information of real-time infection rates.

Most closely related to our paper is [27], who use data on the total number of total tests and the positive test rate to estimate ranges for population COVID-19 infection rates for Illinois, New York, and Italy in early April. Their approach imposes only a weak monotonicity assumption for identification, but produces wide bounds on infection rates. Our approach produces much narrower intervals, but requires imposing some structure on the selection process. Policymakers should be aware that bounds that account for model uncertainty would be wider than ours, as is not uncommon in econometric analyses [28].

## Materials and methods

### Data

The main analysis was based on daily data on total test results (positive plus negative) and the positivity rate (positive tests divided by total tests) across U.S. states from March 31 to April 7. This period was selected to coincide with the sharp rise in reported cases in U.S. during the first wave, and for ease of comparison to a number of seroprevalence studies conducted around the same time period. We excluded earlier observation to limit errors associated with changes in state reporting practices throughout March, although research suggests that community transmission in many states was already widespread by mid-March [25].

The data were obtained from the COVID Tracking Project, a site launched by journalists from The Atlantic that publishes high-quality data on the outbreak across U.S. states [7]. The data were compiled primarily from state public health authorities, occasionally supplemented with information from news reporting, official press releases, or messages from officials released on Facebook or Twitter. We supplemented these data with information on total state population [29].

### Methods

#### Theory.

A simple selection model for testing was developed to link the observed positivity rate among the sample of tested individuals to overall population infection rate.

We consider a stable population, normalized to size one, and denote *A* and *B* as the number of sick and healthy individuals, respectively. Let *p*_{n} denote the probability that a sick person is tested and *q*_{n} the probability that a healthy person is tested, given a total number of tests, *n*. Thus, we have:
and assuming the test is accurate, the number of positive tests is:

This framework highlights how non-random testing will bias estimates of the population disease prevalence. Using Bayes’ rule, we can write the relative probability of testing as the following:
which is equal to one if tests are randomly allocated. When testing is targeted to individuals who are more likely to be sick, we have *Pr*(*sick*|*tested*, *n*) > *Pr*(*sick*|*n*) and *Pr*(*healthy*|*tested*, *n*) < *Pr*(*healthy*|*n*), so the ratio will be greater than one. In this scenario, the ratio of sick to healthy people in the sample, *p*_{n}*A*/*q*_{n}*B*, will exceed the ratio in the overall population, *A*/*B*.

We assume that the severity of selection bias can be expressed a function of the number of tests:
(1)
where *n* is number of conducted tests and *θ* is a vector of parameters to be estimated.

According to this setup, the fraction of positive tests, *s*/*n*, can be written as follows:
(2)
In practice, the denominator of Eq (2) is much larger than one—during the sample period, the median ratio of negative to positive tests, , across U.S. states was 7.3. Thus, taking logs of Eq (2), we can make the following approximation:
(3)

Eq (3) shows that the log share of positive tests in the sample can be approximated by the sum of the log ratio of the relative probability of testing, *p*_{n}/*q*_{n}, and the *unobserved* log ratio of sick to healthy people in the population, *A*/*B*.

#### From theory to estimation.

To conduct the estimation, a first difference estimator was adopted, where the dependent variable is the difference on two consecutive days *t* − 1 and *t* in a given state *i*. Given the last equation, this first difference is equal to:
(4)
where is a mean zero error term that depends on the change in ratio of sick to healthy individuals in the population from *t* − 1 to *t* and an idiosyncratic component, *ϵ*_{i,t}.

Eq (4) forms the basis of the empirical analysis. The identifying assumption is strict exogeneity in the error term: *E*(*u*_{i,t}|*n*_{i,t}, *n*_{i,t−1}). This assumption ensures that the errors are uncorrelated with any function of changes in the number of tests, Δ*n*_{i,t}, and will be violated if changes in the population infection rate are systematically related to testing capacity. This assumption is supported by the short time interval in the daily first difference specification, which limits the scope for disease evolution. In some specifications, we add controls for state fixed effects to allow for jurisdiction-specific exponential growth in underlying disease prevalence from one day to the next.

By focusing on a daily first difference estimator, we are able to partial out the unobserved log ratio of sick to healthy people in the population, *A*_{i,t}/*B*_{i,t}. As a result, changes in the positivity rate depend on the number of tests *only* through a selection channel.

Using day-to-day changes in the positivity rate and day-to-day changes in the number of tests, we can recover by estimating Eq (4). This term captures the predicted change in the positivity rate as a function of the number of tests, *n*. We can recover the estimated population infection rates, , as the estimated positivity rate if the entire population in state *i* were tested on date *t*, i.e. *n*_{i,t} = *pop*_{i}.

The estimated population infection rate can be obtained by rewriting Eq (4) as:
(5)
Eq (5) shows that the estimated log population infection rate is equal to the *observed* log sample positivity rate, , plus an adjustment factor that corrects for non-random testing, .

One could also view this exercise as a reduced form estimation of the relationship between the fraction of individuals who test positive and the size of the tested population, holding constant the population share of sick. Once this relationship has been consistently estimated, we can calculate the estimated share of positive tests for any value of *n*, including when *n* = *pop*_{i}.

#### Empirical implementation.

To implement the procedure, we specify the following functional form for the selection process into testing, *f*(*n*; *θ*):
The term *e*^{γ+βn} ≥ 0 reflects the fact that testing has been targeted towards higher risk populations, with the intercept, *γ*, capturing the severity of selection bias when testing is limited. Meanwhile, the coefficient *β* < 0 identifies how selection bias decreases with *n* as the ratio *p*_{n}/*q*_{n} approaches one. Intuitively, as testing expands, the sample will become more representative of the overall population, and the selection bias will diminish.

Substituting this function into the first difference regression model and taking a third order power series approximation of the log function yields the following estimating equation:
(6)
where *ν*_{i,t} is the mean-zero Gaussian error term. Eq (6) forms the basis for the empirical analysis. The model was estimated by non-linear least squares, allowing for heteroskedastic errors.

Post-estimation, we derived predicted values for population infection rates based on Eq (5). The Delta method was used to approximate the standard errors of the estimated population infection rates. Specifically, the Delta method was used to calculate the standard errors of a first-order Taylor approximation of the function in Eq (5). The validity of this approach relies on the asymptotic normality and consistency of the parameter estimates, alongside the differentiability of the function of these parameters in Eq (5).

## Results

### Population COVID-19 infection rates by state

S1 Table reports the coefficients from Eq (6) estimated across states for the period March 31 to April 7. Model 1 reports the baseline estimates. Model 2 includes additional controls for state fixed effects. Model 3 excludes observations for which the state positivity rate was greater than 0.5. The coefficient estimates are broadly similar across the three specifications.

Fig 1 depicts the relationship between daily changes in the positivity rate and per capita testing, based on the relationship implied by Eq (6). The linear empirical relationship indicates that the functional form of the model fits the data well. Because is *negative*, the upward sloping pattern implies a negative relationship between daily changes in testing and the share of positive tests. A symptom of selection bias is that variables that have no structural relationship with the dependent variable may appear to be significant [10]. Thus, these patterns strongly suggest non-random testing, since daily changes in testing should be unrelated to population disease prevalence except through a selection channel.

*Notes*: This figure reports the relationship between daily changes in the exponential of per capita testing and daily changes in the log positivity rate, using the coefficient of *β* derived from the main estimates of Eq (6).

Table 1 reports the results that adjust observed COVID-19 positivity rates for non-random testing based on the procedure described in the previous section. For reference, column (1) reports the observed positivity rate on April 7, 2020. Columns (2) and (3) report the adjusted rates for April 7 along with 95 percent confidence interval. Estimated population infection rates ranged from 0.3 percent in Wyoming to 7.6 percent in New Jersey. To put these estimates in perspective, in New York state, which had conducted the most extensive testing in the nation, 0.7 percent of the population had tested positive for COVID-19 by April 7. Our estimates imply that 34 states had population infection rates that were higher than the reported per capita cases in New York.

Table 1, col. (4) reports the average estimated population prevalence for the period March 31 to April 7. These averages mitigate sampling error in the daily prevalence estimates, which depend on the observed share of positive tests on any particular day. The average estimates are similar to the April 7 estimates, albeit generally smaller in magnitude, suggesting continued spread of the disease in many states.

Table 2 reports the results from several robustness exercises. First, we estimated modified versions of Eq (6) that include state fixed effects according to the following specification:
(7)
where the term λ_{s} denotes a vector of state fixed effects. These models allow for an exponential trend in infection rates, thereby addressing concerns that underlying disease prevalence may evolve from one day to the next. In these models, each state to have its own specific intercept to capture the fact that the trends may differ depending on the local conditions.

The results from these models (reported in cols. 2 and 7) are virtually identical to the baseline estimates. Moreover, the augmented model tends to produce more precise confidence intervals.

We also explored the sensitivity of the results to excluding observations with particularly high positivity rates. This specification addresses concerns regarding the functional form approximation made Eq (3) may not hold when is small. We restricted the sample to observations with a positivity rate below 0.5 and re-estimated Eq (6). Table 2, cols. 5,6,9 report the results. Although the sample size is reduced, the predicted infection rates are similar in magnitude to the baseline estimates and have similar confidence intervals.

### Comparison of BGL estimated population infection rates to alternative measures COVID-19 prevalence

To assess the validity of the methodology, we compared our state-level estimates of population infection rates (BGL methodology) to alternative measures of COVID-19 severity during the first wave of the pandemic. These alternative measures include 1) SARS-CoV-2 antibody prevalence from around the same time period; 2) estimated state-level COVID-19 infections based on an alternative methodology; and 3) excess mortality rates across states during the first wave.

The first set of comparisons are based on SARS-CoV-2 prevalence from population-based serological testing was conducted in a number of jurisdictions through the middle and end of spring wave (see S2 Table). Given the rapid upsurge in COVID-19 cases in late March, our estimates of current population prevalence in early April should be comparable to seroprevalence rates later in the month. These seroprevalence estimates thus provide a way to externally validate our estimated population infection rates. To expand the set of comparison localities, we also report our estimates of population infection rates for Ontario and Quebec based on testing data from [30], and compare these estimates to province-wide seroprevalence rates based on the same methodology applied to Canadian data [31].

Fig 2 reports the estimates of population infection rates based on our methodology along with various estimates of the prevalence of SARS-CoV-2 antibodies across a number of geographical sites. There is broad similarity between the two prevalence estimates, and both approaches show evidence of widespread undetected infection during the first wave. The median difference is estimated prevalence is 23 percent, and the correlation between the sets of estimates is 0.88. The largest discrepancy between the two measures is in Minnesota, which experienced a sharp increase in COVID-19 cases between the time of our sample period (April 7) and the dates of specimen collection (April 30—May 12).

*Notes*: This figure reports the estimates and 95% confidence intervals for population infection rates across states on April 7 based on BGL methodology, and estimates for SARS-CoV-2 seroprevalence from various sources (see S2 Table for details).

Next, we compared our results to estimated COVID-19 infections across U.S. states based on an alternative methodology: the Retrospective Methodology to Estimate Daily Infections from Deaths (REMEDID) [25]. The REMEDID approach reconstructs the time series of COVID-19 infections across U.S. states by combining information on the timing of state COVID-19 deaths with seroprevalence estimates taken later in the summer of 2020. Notably, this approach is based on an entirely different methodology and data sources than those from the BGL methodology. We compared the estimated population infection rates from the BGL methodology to the total number of COVID-19 infections across states through April 7 based on the REMEDID approach.

Fig 3(a) reports box plots of the distribution of estimated population COVID-19 infection rates across states based on the two different methodologies. Both approaches yield similar estimates of total population infections, albeit slightly higher based on the BGL methodology. Median infection rates are 1.0 based on the BGL methodology and 0.75 based on the REMEDID approach. The distribution of COVID-19 infections are also similar, with a 25 to 75 percentile range of 0.6–1.5 and 0.4–1.2 for each methodology, respectively.

*Notes*: (a) This figure presents a box-plot of the distribution of estimated population COVID-19 infections across U.S. states based on the BGL and REMEDID methodologies [25]. Estimates for the REMEDID methodology are based on total estimated COVID-19 through to April 7, 2020. (b) This figure reports the estimated population COVID-19 infections based on the BGL and REMEDID methodologies across states, along with the 95% confidence interval.

Fig 3(b) reports the estimated infection rates across states based on the BGL and REMEDID approaches. There is a close link between the two approaches. Indeed, the cross-state correlation in infection rates between the two different estimate approaches is 0.75.

Finally, we compared the BGL estimated infection rates to estimates of excess all-cause mortality across states during the first wave (April 1 through June 30, 2020) [32]. Fig 4 reports the scatterplot of these two measures, along with the best-fit line. There is a strong positive relationship between our estimates of COVID-19 and excess mortality. The correlation between the two measures is 0.88. Notably, every state that experienced excess mortality above 1.5 percent fell within the top quartile of estimated infection rates based on the BGL methodology. In contrast, all but two state with excess mortality below 1.5 fell in the bottom three quartiles of BGL estimated infections.

*Notes*: This figure reports the relationship between estimated population COVID-19 infection rates on April 7, 2020 (based on the BGL methodology) and excess all-cause mortality from April 1 to June 30, 2020 [32]. The figure reports the bivariate scatterplot along with the best fit line.

Together, these findings demonstrate a broad alignment between estimated population infection rates based on the BGL methodology and measures of overall COVID-19 prevalence during the first wave across based on alternative approaches.

### Population COVID-19 infection rates and state testing

Table 3 reports the relationship between the number of diagnosed cases and total population COVID-19 infections implied by our estimation procedure. We compared the average population infection rates from March 31 to April 7 to the total number of diagnosed cases by April 12. Because many individuals may not seek testing until the onset of symptoms, the latter date was chosen to correspond with the virus’s median incubation period [33, 34], although the delay between infection and symptom onset varies across individuals according to a lognormal distribution [35]. Column (1) reports the total diagnosed cases by April 12; column (2) reports the total number of COVID-19 cases implied by the estimates reported in Table 1 (col. 4); and column (3) presents the ratio of total cases to diagnosed cases.

The results reveal widespread undetected population infection. For every identified case nationwide, there were an estimated 12 total infections in the population. There were significant cross-state differences in these ratios. In New York, where more than two percent of the population had been tested, the ratio of total cases to positive diagnoses was 8.7, the lowest in the nation. Meanwhile, Oklahoma had the highest ratio in the country (19.4), and tested less than 0.6 percent of its population.

Fig 5(a) presents a bivariate scatter plot between the ratio of total COVID-19 cases per diagnosis and cumulative per capita testing by April 12. The negative relationship (corr = -0.51) indicates that relative differences in state testing do not simply reflect a response to geographic differences in pandemic severity. Instead, the patterns suggest that states that expanded testing capacity more broadly were better able to track population infections.

*Notes*: (a) This figure presents the bivariate relationship between per capita testing and the ratio of total COVID-19 cases per diagnosis. Tests per 1,000 population are based on the cumulative number of tests by April 12. The ratio is the total number of COVID-19 cases, derived from the average estimated population prevalence from March 31 to April 7, divided by the cumulative number of positive tests by April 12. (b) This figure presents the bivariate relationship between log positive tests per capita and log total COVID-19 cases per capita. Positive tests per 1,000 population are based on the cumulative number of positive tests by April 12. The total number of COVID-19 cases is derived from the average estimated population prevalence from March 31 to April 7.

Fig 5(b) documents a positive relationship between per capita COVID-19 diagnoses and the estimated population infection rate. There is a strong positive relationship between the two series. Nevertheless, observed case counts do not perfectly predict overall population infections. For example, despite similar rates of reported COVID-19 cases, we find that Michigan had roughly twice as many per capita infections as Rhode Island. These differences can partly be explained by the fact that nearly two percent of the population in Rhode Island had been tested by April 12, whereas fewer than one percent had been tested in Michigan. Together, these findings suggest that differences in state-level policies towards COVID-19 testing may mask important differences in underlying disease prevalence.

## Conclusion

### Discussion and implications

This paper presents a new methodology to estimate population infection rates from non-random testing data. We applied this methodology to daily testing data for COVID-19 to estimate population infection rates across U.S. states during the first wave of the pandemic. We found widespread undocumented population infection. Our estimated infection rates are similar to findings from seroprevalence studies, retrospective estimates of total cases based on COVID-19 deaths, and all-cause excess mortality during the first wave. We also found that undetected infections were particularly high in states with lower testing capacity.

Pandemics pose an ongoing threat to public health. To effectively respond to these crises, policymakers need to have access to timely and accurate information on total population infections. Nevertheless, in the early phases of an infectious disease outbreak there is often considerable uncertainty regarding the extent of community transmission, as well as geographic spread of the disease.

In this paper, we have developed an approach that can be used to estimate real-time population infection rates from non-random test data. The estimation procedure is straightforward, has few data requirements, and can be used to estimate disease prevalence at various jurisdictional levels.

This methodology provides a useful tool for policymakers to track the scope of population infections during an emerging outbreak. Had the approach been applied to earlier testing data in March 2020, it would have revealed widespread undocumented community transmission that were only later confirmed by analyses of COVID-19 mortality and seroprevalence surveys. This information may have led policymakers to enact earlier and more aggressive public health interventions. Indeed, in an address to the House of Commons on June 10, 2020, Imperial College London epidemiologist Neil Ferguson stated that in March experts had “underestimated how far this country was into the epidemic,” and that “had we introduced lockdowns a week earlier we’d have reduced the final death toll by at least half. The measures, given what we knew about the virus then, were warranted. Certainly had we introduced them earlier we’d have seen many few deaths” [36].

The methodology could also be useful in guiding policymakers in how to allocate scarce health resources across jurisdictions. During the first wave of COVID-19, governments faced challenges in addressing shortages of health workers, personal protective equipment, and other health infrastructure [37]. At the same time, there was considerable uncertainty about the needs for these resources across localities. For example, by late February, community transmission was documented in Washington state, New York city, and Santa Clara county in California [38–40]. Nevertheless, of these three states in which COVID-19 was detected early, only New York experienced the dramatic surge in excess mortality during the first wave (Fig 4). Even within regions, idiosyncratic factors led to widespread variability in the severity of the first wave across jurisdictions. For example, the timing of Mardi Gras festival in late February led to a surge of COVID-19 cases in Louisiana that was far higher than those in other southern states [41]. By providing timely cross-jurisdiction information on population infection rates, our methodology could have enabled federal policymakers to better allocate scarce medical resources during the early onset of the pandemic.

### Policy implementation

To apply the BGL methodology, policymakers should adopt the following procedure:

- Assemble high-frequency data on the total number of tests and the positivity rate across various geographic units.
- Use these data to estimate Eq (7) by non-linear least squares.
- Assess the functional form of the selection process by plotting the changes in the daily positivity rate against the daily change in exponential in per capita testing (as in Fig 1). If the specified function does not fit the data, consider alternative functional forms (linear, log linear) for the selection process,
*f*(*n*,*θ*), and repeat step 2. - Combine the estimates from step 2 in Eq (5) to construct predicted values for population infection rates across each geographic unit and date.
- Apply the Delta Method to Eq (5) to derive standard errors for the predicted population infection rates.

The results from this procedure can provide policymakers with estimates of disease prevalence across different geographic units during an emerging outbreak. For established infectious diseases, this approach can also be used to track the evolution of population disease prevalence over an extended time horizon.

### Limitations and future directions

There are four main limitations to our study which should be taken into account when interpreting the findings or applying the BGL methodology to track disease prevalence.

- Errors in diagnostic testing results will not affect the accuracy of the predicted population infection rates but may reduce their precision.
- Differing testing protocols across jurisdictions may reduce the accuracy of the predicted population infection rates.
- Extended periods of latency between initial infection and symptom onset limit the use of the methodology to track real-time infection.
- The accuracy of the predicted population infection rates depends on correctly specifying the functional relationship between the positive test rate and the size of the tested sample, over which there may be considerable uncertainty.

In what follows, we discuss each of these limitations in greater detail.

First, our analysis depends on the quality of diagnostic testing [42–44]. Because our analysis was focused on day-to-day variation, however, systematic false negative or false positive testing *will not* affect the estimates of population disease prevalence. This is because these errors are eliminated in the first difference Eq (6), provided that the rates of systematic testing errors are similar from one day to the next. Instead, systematic testing errors may reduce precision through classical measurement error [45], increasing the confidence intervals and leading to greater uncertainty in the true population infection rate.

Second, our approach requires an assumption that the underlying sample selection process was similar across observations. In practice, this assumption requires that decisions regarding how to prioritize tests were similar across jurisdictions. In our context this assumption is reasonable, given the short time span of the analysis and the fact that U.S. states faced a common set of guidelines for testing prioritization laid out by the CDC [46]. However, some caution should be taken when applying this approach to other contexts, such as analyses across countries with widely differing testing policies or extended time-series studies across multiple testing regimes, where observational units are highly dissimilar in their test allocation decisions.

Third, the ability of the methodology to track real-time infections is limited by the delay between infection and diagnosis, given that pre-symptomatic individuals are unlikely to seek testing. In the context of COVID-19, the median delay between initial infection and symptom onset has been estimated to be roughly 5 days [25, 47]. Despite this lag, the BGL methodology still provides timelier information on population infections than alternative methodologies based on COVID-19 deaths, which can only provide information on infections with an additional 2 to 8 week lag, given the extended delay between reported COVID-19 cases and mortality [26, 47, 48].

Finally, the estimates of population infection rates depend on a correctly specified functional relationship between the positive test rate and the size of the tested sample. In our empirical implementation, we specify a flexible functional relationship that fits the data well. Nevertheless, an important assumption underlying our analysis is that this observed relationship in the tested sample would continue to hold if testing were expanded out to the broader population. Future research might explore how to relax these functional form assumptions through either semi- or non-parametric approaches.

## Supporting information

### S2 Table. Date, location, and source for seroprevalence estimates.

https://doi.org/10.1371/journal.pone.0311001.s002

(PDF)

### S1 Data. Supplementary data.

This zip file contains the underlying and analyzed datasets used to construct the tables and figures in the manuscript.

https://doi.org/10.1371/journal.pone.0311001.s003

(ZIP)

## References

- 1. Pei S, Kandula S, Shaman J. Differential Effects of Intervention Timing on COVID-19 Spread in the United States. Science Advances. 2020;6(49):eabd6370. pmid:33158911
- 2. Markel H, Lipman H, Navarro J, Sloan A, Michalsen J, Stern A, et al. Nonpharmaceutical Interventions Implemented by US Cities During the 1918-1919 Influenza Pandemic. JAMA. 2007;298(6):644–654. pmid:17684187
- 3. Bootsma M, Ferguson N. The Effect of Public Health Measures on the 1918 Influenza Pandemic in U.S. Cities. Proceedings of the National Academy of Sciences. 2007;104(8):7588–7593. pmid:17416677
- 4. Dong Y, Mo X, Hu Y, Qi X, Jiang F, et al. Epidemiological Characteristics of 2143 Pediatric Patients with 2019 Coronavirus Disease in China. Pediatrics. 2020.
- 5. Pan X, Chen D, Xia Y, Wu X, Li T, et al. Asymptomatic Cases in a Family Cluster with SARS-CoV-2 Infection. The Lancet Infectious Disease. 2020;20(4):410–411.
- 6. Bai Y, Yao L, Wei T, Tian F, Jin D, et al. Presumed Asymptomatic Carrier Transmission of COVID-19. JAMA. 2020. pmid:32083643
- 7.
Meyer R, Kissane E, Madrigal A. The COVID Tracking Project. https://covidtracking.com/; 2020.
- 8.
Korea CDC. Korea Centers for Disease Control and Prevention. https://www.cdc.go.kr/board/board.es?mid=&bid=0030; 2020.
- 9. Heckman J. The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models. Annals of Economics and Social Measurement. 1976;5(4):475–492.
- 10. Heckman J. Sample Selection Bias as a Specification Error. Econometrica. 1979;4(7):153–162.
- 11.
Heckman J, Lalonde R, Smith J. The Economics and Econometrics of Active Labor Market Programs. In: Ashenfelter O, Card D, editors. Handbook of Labor Economics. Amsterdam: North-Holland; 1999. p. 1866–2097.
- 12. Blundell R, Costa Dias M. Evaluation Methods for Non-experimental Data. Fiscal Studies. 2002;21(4):427–468.
- 13. Das M, Newey W, Vella F. Nonparametric Estimation of Sample Selection Models. The Review of Economic Studies. 2003;70(1):33–58.
- 14. Newey W. Two-Step Series Estimation of Sample Selection Models. Econometrics Journal. 2009;12(S1):S217–S229.
- 15. Billah MA, Khan MMMMN. Reproductive number of coronavirus: A systematic review and meta-analysis based on global level evidence. PLOS ONE. 2020;15(1):e0242128. pmid:33175914
- 16. Sorci GBF, Morand S. Explaining among-country variation in COVID-19 case fatality rate. Scientific Reports. 2020;10:18909. pmid:33144595
- 17. Alene M, Yismaw L, Assemie MA, Ketema DB, Mengist B, Kassie B, et al. Magnitude of asymptomatic COVID-19 cases through the course of infection: A systematic review and meta-analysis. PLOS ONE. 2021;16(3):e0249090. pmid:33755688
- 18.
Ferguson N, Laydon D, Nedjati-Gilani G, Imai N, Ainslie K, Baguelin M, et al. Impacts of Non-pharmaceutical Interventions to Reduce COVID-19 Mortality and Healthcare Demand. London: Imperial College COVID-19 Response Team; 2020.
- 19. Perkins A, Cavany S, Moore S, Oidtman R, Lerch A, Poterek M. Estimating Unobserved SARS-CoV-2 Infections in the United States. Proceedings of the National Academy of Sciences. 2020;117(36):22597–22602. pmid:32826332
- 20. Li R, Pei S, Chen B, Song Y, Zhang T, Yang W, et al. Substantial Undocumented Infection Facilitates the Rapid Dissemination of Novel Coronavirus (SARS-Cov2). Science. 2020.
- 21.
Riou J, Hauser A, Counotte M, Althaus C. Adjusting Age-Specific Case Fatality Rates during the COVID-19 Epidemic in Hubei, China, January and February. medRxiv Working Paper; 2020.
- 22.
Johndrow J, Lum K, Ball P. Estimating SARS-CoV-2 Positive Americans using Deaths-only Data. Working Paper; 2020.
- 23.
Javan E, Fox S, Meyers L. Probability of Current COVID-19 Outbreaks in All US Counties. Working Paper; 2020.
- 24. Verity R, Okell L, Dorigatti I, Winskill P, Whittaker C, et al. Estimates of the Severity of Coronavirus Disease 2019: A Model-based Analysis. Lancet Infectious Disease. 2020. pmid:32240634
- 25. García-García D, Morales E, de la Fuente-Nunez C, Vigo I, Fonfría ES, Bordehore C. Identification of the first COVID-19 infections in the US using a retrospective analysis (REMEDID). Spatial and Spatio-temporal Epidemiology. 2022;42:100517. pmid:35934325
- 26.
WHO. Report of the WHO-China Joint Mission on Coronavirus Disease 2019 (COVID-19). https://www.who.int/docs/default-source/coronaviruse/who-china-joint-mission-on-covid-19-final-report.pdf; 2020.
- 27. Manski C, Molinari F. Estimating the COVID-19 Infection Rate: Anatomy of an Inference Problem. Journal of Econometrics. 2020;Forthcoming. pmid:32377030
- 28. Chatfield C. Model Uncertainty, Data Mining, and Statistical Inference. Journal of the Royal Statistics Society. 1995;3:419–444.
- 29.
U S Census Bureau. U.S. Census Bureau, Population Division. Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2019. Washington, DC; 2019.
- 30. Berry I, Soucy J, Tuite A, Fisman D. Open Access Epidemiological Data and an Interactive Dashboard to Monitor the COVID-19 Outbreak in Canada. CMAJ: 2020. pmid:32392510
- 31. Benatia D, Godefroy R, Lewis J. Estimating of COVID-19 Cases across Four Canadian Provinces. Canadian Public Policy. 2020;46(S3):S203–S216. pmid:38630004
- 32.
Foster, T and L Fernandez and S Porter and N Pharris-Ciurej. Age, Sex, and Racial/Ethnic Disparities and Temporal-Spatial Variation in Excess All-Cause Mortality During the COVID-19 Pandemic: Evidence from Linked Administrative and Census Bureau Data. U.S. Census Bureau Discussion Paper; 2022.
- 33. Li Q, Guan X, Wu P, Wang X, Zhou L, Tong Y, et al. Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus – Infected Pneumonia. New England Journal of Medicine. 2020. pmid:31995857
- 34. Lauer S, Grantz K, Bi Q, Jones F, Zheng Q, Meredith H, et al. The Incubation Period of Coronavirus Disease 2019 (COVID-19) from Publicly Reported Confirmed Cases: Estimation and Application. New England Journal of Medicine. 2020. pmid:32150748
- 35. García-García D, Vigo M, Fonfría E, Herrador Z, Navarro M, Bordehore C. Retrospective Methodology to Estimate Daily Infections from Deaths (REMEDID) in COVID-19: the Spain Case Study. Scientific Reports. 2021;11:11274. pmid:34050198
- 36.
Buchan, L. Coronavirus: Lockdown one week earlier could have halved death toll, says Neil Ferguson. The Independent, https://www.independent.co.uk/news/uk/politics/uk-lockdown-coronavirus-death-toll-neil-ferguson-a9559051.html; 2020.
- 37. Unruh L, Allin S, Marchildon G, Burke S, Barry S, Siersbaek R, et al. A Comparison of 2020 Health Policy Response to the COVID-19 Pandemic in Canada, Ireland, the United Kingdom and the United States. Health Policy. 2021;126:427–437. pmid:34497031
- 38. Worobey M, Pekar J, Larsen B, Nelson M, Hill V, Joy J, et al. The Emergence of SARS-CoV-2 in Europe and North America. Science. 2020;370:564–570. pmid:32912998
- 39. Maurano M, Ramaswami S, Zappile P, Dimartino D, Boytard L, Ribiero-Dos-Santo A, et al. Sequencing Identifies Multiple Early Introductions of SARS-CoV-2 to the New York Region. Science. 2020;30(12):1781–1788.
- 40. Deng X, Gu W, Federman S, du Plessis L, Pybus O, Faria N, et al. Genomic Surveillance Reveals Multiple Introductions of SARS-CoV-2 into Northern California. Science. 2020;369:582–587. pmid:32513865
- 41. Zeller M and Gangavarapu K and Anderson C and Smither A and Vanchiere J and Rose R, et al. Emergence of an Early SARS-CoV-2 Epidemic in the United States. medRxiv; 2021.
- 42. Liu J, Xie X, Zhong Z, Zhao W, Zheng C, Wang F. Chest CT for Typical 2019-nCoV Pneumonia: Relationship to Negative RT-PCR Testing. Radiology. 2020. pmid:32049601
- 43. Ai T, Yang Z, Hou H, Zhan C, Chen C, Lv W, et al. Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease 2019 (COVID-19) in China: A Report on 1014 Cases. Radiology. 2020.
- 44.
Yang Y, Yang M, Shen C, Wang F, Yuan J, Li J, et al. Evaluating the Accuracy of Different Respiratory Specimens in the Laboratory Diagnosis and Monitoring the Viral Shedding of 2019-nCoV Infections. medRxiv Working Paper; 2020.
- 45.
Wooldridge J. Econometric Analysis of Cross Section and Panel Data. Cambridge, MA: MIT Press; 2002.
- 46.
CDC. Centers for Disease Control and Prevention: Coronavirus (COVID-19). https://www.cdc.gov/coronavirus/2019-ncov/index.html; 2020.
- 47. Linton N, Kobayashi T, Yang Y, Hayashi K, Akhmetzhanov A, Jung S, et al. Incubation Period and Other Epidemiological Characteristics of 2019 Novel Coronavirus Infections with Right Truncation: A Statistical Analysis of Publicly Available Data. Journal of Clinical Medicine. 2020;9(2):538. pmid:32079150
- 48.
Testa, C and N Krieger and J Chen and W Hanage. Visualizing the Lagged Connection Between COVID-19 Cases and Deaths in the United States: An Animation Using Per Capita State-Level Data (January 22, 2020—July 8, 2020. HCPDS Working Paper; 2020. 4.