Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Modeling post-holiday surge in COVID-19 cases in Pennsylvania counties

  • Benny Ren ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    bennyren@pennmedicine.upenn.edu

    Affiliation Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America

  • Wei-Ting Hwang

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America

Abstract

COVID-19 arrived in the United States in early 2020, with cases quickly being reported in many states including Pennsylvania. Many statistical models have been proposed to understand the trends of the COVID-19 pandemic and factors associated with increasing cases. While Poisson regression is a natural choice to model case counts, this approach fails to account for correlation due to spatial locations. Being a contagious disease and often spreading through community infections, the number of COVID-19 cases are inevitably spatially correlated as locations neighboring counties with a high COVID-19 case count are more likely to have a high case count. In this analysis, we combine generalized estimating equations (GEEs) for Poisson regression, a popular method for analyzing correlated data, with a semivariogram to model daily COVID-19 case counts in 67 Pennsylvania counties between March 20, 2020 to January 23, 2021 in order to study infection dynamics during the beginning of the pandemic. We use a semivariogram that describes the spatial correlation as a function of the distance between two counties as the working correlation. We further incorporate a zero-inflated model in our spatial GEE to accommodate excess zeros in reported cases due to logistical challenges associated with disease monitoring. By modeling time-varying holiday covariates, we estimated the effect of holiday timing on case count. Our analysis showed that the incidence rate ratio was significantly greater than one, 6-8 days after a holiday suggesting a surge in COVID-19 cases approximately one week after a holiday.

Introduction

COVID-19, a highly contagious respiratory disease, first appeared in China at the end of 2019 and quickly spread across the world [1]. Evidence suggests mask-wearing, and social distancing are effective strategies in containing COVID-19 [2, 3]. During the beginning of the pandemic, local and state governments quickly moved to implement mask mandates, travel restrictions and community containment measures (e.g., shelter in place) to mitigate the spread of the disease [46]. However, many Americans still choose to travel and congregate during the pandemic which is heightened during a federal holiday. Due to lack of adherence to public health guidance during the holidays, one should expect to see a surge in COVID-19 cases after a holiday. While many reports reaffirm this hypothesis, they are based on anecdotal evidence such as summary statistics of case counts from moving time windows. There are only a handful epidemiological studies that estimate the association between holiday timing and the number of reported COVID-19 cases [7, 8]. For the first year of the pandemic, we hypothesize that we should see a surge in COVID-19 cases within two weeks after a holiday given that the incubation period for COVID-19 extends up to 14 days, with a median time of 4-5 days from exposure to symptoms onset and adding up to an additional 3 days, either by the PCR test or the instant rapid antigen test, for a positive test to be reported [1, 911]. We consider daily case counts between March 20, 2020 to January 23, 2021 to study early pandemic dynamics prior to widespread vaccine distribution, COVID-19 variants and at-home testing. In addition, rigorous disease surveillance and reporting procedures were in place during the beginning of the pandemic resulting in comprehensive infection data.

Poisson count regression with the population size as an offset is a popular approach to model count data and incidence rate. Based on the reporting guidelines of the COVID-19 datasets, there are certain dates such as holidays and weekends that could impact whether cases are being reported [12]. These reporting practices have resulted in excess or structural zeros in the case count, also known as zero-inflation [13]. We also need to consider spatial correlation among county-level COVID-19 case counts because vector-borne and transmissible diseases such as COVID-19, exhibit non-negligible spatial correlation as the movement of people can spread the virus to nearby counties [14]. Furthermore, processes that are confounded by spatially correlated variables are also suited for spatial models; a well-studied example is disease and pollution [1517]. Thus, we expect to see similar case numbers or trends among neighboring counties [18].

Mixed models are powerful tools for spatial modeling due to its ability to handle a complex spatial correlation structure usually represented as a semivariogram or kriging process [19, 20]. Mixed modeling problems have been addressed from a convex optimization and Bayesian computation perspective [14, 21]. Correlation in spatial epidemiology can also be captured using conditional autoregressive models [2224]. Inferential summaries from these models assume a correctly specified spatial correlation, otherwise a post-estimation robust covariance can be derived to address misspecified correlation. One such class of robust estimators are the heteroskedasticity-consistent or sandwich estimators which can easily be derived from marginal models and generalized estimating equations (GEEs). Marginal models are flexible alternatives to mixed models when population-level effects are of interest [25, 26]. In addition, formulation of GEEs through the quasi-likelihood provide a convenient set of score equations with well established optimization procedures. As a result, a GEE formulation can be derived to incorporate spatial-temporal relationships, as well as zero-inflation. For its generalizability and robust inference under misspecified spatial correlation, GEEs are an appealing but under utilized tool in spatial epidemiology.

We propose the use of a mixture of marginal models for zero-inflated over-dispersed Poisson regression to model the daily number of new cases in 67 Pennsylvania counties [27]. We combine the zero-inflated Poisson regression with the framework of a transition model to estimate present case counts as a function of previously reported cases [24, 25]. We account for spatial correlation by treating seminvariograms as the working correlation under a generalized estimation equation framework and designate each date as a ‘cluster’ under the traditional longitudinal nomenclature [26, 2830]. We propose an Expectation-Solution (ES) algorithm to fit the mixture of marginal models, which conveniently reduces to simpler problems of weighted GEEs and weighted semivariograms [3135]. We include a combination of past case counts and time-lagged covariates, as well as county-specific and date-specific covariates to model daily rates of new COVID-19 cases following the parameterization of semivariogram working correlations discussed in [30]. While there are other methods to model spatial temporal data, we elect to use a GEE due to consistent inference under misspecified working correlation, simplicity of formulation and estimation under an imbalance clusters study design [14, 36, 37]. Imbalance clusters occurs in our data because counties report their first case at different dates, staggered entry into the study, when populous counties report cases before rural counties.

The rest of this manuscript is organized as follows. In the Methods section, we outline the zero-inflated Poisson model and the proposed semivariogram model; we detail the zero-inflated GEE for the Poisson counts with excess zeros and describe the estimating procedures for the semivariogram through the ES algorithm and outline the robust sandwich standard error. In the Results section, we analyze the daily COVID-19 cases from 67 Pennsylvania counties.

Methods

Zero-inflated Poisson model

Let Yi,t denote the count of COVID-19 cases at county i and date t. Often times, a case count of zero is due to logistical issues in reporting, which is known as an excess zero and is unrelated to the Poisson model. A true zero is unrealistic in many situations. For excess zeros, we define the zero-inflation process with Poisson count data as That is, if Yi,t is a zero Poisson random variable, it belongs to a Poisson distribution with probability 1−pi,t, otherwise it is an excess zero with probability pi,t [13]. We denote the latent membership, excess zero indicator, for county i at date t as Zi,tZi,t−1 ∼ Bernoulli(pi,tZi,t−1), following a transition model as a function of previous outcomes that is unobservable and treated as missing data.

To model case counts as a rate, we define the Poisson regression model as (1) and (1) can be rewritten as where mi is the county population offset term, representing the population at county i but fixed for all dates t and we denote μi,tYi,t−1 as μi,t and pi,tZi,t−1 as pi,t for brevity. We include a transition term of a lagged case incidence Yi,t−1 to account for the temporal correlation of counts in the same county. Note that for yi,t−1 = 0 values must be corrected such that log(yi,t−1) are defined [24]. Our transition term addresses temporal trends and correlation allowing us to treat the residuals from different dates as independent in the working correlation of the marginal model. Here, xi,t are county and date-specific covariates detailed in our real data analysis.

We can also construct a logistic regression to model the latent membership Zi,t as (2) We also incorporate a transition term, Zi,t−1, to account for temporal trends and ui,t are county and date-specific covariates specifically related to the zero-inflation process.

Theoretical semivariogram model

Standard zero-inflated Poisson models do not account for the longitudinal nature of the data which is expressed as spatial correlation among counties. Though mixed GLMs can be used to account for spatial correlation as Gaussian random effects, these frameworks involve difficult computation such as the Laplace approximation involving the covariance matrix. Spatial random effects are often modeled using semivariograms which can be viewed as a kriging model in the residual space [14]. Alternatively, we propose a GEE procedure which iteratively updates a working correlation using residuals allowing for a more straight forward estimation procedure.

Semivariograms model the correlation between two locations i and j as a decreasing function of distance. This phenomenon is commonly referred to as Tobler’s first law of geography which states: “everything is related to everything else, but near things are more related than distant things.” In our study, distance is measured by meters and locations are latitude and longitude coordinates of county centroids. In our GEE procedure, residuals are expected to be correlated based on distance and are used to calculate the semivariogram working correlation. At a given date, county residuals ri,t and rj,t, separated by distance di,j, are assumed to follow a theoretical semivariogram γ(di,j, τ) where ϕ is an over-dispersion parameter and τ is a scale parameter and correlation does not depend on time. Thus, the working correlation matrix between counties i and j at date t is given as and is solely a function of distance di,j with τ determining the rate correlation decays as distance increases. We make an isotropic covariance assumption, meaning the semivariogram is a function of the locations only through the Euclidean distance between them.

While standard GEEs treat longitudinal data from individuals as independent clusters, our approach differs by treating dates as independent clusters, counties as our repeated measurements, spatial correlation as the working correlation as summarized in Table 1. After incorporating date-specific covariates and case counts from the previous day, we assume residuals from different dates to be uncorrelated in the working correlation. Fig 1 shows low autocorrelation of residuals, Corr(ri,t, ri,tk), within each county from a naive standard Poisson generalized linear model described by Eq (1) using a lagged autoregressive predictor (yi,t−1) and study covariates while assuming no zero-inflation. Our implementation of zero-inflated modeling assumes some observations belong to a structural zero model which further decreases autocorrelation in residuals, suggesting a single lag autoregressive predictor adequately addresses autocorrelation in the residuals. Furthermore, we expect the structural zeros to be related to holiday and weekend timings based on dataset guidelines and case counts to be cumulatively reported after a holiday. As a result, we anticipate holiday and weekend timings to be the major driver of temporal case trajectories in our study. Next, we detail the estimation approach using GEEs and estimation of scale parameter τ based on procedures from the spatial statistics literature. In many practical settings there is an abundance of time-series data, but limited number of locations; our approach allows the asymptotic results to be driven by the number of dates in the analysis.

thumbnail
Fig 1. Box plot of the residual autocorrelations, Corr(ri,t, ri,tk), for lag k (k = 1, …7) from 67 Pennsylvania counties.

The residuals ri,t were obtained from a Poisson generalized linear model with study covariates and a lagged autoregressive predictor (yi,t−1) as outlined in the COVID-19 data analysis section. Light blue lines connect the autocorrelation values of each county. Solid black line connects the autocorrelation values Corr(rt, rtk) calculated with residuals from all counties combined.

https://doi.org/10.1371/journal.pone.0279371.g001

thumbnail
Table 1. Differences between standard GEE found in longitudinal studies and the proposed GEE.

https://doi.org/10.1371/journal.pone.0279371.t001

Estimation

GEE for over-dispersed Poisson counts.

The GEE for the Poisson model is given as (3) where with Xt = [x1,t, x2,t, x3,t, …] and β are the design matrix of covariates and vector of regression coefficients related to case counts. Here is the last count greater than zero for county i and yt = [y1,t, y2,t, y3,t, …] is a vector representing new COVID-19 cases with corresponding expected values μt = [μ1,t, μ2,t, μ3,t, …]. Scale parameter τP is specific to the Poisson model. We incorporated Poisson model membership probabilities, wi,t = 1−pi,t, through matrix Wt = diag(w1,t, w2,t, w3,t, …), into the weighted GEE. When wi,t ≈ 0, then the data point is effectively removed from the estimation of Poisson regression parameters. The estimation of these weights is outlined in the Expectation-Solution algorithm in next section. We modify (1) to carry forward the, highly correlated, last non-zero count if Yi,t−1 = 0. Typically, when Yi,t−1 = 0 an additional parameter may be introduced in order to ensure are defined [24].

In accordance with GEEs, we use residuals to estimate the working correlation but our procedure involves using a parameterization described in the theoretical semivariogram. Over-dispersion is common in COVID-19 case counts due to high variability in reporting practices across counties but is conveniently accounted for with an additional parameter in the GEE. We calculate the dispersion parameter ϕ and standard residuals ri,t as where rank(X) is the number of linearly independent covariates and S(t) is the set of observed counties at time t. Through the quasi-likelihood, GEEs incorporate a dispersion parameter, resulting in an over-dispersed Poisson regression.

GEE for excess zeros.

To accommodate for latent membership Zi,t, we construct another GEE for estimating pi,t, the probability of an excess zero, using the logit link function. This GEE formulation is given as (4) where and τZ is scale parameter for the excess zero model. Here Ut and λ are the covariates and regression coefficients for modeling the excess zero. Because we do not have the complete data, in order to account for serial correlation among excess zeros, we replace the transition term Zi,t−1 in (2) with an indicator of zero cases on the previous day, in order capture temporal trends. In practice, Z is often an imbalanced binary outcome where over parameterization of covariates can cause numerical instability when fitting the model. Therefore, it has been suggested that it be fitted with a parsimonious set of covariates [27].

We calculate standard Pearson residuals as which are used for estimating the working correlation using a semivariogram as described in the next section.

Estimation of τ through the empirical semivariogram.

After each Newton Raphson update of the regression coefficients β, λ. We also update the working correlation by estimating the scale parameter τ, using the empirical semivariogram. In this section we outline procedures for modeling semivariograms and their computational implementation. First, we calculate the empirical semivariogram by grouping pairs of residuals into U bins, based on equally spaced intervals of distance. Each bin of residuals will be used to calculate an empirical estimate for the midpoint of the bin denoted as d1, d2, d3, …, dU with the corresponding intervals denoted as du±δ. After we obtain the empirical semivariogram , we estimate τ from the theoretical semivariogram model through a weighted least squares approach that minimizes the squared difference between the theoretical and empirical semivariograms. Using the procedure detailed in [38], the weighted least squares solution is given as (5) where weights wi,t and wj,t replace the sample size in our case. The weighted least squares approach is computationally fast and yields robust estimates. Placing the theoretical semivariogram in the denominator down weights the influence of observed correlations separated by large distances.

We denote the solution of (5), as a function of the residuals ri,t, ei,t and weights by τP = G(r, W) and τZ = G(e, I) for the Poisson and excess zero model, respectively. Recall, we use separate GEEs for the Poisson and excess zeros, to account for the spatial correlation through a mixture of marginal (zero-inflated) model. We follow some common practices regarding semivariogram models; we calculate the empirical semivariogram using pairwise distances that are less than half the maximum distance, we also bound τ ∈ (0, maxi,j(di,j)/3] so that exp(−3)≈0.05, i.e., the correlation associated with maximum distance, is upper bounded at 0.05. In our case, maxi,j(di,j) = 592318.3 meters. Next, we outline options for computing the empirical semivariogram .

Empirical semivariograms.

The standard moment estimate of the empirical semivariogram is given as [39] presented an alternative estimate that is robust [40, 41] presented median based approaches with 50% breakpoint [3941]. Being that estimating τ is a sub-routine in the ES algorithm, we elect to use the approach from [39] as our empirical semivariogram: for robustness considerations. We denote the corresponding τ estimates as GCH(r, W), and GCH(e, I) in our analysis. The empirical semivariogram is calculated in parallel and τ is estimated numerically using weighted least squares and the optim package in R [42].

Expectation-Solution algorithm

Following [27] we construct an Expectation-Solution algorithm for our zero-inflated regression models as a mixture of marginal models [27, 31]. The ES algorithm is a modification to the well-known Expectation-Maximization (EM) algorithm, where the M-step is replaced by the solution from the GEE [43]. We further modify the working correlation to follow a spatial correlation structure which is estimated using a semivariogram [30].

E-step.

In the E-step, we update the unobserved Z. The conditional expectation is given as: (6) (7) where are updated Poisson model membership weights. We denote the iteration number of the ES algorithm as s.

S-step.

The S-step replaces the maximization step in the EM algorithm. In the S-step, we iteratively estimate Poisson model parameters β, τP and the excess zero model parameters λ, τZ. The iterative updates for β using Newton Raphson are given as: (8) and is updated with (5) using . We denote the update as (9) where The iteration number of the S-step are denoted with k. We repeat (8) and (9) until convergence to obtain β(s+ 1), a new iteration in the greater ES algorithm.

We apply the same procedure in the S-step for excess zero model parameters: λ and τZ (10) (11) where and the weight matrices are replaced by the identity matrix. Eqs (10) and (11) are repeated until convergence to obtain λ(s + 1). For the ES algorithm, we iteratively update the E-step: z(s), W(s), and the S-step: β(s), λ(s) until convergence.

Inference on coefficients

Using the heteroskedasticity-consistent or robust sandwich estimator for standard errors, we derive confidence intervals for β; we have consistent estimates even under misspecified working correlations which are spatial correlations in our case [31]. We proposed a reasonable spatial correlation model but retain reliable inference even in situations when the correlation model is incorrect. Analogously, we can also calculate the covariance for λ with the equivalent formulation without the weight matrices, Wt, using the score Eq (4). We use the asymptotic distribution, for inference, substituting the final estimates from the ES algorithm.

Results

Data

Poisson model covariates.

Daily reported cases for 67 counties in Pennsylvania from March 20, 2020 to January 23, 2021 were obtained from the New York Times GitHub repository, https://github.com/nytimes/covid-19-data [12]. We restrict our study dates to be prior to widespread vaccine distribution, COVID-19 variants and at-home testing. The New York Times dataset includes confirmed cases from PCR tests as well as probable cases. Probable cases are derived from a set of testing, symptoms and exposure criteria recommended by the Council of State and Territorial Epidemiologists and was also adopted by the Center of Disease Control on April 14, 2020 [44]. Data collection for a county starts after the first recorded case and we exclude the first 14 days of documented cases per county as reporting is often inconsistent during the start of data collection [45]. This also ensures carried forward imputation values: and are all defined.

Our primary goal is to understand the relationship between the timing of a holiday and daily new case counts. We consider federal holidays: New Years, Martin Luther King, George Washington Birthday, Good Friday, Memorial, Independence, Labor, Columbus, Veterans, Thanksgiving, Christmas and election day (November 3rd, 2020) [46]. Federal holidays are dates which a large proportion of the population are not working, enabling congregation and subsequent transmission of COVID-19. We create a vector of holiday binary {1, 0} indicators for whether a day in the last 2 weeks was a holiday in the Poisson regression model from today to the prior 14 days. For example, the count recorded on January 1st, the associated 15 holiday lag indicators are (1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0) for dates: New Years, Dec 31, Dec 30, Dec 29, Dec 28, Dec 27, Dec 26, Christmas, …, Dec 18. We abbreviate the indicator for a holiday being k days in the past as, Holiday Lag k: HLk. Another example, for the count data on January 2nd, the indicators are (0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0). Therefore, the regression coefficients of our holiday indicators can inform whether there are increased cases (i.e., surge) with incidence rate ratios greater than 1, in relation to the presence and timing of a holiday within the prior 2 weeks. If the hypothesis of a post-holiday surge in COVID-19 cases is true, we expect to see several incidence rate ratios (IRRs) associated with HLk to be significantly greater than 1 after a holiday.

However, to account for the association with the day of the week, we include an indicator for the day of the week with Wednesday as the baseline. Our regression analysis also controls for other factors or covariates that may be associated with the incidence of COVID-19 cases as well as excess zeros in reporting. For covariates associated with case incidence, there’s growing evidence suggesting that warm and wet climate conditions seem to reduce spread of COVID-19 [47]. Precipitation and maximum temperature data from daily summaries was obtained from National Oceanic and Atmospheric Administration (NOAA) web service were included as additional covariates [48]. Maximum daily temperature and daily precipitation from weather stations across Pennsylvania were downloaded from the NOAA web service. Using latitude and longitude for each county centroid and weather station, county-specific weather data was interpolated using weather data from the same date. Interpolation was carried out using the VIM R package, median of K nearest neighbors (K-nn), with K = 5 [49]. Precipitation was further binarized into a {0, 1} indicator based on whether or not it rained that day. Similar to the holiday covariates, we also construct 15 lag covariate indicators for precipitation in the prior 2 weeks. We also construct 15 continuous covariates of Fahrenheit temperature values for the prior 2 weeks. Following the convention of abbreviating Holiday Lag: HLk, we abbreviate the covariates for temperature as TLk and precipitation as PLk.

Other covariates included for the Poisson regression consisted of well-known demographic factors that affect health outcomes [50, 51]. Demographic information from the U.S. census and 2016 American Community Survey were also included in the model as county-specific covariates. For each county, quintiles for population density, percentage of the population living in poverty, log median household income, log median house value, percentage of Black residents, percentage of Hispanic residents, percentage of the adult population with less than high school (HS) education, and percentage of owner-occupied housing were included as county-specific covariates. The data was obtained using the same source outlined in [51].

Zero-inflation covariates.

The presence of excess zeros can bias coefficient estimates in the negative direction if ignored. A non-negligible 21% of the reported case counts are zeros. Therefore, we propose a parsimonious model for Z by considering only a subset of covariates from the main Poisson regression model such that zero-inflation is accounted for based on well-known relationships associated with reporting zero cases. Based on the COVID-19 case reporting guidelines, covariates associated with excess zeros include holidays and weekends. For example, there tends to be a gap in reporting during weekends and holidays, suggesting case count reporting is more likely to occur on business days which does not reflect the reality of the pandemic and back logged cases tend to be lumped into the next business day’s case count.

Indicators for tomorrow, today and yesterday being a holiday are included model for Z. As an example, for outcome data recorded on December 31st, the covariates will be (1, 0, 0). An indicator for weekend is also included. In addition, certain county-specific covariates, percentage of population in poverty and percentage of owner-occupied housing are also included as covariates. Owner-occupied housing rate is an indicator of urban-rural dichotomy, with rural counties having a high rate of owner-occupied housing. Poverty rate and owner-occupied housing rate, as socioeconomic markers, are potential confounders for zero-inflation since the economy of a county may determine the resources available for municipal services such as case reporting. We also include an indicator for whether the previous day was a zero as a covariate to account for temporal trends.

COVID-19 data analysis

In addition to the Poisson model, we fitted a negative binomial hurdle model to capture the severity of a COVID-19 outbreak conditional that cases counts are greater than zero [52, 53]. The hurdle model fits a negative binomial regression truncated to the positive case counts, {Yi,t: Yi,t > 0} using the same covariates as (3) and models outcomes with a logistic regression with the same covariates as (4). Robust sandwich estimators for the hurdle model were obtained using the sandwich R package [54, 55].

We fitted the following four models: 1.) Poisson generalized linear model without spatial correlation, without over-dispersion and without zero-inflation, we denoted as GLM, 2.) negative binomial hurdle model without spatial correlation, we denoted as Hurdle-NB, 3.) over-dispersed Poisson marginal model, with spatial correlation, but without zero-inflation, we denote this as GEE-OP, 4.) a zero-inflated over-dispersed Poisson marginal model as described by the Expectation-Solution algorithm, which we denote this as GEE-ZIOP.

The GEE-ZIOP results in Fig 2 and Table 2, shows the estimated incidence rate ratio (and 95% confidence interval) of 1.41 (1.07, 1.85); 1.37 (1.09, 1.72); 1.3 (1.08, 1.57); at 6, 7, 8 respective days after a holiday. At 6 to 8 days after a holiday we expect there to be a surge with 1.3 to 1.41 times as many cases as there would be without a holiday. Intervals for IRR estimates were plotted for all models in Fig 2 and showed that IRR estimates associated with the GEE-ZIOP model at the surge was greater than GLM, Hurdle-NB and GEE-OP models. All four models yield maximum IRR estimates at 6-8 days, suggesting that a surge in case counts occurs about 6-8 days after a holiday.

thumbnail
Fig 2. Incidence rate ratio with robust 95% confidence interval for holiday effects.

Estimated IRRs corresponding to the 15 holiday covariates, for each model. 95% confidence intervals were calculated using robust standard errors.

https://doi.org/10.1371/journal.pone.0279371.g002

thumbnail
Table 2. Intercept and holiday coefficients with 95% confidence intervals based on robust standard errors.

https://doi.org/10.1371/journal.pone.0279371.t002

We did not find a similar time-varying effects of temperature or precipitation (results can be found in Table 3 & 4) and most of the IRRs associated with TLk and PLk covariates were not significant. From the results of the GEE models in Table 5, we observed a significant increase in case incidence on Mondays, which aligns with the dataset guideline. Cases that were not reported during the weekend, possibly due to logistical issues, were reported on Monday instead. Additional county demographic effects can be found in Table 5. Notably, counties with a high proportion African-Americans were associated with increased COVID-19 incidence in all four models, with significant effects in the Hurdle-NB and GEE-ZIOP models. In addition, high median household income counties were associated with decreased COVID-19 incidence, in three of the four model, with significant effect in the GEE-OP and GEE-ZIOP models. Counties with low rate of high school education attainment were also associated with increased COVID-19 incidence in all four models, albeit not always at a significant level. Although many county demographic effects varied across the four models, a number of social and demographic effects from our analysis aligned with results from the literature [56, 57].

thumbnail
Table 3. Temperature coefficients and 95% confidence intervals.

https://doi.org/10.1371/journal.pone.0279371.t003

thumbnail
Table 4. Precipitation coefficients and 95% confidence intervals.

https://doi.org/10.1371/journal.pone.0279371.t004

thumbnail
Table 5. Day of week and county demographic coefficients and 95% confidence intervals based on robust standard errors.

https://doi.org/10.1371/journal.pone.0279371.t005

Our results for modeling zero counts can be found in Table 6. The zero-inflated model differs from a hurdle model by treating zero counts probabilistically through a mixture model, while the hurdle model treats every zero count as a success in a Bernoulli process. For our data, we believe zeros may originate from the Poisson model or be superficially generated due to logistical issues in the reporting and a mixture model is more appropriate. However, a hurdle model is equivalent to a zero-inflated model when zero counts from the underlying Poisson model are rare and serve as a useful sensitivity analysis under different modeling assumptions. Counties with a high rate of owner occupied housing were significantly associated with an increased probability of reporting zero cases in both models. Owner occupied housing may reflect the urban-rural dichotomy of counties, with rural counties having a high rate of owner occupied housing and, by extension, conclude that rural counties have an increased probability of reporting zero cases. This confounding may be due to logistical challenges in reporting cases at a daily frequency and less community transmission due to sparse population density in rural areas. In addition, high poverty rate was significantly associated with an increased probability of reporting zero cases in both models. By testing the effects of socioeconomics factors on case reporting, we found that zero reported cases could be confounded by disease surveillance resources of local municipalities. Rural and poor counties may lack necessary resources to reliably report cases leading to excess zeros which can ultimately hinder the modeling COVID-19 incidence rates if not addressed.

thumbnail
Table 6. Coefficients and 95% confidence intervals based on robust standard errors for the excess zero model.

https://doi.org/10.1371/journal.pone.0279371.t006

Discussion

One of the goals of this analysis was to understand the relationship between holiday timing and surges in COVID-19 cases while accounting for zero-inflation, temporal effects, and spatial correlations in case counts from neighboring areas. We implemented an ES algorithm for the estimation of our zero-inflated Poisson model as a mixture of marginal models. We used a semivariogram for the spatial working correlation and update it in the S-step, after each Newton Raphson iteration. We take advantage of fast and robust semivariogram estimation as a sub-routine in the greater ES algorithm. Considering the zero-inflated nature of the data, carried forward imputation and time-lagged covariates were used to account for the temporal pattern of case reporting. After the estimations have converged, we used the robust covariance estimator, which is consistent under misspecified spatial correlation, to compute standard errors.

We analyzed data from March 20, 2020 to January 23, 2021, before widespread vaccine distribution, COVID-19 variants and at-home testing. Our analysis suggests a statistically significant surge in reported cases, an IRR of 1.3 to 1.41, 6-8 days after a holiday, with the surge of cases gradually tapering off afterward. This is in line with the anticipated timing of 4–5 days to symptom onset and up to 3 days for PCR testing results. Although the models estimated the surge in reported cases to be 6-8 days after a holiday, transmission may have occurred days prior to the holiday. It’s important to note that a calendar holiday date is not the exact date of COVID-19 transmission. For example, traveling may commence the Friday before a Monday holiday or many people may also extend their holidays so people travel and congregate days after a holiday, e.g., Thanksgiving and Friday holidays. The reality is that there is a window of days surrounding a holiday which the transmission likely occurred. In contrast, we observed negative effects 2–3 days after a holiday which could be explained by testing lag associated with holidays. If one contracts COVID-19 when traveling near the calendar holiday date, it takes on average 4–5 days for symptoms to appear. We believe that the negative effect timing could be explained by the asymptomatic incubation period right after contracting COVID-19 when people have yet to be tested.

All but the hurdle model, estimates in Fig 2 confirm a drop in the case counts on a holiday, with many of the backlogged cases being reported the next day, evident by a large positive effect one day after a holiday. In our GEE-ZIOP model, the IRR is estimated to be 0.86 when the current is a holiday and 1.31 one day after a holiday. This trend is aligned with case reporting guidelines. Note that all four models (GLM, Hurdle-NB, GEE-OP and GEE-ZIOP) share similar conclusions regarding holiday effect, indicating the surge in reported cases occurs 6-8 after a holiday, suggesting that these results are robust under different assumptions.

Comparing different models, the GEE-ZIOP model was associated with an increase in the coefficient estimates, but at the expense of efficiency. When using GEE-ZIOP, weighting each data point by the zero-inflation membership probability decreases the effective sample size. However, even when considering this trade-off, there are often still meaningful results; our GEE-ZIOP estimate for the coefficients of HL6, HL7, and HL8 are significantly greater than zero. Furthermore, the GEE-ZIOP model detected an IRR significantly greater than 1, 6 days after a holiday, while GEE-OP did not. Several holidays (Memorial, Labor, Columbus) fall on a Monday, thus 6 days after a Monday is a Sunday and Sundays are likely dates where zero cases are reported. GEE-ZIOP allows us to down weight these excess zeros, revealing that surges in reported cases can occur as early as 6 days after a holiday.

In summary, we expect there to be a surge in county level reported cases one week after a holiday in the state of Pennsylvania. While our results reaffirm the common intuition regarding the early COVID-19 pandemic, our regression model elucidates the temporal trends of post-holiday case counts over a two-week period. In conclusion, we present results that illustrate the timing of a post-holiday surge while considering federal holidays and election day. Understanding the post-holiday surges can inform public policy and can be used to improve public health programs. We believe that our modeling approach and conclusions based on the rich data gathered through rigorous surveillance during the beginning of the pandemic, can guide future disease modeling analyses and are applicable to a wide range of epidemiological settings.

References

  1. 1. Guan Wj, Ni Zy, Hu Y, Liang Wh, Ou Cq, He Jx, et al. Clinical characteristics of coronavirus disease 2019 in China. New England journal of medicine. 2020;382(18):1708–1720.
  2. 2. Lai S, Ruktanonchai NW, Zhou L, Prosper O, Luo W, Floyd JR, et al. Effect of non-pharmaceutical interventions to contain COVID-19 in China. Nature. 2020;585(7825):410–413. pmid:32365354
  3. 3. Brauner JM, Mindermann S, Sharma M, Johnston D, Salvatier J, Gavenčiak T, et al. Inferring the effectiveness of government interventions against COVID-19. Science. 2020;. pmid:33323424
  4. 4. Wu Z, McGoogan JM. Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention. Jama. 2020;323(13):1239–1242. pmid:32091533
  5. 5. Chen S, Yang J, Yang W, Wang C, Bärnighausen T. COVID-19 control in China during mass population movements at New Year. The Lancet. 2020;395(10226):764–766. pmid:32105609
  6. 6. Honein MA, Christie A, Rose DA, Brooks JT, Meaney-Delman D, Cohn A, et al. Summary of Guidance for Public Health Strategies to Address High Levels of Community Transmission of SARS-CoV-2 and Related Deaths, December 2020. Morbidity and mortality weekly report. 2020;69(49):1860. pmid:33301434
  7. 7. Mehta SH, Clipman SJ, Wesolowski A, Solomon SS. Holiday gatherings, mobility and SARS-CoV-2 transmission: results from 10 US states following Thanksgiving. Scientific reports. 2021;11(1):1–9. pmid:34462499
  8. 8. Klausner Z, Fattal E, Hirsch E, Shapira SC. A single holiday was the turning point of the COVID-19 policy of Israel. International journal of infectious diseases. 2020;101:368–373. pmid:33045425
  9. 9. Lauer SA, Grantz KH, Bi Q, Jones FK, Zheng Q, Meredith HR, et al. The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application. Annals of internal medicine. 2020;172(9):577–582. pmid:32150748
  10. 10. Wajnberg A, Mansour M, Leven E, Bouvier NM, Patel G, Firpo-Betancourt A, et al. Humoral response and PCR positivity in patients with COVID-19 in the New York City region, USA: an observational study. The Lancet microbe. 2020;1(7):e283–e289. pmid:33015652
  11. 11. Albert E, Torres I, Bueno F, Huntley D, Molla E, Fernández-Fuentes MÁ, et al. Field evaluation of a rapid antigen test (Panbio™ COVID-19 Ag Rapid Test Device) for COVID-19 diagnosis in primary healthcare centres. Clinical microbiology and infection. 2020;. pmid:33189872
  12. 12. The New York Times. Coronavirus (Covid-19) Data in the United States; 2021. https://github.com/nytimes/covid-19-data.
  13. 13. Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 1992;34(1):1–14.
  14. 14. Xie Y, Xu L, Li J, Deng X, Hong Y, Kolivras K, et al. Spatial Variable Selection and An Application to Virginia Lyme Disease Emergence. Journal of the American statistical association. 2019;114(528):1466–1480.
  15. 15. Elliott P, Wartenberg D. Spatial epidemiology: current approaches and future challenges. Environmental health perspectives. 2004;112(9):998–1006. pmid:15198920
  16. 16. Schnell PM, Papadogeorgou G, et al. Mitigating unobserved spatial confounding when estimating the effect of supermarket access on cardiovascular disease deaths. Annals of applied atatistics. 2020;14(4):2069–2095.
  17. 17. Hanna-Attisha M, LaChance J, Sadler RC, Champney Schnepp A. Elevated blood lead levels in children associated with the Flint drinking water crisis: a spatial analysis of risk and public health response. American journal of public health. 2016;106(2):283–290. pmid:26691115
  18. 18. Kang D, Choi H, Kim JH, Choi J. Spatial epidemic dynamics of the COVID-19 outbreak in China. International journal of infectious diseases. 2020;94:96–102. pmid:32251789
  19. 19. Banerjee S., Carlin B. & Gelfand A. Hierarchical modeling and analysis for spatial data. (Chapman,2003)
  20. 20. Moraga P. Geospatial health data: modeling and visualization with R-INLA and shiny. (CRC Press,2019)
  21. 21. Lawson A. Bayesian disease mapping: hierarchical modeling in spatial epidemiology. (Chapman,2018)
  22. 22. Zhang L., Baladandayuthapani V., Zhu H., Baggerly K., Majewski T., Czerniak B. et al. Functional CAR models for large spatially correlated functional datasets. Journal of the American statistical association. 2016;111(514), 772–786 pmid:28018013
  23. 23. Blangiardo M. & Cameletti M. Spatial and spatio-temporal Bayesian models with R-INLA. (John Wiley & Sons,2015)
  24. 24. Zeger SL, Qaqish B. Markov regression models for time series: a quasi-likelihood approach. Biometrics. 1988; p. 1019–1031. pmid:3148334
  25. 25. Diggle P, Diggle PJ, Heagerty P, Liang KY, Heagerty PJ, Zeger S, et al. Analysis of longitudinal data. (Oxford University Press,2002)
  26. 26. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.
  27. 27. Hall DB, Zhang Z. Marginal models for zero inflated clustered data. Statistical modelling. 2004;4(3):161–180.
  28. 28. Gelfand AE, Diggle P, Guttorp P, Fuentes M. Handbook of spatial statistics. (CRC press,2010)
  29. 29. Lark R. A comparison of some robust estimators of the variogram for use in soil survey. European journal of soil science. 2000;51(1):137–157.
  30. 30. Albert PS, McShane LM. A generalized estimating equations approach for spatially correlated binary data: applications to the analysis of neuroimaging data. Biometrics. 1995; p. 627–638. pmid:7662850
  31. 31. Rosen O, Jiang W, Tanner MA. Mixtures of marginal models. Biometrika. 2000;87(2):391–404.
  32. 32. Preisser JS, Galecki AT, Lohman KK, Wagenknecht LE. Analysis of smoking trends with incomplete longitudinal binary responses. Journal of the American statistical association. 2000;95(452):1021–1031.
  33. 33. Preisser JS, Lohman KK, Rathouz PJ. Performance of weighted estimating equations for longitudinal binary data with drop-outs missing at random. Statistics in medicine. 2002;21(20):3035–3054. pmid:12369080
  34. 34. Reilly C, Gelman A. Weighted classical variogram estimation for data with clustering. Technometrics. 2007;49(2):184–194.
  35. 35. Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis. vol. 998. (John Wiley & Sons,2012)
  36. 36. Hall DB. Zero-inflated Poisson and binomial regression with random effects: a case study. Biometrics. 2000;56(4):1030–1039. pmid:11129458
  37. 37. Torabi M, Rosychuk RJ. Spatio-temporal modelling of disease mapping of rates. Canadian journal of statistics. 2010;38(4):698–715.
  38. 38. Cressie N. Fitting variogram models by weighted least squares. Journal of the international association for mathematical geology. 1985;17(5):563–586.
  39. 39. Cressie N, Hawkins DM. Robust estimation of the variogram: I. Journal of the international association for mathematical geology. 1980;12(2):115–125.
  40. 40. Genton MG. Highly robust variogram estimation. Mathematical geology. 1998;30(2):213–221.
  41. 41. Dowd P. The variogram and kriging: robust and resistant estimators. In: Geostatistics for natural resources characterization. Springer; 1984. p. 91–106. https://doi.org/10.1007/978-94-009-3699-7_6
  42. 42. Byrd RH, Lu P, Nocedal J, Zhu C. A limited memory algorithm for bound constrained optimization. SIAM journal on scientific computing. 1995;16(5):1190–1208.
  43. 43. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society: series B (methodological). 1977;39(1):1–22.
  44. 44. Council of State and Territorial Epidemiologists. Standardized surveillance case definition and national notification for 2019 novel coronavirus disease (COVID-19); 2021. https://cdn.ymaws.com/www.cste.org/resource/resmgr/2020ps/interim-20-id-01_covid-19.pdf.
  45. 45. Centers for Disease Control and Prevention. Geographic Differences in COVID-19 Cases, Deaths, and Incidence-United States, February 12-April 7, 2020.; 2020. https://www.cdc.gov/mmwr/volumes/69/wr/pdfs/mm6915e4-H.pdf.
  46. 46. Hallman J. tis: Time indexes and time indexed series. R package version. 2010;1.
  47. 47. Mecenas P, Bastos RTdRM, Vallinoto ACR, Normando D. Effects of temperature and humidity on the spread of COVID-19: A systematic review. PLoS one. 2020;15(9):e0238339. pmid:32946453
  48. 48. Menne MJ, Durre I, Vose RS, Gleason BE, Houston TG. An overview of the global historical climatology network-daily database. Journal of aAtmospheric and oceanic technology. 2012;29(7):897–910.
  49. 49. Templ M, Alfons A, Kowarik A, Prantner B. VIM: visualization and imputation of missing values. R package version. 2011;2(3).
  50. 50. Kontis V, Bennett JE, Rashid T, Parks RM, Pearson-Stuttard J, Guillot M, et al. Magnitude, demographics and dynamics of the effect of the first wave of the COVID-19 pandemic on all-cause mortality in 21 industrialized countries. Nature medicine. 2020; p. 1–10. pmid:33057181
  51. 51. Wu X, Nethery R, Sabath M, Braun D, Dominici F. Air pollution and COVID-19 mortality in the United States: Strengths and limitations of an ecological regression analysis. Science advances. 2020;6(45):eabd4049. pmid:33148655
  52. 52. Jackman S, Tahk A, Zeileis A, Maimone C, Fearon J, Meers Z, et al. Package ‘pscl’. Political Science Computational Laboratory. 2015;18(04.2017).
  53. 53. Jackman S. pscl: Classes and methods for R. Developed in the Political Science Computational Laboratory, Stanford University. Department of Political Science, Stanford University, Stanford, CA. R package version 1.03. 5. http://www.psclstanford.edu/. 2010;.
  54. 54. Zeileis A, Lumley T, Berger S, Graham N, Zeileis MA. Package ‘sandwich’. 3-0.03; 2021.
  55. 55. Zeileis A. Object-oriented computation of sandwich estimators. Journal of statistical software. 2006;16(1):1–16.
  56. 56. Mahajan UV, Larkins-Pettigrew M. Racial demographics and COVID-19 confirmed cases and deaths: a correlational analysis of 2886 US counties. Journal of public health. 2020;42(3):445–447. pmid:32435809
  57. 57. Karmakar M, Lantz PM, Tipirneni R. Association of social and demographic factors with COVID-19 incidence and death rates in the US. Jama network open. 2021;4(1):e2036462–e2036462. pmid:33512520