Modeling post-holiday surge in COVID-19 cases in Pennsylvania counties

Benny Ren; Wei-Ting Hwang

doi:10.1371/journal.pone.0279371

Abstract

COVID-19 arrived in the United States in early 2020, with cases quickly being reported in many states including Pennsylvania. Many statistical models have been proposed to understand the trends of the COVID-19 pandemic and factors associated with increasing cases. While Poisson regression is a natural choice to model case counts, this approach fails to account for correlation due to spatial locations. Being a contagious disease and often spreading through community infections, the number of COVID-19 cases are inevitably spatially correlated as locations neighboring counties with a high COVID-19 case count are more likely to have a high case count. In this analysis, we combine generalized estimating equations (GEEs) for Poisson regression, a popular method for analyzing correlated data, with a semivariogram to model daily COVID-19 case counts in 67 Pennsylvania counties between March 20, 2020 to January 23, 2021 in order to study infection dynamics during the beginning of the pandemic. We use a semivariogram that describes the spatial correlation as a function of the distance between two counties as the working correlation. We further incorporate a zero-inflated model in our spatial GEE to accommodate excess zeros in reported cases due to logistical challenges associated with disease monitoring. By modeling time-varying holiday covariates, we estimated the effect of holiday timing on case count. Our analysis showed that the incidence rate ratio was significantly greater than one, 6-8 days after a holiday suggesting a surge in COVID-19 cases approximately one week after a holiday.

Citation: Ren B, Hwang W-T (2022) Modeling post-holiday surge in COVID-19 cases in Pennsylvania counties. PLoS ONE 17(12): e0279371. https://doi.org/10.1371/journal.pone.0279371

Editor: Chong Wang, Iowa State University, UNITED STATES

Received: April 25, 2022; Accepted: December 6, 2022; Published: December 19, 2022

Copyright: © 2022 Ren, Hwang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data can be freely accessed from the New York Times GitHub repository (https://github.com/nytimes/covid-19-data).

Funding: WH is supported by National Institute of Environmental Health Sciences grant: P30-ES013508. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. There was no additional external funding received for this study.

Competing interests: The authors have declared that no competing interests exist.

Introduction

COVID-19, a highly contagious respiratory disease, first appeared in China at the end of 2019 and quickly spread across the world [1]. Evidence suggests mask-wearing, and social distancing are effective strategies in containing COVID-19 [2, 3]. During the beginning of the pandemic, local and state governments quickly moved to implement mask mandates, travel restrictions and community containment measures (e.g., shelter in place) to mitigate the spread of the disease [4–6]. However, many Americans still choose to travel and congregate during the pandemic which is heightened during a federal holiday. Due to lack of adherence to public health guidance during the holidays, one should expect to see a surge in COVID-19 cases after a holiday. While many reports reaffirm this hypothesis, they are based on anecdotal evidence such as summary statistics of case counts from moving time windows. There are only a handful epidemiological studies that estimate the association between holiday timing and the number of reported COVID-19 cases [7, 8]. For the first year of the pandemic, we hypothesize that we should see a surge in COVID-19 cases within two weeks after a holiday given that the incubation period for COVID-19 extends up to 14 days, with a median time of 4-5 days from exposure to symptoms onset and adding up to an additional 3 days, either by the PCR test or the instant rapid antigen test, for a positive test to be reported [1, 9–11]. We consider daily case counts between March 20, 2020 to January 23, 2021 to study early pandemic dynamics prior to widespread vaccine distribution, COVID-19 variants and at-home testing. In addition, rigorous disease surveillance and reporting procedures were in place during the beginning of the pandemic resulting in comprehensive infection data.

Poisson count regression with the population size as an offset is a popular approach to model count data and incidence rate. Based on the reporting guidelines of the COVID-19 datasets, there are certain dates such as holidays and weekends that could impact whether cases are being reported [12]. These reporting practices have resulted in excess or structural zeros in the case count, also known as zero-inflation [13]. We also need to consider spatial correlation among county-level COVID-19 case counts because vector-borne and transmissible diseases such as COVID-19, exhibit non-negligible spatial correlation as the movement of people can spread the virus to nearby counties [14]. Furthermore, processes that are confounded by spatially correlated variables are also suited for spatial models; a well-studied example is disease and pollution [15–17]. Thus, we expect to see similar case numbers or trends among neighboring counties [18].

Mixed models are powerful tools for spatial modeling due to its ability to handle a complex spatial correlation structure usually represented as a semivariogram or kriging process [19, 20]. Mixed modeling problems have been addressed from a convex optimization and Bayesian computation perspective [14, 21]. Correlation in spatial epidemiology can also be captured using conditional autoregressive models [22–24]. Inferential summaries from these models assume a correctly specified spatial correlation, otherwise a post-estimation robust covariance can be derived to address misspecified correlation. One such class of robust estimators are the heteroskedasticity-consistent or sandwich estimators which can easily be derived from marginal models and generalized estimating equations (GEEs). Marginal models are flexible alternatives to mixed models when population-level effects are of interest [25, 26]. In addition, formulation of GEEs through the quasi-likelihood provide a convenient set of score equations with well established optimization procedures. As a result, a GEE formulation can be derived to incorporate spatial-temporal relationships, as well as zero-inflation. For its generalizability and robust inference under misspecified spatial correlation, GEEs are an appealing but under utilized tool in spatial epidemiology.

We propose the use of a mixture of marginal models for zero-inflated over-dispersed Poisson regression to model the daily number of new cases in 67 Pennsylvania counties [27]. We combine the zero-inflated Poisson regression with the framework of a transition model to estimate present case counts as a function of previously reported cases [24, 25]. We account for spatial correlation by treating seminvariograms as the working correlation under a generalized estimation equation framework and designate each date as a ‘cluster’ under the traditional longitudinal nomenclature [26, 28–30]. We propose an Expectation-Solution (ES) algorithm to fit the mixture of marginal models, which conveniently reduces to simpler problems of weighted GEEs and weighted semivariograms [31–35]. We include a combination of past case counts and time-lagged covariates, as well as county-specific and date-specific covariates to model daily rates of new COVID-19 cases following the parameterization of semivariogram working correlations discussed in [30]. While there are other methods to model spatial temporal data, we elect to use a GEE due to consistent inference under misspecified working correlation, simplicity of formulation and estimation under an imbalance clusters study design [14, 36, 37]. Imbalance clusters occurs in our data because counties report their first case at different dates, staggered entry into the study, when populous counties report cases before rural counties.

The rest of this manuscript is organized as follows. In the Methods section, we outline the zero-inflated Poisson model and the proposed semivariogram model; we detail the zero-inflated GEE for the Poisson counts with excess zeros and describe the estimating procedures for the semivariogram through the ES algorithm and outline the robust sandwich standard error. In the Results section, we analyze the daily COVID-19 cases from 67 Pennsylvania counties.

Methods

Zero-inflated Poisson model

Let Y_i,t denote the count of COVID-19 cases at county i and date t. Often times, a case count of zero is due to logistical issues in reporting, which is known as an excess zero and is unrelated to the Poisson model. A true zero is unrealistic in many situations. For excess zeros, we define the zero-inflation process with Poisson count data as That is, if Y_i,t is a zero Poisson random variable, it belongs to a Poisson distribution with probability 1−p_i,t, otherwise it is an excess zero with probability p_i,t [13]. We denote the latent membership, excess zero indicator, for county i at date t as Z_i,t∣Z_i,t−1 ∼ Bernoulli(p_i,t∣ Z_i,t−1), following a transition model as a function of previous outcomes that is unobservable and treated as missing data.

To model case counts as a rate, we define the Poisson regression model as (1) and (1) can be rewritten as where m_i is the county population offset term, representing the population at county i but fixed for all dates t and we denote μ_i,t∣Y_i,t−1 as μ_i,t and p_i,t∣Z_i,t−1 as p_i,t for brevity. We include a transition term of a lagged case incidence Y_i,t−1 to account for the temporal correlation of counts in the same county. Note that for y_i,t−1 = 0 values must be corrected such that log(y_i,t−1) are defined [24]. Our transition term addresses temporal trends and correlation allowing us to treat the residuals from different dates as independent in the working correlation of the marginal model. Here, x_i,t are county and date-specific covariates detailed in our real data analysis.

We can also construct a logistic regression to model the latent membership Z_i,t as (2) We also incorporate a transition term, Z_i,t−1, to account for temporal trends and u_i,t are county and date-specific covariates specifically related to the zero-inflation process.

Theoretical semivariogram model

Standard zero-inflated Poisson models do not account for the longitudinal nature of the data which is expressed as spatial correlation among counties. Though mixed GLMs can be used to account for spatial correlation as Gaussian random effects, these frameworks involve difficult computation such as the Laplace approximation involving the covariance matrix. Spatial random effects are often modeled using semivariograms which can be viewed as a kriging model in the residual space [14]. Alternatively, we propose a GEE procedure which iteratively updates a working correlation using residuals allowing for a more straight forward estimation procedure.

Semivariograms model the correlation between two locations i and j as a decreasing function of distance. This phenomenon is commonly referred to as Tobler’s first law of geography which states: “everything is related to everything else, but near things are more related than distant things.” In our study, distance is measured by meters and locations are latitude and longitude coordinates of county centroids. In our GEE procedure, residuals are expected to be correlated based on distance and are used to calculate the semivariogram working correlation. At a given date, county residuals r_i,t and r_j,t, separated by distance d_i,j, are assumed to follow a theoretical semivariogram γ(d_i,j, τ) where ϕ is an over-dispersion parameter and τ is a scale parameter and correlation does not depend on time. Thus, the working correlation matrix between counties i and j at date t is given as and is solely a function of distance d_i,j with τ determining the rate correlation decays as distance increases. We make an isotropic covariance assumption, meaning the semivariogram is a function of the locations only through the Euclidean distance between them.

While standard GEEs treat longitudinal data from individuals as independent clusters, our approach differs by treating dates as independent clusters, counties as our repeated measurements, spatial correlation as the working correlation as summarized in Table 1. After incorporating date-specific covariates and case counts from the previous day, we assume residuals from different dates to be uncorrelated in the working correlation. Fig 1 shows low autocorrelation of residuals, Corr(r_i,t, r_i,t−k), within each county from a naive standard Poisson generalized linear model described by Eq (1) using a lagged autoregressive predictor (y_i,t−1) and study covariates while assuming no zero-inflation. Our implementation of zero-inflated modeling assumes some observations belong to a structural zero model which further decreases autocorrelation in residuals, suggesting a single lag autoregressive predictor adequately addresses autocorrelation in the residuals. Furthermore, we expect the structural zeros to be related to holiday and weekend timings based on dataset guidelines and case counts to be cumulatively reported after a holiday. As a result, we anticipate holiday and weekend timings to be the major driver of temporal case trajectories in our study. Next, we detail the estimation approach using GEEs and estimation of scale parameter τ based on procedures from the spatial statistics literature. In many practical settings there is an abundance of time-series data, but limited number of locations; our approach allows the asymptotic results to be driven by the number of dates in the analysis.

Download:

Fig 1. Box plot of the residual autocorrelations, Corr(r_i,t, r_i,t−k), for lag k (k = 1, …7) from 67 Pennsylvania counties.

The residuals r_i,t were obtained from a Poisson generalized linear model with study covariates and a lagged autoregressive predictor (y_i,t−1) as outlined in the COVID-19 data analysis section. Light blue lines connect the autocorrelation values of each county. Solid black line connects the autocorrelation values Corr(r_t, r_t−k) calculated with residuals from all counties combined.

https://doi.org/10.1371/journal.pone.0279371.g001

Download:

Table 1. Differences between standard GEE found in longitudinal studies and the proposed GEE.

https://doi.org/10.1371/journal.pone.0279371.t001

Estimation

GEE for over-dispersed Poisson counts.

The GEE for the Poisson model is given as (3) where with X_t = [x_1,t, x_2,t, x_3,t, …]^⊤ and β are the design matrix of covariates and vector of regression coefficients related to case counts. Here is the last count greater than zero for county i and y_t = [y_1,t, y_2,t, y_3,t, …]^⊤ is a vector representing new COVID-19 cases with corresponding expected values μ_t = [μ_1,t, μ_2,t, μ_3,t, …]^⊤. Scale parameter τ_P is specific to the Poisson model. We incorporated Poisson model membership probabilities, w_i,t = 1−p_i,t, through matrix W_t = diag(w_1,t, w_2,t, w_3,t, …), into the weighted GEE. When w_i,t ≈ 0, then the data point is effectively removed from the estimation of Poisson regression parameters. The estimation of these weights is outlined in the Expectation-Solution algorithm in next section. We modify (1) to carry forward the, highly correlated, last non-zero count if Y_i,t−1 = 0. Typically, when Y_i,t−1 = 0 an additional parameter may be introduced in order to ensure are defined [24].

In accordance with GEEs, we use residuals to estimate the working correlation but our procedure involves using a parameterization described in the theoretical semivariogram. Over-dispersion is common in COVID-19 case counts due to high variability in reporting practices across counties but is conveniently accounted for with an additional parameter in the GEE. We calculate the dispersion parameter ϕ and standard residuals r_i,t as where rank(X) is the number of linearly independent covariates and S(t) is the set of observed counties at time t. Through the quasi-likelihood, GEEs incorporate a dispersion parameter, resulting in an over-dispersed Poisson regression.

GEE for excess zeros.

To accommodate for latent membership Z_i,t, we construct another GEE for estimating p_i,t, the probability of an excess zero, using the logit link function. This GEE formulation is given as (4) where and τ_Z is scale parameter for the excess zero model. Here U_t and λ are the covariates and regression coefficients for modeling the excess zero. Because we do not have the complete data, in order to account for serial correlation among excess zeros, we replace the transition term Z_i,t−1 in (2) with an indicator of zero cases on the previous day, in order capture temporal trends. In practice, Z is often an imbalanced binary outcome where over parameterization of covariates can cause numerical instability when fitting the model. Therefore, it has been suggested that it be fitted with a parsimonious set of covariates [27].

We calculate standard Pearson residuals as which are used for estimating the working correlation using a semivariogram as described in the next section.

Estimation of τ through the empirical semivariogram.

After each Newton Raphson update of the regression coefficients β, λ. We also update the working correlation by estimating the scale parameter τ, using the empirical semivariogram. In this section we outline procedures for modeling semivariograms and their computational implementation. First, we calculate the empirical semivariogram by grouping pairs of residuals into U bins, based on equally spaced intervals of distance. Each bin of residuals will be used to calculate an empirical estimate for the midpoint of the bin denoted as d₁, d₂, d₃, …, d_U with the corresponding intervals denoted as d_u±δ. After we obtain the empirical semivariogram , we estimate τ from the theoretical semivariogram model through a weighted least squares approach that minimizes the squared difference between the theoretical and empirical semivariograms. Using the procedure detailed in [38], the weighted least squares solution is given as (5) where weights w_i,t and w_j,t replace the sample size in our case. The weighted least squares approach is computationally fast and yields robust estimates. Placing the theoretical semivariogram in the denominator down weights the influence of observed correlations separated by large distances.

We denote the solution of (5), as a function of the residuals r_i,t, e_i,t and weights by τ_P = G(r, W) and τ_Z = G(e, I) for the Poisson and excess zero model, respectively. Recall, we use separate GEEs for the Poisson and excess zeros, to account for the spatial correlation through a mixture of marginal (zero-inflated) model. We follow some common practices regarding semivariogram models; we calculate the empirical semivariogram using pairwise distances that are less than half the maximum distance, we also bound τ ∈ (0, max_i,j(d_i,j)/3] so that exp(−3)≈0.05, i.e., the correlation associated with maximum distance, is upper bounded at 0.05. In our case, max_i,j(d_i,j) = 592318.3 meters. Next, we outline options for computing the empirical semivariogram .

Empirical semivariograms.

The standard moment estimate of the empirical semivariogram is given as [39] presented an alternative estimate that is robust [40, 41] presented median based approaches with 50% breakpoint [39–41]. Being that estimating τ is a sub-routine in the ES algorithm, we elect to use the approach from [39] as our empirical semivariogram: for robustness considerations. We denote the corresponding τ estimates as G_CH(r, W), and G_CH(e, I) in our analysis. The empirical semivariogram is calculated in parallel and τ is estimated numerically using weighted least squares and the optim package in R [42].

Expectation-Solution algorithm

Following [27] we construct an Expectation-Solution algorithm for our zero-inflated regression models as a mixture of marginal models [27, 31]. The ES algorithm is a modification to the well-known Expectation-Maximization (EM) algorithm, where the M-step is replaced by the solution from the GEE [43]. We further modify the working correlation to follow a spatial correlation structure which is estimated using a semivariogram [30].

E-step.

In the E-step, we update the unobserved Z. The conditional expectation is given as: (6) (7) where are updated Poisson model membership weights. We denote the iteration number of the ES algorithm as s.

S-step.

The S-step replaces the maximization step in the EM algorithm. In the S-step, we iteratively estimate Poisson model parameters β, τ_P and the excess zero model parameters λ, τ_Z. The iterative updates for β using Newton Raphson are given as: (8) and is updated with (5) using . We denote the update as (9) where The iteration number of the S-step are denoted with k. We repeat (8) and (9) until convergence to obtain β^{(s+ 1)}, a new iteration in the greater ES algorithm.

We apply the same procedure in the S-step for excess zero model parameters: λ and τ_Z (10) (11) where and the weight matrices are replaced by the identity matrix. Eqs (10) and (11) are repeated until convergence to obtain λ^{(s + 1)}. For the ES algorithm, we iteratively update the E-step: z^(s), W^(s), and the S-step: β^(s), λ^(s) until convergence.

Inference on coefficients

Using the heteroskedasticity-consistent or robust sandwich estimator for standard errors, we derive confidence intervals for β; we have consistent estimates even under misspecified working correlations which are spatial correlations in our case [31]. We proposed a reasonable spatial correlation model but retain reliable inference even in situations when the correlation model is incorrect. Analogously, we can also calculate the covariance for λ with the equivalent formulation without the weight matrices, W_t, using the score Eq (4). We use the asymptotic distribution, for inference, substituting the final estimates from the ES algorithm.

Results

Data

Poisson model covariates.

Daily reported cases for 67 counties in Pennsylvania from March 20, 2020 to January 23, 2021 were obtained from the New York Times GitHub repository, https://github.com/nytimes/covid-19-data [12]. We restrict our study dates to be prior to widespread vaccine distribution, COVID-19 variants and at-home testing. The New York Times dataset includes confirmed cases from PCR tests as well as probable cases. Probable cases are derived from a set of testing, symptoms and exposure criteria recommended by the Council of State and Territorial Epidemiologists and was also adopted by the Center of Disease Control on April 14, 2020 [44]. Data collection for a county starts after the first recorded case and we exclude the first 14 days of documented cases per county as reporting is often inconsistent during the start of data collection [45]. This also ensures carried forward imputation values: and are all defined.

Our primary goal is to understand the relationship between the timing of a holiday and daily new case counts. We consider federal holidays: New Years, Martin Luther King, George Washington Birthday, Good Friday, Memorial, Independence, Labor, Columbus, Veterans, Thanksgiving, Christmas and election day (November 3rd, 2020) [46]. Federal holidays are dates which a large proportion of the population are not working, enabling congregation and subsequent transmission of COVID-19. We create a vector of holiday binary {1, 0} indicators for whether a day in the last 2 weeks was a holiday in the Poisson regression model from today to the prior 14 days. For example, the count recorded on January 1st, the associated 15 holiday lag indicators are (1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0) for dates: New Years, Dec 31, Dec 30, Dec 29, Dec 28, Dec 27, Dec 26, Christmas, …, Dec 18. We abbreviate the indicator for a holiday being k days in the past as, Holiday Lag k: HLk. Another example, for the count data on January 2nd, the indicators are (0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0). Therefore, the regression coefficients of our holiday indicators can inform whether there are increased cases (i.e., surge) with incidence rate ratios greater than 1, in relation to the presence and timing of a holiday within the prior 2 weeks. If the hypothesis of a post-holiday surge in COVID-19 cases is true, we expect to see several incidence rate ratios (IRRs) associated with HLk to be significantly greater than 1 after a holiday.

However, to account for the association with the day of the week, we include an indicator for the day of the week with Wednesday as the baseline. Our regression analysis also controls for other factors or covariates that may be associated with the incidence of COVID-19 cases as well as excess zeros in reporting. For covariates associated with case incidence, there’s growing evidence suggesting that warm and wet climate conditions seem to reduce spread of COVID-19 [47]. Precipitation and maximum temperature data from daily summaries was obtained from National Oceanic and Atmospheric Administration (NOAA) web service were included as additional covariates [48]. Maximum daily temperature and daily precipitation from weather stations across Pennsylvania were downloaded from the NOAA web service. Using latitude and longitude for each county centroid and weather station, county-specific weather data was interpolated using weather data from the same date. Interpolation was carried out using the VIM R package, median of K nearest neighbors (K-nn), with K = 5 [49]. Precipitation was further binarized into a {0, 1} indicator based on whether or not it rained that day. Similar to the holiday covariates, we also construct 15 lag covariate indicators for precipitation in the prior 2 weeks. We also construct 15 continuous covariates of Fahrenheit temperature values for the prior 2 weeks. Following the convention of abbreviating Holiday Lag: HLk, we abbreviate the covariates for temperature as TLk and precipitation as PLk.

Other covariates included for the Poisson regression consisted of well-known demographic factors that affect health outcomes [50, 51]. Demographic information from the U.S. census and 2016 American Community Survey were also included in the model as county-specific covariates. For each county, quintiles for population density, percentage of the population living in poverty, log median household income, log median house value, percentage of Black residents, percentage of Hispanic residents, percentage of the adult population with less than high school (HS) education, and percentage of owner-occupied housing were included as county-specific covariates. The data was obtained using the same source outlined in [51].

Zero-inflation covariates.

The presence of excess zeros can bias coefficient estimates in the negative direction if ignored. A non-negligible 21% of the reported case counts are zeros. Therefore, we propose a parsimonious model for Z by considering only a subset of covariates from the main Poisson regression model such that zero-inflation is accounted for based on well-known relationships associated with reporting zero cases. Based on the COVID-19 case reporting guidelines, covariates associated with excess zeros include holidays and weekends. For example, there tends to be a gap in reporting during weekends and holidays, suggesting case count reporting is more likely to occur on business days which does not reflect the reality of the pandemic and back logged cases tend to be lumped into the next business day’s case count.

Indicators for tomorrow, today and yesterday being a holiday are included model for Z. As an example, for outcome data recorded on December 31st, the covariates will be (1, 0, 0). An indicator for weekend is also included. In addition, certain county-specific covariates, percentage of population in poverty and percentage of owner-occupied housing are also included as covariates. Owner-occupied housing rate is an indicator of urban-rural dichotomy, with rural counties having a high rate of owner-occupied housing. Poverty rate and owner-occupied housing rate, as socioeconomic markers, are potential confounders for zero-inflation since the economy of a county may determine the resources available for municipal services such as case reporting. We also include an indicator for whether the previous day was a zero as a covariate to account for temporal trends.

COVID-19 data analysis

In addition to the Poisson model, we fitted a negative binomial hurdle model to capture the severity of a COVID-19 outbreak conditional that cases counts are greater than zero [52, 53]. The hurdle model fits a negative binomial regression truncated to the positive case counts, {Y_i,t: Y_i,t > 0} using the same covariates as (3) and models outcomes with a logistic regression with the same covariates as (4). Robust sandwich estimators for the hurdle model were obtained using the sandwich R package [54, 55].

We fitted the following four models: 1.) Poisson generalized linear model without spatial correlation, without over-dispersion and without zero-inflation, we denoted as GLM, 2.) negative binomial hurdle model without spatial correlation, we denoted as Hurdle-NB, 3.) over-dispersed Poisson marginal model, with spatial correlation, but without zero-inflation, we denote this as GEE-OP, 4.) a zero-inflated over-dispersed Poisson marginal model as described by the Expectation-Solution algorithm, which we denote this as GEE-ZIOP.

The GEE-ZIOP results in Fig 2 and Table 2, shows the estimated incidence rate ratio (and 95% confidence interval) of 1.41 (1.07, 1.85); 1.37 (1.09, 1.72); 1.3 (1.08, 1.57); at 6, 7, 8 respective days after a holiday. At 6 to 8 days after a holiday we expect there to be a surge with 1.3 to 1.41 times as many cases as there would be without a holiday. Intervals for IRR estimates were plotted for all models in Fig 2 and showed that IRR estimates associated with the GEE-ZIOP model at the surge was greater than GLM, Hurdle-NB and GEE-OP models. All four models yield maximum IRR estimates at 6-8 days, suggesting that a surge in case counts occurs about 6-8 days after a holiday.

Download:

Fig 2. Incidence rate ratio with robust 95% confidence interval for holiday effects.

Estimated IRRs corresponding to the 15 holiday covariates, for each model. 95% confidence intervals were calculated using robust standard errors.

https://doi.org/10.1371/journal.pone.0279371.g002

Download:

Table 2. Intercept and holiday coefficients with 95% confidence intervals based on robust standard errors.

https://doi.org/10.1371/journal.pone.0279371.t002

We did not find a similar time-varying effects of temperature or precipitation (results can be found in Table 3 & 4) and most of the IRRs associated with TLk and PLk covariates were not significant. From the results of the GEE models in Table 5, we observed a significant increase in case incidence on Mondays, which aligns with the dataset guideline. Cases that were not reported during the weekend, possibly due to logistical issues, were reported on Monday instead. Additional county demographic effects can be found in Table 5. Notably, counties with a high proportion African-Americans were associated with increased COVID-19 incidence in all four models, with significant effects in the Hurdle-NB and GEE-ZIOP models. In addition, high median household income counties were associated with decreased COVID-19 incidence, in three of the four model, with significant effect in the GEE-OP and GEE-ZIOP models. Counties with low rate of high school education attainment were also associated with increased COVID-19 incidence in all four models, albeit not always at a significant level. Although many county demographic effects varied across the four models, a number of social and demographic effects from our analysis aligned with results from the literature [56, 57].

Download:

Table 3. Temperature coefficients and 95% confidence intervals.

https://doi.org/10.1371/journal.pone.0279371.t003

Download:

Table 4. Precipitation coefficients and 95% confidence intervals.

https://doi.org/10.1371/journal.pone.0279371.t004

Download:

Table 5. Day of week and county demographic coefficients and 95% confidence intervals based on robust standard errors.

https://doi.org/10.1371/journal.pone.0279371.t005

Our results for modeling zero counts can be found in Table 6. The zero-inflated model differs from a hurdle model by treating zero counts probabilistically through a mixture model, while the hurdle model treats every zero count as a success in a Bernoulli process. For our data, we believe zeros may originate from the Poisson model or be superficially generated due to logistical issues in the reporting and a mixture model is more appropriate. However, a hurdle model is equivalent to a zero-inflated model when zero counts from the underlying Poisson model are rare and serve as a useful sensitivity analysis under different modeling assumptions. Counties with a high rate of owner occupied housing were significantly associated with an increased probability of reporting zero cases in both models. Owner occupied housing may reflect the urban-rural dichotomy of counties, with rural counties having a high rate of owner occupied housing and, by extension, conclude that rural counties have an increased probability of reporting zero cases. This confounding may be due to logistical challenges in reporting cases at a daily frequency and less community transmission due to sparse population density in rural areas. In addition, high poverty rate was significantly associated with an increased probability of reporting zero cases in both models. By testing the effects of socioeconomics factors on case reporting, we found that zero reported cases could be confounded by disease surveillance resources of local municipalities. Rural and poor counties may lack necessary resources to reliably report cases leading to excess zeros which can ultimately hinder the modeling COVID-19 incidence rates if not addressed.

Download:

Table 6. Coefficients and 95% confidence intervals based on robust standard errors for the excess zero model.

https://doi.org/10.1371/journal.pone.0279371.t006

Discussion

One of the goals of this analysis was to understand the relationship between holiday timing and surges in COVID-19 cases while accounting for zero-inflation, temporal effects, and spatial correlations in case counts from neighboring areas. We implemented an ES algorithm for the estimation of our zero-inflated Poisson model as a mixture of marginal models. We used a semivariogram for the spatial working correlation and update it in the S-step, after each Newton Raphson iteration. We take advantage of fast and robust semivariogram estimation as a sub-routine in the greater ES algorithm. Considering the zero-inflated nature of the data, carried forward imputation and time-lagged covariates were used to account for the temporal pattern of case reporting. After the estimations have converged, we used the robust covariance estimator, which is consistent under misspecified spatial correlation, to compute standard errors.

We analyzed data from March 20, 2020 to January 23, 2021, before widespread vaccine distribution, COVID-19 variants and at-home testing. Our analysis suggests a statistically significant surge in reported cases, an IRR of 1.3 to 1.41, 6-8 days after a holiday, with the surge of cases gradually tapering off afterward. This is in line with the anticipated timing of 4–5 days to symptom onset and up to 3 days for PCR testing results. Although the models estimated the surge in reported cases to be 6-8 days after a holiday, transmission may have occurred days prior to the holiday. It’s important to note that a calendar holiday date is not the exact date of COVID-19 transmission. For example, traveling may commence the Friday before a Monday holiday or many people may also extend their holidays so people travel and congregate days after a holiday, e.g., Thanksgiving and Friday holidays. The reality is that there is a window of days surrounding a holiday which the transmission likely occurred. In contrast, we observed negative effects 2–3 days after a holiday which could be explained by testing lag associated with holidays. If one contracts COVID-19 when traveling near the calendar holiday date, it takes on average 4–5 days for symptoms to appear. We believe that the negative effect timing could be explained by the asymptomatic incubation period right after contracting COVID-19 when people have yet to be tested.

All but the hurdle model, estimates in Fig 2 confirm a drop in the case counts on a holiday, with many of the backlogged cases being reported the next day, evident by a large positive effect one day after a holiday. In our GEE-ZIOP model, the IRR is estimated to be 0.86 when the current is a holiday and 1.31 one day after a holiday. This trend is aligned with case reporting guidelines. Note that all four models (GLM, Hurdle-NB, GEE-OP and GEE-ZIOP) share similar conclusions regarding holiday effect, indicating the surge in reported cases occurs 6-8 after a holiday, suggesting that these results are robust under different assumptions.

Comparing different models, the GEE-ZIOP model was associated with an increase in the coefficient estimates, but at the expense of efficiency. When using GEE-ZIOP, weighting each data point by the zero-inflation membership probability decreases the effective sample size. However, even when considering this trade-off, there are often still meaningful results; our GEE-ZIOP estimate for the coefficients of HL6, HL7, and HL8 are significantly greater than zero. Furthermore, the GEE-ZIOP model detected an IRR significantly greater than 1, 6 days after a holiday, while GEE-OP did not. Several holidays (Memorial, Labor, Columbus) fall on a Monday, thus 6 days after a Monday is a Sunday and Sundays are likely dates where zero cases are reported. GEE-ZIOP allows us to down weight these excess zeros, revealing that surges in reported cases can occur as early as 6 days after a holiday.

In summary, we expect there to be a surge in county level reported cases one week after a holiday in the state of Pennsylvania. While our results reaffirm the common intuition regarding the early COVID-19 pandemic, our regression model elucidates the temporal trends of post-holiday case counts over a two-week period. In conclusion, we present results that illustrate the timing of a post-holiday surge while considering federal holidays and election day. Understanding the post-holiday surges can inform public policy and can be used to improve public health programs. We believe that our modeling approach and conclusions based on the rich data gathered through rigorous surveillance during the beginning of the pandemic, can guide future disease modeling analyses and are applicable to a wide range of epidemiological settings.

References

1. Guan Wj, Ni Zy, Hu Y, Liang Wh, Ou Cq, He Jx, et al. Clinical characteristics of coronavirus disease 2019 in China. New England journal of medicine. 2020;382(18):1708–1720.
- View Article
- Google Scholar
2. Lai S, Ruktanonchai NW, Zhou L, Prosper O, Luo W, Floyd JR, et al. Effect of non-pharmaceutical interventions to contain COVID-19 in China. Nature. 2020;585(7825):410–413. pmid:32365354
- View Article
- PubMed/NCBI
- Google Scholar
3. Brauner JM, Mindermann S, Sharma M, Johnston D, Salvatier J, Gavenčiak T, et al. Inferring the effectiveness of government interventions against COVID-19. Science. 2020;. pmid:33323424
- View Article
- PubMed/NCBI
- Google Scholar
4. Wu Z, McGoogan JM. Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention. Jama. 2020;323(13):1239–1242. pmid:32091533
- View Article
- PubMed/NCBI
- Google Scholar
5. Chen S, Yang J, Yang W, Wang C, Bärnighausen T. COVID-19 control in China during mass population movements at New Year. The Lancet. 2020;395(10226):764–766. pmid:32105609
- View Article
- PubMed/NCBI
- Google Scholar
6. Honein MA, Christie A, Rose DA, Brooks JT, Meaney-Delman D, Cohn A, et al. Summary of Guidance for Public Health Strategies to Address High Levels of Community Transmission of SARS-CoV-2 and Related Deaths, December 2020. Morbidity and mortality weekly report. 2020;69(49):1860. pmid:33301434
- View Article
- PubMed/NCBI
- Google Scholar
7. Mehta SH, Clipman SJ, Wesolowski A, Solomon SS. Holiday gatherings, mobility and SARS-CoV-2 transmission: results from 10 US states following Thanksgiving. Scientific reports. 2021;11(1):1–9. pmid:34462499
- View Article
- PubMed/NCBI
- Google Scholar
8. Klausner Z, Fattal E, Hirsch E, Shapira SC. A single holiday was the turning point of the COVID-19 policy of Israel. International journal of infectious diseases. 2020;101:368–373. pmid:33045425
- View Article
- PubMed/NCBI
- Google Scholar
9. Lauer SA, Grantz KH, Bi Q, Jones FK, Zheng Q, Meredith HR, et al. The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application. Annals of internal medicine. 2020;172(9):577–582. pmid:32150748
- View Article
- PubMed/NCBI
- Google Scholar
10. Wajnberg A, Mansour M, Leven E, Bouvier NM, Patel G, Firpo-Betancourt A, et al. Humoral response and PCR positivity in patients with COVID-19 in the New York City region, USA: an observational study. The Lancet microbe. 2020;1(7):e283–e289. pmid:33015652
- View Article
- PubMed/NCBI
- Google Scholar
11. Albert E, Torres I, Bueno F, Huntley D, Molla E, Fernández-Fuentes MÁ, et al. Field evaluation of a rapid antigen test (Panbio™ COVID-19 Ag Rapid Test Device) for COVID-19 diagnosis in primary healthcare centres. Clinical microbiology and infection. 2020;. pmid:33189872
- View Article
- PubMed/NCBI
- Google Scholar
12. The New York Times. Coronavirus (Covid-19) Data in the United States; 2021. https://github.com/nytimes/covid-19-data.
13. Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 1992;34(1):1–14.
- View Article
- Google Scholar
14. Xie Y, Xu L, Li J, Deng X, Hong Y, Kolivras K, et al. Spatial Variable Selection and An Application to Virginia Lyme Disease Emergence. Journal of the American statistical association. 2019;114(528):1466–1480.
- View Article
- Google Scholar
15. Elliott P, Wartenberg D. Spatial epidemiology: current approaches and future challenges. Environmental health perspectives. 2004;112(9):998–1006. pmid:15198920
- View Article
- PubMed/NCBI
- Google Scholar
16. Schnell PM, Papadogeorgou G, et al. Mitigating unobserved spatial confounding when estimating the effect of supermarket access on cardiovascular disease deaths. Annals of applied atatistics. 2020;14(4):2069–2095.
- View Article
- Google Scholar
17. Hanna-Attisha M, LaChance J, Sadler RC, Champney Schnepp A. Elevated blood lead levels in children associated with the Flint drinking water crisis: a spatial analysis of risk and public health response. American journal of public health. 2016;106(2):283–290. pmid:26691115
- View Article
- PubMed/NCBI
- Google Scholar
18. Kang D, Choi H, Kim JH, Choi J. Spatial epidemic dynamics of the COVID-19 outbreak in China. International journal of infectious diseases. 2020;94:96–102. pmid:32251789
- View Article
- PubMed/NCBI
- Google Scholar
19. Banerjee S., Carlin B. & Gelfand A. Hierarchical modeling and analysis for spatial data. (Chapman,2003)
20. Moraga P. Geospatial health data: modeling and visualization with R-INLA and shiny. (CRC Press,2019)
21. Lawson A. Bayesian disease mapping: hierarchical modeling in spatial epidemiology. (Chapman,2018)
22. Zhang L., Baladandayuthapani V., Zhu H., Baggerly K., Majewski T., Czerniak B. et al. Functional CAR models for large spatially correlated functional datasets. Journal of the American statistical association. 2016;111(514), 772–786 pmid:28018013
- View Article
- PubMed/NCBI
- Google Scholar
23. Blangiardo M. & Cameletti M. Spatial and spatio-temporal Bayesian models with R-INLA. (John Wiley & Sons,2015)
24. Zeger SL, Qaqish B. Markov regression models for time series: a quasi-likelihood approach. Biometrics. 1988; p. 1019–1031. pmid:3148334
- View Article
- PubMed/NCBI
- Google Scholar
25. Diggle P, Diggle PJ, Heagerty P, Liang KY, Heagerty PJ, Zeger S, et al. Analysis of longitudinal data. (Oxford University Press,2002)
26. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.
- View Article
- Google Scholar
27. Hall DB, Zhang Z. Marginal models for zero inflated clustered data. Statistical modelling. 2004;4(3):161–180.
- View Article
- Google Scholar
28. Gelfand AE, Diggle P, Guttorp P, Fuentes M. Handbook of spatial statistics. (CRC press,2010)
29. Lark R. A comparison of some robust estimators of the variogram for use in soil survey. European journal of soil science. 2000;51(1):137–157.
- View Article
- Google Scholar
30. Albert PS, McShane LM. A generalized estimating equations approach for spatially correlated binary data: applications to the analysis of neuroimaging data. Biometrics. 1995; p. 627–638. pmid:7662850
- View Article
- PubMed/NCBI
- Google Scholar
31. Rosen O, Jiang W, Tanner MA. Mixtures of marginal models. Biometrika. 2000;87(2):391–404.
- View Article
- Google Scholar
32. Preisser JS, Galecki AT, Lohman KK, Wagenknecht LE. Analysis of smoking trends with incomplete longitudinal binary responses. Journal of the American statistical association. 2000;95(452):1021–1031.
- View Article
- Google Scholar
33. Preisser JS, Lohman KK, Rathouz PJ. Performance of weighted estimating equations for longitudinal binary data with drop-outs missing at random. Statistics in medicine. 2002;21(20):3035–3054. pmid:12369080
- View Article
- PubMed/NCBI
- Google Scholar
34. Reilly C, Gelman A. Weighted classical variogram estimation for data with clustering. Technometrics. 2007;49(2):184–194.
- View Article
- Google Scholar
35. Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis. vol. 998. (John Wiley & Sons,2012)
36. Hall DB. Zero-inflated Poisson and binomial regression with random effects: a case study. Biometrics. 2000;56(4):1030–1039. pmid:11129458
- View Article
- PubMed/NCBI
- Google Scholar
37. Torabi M, Rosychuk RJ. Spatio-temporal modelling of disease mapping of rates. Canadian journal of statistics. 2010;38(4):698–715.
- View Article
- Google Scholar
38. Cressie N. Fitting variogram models by weighted least squares. Journal of the international association for mathematical geology. 1985;17(5):563–586.
- View Article
- Google Scholar
39. Cressie N, Hawkins DM. Robust estimation of the variogram: I. Journal of the international association for mathematical geology. 1980;12(2):115–125.
- View Article
- Google Scholar
40. Genton MG. Highly robust variogram estimation. Mathematical geology. 1998;30(2):213–221.
41. Dowd P. The variogram and kriging: robust and resistant estimators. In: Geostatistics for natural resources characterization. Springer; 1984. p. 91–106. https://doi.org/10.1007/978-94-009-3699-7_6
42. Byrd RH, Lu P, Nocedal J, Zhu C. A limited memory algorithm for bound constrained optimization. SIAM journal on scientific computing. 1995;16(5):1190–1208.
- View Article
- Google Scholar
43. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society: series B (methodological). 1977;39(1):1–22.
- View Article
- Google Scholar
44. Council of State and Territorial Epidemiologists. Standardized surveillance case definition and national notification for 2019 novel coronavirus disease (COVID-19); 2021. https://cdn.ymaws.com/www.cste.org/resource/resmgr/2020ps/interim-20-id-01_covid-19.pdf.
45. Centers for Disease Control and Prevention. Geographic Differences in COVID-19 Cases, Deaths, and Incidence-United States, February 12-April 7, 2020.; 2020. https://www.cdc.gov/mmwr/volumes/69/wr/pdfs/mm6915e4-H.pdf.
46. Hallman J. tis: Time indexes and time indexed series. R package version. 2010;1.
- View Article
- Google Scholar
47. Mecenas P, Bastos RTdRM, Vallinoto ACR, Normando D. Effects of temperature and humidity on the spread of COVID-19: A systematic review. PLoS one. 2020;15(9):e0238339. pmid:32946453
- View Article
- PubMed/NCBI
- Google Scholar
48. Menne MJ, Durre I, Vose RS, Gleason BE, Houston TG. An overview of the global historical climatology network-daily database. Journal of aAtmospheric and oceanic technology. 2012;29(7):897–910.
- View Article
- Google Scholar
49. Templ M, Alfons A, Kowarik A, Prantner B. VIM: visualization and imputation of missing values. R package version. 2011;2(3).
- View Article
- Google Scholar
50. Kontis V, Bennett JE, Rashid T, Parks RM, Pearson-Stuttard J, Guillot M, et al. Magnitude, demographics and dynamics of the effect of the first wave of the COVID-19 pandemic on all-cause mortality in 21 industrialized countries. Nature medicine. 2020; p. 1–10. pmid:33057181
- View Article
- PubMed/NCBI
- Google Scholar
51. Wu X, Nethery R, Sabath M, Braun D, Dominici F. Air pollution and COVID-19 mortality in the United States: Strengths and limitations of an ecological regression analysis. Science advances. 2020;6(45):eabd4049. pmid:33148655
- View Article
- PubMed/NCBI
- Google Scholar
52. Jackman S, Tahk A, Zeileis A, Maimone C, Fearon J, Meers Z, et al. Package ‘pscl’. Political Science Computational Laboratory. 2015;18(04.2017).
- View Article
- Google Scholar
53. Jackman S. pscl: Classes and methods for R. Developed in the Political Science Computational Laboratory, Stanford University. Department of Political Science, Stanford University, Stanford, CA. R package version 1.03. 5. http://www.psclstanford.edu/. 2010;.
54. Zeileis A, Lumley T, Berger S, Graham N, Zeileis MA. Package ‘sandwich’. 3-0.03; 2021.
55. Zeileis A. Object-oriented computation of sandwich estimators. Journal of statistical software. 2006;16(1):1–16.
- View Article
- Google Scholar
56. Mahajan UV, Larkins-Pettigrew M. Racial demographics and COVID-19 confirmed cases and deaths: a correlational analysis of 2886 US counties. Journal of public health. 2020;42(3):445–447. pmid:32435809
- View Article
- PubMed/NCBI
- Google Scholar
57. Karmakar M, Lantz PM, Tipirneni R. Association of social and demographic factors with COVID-19 incidence and death rates in the US. Jama network open. 2021;4(1):e2036462–e2036462. pmid:33512520
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Guan Wj, Ni Zy, Hu Y, Liang Wh, Ou Cq, He Jx, et al. Clinical characteristics of coronavirus disease 2019 in China. New England journal of medicine. 2020;382(18):1708–1720.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Lai S, Ruktanonchai NW, Zhou L, Prosper O, Luo W, Floyd JR, et al. Effect of non-pharmaceutical interventions to contain COVID-19 in China. Nature. 2020;585(7825):410–413. pmid:32365354
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Brauner JM, Mindermann S, Sharma M, Johnston D, Salvatier J, Gavenčiak T, et al. Inferring the effectiveness of government interventions against COVID-19. Science. 2020;. pmid:33323424
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Wu Z, McGoogan JM. Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention. Jama. 2020;323(13):1239–1242. pmid:32091533
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Chen S, Yang J, Yang W, Wang C, Bärnighausen T. COVID-19 control in China during mass population movements at New Year. The Lancet. 2020;395(10226):764–766. pmid:32105609
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref6] 6. Honein MA, Christie A, Rose DA, Brooks JT, Meaney-Delman D, Cohn A, et al. Summary of Guidance for Public Health Strategies to Address High Levels of Community Transmission of SARS-CoV-2 and Related Deaths, December 2020. Morbidity and mortality weekly report. 2020;69(49):1860. pmid:33301434
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref7] 7. Mehta SH, Clipman SJ, Wesolowski A, Solomon SS. Holiday gatherings, mobility and SARS-CoV-2 transmission: results from 10 US states following Thanksgiving. Scientific reports. 2021;11(1):1–9. pmid:34462499
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Klausner Z, Fattal E, Hirsch E, Shapira SC. A single holiday was the turning point of the COVID-19 policy of Israel. International journal of infectious diseases. 2020;101:368–373. pmid:33045425
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Lauer SA, Grantz KH, Bi Q, Jones FK, Zheng Q, Meredith HR, et al. The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application. Annals of internal medicine. 2020;172(9):577–582. pmid:32150748
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref10] 10. Wajnberg A, Mansour M, Leven E, Bouvier NM, Patel G, Firpo-Betancourt A, et al. Humoral response and PCR positivity in patients with COVID-19 in the New York City region, USA: an observational study. The Lancet microbe. 2020;1(7):e283–e289. pmid:33015652
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref11] 11. Albert E, Torres I, Bueno F, Huntley D, Molla E, Fernández-Fuentes MÁ, et al. Field evaluation of a rapid antigen test (Panbio™ COVID-19 Ag Rapid Test Device) for COVID-19 diagnosis in primary healthcare centres. Clinical microbiology and infection. 2020;. pmid:33189872
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref12] 12. The New York Times. Coronavirus (Covid-19) Data in the United States; 2021. https://github.com/nytimes/covid-19-data.

[ref13] 13. Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 1992;34(1):1–14.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref14] 14. Xie Y, Xu L, Li J, Deng X, Hong Y, Kolivras K, et al. Spatial Variable Selection and An Application to Virginia Lyme Disease Emergence. Journal of the American statistical association. 2019;114(528):1466–1480.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref15] 15. Elliott P, Wartenberg D. Spatial epidemiology: current approaches and future challenges. Environmental health perspectives. 2004;112(9):998–1006. pmid:15198920
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref16] 16. Schnell PM, Papadogeorgou G, et al. Mitigating unobserved spatial confounding when estimating the effect of supermarket access on cardiovascular disease deaths. Annals of applied atatistics. 2020;14(4):2069–2095.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref17] 17. Hanna-Attisha M, LaChance J, Sadler RC, Champney Schnepp A. Elevated blood lead levels in children associated with the Flint drinking water crisis: a spatial analysis of risk and public health response. American journal of public health. 2016;106(2):283–290. pmid:26691115
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref18] 18. Kang D, Choi H, Kim JH, Choi J. Spatial epidemic dynamics of the COVID-19 outbreak in China. International journal of infectious diseases. 2020;94:96–102. pmid:32251789
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref19] 19. Banerjee S., Carlin B. & Gelfand A. Hierarchical modeling and analysis for spatial data. (Chapman,2003)

[ref20] 20. Moraga P. Geospatial health data: modeling and visualization with R-INLA and shiny. (CRC Press,2019)

[ref21] 21. Lawson A. Bayesian disease mapping: hierarchical modeling in spatial epidemiology. (Chapman,2018)

[ref22] 22. Zhang L., Baladandayuthapani V., Zhu H., Baggerly K., Majewski T., Czerniak B. et al. Functional CAR models for large spatially correlated functional datasets. Journal of the American statistical association. 2016;111(514), 772–786 pmid:28018013
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref23] 23. Blangiardo M. & Cameletti M. Spatial and spatio-temporal Bayesian models with R-INLA. (John Wiley & Sons,2015)

[ref24] 24. Zeger SL, Qaqish B. Markov regression models for time series: a quasi-likelihood approach. Biometrics. 1988; p. 1019–1031. pmid:3148334
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref25] 25. Diggle P, Diggle PJ, Heagerty P, Liang KY, Heagerty PJ, Zeger S, et al. Analysis of longitudinal data. (Oxford University Press,2002)

[ref26] 26. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref27] 27. Hall DB, Zhang Z. Marginal models for zero inflated clustered data. Statistical modelling. 2004;4(3):161–180.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref28] 28. Gelfand AE, Diggle P, Guttorp P, Fuentes M. Handbook of spatial statistics. (CRC press,2010)

[ref29] 29. Lark R. A comparison of some robust estimators of the variogram for use in soil survey. European journal of soil science. 2000;51(1):137–157.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref30] 30. Albert PS, McShane LM. A generalized estimating equations approach for spatially correlated binary data: applications to the analysis of neuroimaging data. Biometrics. 1995; p. 627–638. pmid:7662850
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref31] 31. Rosen O, Jiang W, Tanner MA. Mixtures of marginal models. Biometrika. 2000;87(2):391–404.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref32] 32. Preisser JS, Galecki AT, Lohman KK, Wagenknecht LE. Analysis of smoking trends with incomplete longitudinal binary responses. Journal of the American statistical association. 2000;95(452):1021–1031.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref33] 33. Preisser JS, Lohman KK, Rathouz PJ. Performance of weighted estimating equations for longitudinal binary data with drop-outs missing at random. Statistics in medicine. 2002;21(20):3035–3054. pmid:12369080
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref34] 34. Reilly C, Gelman A. Weighted classical variogram estimation for data with clustering. Technometrics. 2007;49(2):184–194.
View Article
Google Scholar

[104] View Article

[105] Google Scholar

[ref35] 35. Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis. vol. 998. (John Wiley & Sons,2012)

[ref36] 36. Hall DB. Zero-inflated Poisson and binomial regression with random effects: a case study. Biometrics. 2000;56(4):1030–1039. pmid:11129458
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref37] 37. Torabi M, Rosychuk RJ. Spatio-temporal modelling of disease mapping of rates. Canadian journal of statistics. 2010;38(4):698–715.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref38] 38. Cressie N. Fitting variogram models by weighted least squares. Journal of the international association for mathematical geology. 1985;17(5):563–586.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref39] 39. Cressie N, Hawkins DM. Robust estimation of the variogram: I. Journal of the international association for mathematical geology. 1980;12(2):115–125.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref40] 40. Genton MG. Highly robust variogram estimation. Mathematical geology. 1998;30(2):213–221.

[ref41] 41. Dowd P. The variogram and kriging: robust and resistant estimators. In: Geostatistics for natural resources characterization. Springer; 1984. p. 91–106. https://doi.org/10.1007/978-94-009-3699-7_6

[ref42] 42. Byrd RH, Lu P, Nocedal J, Zhu C. A limited memory algorithm for bound constrained optimization. SIAM journal on scientific computing. 1995;16(5):1190–1208.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref43] 43. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society: series B (methodological). 1977;39(1):1–22.
View Article
Google Scholar

[126] View Article

[127] Google Scholar

[ref44] 44. Council of State and Territorial Epidemiologists. Standardized surveillance case definition and national notification for 2019 novel coronavirus disease (COVID-19); 2021. https://cdn.ymaws.com/www.cste.org/resource/resmgr/2020ps/interim-20-id-01_covid-19.pdf.

[ref45] 45. Centers for Disease Control and Prevention. Geographic Differences in COVID-19 Cases, Deaths, and Incidence-United States, February 12-April 7, 2020.; 2020. https://www.cdc.gov/mmwr/volumes/69/wr/pdfs/mm6915e4-H.pdf.

[ref46] 46. Hallman J. tis: Time indexes and time indexed series. R package version. 2010;1.
View Article
Google Scholar

[131] View Article

[132] Google Scholar

[ref47] 47. Mecenas P, Bastos RTdRM, Vallinoto ACR, Normando D. Effects of temperature and humidity on the spread of COVID-19: A systematic review. PLoS one. 2020;15(9):e0238339. pmid:32946453
View Article
PubMed/NCBI
Google Scholar

[134] View Article

[135] PubMed/NCBI

[136] Google Scholar

[ref48] 48. Menne MJ, Durre I, Vose RS, Gleason BE, Houston TG. An overview of the global historical climatology network-daily database. Journal of aAtmospheric and oceanic technology. 2012;29(7):897–910.
View Article
Google Scholar

[138] View Article

[139] Google Scholar

[ref49] 49. Templ M, Alfons A, Kowarik A, Prantner B. VIM: visualization and imputation of missing values. R package version. 2011;2(3).
View Article
Google Scholar

[141] View Article

[142] Google Scholar

[ref50] 50. Kontis V, Bennett JE, Rashid T, Parks RM, Pearson-Stuttard J, Guillot M, et al. Magnitude, demographics and dynamics of the effect of the first wave of the COVID-19 pandemic on all-cause mortality in 21 industrialized countries. Nature medicine. 2020; p. 1–10. pmid:33057181
View Article
PubMed/NCBI
Google Scholar

[144] View Article

[145] PubMed/NCBI

[146] Google Scholar

[ref51] 51. Wu X, Nethery R, Sabath M, Braun D, Dominici F. Air pollution and COVID-19 mortality in the United States: Strengths and limitations of an ecological regression analysis. Science advances. 2020;6(45):eabd4049. pmid:33148655
View Article
PubMed/NCBI
Google Scholar

[148] View Article

[149] PubMed/NCBI

[150] Google Scholar

[ref52] 52. Jackman S, Tahk A, Zeileis A, Maimone C, Fearon J, Meers Z, et al. Package ‘pscl’. Political Science Computational Laboratory. 2015;18(04.2017).
View Article
Google Scholar

[152] View Article

[153] Google Scholar

[ref53] 53. Jackman S. pscl: Classes and methods for R. Developed in the Political Science Computational Laboratory, Stanford University. Department of Political Science, Stanford University, Stanford, CA. R package version 1.03. 5. http://www.psclstanford.edu/. 2010;.

[ref54] 54. Zeileis A, Lumley T, Berger S, Graham N, Zeileis MA. Package ‘sandwich’. 3-0.03; 2021.

[ref55] 55. Zeileis A. Object-oriented computation of sandwich estimators. Journal of statistical software. 2006;16(1):1–16.
View Article
Google Scholar

[157] View Article

[158] Google Scholar

[ref56] 56. Mahajan UV, Larkins-Pettigrew M. Racial demographics and COVID-19 confirmed cases and deaths: a correlational analysis of 2886 US counties. Journal of public health. 2020;42(3):445–447. pmid:32435809
View Article
PubMed/NCBI
Google Scholar

[160] View Article

[161] PubMed/NCBI

[162] Google Scholar

[ref57] 57. Karmakar M, Lantz PM, Tipirneni R. Association of social and demographic factors with COVID-19 incidence and death rates in the US. Jama network open. 2021;4(1):e2036462–e2036462. pmid:33512520
View Article
PubMed/NCBI
Google Scholar

[164] View Article

[165] PubMed/NCBI

[166] Google Scholar

Figures

Abstract

Introduction

Methods

Zero-inflated Poisson model

Theoretical semivariogram model

Estimation

GEE for over-dispersed Poisson counts.

GEE for excess zeros.

Estimation of τ through the empirical semivariogram.

Empirical semivariograms.

Expectation-Solution algorithm

E-step.

S-step.

Inference on coefficients

Results

Data

Poisson model covariates.

Zero-inflation covariates.

COVID-19 data analysis

Discussion

References