Improved log-Gaussian approximation for over-dispersed Poisson regression: Application to spatial analysis of COVID-19

Daisuke Murakami; Tomoko Matsui

doi:10.1371/journal.pone.0260836

Abstract

In the era of open data, Poisson and other count regression models are increasingly important. Still, conventional Poisson regression has remaining issues in terms of identifiability and computational efficiency. Especially, due to an identification problem, Poisson regression can be unstable for small samples with many zeros. Provided this, we develop a closed-form inference for an over-dispersed Poisson regression including Poisson additive mixed models. The approach is derived via mode-based log-Gaussian approximation. The resulting method is fast, practical, and free from the identification problem. Monte Carlo experiments demonstrate that the estimation error of the proposed method is a considerably smaller estimation error than the closed-form alternatives and as small as the usual Poisson regressions. For counts with many zeros, our approximation has better estimation accuracy than conventional Poisson regression. We obtained similar results in the case of Poisson additive mixed modeling considering spatial or group effects. The developed method was applied for analyzing COVID-19 data in Japan. This result suggests that influences of pedestrian density, age, and other factors on the number of cases change over periods.

Citation: Murakami D, Matsui T (2022) Improved log-Gaussian approximation for over-dispersed Poisson regression: Application to spatial analysis of COVID-19. PLoS ONE 17(1): e0260836. https://doi.org/10.1371/journal.pone.0260836

Editor: Luca Citi, University of Essex, UNITED KINGDOM

Received: April 19, 2021; Accepted: November 17, 2021; Published: January 7, 2022

Copyright: © 2022 Murakami, Matsui. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The R codes used in the “Results: Monte Carlo experiments” and “Results: COVID-19 analysis” sections are available from https://github.com/dmuraka/SimplePoissonApprox_MCsim. The COVID-19 data, which is used in “Results: COVID-19 analysis” section, is owned by JX Press Corporation (https://jxpress.net/ (there is Japanese website only)) and cannot be shared publicly because it is proprietary data. Anyone can purchase the data from the company, and the authors of this study had no special access privileges to the data.

Funding: This work was supported by JSPS KAKENHI Grant Numbers 17H02046, JP18H01556, and 18H03628, and JST-Mirai Program Grant Number JP1124793, Japan; all grants were awarded to DM.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Currently, a wide variety of count data are collected through sensors and used for smart urban and regional management (see [1]). For example, in 2020–2021 when the coronavirus disease (COVID-19) spread globally, the daily number of people infected with coronavirus was monitored worldwide, and countermeasures were considered based on the observations [2].

Poisson and other regression models for count data have been used for analyzing the number of COVID-19 cases (e.g., [3, 4]) or other diseases (e.g., [5, 6]). These regression models have also been used in ecology (e.g., [7, 8]), criminology (e.g., [9, 10]), and other fields. Recently, Bayesian Poisson regression, which assumes Poisson distribution for the count data and Gaussian priors for latent variables describing spatial, group, and other effects, is widely used in applied studies.

Still, Poisson regression has remaining issues in terms of (a) computational efficiency and (b) identifiability. Regarding (a), owing to the lack of conjugacy between the Poisson and Gaussian distributions, an approximate inference is necessary for the estimation. Unfortunately, the Markov Chain Monte Carlo method can be slow for large samples. Faster approximations have been developed for count data regression in a context of additive modeling (e.g., [11, 12]), mixed effects modeling [13], and Gaussian process (e.g., [14, 15]).

Regarding (b), the maximum likelihood estimates of the conventional Poisson regression are unidentifiable or identifiable only weakly for certain data configurations [16, 17], typically, for small samples with many zeros. As we will illustrate later, this property considerably worsens the accuracy of Poisson regression estimates in some cases.

Gaussian approximation is useful for improving (a) the computational efficiency and avoiding (b) the identification problem, which is attributed to the Poisson likelihood [18]. [19–22] proposed closed-form Gaussian approximations for Poisson regression. These approaches are easy to implement, computationally efficient, and free from the identification problem. Given the current situation wherein a wide range of researchers and practitioners use count data, these practical approaches will become increasingly important. Unfortunately, these approximations have the following disadvantages:

They have poor approximation accuracy for counts with many zeros as we will demonstrate later. A closed-form approach accurately describing such data is needed.
An arbitrary parameter, which is used to avoid taking the logarithm of zero, must be determined a priori. The value is known to have substantial impact on the modeling result [23]. A closed-form approach without such an arbitrary parameter is needed.

Given that, we develop a log-Gaussian approximation for the over-dispersed Poisson regression that is fast, practical, avoids the identification problem, and overcomes (i)–(ii).

Methods

Improved log-Gaussian approximation

Over-dispersed Poisson regression.

This study considers the following over-dispersed Poisson model for count variables Y_i|i∈{1,…N}: (1) where E[Y_i] = λ_i and Var[Y_i] = σ²λ_i. λ_i is a mean parameter, σ² is an over-dispersion parameter, and z_i is a given offset variable.

Suppose that μ_i = x_i′β where x_i is a column vector of K explanatory variables and β is a vector of regression coefficients. The coefficient estimator and the variance-covariance matrix are given as (2) (3) z = [z_1,…,z_N]′ with , and Λ is a diagonal matrix whose i-th element equals λ_i. The coefficients are estimated by an iteratively re-weighted least squares (IRLS) method alternately updating and until convergence. Given the dispersion parameter is estimated as follows: (4)

The resulting mean estimate for the over-dispersed Poisson model is known to be the same as the conventional Poisson regression assuming : (5) is the Poisson maximum likelihood estimator that suffer from the identification problem as detailed in [17]. Note that the λ_i parameter explains not only the mean but also the mode of Y_i; for integer-valued λ_i, Y_i has two modes {λ_i−1, λ_i}. Later, we will use the center of the two modes Mode_c[Y_i] = λ_i−0.5.

Log-Gaussian approximation for the Poisson regression.

To overcome the identification problem, we consider approximating the mean estimator by using an estimator obtained from a log-Gaussian model, which is unaffected by the identification problem. The estimated is used to estimate , and .

For the approximation, we need to identify a log-Gaussian model that accurately approximates the Poisson model Eq (5) around λ_i. Although mean-based log-Gaussian approximations for Poisson regression has been developed (e.g., [20]), the mean and mode of the two distributions behave somewhat differently; the mean and mode of a Poisson distribution are linearly proportional and grow in the same order (and, thus, λ_i explains both mean and mode) while those of a log-Gaussian distribution are not linearly proportional, and the mean grows faster than the mode. Therefore, mean-based approximation can have poor approximation accuracy around the mode, which is the distribution center. Considering the success of Laplace or other mode-based approximations in previous studies, it is reasonable to accurately approximate Poisson distribution around the mode.

This study first develops a mode-based closed-form approximation. We will use the Poisson mode center Mode_c[Y_i]. Because the mode center is available only when λ_i = E[Y_i]≥0.5 to assure non-negativity, we develop a mode-based approximation for λ_i≥0.5 and another approximation for λ_i<0.5. After that, we combine the two approximations for estimating the parameter.

Approximation for λ_i≥0.5

We approximate Eq (5) by using the log-Gaussian variable y_i defined as: (6) where μ_i(G) represents the mean (logscale), and c is a constant required to avoid taking the logarithm of zero. is an approximate variance for a log-transformed Poisson random deviate.

We perform the approximation so that the mode Mode[y_i] of the log-Gaussian model equals the mode center Mode_c[Y_i] of the Poisson model. The following condition is obtained from the mode matching Mode_c[Y_i] = Mode[y_i]: (7) Eq (7) suggests that μ_i and μ_i(G) do not generally have a linear relationship. Exceptionally, they have the following linear relationship if c = 0.5: (8) While existing studies have determined c somewhat arbitrary, c = 0.5 is found to be necessary for applying the linear approximation under our assumption.

Let us substitute c = 0.5 and Eq (8) into Eq (6). Then, we obtain the following log-Gaussian model approximating the Poisson model: (9) By organizing Eq (9), we have the following model: (10) where . The log-Gaussian distribution approximates the Poisson distribution around the mode center.

Approximation for λ_i<0.5

If λ_i<0.5, the mode of the Poisson variable Y_i and log-Gaussian variable behave somewhat differently: the Poisson mode always takes zero value while the mode of the log-Gaussian distribution gradually converges to zero as λ_i (or μ_i) declines. Mode-based approximation is not suitable in this case. Conversely, the means of the two distributions both converge to zero as λ_i or μ_i approaches zero. For λ_i = E[Y_i]<0.5, we rely on a mean-based approximation.

By taking the expectation of using Eq (10), we have the following relationship: (11) Eq (11) implies that, when approximating the Poisson mean function μ_i (logscale) using , it should be rescaled by multiplying . By applying the rescaling to , Eq (10) is modified to approximate the mean of the Poisson distribution as follows: (12) where .

Proposed approximation.

Considering the advantage of the mode-based approximation explained in the “Log-Gaussian approximation for the Poisson regression” section, we use Eq (10) as long as λ_i≥0.5 while Eq (12) is used otherwise. Still, λ_i = E[Y_i] is unknown a priori. Considering the property of count data that P(Y_i<0.5) = P(Y_i = 0), we approximate P(E[Y_i]<0.5) using the ratio r of zero counts in {Y₁,…,Y_N}. Given the approximation, Eqs (10) and (12) are applied with probabilities 1−r and r, respectively. By combining these equations using r, our proposed approximation is formulated as: (13) where , which yields Eq (10) if r = 0 and Eq (12) if r = 1. If all counts are non-zero, the mode-based approximation is applied for all the samples. As the share of zero counts increases, the mean-based approximation is emphasized.

For the unknown λ_i in the variance term, we rely on a plug-in estimator . The resulting approximation equation yields (14) This plug-in method ignores the uncertainty in the variance term. Consideration of the uncertainty will be an important future task.

Given Eq (14), the estimator for μ_i = x_i′β yields where , and Λ_y is a diagonal matrix whose i-th element equals approximates the μ_i parameter but is free from the identification problem. Thus, we use as an estimate of the Poisson mean λ_i. In other words, the estimated is substituted into Eqs (2)—(4) to estimate , and .

Proposed approximation for Poisson mixed effects model.

Our approximation is readily extended for (over-dispersed) Poisson mixed effects model (MEM) (e.g., [21]) which is formulated as (15) where Σ_β is the variance-covariance matrix for β. We consider the following estimators for Eq (15): (16) (17) (18) where is the effective degrees of freedom, with , and Λ⁺ is a diagonal matrix whose i-th element equals . These estimators are identical to the conventional Poisson MEM if z⁺ is replaced with z.

As before, we approximate the Poisson mean in Eqs (16)–(18) using the following model approximating Eq (15) around λ_i: (19) Once the Gaussian mixed effects model (Eq 19) is estimated, the approximate Poisson mean (logscale) is obtained where .

In short, an over-dispersed Poisson regression with/without random coefficients is approximated by the following steps: (I) Estimate using a log-Gaussian model whose explained variable and sample weight y_i+0.5; (II) Substitute the estimated into Eqs (2)–(4) for models without random coefficients or Eqs (16)—(18) for models with random effects. Later, we examine approximation accuracy of our approach through Monte Carlo experiments.

Property of the proposed approximation

Table 1 summarizes closed-form approximations for the Poisson regression models. These methods perform approximations through the estimation of a log-Gaussian model using the explained variables and the weight shown in this table. These practical methods will be useful to avoid the identification problem for not only researchers but also practitioners. However, existing methods are accurate only for a moderate to large μ_i [22]. These methods should not be used for counts with many zeros. Besides, the c parameter, which has a considerable impact on analysis result, must be determined a priori (see the Introduction section). These drawbacks inhibit the practical use of these approximations.

Download:

Table 1. Closed-form approximations for Poisson regression.

c is a tuning parameter that must be determined a priori. z_i = 1 is assumed. All the approximations employ Gaussian linear regression whose explained variables and weights are as shown in the table.

https://doi.org/10.1371/journal.pone.0260836.t001

Major advantages of our approximation relative to these existing methods are as follows: (A) it does not have the tuning parameter (c); (B) because of the mode matching, the proposed method accurately approximates the mode of the Poisson distribution irrespective of μ_i; (C) Gaussian approximation is used only for estimating the Poisson mean while alternative methods use it for estimating both the Poisson mean and the regression coefficients. As we will show later, these advantages considerably improve the approximation accuracy for count data with many zeros.

Our mode-matching method is akin to the Laplace approximation, which is based on the mode-matching of a Gaussian distribution and the target distribution. Considering studies demonstrating the accuracy of the Laplace approximation, our mode-based approach is expected to be accurate as well. Still, the Laplace approximation can have poor accuracy if the target distribution is far from Gaussian distribution. Extension based on other approximation methods such as numerical quadrature is an important remining task.

Results: Monte Carlo experiments

Case 1: Basic over-dispersed Poisson regression model

This section compares the estimation accuracy of the proposed approximation (Proposed) with standard Poisson regression (Poisson), over-dispersed Poisson regression (odPoisson), and negative binomial regression alternatives (NB). We also compare ours with the log-linear approximation of [19] (LogLinear) and the Taylor approximation of [20] (Taylor).

The simulated count data y_i is generated from the over-dispersed Poisson regression with mean λ_i and the overdispersion parameter σ²: (20) where x_i,1 and x_i,2 are generated from standard normal distributions N(0, 1), and {β₁, β₂} = {2.0, 0.5}. We refer to β₁ as a strong and β₂ as a weak coefficient. σ² = 1 implies the standard Poisson regression without over-dispersion while σ²>1 means over-dispersion. The β₀ parameter implicitly controls the ratio of zero counts; a smaller β₀ value yields more zero counts.

Over-dispersed Poisson distribution does not have probability mass function [24]. The simulation data is sampled to satisfy E[y_i] = λ_i and Var[y_i] = σ²λ_i, which y_i~odPoisson(λ_i, σ²) assumes, as follows:

Calculate λ_i = exp(β₀+x_i,1β₁+x_i,2β₂)
Calculate v_i = (σ²−1)/λ_i
Sample y_i~NB(λ_i, v_i) where NB(λ_i, v_i) is a negative binomial distribution with expectation λ_i and variance

The sampled y_i has the expectation E[y_i] = λ_i and variance that odPoisson(λ_i, σ²) assumes. Thus, the sampled y_i fulfills the assumption of the over-dispersed Poisson distribution.

The coefficient estimation accuracy is compared across models while varying β₀∈{−2, −1, 0, 1, 1}, σ²∈{1, 5}, and N∈{50, 200}. In each case, the simulations were iterated 1000 times and the root mean squared error (RMSE) and the mean bias are evaluated: (21) where is the estimated β_k in the iter-th iteration.

The evaluated RMSE and bias values are plotted in Figs 1 and 2 in a case without overdispersion σ² = 1.0 whereas Figs 3 and 4 in cases with overdispersion σ² = 5.0. LogLinear and Taylor tend to have large RMSEs and biases across cases, and the errors inflate if y_i has many zero values (i.e., small β₀). In contrast, the RMSE values for Proposed are as small as the Poisson and odPoisson specifications across cases. Poisson, odPoisson, and NB have large RMSE values for small over-dispersed samples (σ² = 5.0) with many zero values (β₀ = −2); it is attributable to the identification problem explained in the “Introduction” section. Proposed does not suffer from this problem. Proposed is advantageous in terms of stability. The bias of the proposed method is small across cases. It is suggested that the proposed method estimates regression coefficients in a reasonable accuracy.

Download:

Fig 1. RMSE of the regression coefficients in cases without overdispersion (σ² = 1.0) (x−axis: β₀, y-axis: RMSE).

https://doi.org/10.1371/journal.pone.0260836.g001

Download:

Fig 2. Bias of the regression coefficients in cases without overdispersion (σ² = 1.0) (x−axis: β₀, y-axis: Bias).

https://doi.org/10.1371/journal.pone.0260836.g002

Download:

Fig 3. RMSE of the regression coefficients in cases with overdispersion (σ² = 5.0) (x−axis: β₀, y-axis: RMSE).

https://doi.org/10.1371/journal.pone.0260836.g003

Download:

Fig 4. Bias of the regression coefficients in cases with overdispersion (σ² = 5.0) (x−axis: β₀, y-axis: Bias).

https://doi.org/10.1371/journal.pone.0260836.g004

Fig 5 shows the coefficient standard error (SE) estimates. While the SEs estimated from Proposed are similar to odPoisson, the former method tends to have smaller SE values than the latter when σ² = 5.0. To examine if our SE accurately estimates the uncertainty in the coefficient estimates, Fig 6 plots (SE)/(standard deviation of the estimated coefficient values). The value is close to 1.0 if the SEs accurately evaluate the uncertainty. Based on the figure, all the methods tend to underestimate the SE value. Still, the bias of Proposed is smaller than Poisson, NB, LogLinear, and Taylor whereas larger than odPoisson. Reducing the bias in the SE estimates is an important task.

Download:

Fig 5. Means of the coefficient standard errors (N = 200) (x−axis: β₀, y-axis: mean standard error).

https://doi.org/10.1371/journal.pone.0260836.g005

Download:

Fig 6. Means of (estimated standard error)/(standard deviation of the estimated coefficient values) when (N = 200; x−axis: β₀, y-axis: mean standard error).

https://doi.org/10.1371/journal.pone.0260836.g006

In S1 Appendix in S1 File, we perform another Monte Carlo experiments assuming six explanatory variables. The result is consistent with the results obtained in this section.

Case 2: Model with spatial effects

To verify the expandability of the proposed model, this section applies the proposed method to estimate a spatial regression model, which has been widely used to analyze spatial phenomena in the environment, economy, and epidemic. We consider the following model: (22) where {β₁, β₂} = {2, 0.5} and σ² = 5, which means an overdispersion with variance Var[y_i] = 5λ_i. s_i is a process capturing a spatially dependent pattern of the data. It is modeled by a low rank Gaussian process whose spatial dependence exponentially decays relative to the Euclidean distance between the geometric centers of the two zones. Eq (22) is an over-dispersed Poisson mixed-effects model (MEM) that considers spatial dependence. The model is estimated by applying the maximum likelihood (ML) estimation for the Poisson MEM (Poisson), an over-dispersed Poisson MEM (odPoisson), the Taylor approximate Poisson MEM (Taylor), and our specification (Proposed). Taylor and Proposed fitted linear MEMs using the transforming explained variables and weight variables (see Table 1). All models were estimated using a restricted maximum likelihood method implemented a R package mgcv [11].

We assumed β₀∈{−2, −1, 0, 1, 1} and N∈{50, 200}. In each case, the models were estimated 500 times, and the estimation accuracies were compared. Figs 7 and 8 display the estimated RMSEs and biases, respectively. When N = 50, odPoisson took extremely large RMSEs due to the identification problem. Poisson and Taylor also had large RMSEs. In contrast, the proposed method tends to have smaller RMSE values. The proposed method may be a better choice for small samples. Even for N = 200, the RMSEs and biases of Proposed were as small as those of Poisson and odPoisson. The estimation accuracy of the proposed method was verified in the case of spatial regression.

Download:

Fig 7. RMSE of the regression coefficients (model with spatial effects) (x-axis: β₀, y-axis: RMSE).

https://doi.org/10.1371/journal.pone.0260836.g007

Download:

Fig 8. Bias of the regression coefficients (model with spatial effects) (x-axis: β₀, y-axis: Bias).

https://doi.org/10.1371/journal.pone.0260836.g008

Fig 9 compares the coefficient standard errors. The SEs obtained from Proposed are similar to odPoisson for a large β₀, while Proposed has smaller SEs for small β₀. Fig 10 plots (SE)/(standard deviation of the estimated coefficient values) when N = 200. Unlike odPoisson whose SEs are severely underestimated for small β₀. Proposed estimates SEs reasonably accurately across cases.

Download:

Fig 9. Means of the coefficient standard errors (N = 200) (x−axis: β₀, y-axis: mean standard error).

https://doi.org/10.1371/journal.pone.0260836.g009

Download:

Fig 10. Means of (estimated standard error)/(standard deviation of the estimated coefficient values) (N = 200; x−axis: β₀, y-axis: RMSE).

https://doi.org/10.1371/journal.pone.0260836.g010

Finally, Fig 11 compares the estimation accuracy for the spatially dependent process s_i. The RMSE values of Proposed are almost identical with odPoisson suggesting the accuracy of our approximation.

Download:

Fig 11. RMSE of the estimated spatial effects (x-axis: β₀, y-axis: RMSE).

https://doi.org/10.1371/journal.pone.0260836.g011

We performed another Monte Carlo experiment assuming group effects, which estimates heterogeneity across groups, instead of the spatially dependent effects. As summarized in S2 Appendix in S1 File, the RMSEs and biases are as small as Poisson and odPoisson for N = 200 and smaller than the two methods for N = 50.

In short, the proposed method provides an accurate and stable approximation for an over-dispersed Poisson MEM.

Computation time comparison

Finally, computation time is compared while varying N∈{1,000, 10,000, 100,000, 300,000} under cases 1 and 2. β₀ = 0 and σ² = 1 are assumed in this section. We use R version 4.0.2 (https://cran.r-project.org/) installed in a Mac Pro (3.5 GHz, 6-Core Intel Xeon E5 processor with 64 GB memory). The gam function in the mgcv package is used for the model estimation.

Under case 1 (basic model), Poisson, odPoisson, and Proposed took 20.1, 116.0, and 1.34 seconds on average, respectively. In case 2 which estimated spatial effects, Proposed is again considerably faster than Poisson and odPoisson, especially for large samples as plotted in Fig 12. The computational efficiency of Proposed is confirmed.

Download:

Fig 12. Computation time comparison under case 2 (model with spatial effects).

https://doi.org/10.1371/journal.pone.0260836.g012

Results: COVID-19 analysis

Outline

This section employs the developed approximation to an analysis of the COVID-19 (coronavirus disease 2019) pandemic. Since the first case was detected in Wuhan, China, in December 2019, the coronavirus spread. As of February 1, 2021, the cumulative number of confirmed cases is 103.41 million, while the confirmed death toll is 2.25 million. To achieve effective infection control for not only COVID-19 but also pandemics/endemics in the future, it is important to investigate the influencing factor behind the disaster. The data of daily new cases analyzed in this section is provided from JX Press corporation (https://jxpress.net/).

Fig 13 plots the number of daily cases in Japan between February 1, 2020, and January 29, 2021. The number peaked around April 2020, August 2020, and January 2021, respectively. Based on the time trend, we refer to February 1 –May 31 as the first wave, June 1 –September 30 as the second wave, and October 1 –January 29, 2021, as the third wave. Fig 14 displays the spatial plots of the daily new cases by prefecture. This figure shows the tendency of the number of infected people to become large near Tokyo and Osaka, which are major urban areas.

Download:

Fig 13. Daily number of cases across Japan.

https://doi.org/10.1371/journal.pone.0260836.g013

Download:

Fig 14. Number of cases by prefecture.

https://doi.org/10.1371/journal.pone.0260836.g014

We performed a regression analysis exploring the influencing factor of the increase/decrease in each wave. The explained variables were the number of daily cases by prefecture by 10-year age groups (-19, 20–29, …, 70–79, 80-). The sample sizes were 51,336, 50,508, and 50,094 for the three waves, respectively. 89.0% (45,696 samples), 77.1% (38,954 samples), and 49.7% (24,873 samples) of the samples were zeros.

For the COVID-19 data, we approximate the following over-dispersed Poisson additive mixed model using our approach: (23) where y_i is the number of daily new cases. β₀ and β₁ are regression coefficients. The model is fitted for each of the three datasets completely separately. To scale the mean function according to the population, the offset variable z_i is given by the prefectural population. The explanatory variable x_i is the prefectural pedestrian density by day, which is relative to January 13, 2020 (source: Apple Mobility Trends: https://covid19.apple.com/mobility). The density is estimated based on the number of route searches by Apple map users. For further detail, see the source page. g_i,l represents the l-th group-wise random effect. We consider the effects by week (g_i,1), days of the week (g_i,2), generation (g_i,3), and prefecture (g_i,4). In considering countermeasures, it is important to reveal not only patterns by prefectures but also across prefectures. To estimate this, we include a low rank Gaussian process s_i which smoothly varies depending on geographic coordinates; we use the geographic center of each prefecture for the modeling. The model was estimated using the mgcv package.

Results

Table 2 summarizes the estimated parameters. The estimated coefficients of pedestrian density become positively significant in the first and second the waves. Self-restraint was estimated to reduce the number of cases in the early periods. Based on the estimated dispersion parameter (σ²), the variance of the number of cases were over-dispersed, and the tendency became stronger over time.

Download:

Table 2. Parameter estimates.

See Fig 15 for the fitting on the number of cases.

https://doi.org/10.1371/journal.pone.0260836.t002

Download:

Fig 15. Comparison of the observed and predicted number of cases.

https://doi.org/10.1371/journal.pone.0260836.g015

Fig 16 plots the estimated group effects by week, days of the week, and generation. The estimated week-wise effects show that the increase in cases lasts longer in the third wave. Control of the infection spread might be getting more difficult over waves. Regarding the days of the week, Monday has the lowest while Thursday, Friday, and Saturday have higher values. The difference is attributable to some business reasons such as the closing of hospitals and PCR test sites. The estimated generation effects have considerable differences across waves. In the first wave, people who are in the working generation (the 20s - 50s) tend to be infected. Commuting and/or meeting in the office might trigger the infection. In the second wave, the 20’s group has a strong tendency of being infected as compared to the elders, therefore, more self-restriction is needed. In the third wave, not only the 20s but also the 30s – 50s have high chances of being infected. Infection might spread again across the working generation.

Download:

Fig 16. Estimated group effects (week, days of the week, generation).

https://doi.org/10.1371/journal.pone.0260836.g016

Fig 17 plots the estimated prefecture-wise independent effects and spatially dependent effects. The former estimates local hotspots while the latter, global hotspots. The estimated prefecture-wise effects suggest that prefectures including major cities (Tokyo, Osaka, Fukuoka) and Hokkaido are local hotspots. More countermeasures might be required in these prefectures. On the other hand, based on the estimated spatially dependent effects, there is a global hotspot around Tokyo, and the influences grow over waves. Control of the infection spread from Tokyo might have been important to mitigate the third wave.

Download:

Fig 17.

Estimated group effects by prefecture (top) and spatially dependent effects (bottom).

https://doi.org/10.1371/journal.pone.0260836.g017

Discussion

This study develops a practical log-Gaussian approximation for Poisson regression models. Considering its simplicity, stability, and computational efficiency, it will be useful for researchers as well as practitioners.

Exploring the expandability of our approach is an important future task. For example, our approach might be useful for spatial and spatiotemporal interpolation of count data by combining it with Gaussian process models without additional computation and implementation costs. Our approach might also be useful for fast count data assimilation by combining it with a state-space model. Exploring such extensions will be an interesting research endeavor.

Supporting information

S1 File.

https://doi.org/10.1371/journal.pone.0260836.s001

(DOCX)

References

1. Soomro K, Bhutta MN M, Khan Z, Tahir MA. Smart city big data analytics: An advanced review. Wiley Interdiscip. Rev Data Min Knowl Discov. 2019;9(5):e1319.
- View Article
- Google Scholar
2. Viner RM, Russell S, Croker H, Packer J, Ward J, Stansfield C, et al. School closure and management practices during coronavirus outbreaks including COVID-19: a rapid systematic review. Lancet Child Adolesc Health. 2020;4(5):397–404. pmid:32272089
- View Article
- PubMed/NCBI
- Google Scholar
3. Oztig LI, Askin OE. Human mobility and coronavirus disease 2019 (COVID-19): a negative binomial regression analysis. Public health. 2020;185:364–7. pmid:32739776
- View Article
- PubMed/NCBI
- Google Scholar
4. Vokó Z, Pitter JG. The effect of social distance measures on COVID-19 epidemics in Europe: an interrupted time series analysis. GeroScience. 2020;42(4):1075–82. pmid:32529592
- View Article
- PubMed/NCBI
- Google Scholar
5. Wakefield J. Disease mapping and spatial regression with count data. Biostatistics. 2007;8:158–183. pmid:16809429
- View Article
- PubMed/NCBI
- Google Scholar
6. Lee D, Neocleous T. Bayesian quantile regression for count data with application to environmental epidemiology. J Roy Stat Soc C. 2010;59(5):905–20.
- View Article
- Google Scholar
7. Ver Hoef JM, Boveng PL. Quasi‐Poisson vs. negative binomial regression: how should we model overdispersed count data? Ecology. 2007;88(11):2766–72. pmid:18051645
- View Article
- PubMed/NCBI
- Google Scholar
8. Lindén A, Mäntyniemi S. Using the negative binomial distribution to model overdispersion in ecological count data. Ecology. 2011;92(7):1414–21. pmid:21870615
- View Article
- PubMed/NCBI
- Google Scholar
9. Osgood DW. Poisson-based regression analysis of aggregate crime rates. J Quant Criminol. 2000;16(1):21–43.
- View Article
- Google Scholar
10. Piza EL. Using Poisson and negative binomial regression models to measure the influence of risk on crime incident counts. Rutgers Center on Public Security. 2012.
- View Article
- Google Scholar
11. Wood SN. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J Roy Stat Soc B. 2011;73(1):3–36.
- View Article
- Google Scholar
12. Rodríguez-Álvarez MX, Lee DJ, Kneib T, Durbán M, Eilers P. Fast smoothing parameter separation in multidimensional generalized P-splines: the SAP algorithm. Stat Comput. 2015(5);25:941–57.
- View Article
- Google Scholar
13. Pinheiro JC, Bates DM. Mixed-Effects Models in S and S-PLUS. New York: Springer; 2000.
14. Diggle PJ, Tawn JA, Moyeed RA. Model‐based geostatistics. J Roy Stat Soc C. 1998;47(3):299–350.
- View Article
- Google Scholar
15. Rue H, Martino S, Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J Roy Stat Soc B. 2009;71(2):319–92.
- View Article
- Google Scholar
16. Silva JS, Tenreyro S. On the existence of the maximum likelihood estimates in Poisson regression. Econ Lett. 2010;107(2):310–2.
- View Article
- Google Scholar
17. Correia S, Guimarães P, Zylkin T. Verifying the existence of maximum likelihood estimates for generalized linear models. ArXiv. 2019;1903.01633.
- View Article
- Google Scholar
18. Breslow NE. Extra‐Poisson variation in log‐linear models. J Roy Stat Soc C. 1984;33(1):38–44.
- View Article
- Google Scholar
19. El-Sayyad GM. Bayesian and classical analysis of Poisson regression. J Roy Stat Soc B. 1973;35(3):445–51.
- View Article
- Google Scholar
20. Chan AB, Dong D. Generalized Gaussian process models. Proceedings of the IEEE conference on computer vision and pattern Recognition. 2011;5995688:2681–8.
21. Wood SN. Generalized Additive Models: An Introduction with R. CRC Press: Boca Raton; 2017.
22. Chan AB, Vasconcelos N. Counting people with low-level features and Bayesian regression. IEEE Trans Image Process. 2011;21(4):2160–77. pmid:22020684
- View Article
- PubMed/NCBI
- Google Scholar
23. Bellego C, Pape LD. Dealing with logs and zeros in regression models. SSRN: 3444996 [Preprint]. 2019; Available from: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3444996.
- View Article
- Google Scholar
24. Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss—Newton method. Biometrika. 1974;61(3):439–47.
- View Article
- Google Scholar

[ref1] 1. Soomro K, Bhutta MN M, Khan Z, Tahir MA. Smart city big data analytics: An advanced review. Wiley Interdiscip. Rev Data Min Knowl Discov. 2019;9(5):e1319.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Viner RM, Russell S, Croker H, Packer J, Ward J, Stansfield C, et al. School closure and management practices during coronavirus outbreaks including COVID-19: a rapid systematic review. Lancet Child Adolesc Health. 2020;4(5):397–404. pmid:32272089
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. Oztig LI, Askin OE. Human mobility and coronavirus disease 2019 (COVID-19): a negative binomial regression analysis. Public health. 2020;185:364–7. pmid:32739776
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. Vokó Z, Pitter JG. The effect of social distance measures on COVID-19 epidemics in Europe: an interrupted time series analysis. GeroScience. 2020;42(4):1075–82. pmid:32529592
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Wakefield J. Disease mapping and spatial regression with count data. Biostatistics. 2007;8:158–183. pmid:16809429
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref6] 6. Lee D, Neocleous T. Bayesian quantile regression for count data with application to environmental epidemiology. J Roy Stat Soc C. 2010;59(5):905–20.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref7] 7. Ver Hoef JM, Boveng PL. Quasi‐Poisson vs. negative binomial regression: how should we model overdispersed count data? Ecology. 2007;88(11):2766–72. pmid:18051645
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref8] 8. Lindén A, Mäntyniemi S. Using the negative binomial distribution to model overdispersion in ecological count data. Ecology. 2011;92(7):1414–21. pmid:21870615
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref9] 9. Osgood DW. Poisson-based regression analysis of aggregate crime rates. J Quant Criminol. 2000;16(1):21–43.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref10] 10. Piza EL. Using Poisson and negative binomial regression models to measure the influence of risk on crime incident counts. Rutgers Center on Public Security. 2012.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref11] 11. Wood SN. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J Roy Stat Soc B. 2011;73(1):3–36.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref12] 12. Rodríguez-Álvarez MX, Lee DJ, Kneib T, Durbán M, Eilers P. Fast smoothing parameter separation in multidimensional generalized P-splines: the SAP algorithm. Stat Comput. 2015(5);25:941–57.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref13] 13. Pinheiro JC, Bates DM. Mixed-Effects Models in S and S-PLUS. New York: Springer; 2000.

[ref14] 14. Diggle PJ, Tawn JA, Moyeed RA. Model‐based geostatistics. J Roy Stat Soc C. 1998;47(3):299–350.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref15] 15. Rue H, Martino S, Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J Roy Stat Soc B. 2009;71(2):319–92.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref16] 16. Silva JS, Tenreyro S. On the existence of the maximum likelihood estimates in Poisson regression. Econ Lett. 2010;107(2):310–2.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref17] 17. Correia S, Guimarães P, Zylkin T. Verifying the existence of maximum likelihood estimates for generalized linear models. ArXiv. 2019;1903.01633.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref18] 18. Breslow NE. Extra‐Poisson variation in log‐linear models. J Roy Stat Soc C. 1984;33(1):38–44.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref19] 19. El-Sayyad GM. Bayesian and classical analysis of Poisson regression. J Roy Stat Soc B. 1973;35(3):445–51.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref20] 20. Chan AB, Dong D. Generalized Gaussian process models. Proceedings of the IEEE conference on computer vision and pattern Recognition. 2011;5995688:2681–8.

[ref21] 21. Wood SN. Generalized Additive Models: An Introduction with R. CRC Press: Boca Raton; 2017.

[ref22] 22. Chan AB, Vasconcelos N. Counting people with low-level features and Bayesian regression. IEEE Trans Image Process. 2011;21(4):2160–77. pmid:22020684
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref23] 23. Bellego C, Pape LD. Dealing with logs and zeros in regression models. SSRN: 3444996 [Preprint]. 2019; Available from: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3444996.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref24] 24. Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss—Newton method. Biometrika. 1974;61(3):439–47.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

Figures

Abstract

Introduction

Methods

Improved log-Gaussian approximation

Over-dispersed Poisson regression.

Log-Gaussian approximation for the Poisson regression.

Proposed approximation.

Proposed approximation for Poisson mixed effects model.

Property of the proposed approximation

Results: Monte Carlo experiments

Case 1: Basic over-dispersed Poisson regression model

Case 2: Model with spatial effects

Computation time comparison

Results: COVID-19 analysis

Outline

Results

Discussion

Supporting information

S1 File.

References