Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Multivariate epidemic count time series model

  • Shinsuke Koyama

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    skoyama@ism.ac.jp

    Affiliations Department of Statistical Modeling, The Institute of Statistical Mathematics, Tachikawa, Tokyo, Japan, Department of Statistical Science, Graduate University for Advanced Studies (SOKENDAI), Tachikawa, Tokyo, Japan

Abstract

An infectious disease spreads not only over a single population or community but also across multiple and heterogeneous communities. Moreover, its transmissibility varies over time because of various factors such as seasonality and epidemic control, which results in strongly nonstationary behavior. In conventional methods for assessing transmissibility trends or changes, univariate time-varying reproduction numbers are calculated without taking into account transmission across multiple communities. In this paper, we propose a multivariate-count time series model for epidemics. We also propose a statistical method for estimating the transmission of infections across multiple communities and the time-varying reproduction numbers of each community simultaneously from a multivariate time series of case counts. We apply our method to incidence data for the novel coronavirus disease 2019 (COVID-19) pandemic to reveal the spatiotemporal heterogeneity of the epidemic process.

Introduction

Mathematical and statistical modeling of epidemics is crucial in epidemiology because it provides a theoretical basis for analyzing the spread of infectious diseases and the building blocks of statistical methodologies for data analysis [1, 2]. Epidemic modeling has been used to describe a wide variety of phenomena. For instance, the spread of information, opinions, and social behaviors can be modeled as a contagion process [3]. Epidemic modeling lies at the core of a research field that crosses different disciplines. In this study, we developed a multivariate time-series epidemic analysis model.

During an ongoing pandemic, the most commonly available type of data is the daily (or weekly) number of newly reported cases. The time series of case counts provides information on the epidemic size and transmissibility trends. The time-varying effective reproduction number was defined as the expected number of secondary cases arising from a single primary case. They are widely used to monitor these trends. Recent developments in epidemiological data analysis have utilized statistical methodologies to improve the efficiency of estimating the effective reproduction number [4, 5]. However, most existing methods are limited to univariate time-series analyses. Although epidemics spread across multiple communities, populations, or regions, these methods can only be used to calculate the univariate effective reproduction numbers of each community separately. To incorporate infection transmission over multiple communities, it is necessary to extend these methods to perform multivariate analysis.

In this study, we propose a multivariate count time series model that describes an epidemic process in multiple nodes. Central to our approach is introducing latent variables to represent successive infections in transmission chains. These latent variables enable us to make posterior inferences regarding the transmission of infections across multiple nodes and develop an EM-type algorithm for estimating the model parameters. In particular, the proposed algorithm can be used to simultaneously estimate the entries in the adjacency matrix and time-varying effective reproduction numbers of each node from a multivariate incidence time series. An application of our method is demonstrated using synthetic and actual data from the COVID-19 pandemic.

Related works

The effective reproduction number can be defined in two approaches: instantaneous and cohort reproduction numbers. The former measures transmission at a specific point in time, while the latter measures transmission in a specific cohort of individuals [6, 7]. In the following methods, either of these reproduction numbers are estimated from a time series of case counts.

Wallinga and Teunis’s method is based on the likelihood of a renewal model and is commonly used to estimate the cohort reproduction number [8]. In the methods proposed by [9] (EpiEstim) and Bettencourt and Ribeiro [10], the instantaneous reproduction number is estimated using a Bayesian framework. The main difference between these two methods lies in the model assumption. A renewal model is assumed in EpiEstim, similar to the Wallinga and Teunis method, whereas the Bettencourt and Ribeiro method is based on the linearized growth rate of a SIR model. Compared to the SIR model, the renewal model involves simpler parametric assumptions regarding the epidemic process and requires only the generation interval distribution. An advantage of methods based on the renewal model is their simplicity: Parsimony reduces the risk of model misspecification when there are many unknowns in the underlying process. A comprehensive comparison of these three methods was presented by [7].

In a recently proposed method [4], the cohort reproduction number was estimated and combined with the state-space model in a renewal model, and the estimation problem was solved using a recursive Bayesian smoothing procedure. Parag [5] independently proposed a method to estimate the instantaneous reproduction number along the same lines. To the best of our knowledge, these are state-of-the-art methods for estimating the effective reproduction number from an incidence curve.

However, note that all these methods apply only to univariate time-series data.

Methods

Multivariate-count time series model

First, we consider a univariate epidemic model. Let nt be the number of cases in which symptoms begin at time t. Given an initial case at time t = 1, the rate of new cases at time t(≥ 2) can be described as (1) where {ϕτ} is the serial interval distribution (i.e., the distribution of time from symptom onset in the primary case to symptom onset in the secondary case [11]) and Rtτ is the effective reproduction number at time tτ.

Before extending it to perform multivariate analysis, we contrast Eq (1) with the conventional renewal model that describes epidemic processes [5, 6, 9, 12]. (2) The most significant difference between these two models lies in their treatment of the effective reproduction number. Here, Rt in Eq (2) represents the instantaneous reproduction number, while Rtτ in Eq (1) represents the cohort reproduction number, as shown in S1 File. Additionally, the count ntτ in Eq (2) represents the number of cases infected at time tτ; consequently, {ϕτ} represents the generation time distribution. Therefore, to apply this model to reported cases, it is necessary to adjust the time lag between infection and symptom onset [6]. By contrast, because the case definition in Eq (1) is based on symptoms, there is no need to adjust for the time lag. Hence, we employ Eq (1) as the basis of our multivariate model.

We consider an infection process that spreads over D nodes, where each node represents a group of individuals. Let nit be the number of newly reported cases in (i, t), where (i, t) denotes node i at time t. By extending Eq (1) to a multivariate setting, the rate of new cases at (i, t) is given by (3) where Rj,tτ denotes the effective reproduction number at (j, tτ) and aij(≥ 0) represents the transmission ratio from node j to node i satisfying . Based on the history of previously reported cases, (4) the count nit is assumed to follow a Poisson distribution with the rate Eq (3): (5) The multivariate count time series model comprises the two components in Eqs (3) and (5).

Latent variable representing secondary infection in transmission chain

Now, let us introduce the latent variable , which represents the number of secondary cases at (i, t) infected by the primary cases at (j, s) (s < t). As the total number of new cases at (i, t) is given by nit, the following equality holds: (6) Assuming conditional independence between the transmission events and for (j, s) ≠ (j′, s′) and given N1:t−1. The superposition principle can be applied to the Poisson distribution (5), leading to a Poisson distribution for counts with the rate given by . Thus, given the sum of independent Poisson random variables nit, the conditional distribution of each element of the Poisson vector is multinomially distributed with count probabilities scaled by the sum of the individual rates: (7)

In particular, the conditional expectation of , given nit and N1:t−1, is (8) from which posterior inferences can be made regarding secondary infections from the reported incidences.

Parameter estimation

Using latent variables, we develop an expectation maximization (EM)-type algorithm to estimate the model parameters. In this study, we focus on estimating the weighted adjacency matrix A = (aij) and time-varying reproduction number R = {Rjs} of each node from an observed multivariate time series of incidence N1:T. To estimate these parameters, we consider the following penalized log-likelihood function: (9) where the choice of exponent p ∈ {1, 2} and hyperparameter γ ≥ 0 depend on the sparsity or smoothness of the variation in the time-varying reproduction numbers.

Rather than maximizing Eq (9) with respect to A and R directly, we iteratively update the estimates to ensure that the objective function increases monotonically. To update the parameters, we construct a tight lower bound Q(A, R|A(k), R(k)) for the current parameter estimations {A(k), R(k)} such that (10) (11) Maximizing the function that satisfies these properties ensures that the objective function increases monotonically. Similar to the EM algorithm, the lower bound is obtained as follows: (12) where is the conditional expectation of computed from Eq (8) and the current parameter estimations {A(k), R(k)}. (See S1 File for the derivation.)

Updating of A The update for aij is obtained by maximizing Eq (12) with respect to aij under constraint . Using the Lagrange multiplier method, we obtain (13) Eq (13) leads to a natural interpretation of aij as the fraction of the expected number of secondary cases in node i infected by the primary cases in node j.

Updating of R The update of Rjs satisfies the following equation: (14) where denotes the total number of expected secondary infections caused by primary cases at (j, s), and . As shown in S1 File, the solution of the system of the above equations (s = 1, …, T) corresponds to the maximum a posteriori (MAP) estimates of a univariate state-space model with a Poisson observation model, (15) and the state-transition density, (16)

The transition density corresponding to the penalty function is given by a Laplace (p = 1) or Gaussian distribution (p = 2). The update of the time-varying reproduction number of node j, , is then computed using state smoothing for the equivalent state-space model. The details of the smoothing algorithm are provided in S1 File.

The overall algorithm is summarized in Algorithm 1.

Algorithm 1 Algorithm for estimating A and R

Input: Time series of case counts N1:T, penalty function p = 1 or 2, and hyper-parameter γ > 0.

1: Initialize for all i, j, and for all j, s.

2: while k = 0, 1, … do

3:  Compute using Eq (8) with A(k) and R(k) for all i, t, j, s.

4:  Update using Eq (13) for all i, j.

5:  for j = 1 to D do

6:   Update by state smoothing.

7:  end for

8: end while

output: A(k+1), R(k+1).

The free energy (negative log-marginal likelihood) of the state-space model can be computed through state smoothing (see S1 File for details). We use the total free energy, that is, the sum of the free energies of all the nodes, as the criterion for selecting the penalty function (p = 1 or 2) and the value of the hyperparameter γ.

Results

Analysis of synthetic data

We first demonstrate our method using a toy example whereby the infection process is simulated using two nodes (D = 2). To mimic the COVID-19 pandemic, we employed a log-normal distribution for the serial interval distribution {ϕt} with a mean and standard deviation of 4.7 and 2.9 days, respectively, [13]. The transmission ratios were set as a11 = a22 = 0.9 and a12 = a21 = 0.1. The time-varying reproduction numbers of each node are shown in Fig 1a. Using these parameters and a given initial infection at t = 1, the model given by Eqs (3) and (5) is used to generate N1:T. In Fig 1b, we plot the simulated case counts N1:T for each node (black lines). Because of the profile of the effective reproduction numbers, the incidence curves for both nodes exhibit two waves: the first and second are caused by nodes 1 and 2, respectively.

thumbnail
Fig 1. Simulation result.

(a) Time-varying reproduction numbers used in the simulation. (b) Simulated infection counts (black line) and estimated infection counts transmitted from the other node (gray area). (c-f) True (yellow line) and estimated reproduction numbers (blue line) with 95% credible interval (shaded area) obtained using our method (c), OnlyR (d), WT (e), and Cauchy (f). Our method successfully estimated the amplitude of the effective reproduction number while the other three methods failed.

https://doi.org/10.1371/journal.pone.0287389.g001

From the simulated case counts N1:T, we set the penalty function (p = 1 or 2) and the value of the hyperparameter γ based on free energy minimization using a grid search and estimated the model parameters . The estimated transmission ratios are , , and , which are in good agreement with the true values. Fig 1c shows the estimated time-varying reproduction numbers (blue line) of each node with 95% credible intervals (shaded area), along with the true values (yellow line), from which we confirmed that the amplitudes of the reproduction numbers were properly estimated.

Using the estimated parameters and Eq (8), the conditional expectation of the secondary infections, , was computed, from which we estimated the number of infections that are transmitted from the other node, (Fig 1b, gray area). As expected, we observed that the second wave in node 1 was dominated by infections transmitted from node 2, whereas the first wave in node 2 was initiated by the outbreak of node 1.

To examine the effect of overlooking inter-node infections on estimates of time-varying reproduction numbers, we estimated R with the identity matrix for A fixed, that is, separate estimates for each node (Fig 1d; ‘OnlyR’). The estimated reproductive numbers are substantially biased. The reproduction number of node 1 during the first wave was underestimated because ignoring the infections node 1 transmitted to node 2, whereas the reproduction number during the second wave was overestimated because it counted the infections transmitted from node 2. (The same holds for the estimated reproduction number of node 2.)

For comparison, we applied two other estimation methods to estimate the time-varying reproduction numbers. The first method is that proposed by Wallinga and Teunis (‘WT’), which is widely used to estimate effective reproduction numbers [8]. In the second method [4], the effective reproduction number is estimated using the state-space method equipped with a Cauchy transition density (“Cauchy”). As these methods are applicable only to univariate time series, the time-varying reproduction numbers for each node are estimated separately. The estimation results obtained using these two methods are plotted in Fig 1e and 1f. As was the case with OnlyR, upward and downward biases were observed in these two methods because the inter-node infections were ignored. The difference in the estimated reproduction numbers among these three methods was attributed to the smoothness assumption made in these estimation methods. In particular, the WT is easily influenced by data fluctuations in the initial phase of the epidemic, and the estimated reproduction numbers decrease at the end of the recorded interval owing to right truncation.

We performed an additional numerical study, in which the number of nodes (dimensions) was varied from D = 10 to 50. The transmission ratios were randomly chosen and normalized such that , and the time-varying reproduction numbers of each node were randomly chosen between the two profiles, as shown in Fig 1a. To quantify the estimation performance, we compute the average relative error as follows: (17) for time-varying reproduction numbers. We performed the numerical study 10 times with different samples and reported the average performance metrics over the 10 runs. The results are summarized in Table 1. Our method outperformed the other three methods over the range of dimensions examined. This confirmed that the estimation of time-varying reproductive numbers can be significantly improved by considering the transmission of infections across multiple nodes.

thumbnail
Table 1. Average relative error between the true R and its estimate obtained using our method, OnlyR, WT, and Cauchy.

https://doi.org/10.1371/journal.pone.0287389.t001

Analysis of actual data

We applied our method to actual data from the COVID-19 pandemic in Japan (https://www.mhlw.go.jp/stf/covid-19/open-data.html). The data consist of newly confirmed cases in D = 47 prefectures between January 16, 2020, and November 24, 2021, in which 1,720,441 cases were reported. The data for each week were aggregated to reduce the influence of daily noise on the reported cases. We applied our estimation method to new weekly cases to estimate the parameters . We empirically confirmed that the algorithms with different initializations converged to the same estimate. Using the estimated parameters and Eq (8), the conditional expectation of secondary infections, , was computed, from which we made posterior inferences regarding the infections transmitted across the prefectures. In particular, we computed the expected total number of secondary infections in prefecture i that were transmitted from prefecture j: .

Overall, it was estimated that 82% of the infected cases ( cases) were infected within each prefecture (“intra-prefectural infections”) and that 18% ( cases) of the infections were transmitted across prefectures (“inter-prefectural infections”). Fig 2 shows a matrix visualization of (Fig 2a, heat map) and bar graphs of the intra-prefectural infections , in-degree , and out-degree in each prefecture (Fig 2b). The prefectures are arranged in geographical order from northeast to southwest. Inter-prefectural infections exhibit a community structure that is correlated with the demographic, economic, and industrial characteristics of the prefectures. The largest community was centered around Tokyo (the capital of Japan), the second-largest community around Osaka (the largest prefecture in the west), and the third-largest community was centered around Aichi (the largest industrial area).

thumbnail
Fig 2. Result of the actual data analysis.

(a) Matrix visualization of estimated secondary infections plotted in log scale. Diagonal and off-diagonal elements represent intra-prefectural and inter-prefectural infections, respectively. (b) Bar graph of intra-prefectural infections (top), in-degree 〈yi〉 (middle), and out-degree 〈yj〉 (bottom) of each prefecture.

https://doi.org/10.1371/journal.pone.0287389.g002

Fig 3 shows the weekly incidence curves (black lines) along with the estimated infection counts transmitted from the other nodes (gray areas) and estimated time-varying reproduction numbers (blue line) for Tokyo, Osaka, and Aichi. The estimated reproduction numbers exhibited rises and falls that are correlated with the periods whereby the state of emergency was implemented (purple range). In the same figure, we plot the time-varying reproduction numbers estimated using OnlyR (green line), WT (yellow line), and Cauchy (red line). These three estimates exhibit a systematic deviation from that obtained by our method, which indicates that there is a substantial number of inter-prefectural infections transmitted across prefectures. The estimated time-varying reproduction numbers for all the 47 prefectures are shown in S1 Fig.

thumbnail
Fig 3. Result of the actual data analysis.

Weekly incidence curve (black line) along with estimated infection counts transmitted from outside (gray area) and estimated time-varying reproduction number (blue line) with 95% credible interval (shaded sky blue area) for Tokyo (top), Osaka (middle) and Aichi (bottom) prefectures. Green, yellow, and red lines represent time-varying reproduction numbers estimated by OnlyR, WT, and Cauchy, respectively. The purple range represents the period whereby a state of emergency was implemented.

https://doi.org/10.1371/journal.pone.0287389.g003

Conclusion

In this paper, we proposed a multivariate-count time-series model for epidemics spreading across multiple nodes. The central concept of our approach is to introduce latent variables that represent secondary infections in transmission chains. This enabled us to infer infection transmission across multiple nodes and develop an EM-type algorithm for estimating model parameters. The proposed algorithm simultaneously estimates the weighted adjacency matrix and time-varying reproduction numbers for each node from a multivariate time series of incidence. In addition, we formulated a state-smoothing algorithm to estimate time-varying reproduction numbers. This enabled us to use a tool developed for state-space models.

Because the serial interval distribution is fixed, our estimation method is limited to incidence data whereby the statistical properties of the serial interval remain unchanged. While we employed the log-normal distribution estimated in the early phase of COVID-19 [13], the omicron variant, which became prevalent in January 2022, spreads more easily than the earlier variants of SARS-CoV-2 [14]. Therefore, it is necessary to revise the estimate of the serial interval distribution to assess the transmissibility during distinct phases of the pandemic.

Although we assume that the entire outbreak is driven by transmission within the network, the proposed model (3) can include case imports from outside the network as (18) where μit is the import rate in case (i, t). Accordingly, the cases imported from outside the network, denoted by , are incorporated into Eq (6) as (19) for which posterior inference based on the multinomial distribution (7) is applied to differentiate between cases arising from the network and those imported from outside. The rate of case importations, μit, may be pre-estimated by medical inspection during airport quarantine; if this is not the case, simultaneous estimation of μit, along with the adjacency matrix and time-varying reproduction numbers, will be developed, which is left for future work.

The concept of introducing latent variables to represent transmission chains is similar to that of the branching representation of the Hawkes process [1517]. A Hawkes-type point process was obtained in the continuous-time limit of the count time-series model. From this perspective, our model can be regarded as its discrete-time counterpart and provides a basis for its application in the analysis of multivariate count time series.

Supporting information

S1 File. Supplementary material to the manuscript.

https://doi.org/10.1371/journal.pone.0287389.s001

(PDF)

S1 Fig. Estimated effective reproduction numbers for 47 prefectures.

https://doi.org/10.1371/journal.pone.0287389.s002

(PDF)

Acknowledgments

The author would like to thank the participants of “Workshop on Mathematical Modeling,” held in Karuizawa in July 2022, for their valuable comments.

References

  1. 1. Andersson H, Britton T. Stochastic Epidemie Models and Their Statistical Analysis (Lecture Notes in Statistics). New York: Springer; 2000.
  2. 2. Yan P, Chowell G. Quantitative methods for investigating infectious disease outbreaks. Springer; 2019.
  3. 3. Hill AL, Rand DG, Nowak MA, Christakis NA. Infectious Disease Modeling of Social Contagion in Networks. PLOS Computational Biology. 2010;6:1–15. pmid:21079667
  4. 4. Koyama S, Horie T, Shinomoto S. Estimating the time-varying reproduction number of COVID-19 with a state-space method. PLoS Computational Biology. 2021;17:e1008679. pmid:33513137
  5. 5. Parag KV. Improved estimation of time-varying reproduction numbers at low case incidence and between epidemic waves. PLoS Computational Biology. 2021;17:e1009347. pmid:34492011
  6. 6. Fraser C. Estimating individual and household reproduction numbers in an emerging epidemic. PLoS ONE. 2007;2:e758. pmid:17712406
  7. 7. Gostic KM, McGough L, Baskerville EB, Abbott S, Joshi K, Tedijanto C, et al. Practical considerations for measuring the effective reproductive number, Rt. PLoS Computational Biology. 2020;16:e1008409. pmid:33301457
  8. 8. Wallinga J, Teunis P. Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures. American Journal of Epidemiology. 2004;160:509–516. pmid:15353409
  9. 9. Cori A, Ferguson NM, Fraser C, Cauchemez S. A new framework and software to estimate time-varying reproduction numbers during epidemics. American Journal of Epidemiology. 2013;178:1505–1512. pmid:24043437
  10. 10. Bettencourt LMA, Ribeiro RM. Real time Bayesian estimation of the epidemic potential of emerging infectious diseases. PLoS ONE. 2008;3:e2185. pmid:18478118
  11. 11. Fine PEM. The interval between successive cases of an infectious disease. Am J Epidemiol. 2003;158:1039–1047. pmid:14630599
  12. 12. Nishiura H, Chowell G. The Effective Reproduction Number as a Prelude to Statistical Estimation of Time-Dependent Epidemic Trends. Mathematical and Statistical Estimation Approaches in Epidemiology. 2009; p. 103–121.
  13. 13. Nishiura H, Lintona NM, Akhmetzhanov AR. Serial interval of novel coronavirus (COVID-19) infections. International Journal of Infectious Diseases. 2020;93:284–286. pmid:32145466
  14. 14. Backer JA, Eggink D, Andeweg SP, Veldhuijzen IK, van Maarseveen N, Vermaas K, et al. Shorter serial intervals in SARS-CoV-2 cases with Omicron BA.1 variant compared with Delta variant, the Netherlands, 13 to 26 December 2021. Eurosurveillance. 2022;27:2200042. pmid:35144721
  15. 15. Hawkes AG. Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society Series B (Methodological). 1971;33:438–443.
  16. 16. Hawkes AG. Spectra of some self-exciting and mutually exciting point processes. Biometrika. 1971;58:83–90.
  17. 17. Hawkes AG, Oakes D. A cluster process representation of a self-exciting process. Journal of Applied Probability. 1974;11:493–503.