Figures
Abstract
Given a sequence of epidemic events, can a single epidemic model capture its dynamics during the entire period? How should we divide the sequence into segments to better capture the dynamics? Throughout human history, infectious diseases (e.g., the Black Death and COVID-19) have been serious threats. Consequently, understanding and forecasting the evolving patterns of epidemic events are critical for prevention and decision making. To this end, epidemic models based on ordinary differential equations (ODEs), which effectively describe dynamic systems in many fields, have been employed. However, a single epidemic model is not enough to capture long-term dynamics of epidemic events especially when the dynamics heavily depend on external factors (e.g., lockdown and the capability to perform tests). In this work, we demonstrate that properly dividing the event sequence regarding COVID-19 (specifically, the numbers of active cases, recoveries, and deaths) into multiple segments and fitting a simple epidemic model to each segment leads to a better fit with fewer parameters than fitting a complex model to the entire sequence. Moreover, we propose a methodology for balancing the number of segments and the complexity of epidemic models, based on the Minimum Description Length principle. Our methodology is (a) Automatic: not requiring any user-defined parameters, (b) Model-agnostic: applicable to any ODE-based epidemic models, and (c) Effective: effectively describing and forecasting the spread of COVID-19 in 70 countries.
Citation: Lee G, Yoon S-e, Shin K (2022) Simple epidemic models with segmentation can be better than complex ones. PLoS ONE 17(1): e0262244. https://doi.org/10.1371/journal.pone.0262244
Editor: Siew Ann Cheong, Nanyang Technological University, SINGAPORE
Received: July 14, 2021; Accepted: December 20, 2021; Published: January 12, 2022
Copyright: © 2022 Lee et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data are available from https://github.com/geonlee0325/covid_segmentation.
Funding: This research was supported by Disaster-Safety Platform Technology Development Program of the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. 2019M3D7A1094364), and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Infectious diseases have been serious threats to global public health. They not only change lifestyles of millions of people worldwide but also bring about dramatic changes in many areas, including economies, cultures, ecologies, and more. Unfortunately, the war against infectious diseases has continued throughout human history. The Black Death killed a third of the world’s population in 1340s, and the Spanish flu in 1918 is estimated to have resulted in at most 500 million deaths. Recent epidemic outbreaks of SARS, Ebola, Zika, and COVID-19 show that the war is not over yet.
Consequently, understanding and predicting epidemic spreads are important for prevention and effective decision making. How many people will be infected within a week? How will lockdowns affect the spread? To answer these questions, we require a method that is simple enough to be comprehensible but expressive enough to accurately model and predict the spread of infectious diseases.
Ordinary differential equations (ODEs) have successfully described dynamic systems in various fields, including ecology, economics, physics, and biology. ODEs have also been utilized in epidemics. Some of the earliest epidemic models, such as SIS, SIR, and SEIR, are compartment models [1]. These models divide the population into several compartments and capture patterns of dynamic changes in the sizes of the compartments over time. The dynamics are expressed as predefined ODEs, which are based on human knowledge, with tunable parameters. While these models are intuitive and simple, they often have limited expressiveness, failing to capture epidemic dynamics accurately. On the other hand, data-driven models [2, 3] aim to model and forecast co-evolving time-series data using ODEs, without relying on human knowledge. They employ latent variables and non-linear differential equations to capture complicated temporal dynamics.
Despite the development of epidemic models, describing long-term dynamics of epidemics using a single epidemic model often faces limitations due to the unpredictability and abruptness of real-world events. Indeed, various external factors may substantially change the dynamics of epidemic events. For example, policies reducing contacts between individuals (e.g., lockdown) and the capability to perform tests can significantly affect the dynamics.
In this work, we demonstrate that properly dividing an epidemic event sequence into multiple segments and fitting a simple epidemic model to each segment greatly helps describe and predict the epidemic propagation concisely and accurately. For example, in Fig 1(a) and 1(b), the entire sequence of events regarding COVID-19 in Italy is fitted to two epidemic models with different numbers of parameters. On the other hand, in Fig 1(c), the sequence is split into multiple segments, and then a simple model is fitted to each segment. As seen in Fig 1(d), the segmentation leads to 8.09× smaller fitting error with fewer parameters than using a single model for the entire sequence.
Dividing the event sequence (i.e., the numbers of active cases, recoveries, and deaths) properly into multiple segments and fitting a simple epidemic model to each segment leads to a more concise model with a better fit than fitting a complex model to the entire period. See the experiment section for details.
Then the following questions naturally arise: Given a sequence of epidemic events, where should we divide it? How many segments should we divide it into? We propose a segmentation scheme that greedily decides where to split. It also decides the number of segments by balancing the fitting error and the sizes of the models for all segments, based on the Minimum Description Length (MDL) principle.
We validate our approach using event sequences regarding recent Coronavirus Disease-19 (COVID-19), specifically the numbers of active cases, recoveries, and deaths in 70 countries. COVID-19 was recognized as a pandemic by the World Health Organization. By early April 2021, 129 million confirmed cases and 2.8 million deaths were reported worldwide. Our experiments reveal that our segmentation scheme enhances three epidemic models in explaining and predicting the propagation of COVID-19.
The strengths of our approach are summarized as follows:
- Automatic: It does not require any user-defined parameters, such as the number of segments.
- Model-agnostic: It is applicable to any ODE-based epidemic models without being restricted to certain models.
- Effective: Applied to the COVID-19 datasets, it significantly reduces the fitting error (up to 14.29× with fewer parameters) and forecasting error (up to 31.54×) of three epidemic models.
Using the proposed segmentation methodology, we expect that real-world surveillance services of COVID-19 can be assisted in following manners:
- Policy verification: The point where the segmentation occurs indicates the rapid changes in dynamics, which policymakers should be aware of. Thus, segmentation of the sequence assists examining the impact of policies (e.g., lockdowns or mandatory mask-wearing) after they are deployed.
- Future prediction: As shown in the experiments section, future epidemics can be estimated more accurately using our segmentation scheme. Accurate prediction can improve social policy decisions.
Reproducibility: The code and datasets used in the paper are available at https://github.com/geonlee0325/covid_segmentation.
2 Related work
We briefly review previous work on two related topics: epidemic models and time-series analysis models.
2.1 Epidemic models
A variety of epidemic models have been proposed to understand and predict the spread of infectious diseases [4]. In the SI model, the population is divided into two different groups: susceptible and infectious; and the size of each group changes based on predefined differential equations. Taking realistic conditions, such as reinfection, recovery, immunity, population change, and exposure, into consideration, the SI model has been extended to SIS, SIR [5], SIRS [6], SIRD [7], SEIR [8], and many more. The spread of COVID-19 has been analyzed using modified SIRs: Li et al. [9] take human mobility into account, and Dandekar et al. [10] consider quarantine controls. These models are intuitive, explainable, and simple since they are based on human knowledge. However, they show weakness in capturing long-term dynamics of epidemic events especially when the dynamics heavily depend on external factors.
2.2 Time-series analysis models
Mining and modeling time-series data is a building block of many analytical and predictive tasks, such as pattern discovery [11, 12], disaggregation [13], and forecasting [2, 3, 14, 15], in a variety of fields, including social media [16, 17], web [14], and medical science [18]. Especially, ordinary differential equations (ODEs) have attracted much attention, due to its simplicity and expressiveness, and several studies focus on learning ODEs from data [19–22]. Recently, Chen et al. [19] introduce a generative model to solve ODEs using neural networks.
There have been several studies on learning to segment temporal data. Most of them [2, 3, 15, 23] focus on detecting repetitive patterns in activities (e.g., sensor data and motion events), while we focus on segmenting epidemic data, where dynamics suddenly change due to external factors, eventually better modeling and forecasting the spread of COVID-19.
Recently, Jiang et al. [24, 25] propose piecewise linear quantile models that detect multiple change-points, where an SN-based test statistic is above the properly chosen threshold, for capturing the ever-changing growth rate of daily new cases of COVID-19. Note that our segmentation scheme has two distinct advantages over those used in these models: (a) automatic: it does not require any prior hyperparameters and (b) model-agnostic: it can be applicable to any ODE-based epidemic models, including non-linear fitting models. Our segmentation scheme belongs to the class of binary segmentation [26]. While existing binary segmentation schemes are known to cause loss when detecting non-monotonic changes [27, 28], we demonstrate that our MDL-based segmentation scheme accurately divides the sequences and fits a model to each segment. Specifically, as shown in the experiment section, our segmentation scheme detects splitting points 3.59× more accurately and leads to 3.23× smaller fitting error (with the same number of parameters) than the non-binary the segmentation method inspired by [2].
3 Preliminaries
In this section, we introduce some notations and three main epidemic models that are used in the paper. Refer to Table 1 for the frequently-used notations. We first review the Susceptible-Infectious-Recovered (SIR) model, which is one of the most classical compartment models. Then, we introduce two latent dynamics models that are based on linear and non-linear dynamics of latent variables.
3.1 Susceptible-Infectious-Recovered (SIR) model
The SIR model is one of the most classical epidemic models. Given a group of individuals of closed population P, each individual is assigned to one of the three states: S (susceptible), I (infectious), and R (recovered). Here, we use S(t), I(t), and R(t) to denote the number of individuals at the three states, respectively, at timestamp t. The model assumes that each individual goes through two types of transitions: infection and recovery. That is, the state to which an individual belongs changes from S to I and then from I to R. Additionally, the model assumes that the probability of a susceptible individual to get infected at each time t is proportional to the number of infected individuals with a coefficient β, and the model assumes that the probability of an infected individual to become recovered at each time t is γ. These dynamics can be expressed as the following three differential equations, where β and γ are model parameters:
Note that these equations imply S(t) + I(t) + R(t) = P.
3.2 Non-Linear Latent Dynamics (NLLD) model
This model [2] consists of two multi-dimensional event sequences: a k-dimensional latent (i.e., unobservable) event sequence w(t) and a d-dimensional observable event sequence v(t). The observed events v(t) are assumed to be determined by the following non-linear dynamical systems of the latent factors w(t):
(1)
(2)
where ⊙ denotes the Hadamard product (i.e., the elementwise product); and
,
, and
describe the linear, exponential, and non-linear dynamics between latent factors. In addition,
and
are used to project the latent factors to the observed events. The model parameters are p, Q, A, u, V, and the initial condition w(0) = w0 of the latent factors.
3.3 Linear Latent Dynamics (LLD) model
We also consider a special case of the NLLD model, where the d-dimensional observed event sequence v(t) is assumed to be determined by the following linear dynamical systems of k-dimensional latent factors w(t):
The NLLD and LLD models can naturally be used as epidemic models if we regard I(t) and R(t) (i.e., the numbers of infected and recovered individuals) in the SIR model as the 2-dimensional observed event sequence v(t). Unlike the SIR model, the latent dynamics models are fully data driven, and thus they capture the temporal patterns in the event sequences without any prior knowledge of epidemics. Moreover, they describe the dynamics of the observed events using latent factors, which are not directly observed. Many real-world events are known to be largely affected by latent factors, and as shown in the experiment section, the latent dynamic models predict the spread of COVID-19 significantly more accurate than the SIR model.
4 Proposed method
In this section, we present our approach for deciding the number of segments and their locations automatically without user-defined parameters. We first define the description length of an event sequence. Then, based on the definition, we describe how we adapt the Minimum Description Length (MDL) principle to evaluate segmentation. Then, we propose a search algorithm for finding the best segmentation.
4.1 Description length
Given a sequence X and a model M, the description length (in bits) of X, denoted by Cost(X), is defined as:
where the model cost Cost(M) is the number of bits required to describe the model M, and the data cost Cost(X|M) is the number of bits to encode X given M. The model cost and the data cost are described below.
4.1.1 Model cost.
To measure the model cost Cost(M), we examine the parameters of the model M and their sizes in bits. Below, we consider the three aforementioned epidemic models. Note that the model cost of any other models can be measured in a similar way.
- SIR Model: The infection rate β and the recovery rate γ are two real numbers, and encoding each requires CF bits (we set CF to 8 by convention). Thus, the model cost required to describe the SIR model in bits is (we ignore the cost required to encode the population P since it is required only once regardless of the number of segments):
- Non-linear Latent Dynamics (NLLD) Model: This model is described by a set of six parameters: w0, p, Q, A, u, and V (see Eqs (1) and (2)). They contain to k, k, k2, k, d, and kd real-valued parameters, respectively. Thus, the model cost in bits required to describe the NLLD model is:
(3)
- Linear Latent Dynamics (LLD) Model: The model cost required by the LLD model is:
Note that the cost in bits required to encode A is subtracted from Eq (3).
Algorithm 1: Segment: MDL-based Greedy Segmentation Search
Input: (1) epidemic event stream X1:n
(2) epidemic model solver f
Output: segmented event stream
1 if n ≤ 2 then return X1:n ⊳ Base Case
2 C ← Cost(f(X1:n)) + Cost(X1:n|f(X1:n))
3
⊳ Eq (4)
4 C* ← Cost(X1:i* ⊕ Xi*+1:n)
5 if C* ≥ C then return X1:n
6 else return Segment(X1:i*, f) ⊕ Segment(Xi*+1:n, f) ⊳ Recursive Calls
4.1.2 Data cost.
The data cost Cost(X|M) is the number of bits required to describe X given M. We assume the Huffman coding [29] to encode the difference between the observed event sequence X and the event sequence V estimated by the model M. Then, the number of bits required is the negative log-likelihood under a Gaussian distribution as follows:
where xi(t) and vi(t) are the i-th dimension of actual and estimated events at time t. We fix σ to the standard deviation of the elements of X − V during the period of each segment.
4.1.2.1 Optimization. In order to fit M to X, we use the Levenberg-Marquardt (LM) algorithm to minimize the mean square errors between the given data sequence and the estimated sequence. Specifically, the LM algorithm adaptively varies the parameter updates to be interploated between the Gauss-Newton update or the gradient descent update, by adopting a damping parameter. The lmfit library we used in our implementation requires two arguments xtol and ftol, which are the relative errors desired in the approximation solution and the desired sum-of-squares, respectively. That is, termination occurs (a) when the relative error between two consecutive iterates is at most xtol or (b) when both the actual and predicted relative reductions in the sum of squares are at most ftol. However, as discussed in Section 5.5.1, our segmentation scheme is insensitive to these parameters, and thus we consistently use the same values throughout experiments. For the NLLD model, we split into the linear parameter set (p, Q, u, and V) and the non-linear parameter set (A) and separately optimize them using the expectation-maximization (EM) algorithm, as suggested in [2]. This, in practice, accelerates convergence, compared to simultaneously optimizing the entire parameters.
4.2 Segmentation evaluation
We adapt the Minimum Description Length (MDL) principle [30] for segmentation evaluation. Consider an event sequence X(= X1:n) and a solver f of an epidemic model. We denote the division of X into r segments where each i-th segment starts at time si and ends at time ei by
where s1 = 1, er = n, and ei + 1 = si+1 for each i ∈ {1, ⋯, r − 1}. Let f(Xi:j) be the epidemic model fitted to the segment Xi:j. Then, the description length in bits of
is:
(4)
where (r − 1) ⋅ log2(n) is the cost in bits required to encode r − 1 splitting points (i.e., s2, ⋯, sr). Since each splitting point is an positive integer smaller than n, the number of bits required to encode it is log2(n). The description length (i.e,. Eq (4)) balances the fitting error and the size of the parameters required to encode the epidemic models for all segments, and we use it to evaluate segmentation. Specifically, based on the MDL principle, we prefer the segmentation that minimizes Eq (4), and in the following subsection, we discuss how we search for such a segmentation.
4.3 Segmentation search
Given an event sequence X, how can we find the segmentation that minimizes the description length (i.e., Eq (4))? Since there are 2n ways to segment a length n sequence, naïvely trying all possible segments is computationally prohibitive. Thus, we propose to greedily segment the sequence, as described in Algorithm 1, throughout which we make the length of each segment at least two. Given an event sequence X1:n, we find a splitting point i* ∈ {2, ⋯, n − 2} where the description length (i.e., Eq (4)) of the corresponding segmentation is minimized (Line 3). If splitting X1:n at time i* strictly decreases the description length, we divide X1:n into X1:i* and Xi*+1,n, and then recursively divide each segments (Line 6). Otherwise, we stop segmentation (Line 5).
5 Experiments
In this section, we review our experiments designed to answer the following questions:
- Q1. Effectiveness of Segmentation: Does segmentation help understand the spread of COVID-19? Does it give a better trade-off between model complexity and fitness?
- Q2. Effectiveness of our Segmentation Scheme: How well does our greedy segmentation algorithm based on the MDL principle work? Does it yield small fitting error with the same number of segments than baseline?
- Q3. Accuracy of Forecasting: Is segmentation beneficial for accurately predicting the spread of COVID-19? Is it beneficial regardless of epidemic models used?
5.1 Experimental settings
- Machines: We conducted all the experiments on a machine with AMD Ryzen 9 3900X CPU and 128GB RAM.
- Datasets: We considered the 70 countries with the most confirmed cases as of the end of March, 2021. We used the number of active cases as I(t) and the number of recoveries and deaths as R(t) in each of the 70 countries from March 1, 2020 to March 30, 2021. The dataset is publicly available at [31]. Since the number of recoveries in the US is not available, we used the number of deaths as R(t).
- Implementations: We implemented the SIR model, the LLD model, and the NLLD model in Python. We used the lmfit library for the optimization (see https://lmfit.github.io/lmfit-py/ for details).
- How to choose k: For the LLD and NLLD models, we chose the number of latent factors k between 1 and 6 so that the description length (i.e., Eq (4)) is minimized.
5.2 Q1. Effectiveness of segmentation
We measure how segmentation by Algorithm 1 affects the model complexity and fitting error of the three considered epidemic models. As seen in Fig 2, segmentation leads to significantly better trade-offs between the model cost (in bits) and the fitting error (in terms of RMSE), regardless of the epidemic models used. For example, in the India dataset, the NLLD model with segmentation yields 11.54× smaller fitting error with smaller model cost than the same model without segmentation. Fig 3 show the input and estimated event sequences when the description length is minimized. The description length is minimized when a simple epidemic model with few latent factors is used with an enough number of segments. Simple epidemic models with segmentation provide more concise and accurate description of the spread of COVID-19 than complex models without segmentation. The results in the other countries can be found in the supplement.
For the LLD and NLLD models without segmentation, k varies from 1 to 10.
The true and estimated event sequences when the description length in bits is minimized.
We further qualitatively analyze the splitting points detected by our segmentation scheme in the dataset collected in Japan. Specifically, in the dataset our segmentation scheme detects three splitting points: (1) May 14, 2020, (2) August 25, 2020, and (3) January 13, 2021. As shown in Fig 4, these dates coincide with the periods when the state of emergency (SOE) was declared or lifted by the Japanese Government. The result indicates that there is a close correspondence between the segmentation derived by the proposed scheme and the deployed policies.
Splitting points detected by our segmentation scheme coincide with the periods when the state of emergency (SOE) was declared or lifted by the Japanese Government. Note that such events happened 12 times in total during the considered period, and all of them are marked in the figure.
5.3 Q2. Effectiveness of our segmentation scheme
We demonstrate the effectiveness of our greedy segmentation scheme based on the MDL principle by comparing it with the incremental method inspired by [2]. The incremental method goes through the sequence from the start and initiates a new segment whenever the fitting error within the current segment exceeds a given threshold ϵ. As in [2], we set the threshold proportional to the L2 norm of the current segment Xc with a coefficient α. That is, ϵ = α ⋅ ||Xc||2. Note that smaller α is expected to yield more segments. As seen in Fig 5, where we fix k to 2 and vary α from 0.05 to 0.5, our proposed segmentation scheme significantly outperforms the incremental method. Specifically, our scheme gives up to 3.23× smaller fitting error with the same model cost, which is proportional to the number of segments, than the incremental segmentation. The results in the other countries can be found in the supplement.
Furthermore, to numerically evaluate the accuracy of the segmentation, we generate synthetic sequences with randomly selected splitting points where each segment is generated by a different set of random parameters of the NLLD model. We carefully sample parameters based on the model parameters fitted to real-world sequences. Specifically, we sample −0.1 < p < 0.1, −0.1 < Q < 0.1, −0.001 < A < 0.001, −0.1 < u < 0.1, −1.0 < V < 1.0, and −1 < w0 < 1 uniformly at random. Then, we compare the detected splitting points, i.e., timestamps where the segmentation occurs, and the ground-truth ones by measuring F1 scores. When measuring F1 scores, for robust evaluation, we consider a detected splitting point is correct if it is within δ time units from a ground-truth one. As shown in Table 2, splitting points detected by our segmentation scheme match the ground-truth splitting points closely, and especially, our segmentation scheme is more accurate than the incremental method.
5.4 Q3. Accuracy of forecasting
We examine the effect of segmentation on the the accuracy of future prediction using the three considered epidemic models. To this end, we divide each sequence into the training sequence and the test sequence, which span 327 days and 37 days, respectively. Then, we fit the epidemic models to each training sequence with and without segmentation and predict the event sequence during the test period. When segmentation is applied, we ensure that the last segment is at least as long as the test period, and we use the model fitted to the last segment of the training sequence for prediction. We can ensure this by modifying Algorithm 1 so that it never splits the training sequence during its last 37 days. That is, it searches for splitting points during the first 290 days. This constraint is helpful for forecasting, as shown experimentally in Section 5.5.2. For the LLD and NLLD models without segmentation, we vary the the number of latent factors k from 1 to 6.
In Table 3, we compare the prediction error (in terms of RMSE) of the three epidemic models with and without segmentation. When the LLD model or the NLLD model is used, among 7 different settings, our segmentation scheme leads to the most accurate prediction in 32 and 33 (out of 70) countries, respectively. The second best one, which is the LLD model with k = 2 and no segmentation, is most accurate only in 9 countries. When the SIR model is used, segmentation increases the prediction accuracy in 70 (out of 70) countries. Moreover, prediction without segmentation is unstable with unreasonably large RMSE in some countries, while it is stable with segmentation in all countries. To sum up, segmentation tends to improve the prediction accuracy of all three considered epidemic models.
Note that with segmentation, only the last segment, not the entire sequence, is used for prediction. Despite the fact, segmentation increases the accuracy of prediction by letting epidemic models focus on the part that represents the current epidemic dynamics while ignoring the part before inherent changes in the dynamics.
5.5 Additional experimental results
Below, we present the results of additional experiments.
5.5.1 Insensitivity to two arguments: xtol and ftol.
For optimization, we used the lmfit library provided in Python, which minimizes non-linear least-squares. The leastsq function, which we used, requires two arguments, xtol and ftol, which are the desired relative errors in the approximation solution and the sum-of-squares, respectively (see https://lmfit.github.io/lmfit-py/fitting.html#lmfit.minimizer.Minimizer.leastsq for details.). We tested the NLLD model in the Japan dataset using eight different xtol and ftol values (10−1 to 10−8) and five different latent factors k (2 to 6). In the 40 considered settings, the splitting points of the segmentation were exactly the same (71th, 198th, and 324th day), which implies that the proposed scheme is insensitive to these parameters. Thus, in this work, we do not tune xtol and ftol but fix them to 10−8 in all experiments in the main paper.
5.5.2 The effect of the constraint on the last segment.
One might concern that avoiding segmentation within the last 37 days before the test set may degrade the flexibility of the model and thus the accuracy of forecasting. Empirically, however, this constraint is helpful for accurate prediction by preventing overfitting. Note that if the length of the last segment is too short, overfitting easily occurs, resulting in a large generalization (i.e., prediction) error. In order to demonstrate the effect of the constraint, we compared the forecasting errors of the NLLD model with (our original setting) and without the constraint in 70 countries. As shown in Fig 6, without the constraint, NLLD greatly overestimated the numbers of infected and recovered individuals in some countries (specifically, Lebanon and Lithuania). It should be noted that the estimates were even larger than the population of the countries. On the other hand, the constraint helped preventing such absurd predictions, and specifically, NLLD with the constraint always made predictions within the population of the countries. In addition, out of the 70 countries, NLLD with the constraint outperformed that without the constraint in 39 countries. The average forecasting error (in terms of RMSE) was also smaller when adopting the constraint. Specifically, it was 94.3 with the constraint and 116.3 without the constraint (averaged only the reasonable results in the 68 countries).
NLLD without the constraint sometimes greatly overestimates the numbers of infected and recovered individuals, and the constraint helps prevent such absurd predictions. Moreover, the constraint was beneficial in 39 countries out of the 70 countries. The countries are indexed in the order of the forecasting error of NLLD with the constraint.
6 Conclusions
In this work, we propose to divide epidemic event sequences into multiple segments and fit a simple model to each segment. To this end, we propose a greedy algorithm based on the MDL principle to decide where to split the sequences. Through extensive experiments using the COVID-19 event sequences from 70 countries, we demonstrate that our methodology has the following advantages:
- Automatic: All parameters are tuned automatically based on the MDL principle without relying on users.
- Model-agnostic: Any ODE-based epidemic models can be used with our segmentation scheme.
- Effective: The fitting error and prediction error of three epidemic models decrease up to 14.29× and 31.54×, respectively, with our segmentation scheme.
Reproducibility: The code and datasets used in the paper are available at https://github.com/geonlee0325/covid_segmentation.
References
- 1.
Hethcote HW, Stech HW, van den Driessche P. “Periodicity and stability in epidemic models: a survey. In: Differential equations and applications in ecology,epidemics, and population problems.” Elsevier; 1981. p. 65–82.
- 2. Matsubara Y, Sakurai Y. “Regime shifts in streams: Real-time forecasting of co-evolving time sequences,” In: KDD; 2016.
- 3. Matsubara Y, Sakurai Y. “Dynamic modeling and forecasting of time-evolving data streams,” In: KDD; 2019.
- 4.
Anderson RM, Anderson B, May RM. “Infectious diseases of humans: dynamics and control,” Oxford university press; 1992.
- 5. Antulov-Fantulin Nino and Lančić Alen and Štefančić Hrvoje and Šikić Mile. “FastSIR algorithm: A fast algorithm for the simulation of the epidemic spread in large networks by using the susceptible–infected–recovered compartment model,” Information Sciences. 2013;239:226–240.
- 6. Guo W, Zhang Q, Rong L. “A stochastic epidemic model with nonmonotone incidence rate: Sufficient and necessary conditions for near-optimality,” Information Sciences. 2018;467:670–684.
- 7. Osemwinyen AC, Diakhaby A. “Mathematical modelling of the transmission dynamics of ebola virus,” Applied and Computational Mathematics. 2015;4(4):313–320.
- 8. Fang H, Chen J, Hu J. “Modelling the SARS epidemic by a lattice-based Monte-Carlo simulation,” In: EMB; 2006.
- 9. Li R, Pei S, Chen B, Song Y, Zhang T, Yang W, et al. “Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus(SARS-CoV-2),” Science. 2020;368(6490):489–493. pmid:32179701
- 10. Dandekar R, Barbastathis G. “Quantifying the effect of quarantine control in Covid-19 infectious spread using machine learning,” medRxiv. 2020.
- 11. Papadimitriou S, Sun J, Faloutsos C. “Streaming pattern discovery in multiple time-series,” In: VLDB; 2005.
- 12. Papadimitriou S, Yu P. “Optimal multi-scale patterns in time series streams,” In: SIGMOD; 2006.
- 13. Yang F, Song HA, Liu Z, Faloutsos C, Zadorozhny V, Sidiropoulos N. “Ares: automatic disaggregation of historical data,” In: ICDE; 2018.
- 14. Matsubara Y, Sakurai Y, Faloutsos C. “The web as a jungle: Non-linear dynamical systems for co-evolving online activities,” In: WWW; 2015.
- 15. Hooi B, Liu S, Smailagic A, Faloutsos C. “BeatLex: Summarizing and Forecasting Time Series with Patterns,” In: ECML-PKDD; 2017.
- 16. Matsubara Y, Sakurai Y, Prakash BA, Li L, Faloutsos C. “Rise and fall patterns of information diffusion: model and implications,” In: KDD; 2012.
- 17. Mathioudakis M, Koudas N, Marbach P. “Early online identification of attention gathering items in social media,” In: WSDM; 2010.
- 18. Davidson I, Gilpin S, Carmichael O, Walker P. “Network discovery via constrained tensor analysis of fmri data,” In: KDD; 2013.
- 19. Chen RT, Rubanova Y, Bettencourt J, Duvenaud DK. “Neural ordinary differential equations,” In: NeurIPS; 2018.
- 20. Raissi M, Karniadakis GE. “Hidden physics models: Machine learning of nonlinear partial differential equations,” Journal of Computational Physics. 2018;357:125–141.
- 21. Schober M, Duvenaud DK, Hennig P. “Probabilistic ODE solvers with Runge-Kutta means,” In: NIPS; 2014.
- 22.
Raissi M, Perdikaris P, Karniadakis GE. “Numerical Gaussian processes for time-dependent and non-linear partial differential equations,” arXiv preprintarXiv:170310230. 2017.
- 23. Matsubara Y, Sakurai Y, Faloutsos C. “Autoplait: Automatic mining of co-evolving time sequences,” In: SIGMOD; 2014.
- 24. Jiang F, Zhao Z, Shao X. “Time series analysis of COVID-19 infection curve: A change-point perspective,” Journal of Econometrics. 2020. pmid:32836681
- 25. Jiang F, Zhao Z, Shao X. “Modelling the COVID-19 infection trajectory: A piecewise linear quantile trend model,” Journal of the Royal Statistical Society:Series B (Statistical Methodology). 2021.
- 26. Scott AJ, Knott M. “A cluster analysis method for grouping means in the analysis of variance,” Biometrics. 1974; p. 507–512.
- 27. Baranowski R, Chen Y, Fryzlewicz P. “Narrowest-over-threshold detection of multiple change points and change-point-like features,” Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2019;81(3):649–672.
- 28. Olshen AB, Venkatraman E, Lucito R, Wigler M. “Circular binary segmentation for the analysis of array-based DNA copy number data,” Biostatistics.2004;5(4):557–572. pmid:15475419
- 29. Böhm Christian and Faloutsos Christos and Pan Jia-Yu and Plant Claudia. “Ric: Parameter-free noise-robust clustering,” TKDD. 2007;1(3):10–es.
- 30. Rissanen J. “Modeling by shortest data description,” Automatica. 1978;14(5):465–471
- 31. Rajkumar S. “Novel Corona Virus 2019 Dataset. Day level information on covid-19 affected cases,” 2020. Available from: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset.