Simple epidemic models with segmentation can be better than complex ones

Geon Lee; Se-eun Yoon; Kijung Shin

doi:10.1371/journal.pone.0262244

Abstract

Given a sequence of epidemic events, can a single epidemic model capture its dynamics during the entire period? How should we divide the sequence into segments to better capture the dynamics? Throughout human history, infectious diseases (e.g., the Black Death and COVID-19) have been serious threats. Consequently, understanding and forecasting the evolving patterns of epidemic events are critical for prevention and decision making. To this end, epidemic models based on ordinary differential equations (ODEs), which effectively describe dynamic systems in many fields, have been employed. However, a single epidemic model is not enough to capture long-term dynamics of epidemic events especially when the dynamics heavily depend on external factors (e.g., lockdown and the capability to perform tests). In this work, we demonstrate that properly dividing the event sequence regarding COVID-19 (specifically, the numbers of active cases, recoveries, and deaths) into multiple segments and fitting a simple epidemic model to each segment leads to a better fit with fewer parameters than fitting a complex model to the entire sequence. Moreover, we propose a methodology for balancing the number of segments and the complexity of epidemic models, based on the Minimum Description Length principle. Our methodology is (a) Automatic: not requiring any user-defined parameters, (b) Model-agnostic: applicable to any ODE-based epidemic models, and (c) Effective: effectively describing and forecasting the spread of COVID-19 in 70 countries.

Citation: Lee G, Yoon S-e, Shin K (2022) Simple epidemic models with segmentation can be better than complex ones. PLoS ONE 17(1): e0262244. https://doi.org/10.1371/journal.pone.0262244

Editor: Siew Ann Cheong, Nanyang Technological University, SINGAPORE

Received: July 14, 2021; Accepted: December 20, 2021; Published: January 12, 2022

Copyright: © 2022 Lee et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data are available from https://github.com/geonlee0325/covid_segmentation.

Funding: This research was supported by Disaster-Safety Platform Technology Development Program of the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. 2019M3D7A1094364), and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Infectious diseases have been serious threats to global public health. They not only change lifestyles of millions of people worldwide but also bring about dramatic changes in many areas, including economies, cultures, ecologies, and more. Unfortunately, the war against infectious diseases has continued throughout human history. The Black Death killed a third of the world’s population in 1340s, and the Spanish flu in 1918 is estimated to have resulted in at most 500 million deaths. Recent epidemic outbreaks of SARS, Ebola, Zika, and COVID-19 show that the war is not over yet.

Consequently, understanding and predicting epidemic spreads are important for prevention and effective decision making. How many people will be infected within a week? How will lockdowns affect the spread? To answer these questions, we require a method that is simple enough to be comprehensible but expressive enough to accurately model and predict the spread of infectious diseases.

Ordinary differential equations (ODEs) have successfully described dynamic systems in various fields, including ecology, economics, physics, and biology. ODEs have also been utilized in epidemics. Some of the earliest epidemic models, such as SIS, SIR, and SEIR, are compartment models [1]. These models divide the population into several compartments and capture patterns of dynamic changes in the sizes of the compartments over time. The dynamics are expressed as predefined ODEs, which are based on human knowledge, with tunable parameters. While these models are intuitive and simple, they often have limited expressiveness, failing to capture epidemic dynamics accurately. On the other hand, data-driven models [2, 3] aim to model and forecast co-evolving time-series data using ODEs, without relying on human knowledge. They employ latent variables and non-linear differential equations to capture complicated temporal dynamics.

Despite the development of epidemic models, describing long-term dynamics of epidemics using a single epidemic model often faces limitations due to the unpredictability and abruptness of real-world events. Indeed, various external factors may substantially change the dynamics of epidemic events. For example, policies reducing contacts between individuals (e.g., lockdown) and the capability to perform tests can significantly affect the dynamics.

In this work, we demonstrate that properly dividing an epidemic event sequence into multiple segments and fitting a simple epidemic model to each segment greatly helps describe and predict the epidemic propagation concisely and accurately. For example, in Fig 1(a) and 1(b), the entire sequence of events regarding COVID-19 in Italy is fitted to two epidemic models with different numbers of parameters. On the other hand, in Fig 1(c), the sequence is split into multiple segments, and then a simple model is fitted to each segment. As seen in Fig 1(d), the segmentation leads to 8.09× smaller fitting error with fewer parameters than using a single model for the entire sequence.

Download:

Fig 1. Proper segmentation helps concisely and accurately describe the spread of COVID-19 in Italy.

Dividing the event sequence (i.e., the numbers of active cases, recoveries, and deaths) properly into multiple segments and fitting a simple epidemic model to each segment leads to a more concise model with a better fit than fitting a complex model to the entire period. See the experiment section for details.

https://doi.org/10.1371/journal.pone.0262244.g001

Then the following questions naturally arise: Given a sequence of epidemic events, where should we divide it? How many segments should we divide it into? We propose a segmentation scheme that greedily decides where to split. It also decides the number of segments by balancing the fitting error and the sizes of the models for all segments, based on the Minimum Description Length (MDL) principle.

We validate our approach using event sequences regarding recent Coronavirus Disease-19 (COVID-19), specifically the numbers of active cases, recoveries, and deaths in 70 countries. COVID-19 was recognized as a pandemic by the World Health Organization. By early April 2021, 129 million confirmed cases and 2.8 million deaths were reported worldwide. Our experiments reveal that our segmentation scheme enhances three epidemic models in explaining and predicting the propagation of COVID-19.

The strengths of our approach are summarized as follows:

Automatic: It does not require any user-defined parameters, such as the number of segments.
Model-agnostic: It is applicable to any ODE-based epidemic models without being restricted to certain models.
Effective: Applied to the COVID-19 datasets, it significantly reduces the fitting error (up to 14.29× with fewer parameters) and forecasting error (up to 31.54×) of three epidemic models.

Using the proposed segmentation methodology, we expect that real-world surveillance services of COVID-19 can be assisted in following manners:

Policy verification: The point where the segmentation occurs indicates the rapid changes in dynamics, which policymakers should be aware of. Thus, segmentation of the sequence assists examining the impact of policies (e.g., lockdowns or mandatory mask-wearing) after they are deployed.
Future prediction: As shown in the experiments section, future epidemics can be estimated more accurately using our segmentation scheme. Accurate prediction can improve social policy decisions.

Reproducibility: The code and datasets used in the paper are available at https://github.com/geonlee0325/covid_segmentation.

2 Related work

We briefly review previous work on two related topics: epidemic models and time-series analysis models.

2.1 Epidemic models

A variety of epidemic models have been proposed to understand and predict the spread of infectious diseases [4]. In the SI model, the population is divided into two different groups: susceptible and infectious; and the size of each group changes based on predefined differential equations. Taking realistic conditions, such as reinfection, recovery, immunity, population change, and exposure, into consideration, the SI model has been extended to SIS, SIR [5], SIRS [6], SIRD [7], SEIR [8], and many more. The spread of COVID-19 has been analyzed using modified SIRs: Li et al. [9] take human mobility into account, and Dandekar et al. [10] consider quarantine controls. These models are intuitive, explainable, and simple since they are based on human knowledge. However, they show weakness in capturing long-term dynamics of epidemic events especially when the dynamics heavily depend on external factors.

2.2 Time-series analysis models

Mining and modeling time-series data is a building block of many analytical and predictive tasks, such as pattern discovery [11, 12], disaggregation [13], and forecasting [2, 3, 14, 15], in a variety of fields, including social media [16, 17], web [14], and medical science [18]. Especially, ordinary differential equations (ODEs) have attracted much attention, due to its simplicity and expressiveness, and several studies focus on learning ODEs from data [19–22]. Recently, Chen et al. [19] introduce a generative model to solve ODEs using neural networks.

There have been several studies on learning to segment temporal data. Most of them [2, 3, 15, 23] focus on detecting repetitive patterns in activities (e.g., sensor data and motion events), while we focus on segmenting epidemic data, where dynamics suddenly change due to external factors, eventually better modeling and forecasting the spread of COVID-19.

Recently, Jiang et al. [24, 25] propose piecewise linear quantile models that detect multiple change-points, where an SN-based test statistic is above the properly chosen threshold, for capturing the ever-changing growth rate of daily new cases of COVID-19. Note that our segmentation scheme has two distinct advantages over those used in these models: (a) automatic: it does not require any prior hyperparameters and (b) model-agnostic: it can be applicable to any ODE-based epidemic models, including non-linear fitting models. Our segmentation scheme belongs to the class of binary segmentation [26]. While existing binary segmentation schemes are known to cause loss when detecting non-monotonic changes [27, 28], we demonstrate that our MDL-based segmentation scheme accurately divides the sequences and fits a model to each segment. Specifically, as shown in the experiment section, our segmentation scheme detects splitting points 3.59× more accurately and leads to 3.23× smaller fitting error (with the same number of parameters) than the non-binary the segmentation method inspired by [2].

3 Preliminaries

In this section, we introduce some notations and three main epidemic models that are used in the paper. Refer to Table 1 for the frequently-used notations. We first review the Susceptible-Infectious-Recovered (SIR) model, which is one of the most classical compartment models. Then, we introduce two latent dynamics models that are based on linear and non-linear dynamics of latent variables.

Download:

Table 1. Frequently-used notations and symbols.

https://doi.org/10.1371/journal.pone.0262244.t001

3.1 Susceptible-Infectious-Recovered (SIR) model

The SIR model is one of the most classical epidemic models. Given a group of individuals of closed population P, each individual is assigned to one of the three states: S (susceptible), I (infectious), and R (recovered). Here, we use S(t), I(t), and R(t) to denote the number of individuals at the three states, respectively, at timestamp t. The model assumes that each individual goes through two types of transitions: infection and recovery. That is, the state to which an individual belongs changes from S to I and then from I to R. Additionally, the model assumes that the probability of a susceptible individual to get infected at each time t is proportional to the number of infected individuals with a coefficient β, and the model assumes that the probability of an infected individual to become recovered at each time t is γ. These dynamics can be expressed as the following three differential equations, where β and γ are model parameters: Note that these equations imply S(t) + I(t) + R(t) = P.

3.2 Non-Linear Latent Dynamics (NLLD) model

This model [2] consists of two multi-dimensional event sequences: a k-dimensional latent (i.e., unobservable) event sequence w(t) and a d-dimensional observable event sequence v(t). The observed events v(t) are assumed to be determined by the following non-linear dynamical systems of the latent factors w(t): (1) (2) where ⊙ denotes the Hadamard product (i.e., the elementwise product); and , , and describe the linear, exponential, and non-linear dynamics between latent factors. In addition, and are used to project the latent factors to the observed events. The model parameters are p, Q, A, u, V, and the initial condition w(0) = w₀ of the latent factors.

3.3 Linear Latent Dynamics (LLD) model

We also consider a special case of the NLLD model, where the d-dimensional observed event sequence v(t) is assumed to be determined by the following linear dynamical systems of k-dimensional latent factors w(t): The NLLD and LLD models can naturally be used as epidemic models if we regard I(t) and R(t) (i.e., the numbers of infected and recovered individuals) in the SIR model as the 2-dimensional observed event sequence v(t). Unlike the SIR model, the latent dynamics models are fully data driven, and thus they capture the temporal patterns in the event sequences without any prior knowledge of epidemics. Moreover, they describe the dynamics of the observed events using latent factors, which are not directly observed. Many real-world events are known to be largely affected by latent factors, and as shown in the experiment section, the latent dynamic models predict the spread of COVID-19 significantly more accurate than the SIR model.

3.3.1 Remarks.

Our segmentation scheme described in the following section is model agnostic. That is, it can be applied to any epidemic or time-series analysis models, including but not limited to the three considered ones.

4 Proposed method

In this section, we present our approach for deciding the number of segments and their locations automatically without user-defined parameters. We first define the description length of an event sequence. Then, based on the definition, we describe how we adapt the Minimum Description Length (MDL) principle to evaluate segmentation. Then, we propose a search algorithm for finding the best segmentation.

4.1 Description length

Given a sequence X and a model M, the description length (in bits) of X, denoted by Cost(X), is defined as: where the model cost Cost(M) is the number of bits required to describe the model M, and the data cost Cost(X|M) is the number of bits to encode X given M. The model cost and the data cost are described below.

4.1.1 Model cost.

To measure the model cost Cost(M), we examine the parameters of the model M and their sizes in bits. Below, we consider the three aforementioned epidemic models. Note that the model cost of any other models can be measured in a similar way.

SIR Model: The infection rate β and the recovery rate γ are two real numbers, and encoding each requires C_F bits (we set C_F to 8 by convention). Thus, the model cost required to describe the SIR model in bits is (we ignore the cost required to encode the population P since it is required only once regardless of the number of segments):
Non-linear Latent Dynamics (NLLD) Model: This model is described by a set of six parameters: w₀, p, Q, A, u, and V (see Eqs (1) and (2)). They contain to k, k, k², k, d, and kd real-valued parameters, respectively. Thus, the model cost in bits required to describe the NLLD model is: (3)
Linear Latent Dynamics (LLD) Model: The model cost required by the LLD model is: Note that the cost in bits required to encode A is subtracted from Eq (3).

Algorithm 1: Segment: MDL-based Greedy Segmentation Search

Input: (1) epidemic event stream X_1:n

(2) epidemic model solver f

Output: segmented event stream

1 if n ≤ 2 then return X_1:n ⊳ Base Case

2 C ← Cost(f(X_1:n)) + Cost(X_1:n|f(X_1:n))

3 ⊳ Eq (4)

4 C* ← Cost(X_1:i* ⊕ X_i*+1:n)

5 if C* ≥ C then return X_1:n

6 else return Segment(X_1:i*, f) ⊕ Segment(X_i*+1:n, f) ⊳ Recursive Calls

4.1.2 Data cost.

The data cost Cost(X|M) is the number of bits required to describe X given M. We assume the Huffman coding [29] to encode the difference between the observed event sequence X and the event sequence V estimated by the model M. Then, the number of bits required is the negative log-likelihood under a Gaussian distribution as follows: where x_i(t) and v_i(t) are the i-th dimension of actual and estimated events at time t. We fix σ to the standard deviation of the elements of X − V during the period of each segment.

4.1.2.1 Optimization. In order to fit M to X, we use the Levenberg-Marquardt (LM) algorithm to minimize the mean square errors between the given data sequence and the estimated sequence. Specifically, the LM algorithm adaptively varies the parameter updates to be interploated between the Gauss-Newton update or the gradient descent update, by adopting a damping parameter. The lmfit library we used in our implementation requires two arguments xtol and ftol, which are the relative errors desired in the approximation solution and the desired sum-of-squares, respectively. That is, termination occurs (a) when the relative error between two consecutive iterates is at most xtol or (b) when both the actual and predicted relative reductions in the sum of squares are at most ftol. However, as discussed in Section 5.5.1, our segmentation scheme is insensitive to these parameters, and thus we consistently use the same values throughout experiments. For the NLLD model, we split into the linear parameter set (p, Q, u, and V) and the non-linear parameter set (A) and separately optimize them using the expectation-maximization (EM) algorithm, as suggested in [2]. This, in practice, accelerates convergence, compared to simultaneously optimizing the entire parameters.

4.2 Segmentation evaluation

We adapt the Minimum Description Length (MDL) principle [30] for segmentation evaluation. Consider an event sequence X(= X_1:n) and a solver f of an epidemic model. We denote the division of X into r segments where each i-th segment starts at time s_i and ends at time e_i by where s₁ = 1, e_r = n, and e_i + 1 = s_i+1 for each i ∈ {1, ⋯, r − 1}. Let f(X_i:j) be the epidemic model fitted to the segment X_i:j. Then, the description length in bits of is: (4) where (r − 1) ⋅ log₂(n) is the cost in bits required to encode r − 1 splitting points (i.e., s₂, ⋯, s_r). Since each splitting point is an positive integer smaller than n, the number of bits required to encode it is log₂(n). The description length (i.e,. Eq (4)) balances the fitting error and the size of the parameters required to encode the epidemic models for all segments, and we use it to evaluate segmentation. Specifically, based on the MDL principle, we prefer the segmentation that minimizes Eq (4), and in the following subsection, we discuss how we search for such a segmentation.

4.3 Segmentation search

Given an event sequence X, how can we find the segmentation that minimizes the description length (i.e., Eq (4))? Since there are 2ⁿ ways to segment a length n sequence, naïvely trying all possible segments is computationally prohibitive. Thus, we propose to greedily segment the sequence, as described in Algorithm 1, throughout which we make the length of each segment at least two. Given an event sequence X_1:n, we find a splitting point i* ∈ {2, ⋯, n − 2} where the description length (i.e., Eq (4)) of the corresponding segmentation is minimized (Line 3). If splitting X_1:n at time i* strictly decreases the description length, we divide X_1:n into X_1:i* and X_i*+1,n, and then recursively divide each segments (Line 6). Otherwise, we stop segmentation (Line 5).

5 Experiments

In this section, we review our experiments designed to answer the following questions:

Q1. Effectiveness of Segmentation: Does segmentation help understand the spread of COVID-19? Does it give a better trade-off between model complexity and fitness?
Q2. Effectiveness of our Segmentation Scheme: How well does our greedy segmentation algorithm based on the MDL principle work? Does it yield small fitting error with the same number of segments than baseline?
Q3. Accuracy of Forecasting: Is segmentation beneficial for accurately predicting the spread of COVID-19? Is it beneficial regardless of epidemic models used?

5.1 Experimental settings

Machines: We conducted all the experiments on a machine with AMD Ryzen 9 3900X CPU and 128GB RAM.
Datasets: We considered the 70 countries with the most confirmed cases as of the end of March, 2021. We used the number of active cases as I(t) and the number of recoveries and deaths as R(t) in each of the 70 countries from March 1, 2020 to March 30, 2021. The dataset is publicly available at [31]. Since the number of recoveries in the US is not available, we used the number of deaths as R(t).
Implementations: We implemented the SIR model, the LLD model, and the NLLD model in Python. We used the lmfit library for the optimization (see https://lmfit.github.io/lmfit-py/ for details).
How to choose k: For the LLD and NLLD models, we chose the number of latent factors k between 1 and 6 so that the description length (i.e., Eq (4)) is minimized.

5.2 Q1. Effectiveness of segmentation

We measure how segmentation by Algorithm 1 affects the model complexity and fitting error of the three considered epidemic models. As seen in Fig 2, segmentation leads to significantly better trade-offs between the model cost (in bits) and the fitting error (in terms of RMSE), regardless of the epidemic models used. For example, in the India dataset, the NLLD model with segmentation yields 11.54× smaller fitting error with smaller model cost than the same model without segmentation. Fig 3 show the input and estimated event sequences when the description length is minimized. The description length is minimized when a simple epidemic model with few latent factors is used with an enough number of segments. Simple epidemic models with segmentation provide more concise and accurate description of the spread of COVID-19 than complex models without segmentation. The results in the other countries can be found in the supplement.

Download:

Fig 2. Segmentation leads to better trade-offs between model complexity and fitting error.

For the LLD and NLLD models without segmentation, k varies from 1 to 10.

https://doi.org/10.1371/journal.pone.0262244.g002

Download:

Fig 3. Simple models with multiple segments are preferred over complex models without segments.

The true and estimated event sequences when the description length in bits is minimized.

https://doi.org/10.1371/journal.pone.0262244.g003

We further qualitatively analyze the splitting points detected by our segmentation scheme in the dataset collected in Japan. Specifically, in the dataset our segmentation scheme detects three splitting points: (1) May 14, 2020, (2) August 25, 2020, and (3) January 13, 2021. As shown in Fig 4, these dates coincide with the periods when the state of emergency (SOE) was declared or lifted by the Japanese Government. The result indicates that there is a close correspondence between the segmentation derived by the proposed scheme and the deployed policies.

Download:

Fig 4. Our proposed segmentation scheme captures policy changes.

Splitting points detected by our segmentation scheme coincide with the periods when the state of emergency (SOE) was declared or lifted by the Japanese Government. Note that such events happened 12 times in total during the considered period, and all of them are marked in the figure.

https://doi.org/10.1371/journal.pone.0262244.g004

5.3 Q2. Effectiveness of our segmentation scheme

We demonstrate the effectiveness of our greedy segmentation scheme based on the MDL principle by comparing it with the incremental method inspired by [2]. The incremental method goes through the sequence from the start and initiates a new segment whenever the fitting error within the current segment exceeds a given threshold ϵ. As in [2], we set the threshold proportional to the L₂ norm of the current segment X_c with a coefficient α. That is, ϵ = α ⋅ ||X_c||₂. Note that smaller α is expected to yield more segments. As seen in Fig 5, where we fix k to 2 and vary α from 0.05 to 0.5, our proposed segmentation scheme significantly outperforms the incremental method. Specifically, our scheme gives up to 3.23× smaller fitting error with the same model cost, which is proportional to the number of segments, than the incremental segmentation. The results in the other countries can be found in the supplement.

Download:

Fig 5. Our proposed greedy segmentation scheme based on the MDL principle yields better segmentation than the incremental method.

https://doi.org/10.1371/journal.pone.0262244.g005

Furthermore, to numerically evaluate the accuracy of the segmentation, we generate synthetic sequences with randomly selected splitting points where each segment is generated by a different set of random parameters of the NLLD model. We carefully sample parameters based on the model parameters fitted to real-world sequences. Specifically, we sample −0.1 < p < 0.1, −0.1 < Q < 0.1, −0.001 < A < 0.001, −0.1 < u < 0.1, −1.0 < V < 1.0, and −1 < w₀ < 1 uniformly at random. Then, we compare the detected splitting points, i.e., timestamps where the segmentation occurs, and the ground-truth ones by measuring F1 scores. When measuring F1 scores, for robust evaluation, we consider a detected splitting point is correct if it is within δ time units from a ground-truth one. As shown in Table 2, splitting points detected by our segmentation scheme match the ground-truth splitting points closely, and especially, our segmentation scheme is more accurate than the incremental method.

Download:

Table 2. Our segmentation scheme accurately (in terms of F1 score) detects ground-truth splitting points in synthetic sequences.

https://doi.org/10.1371/journal.pone.0262244.t002

5.4 Q3. Accuracy of forecasting

We examine the effect of segmentation on the the accuracy of future prediction using the three considered epidemic models. To this end, we divide each sequence into the training sequence and the test sequence, which span 327 days and 37 days, respectively. Then, we fit the epidemic models to each training sequence with and without segmentation and predict the event sequence during the test period. When segmentation is applied, we ensure that the last segment is at least as long as the test period, and we use the model fitted to the last segment of the training sequence for prediction. We can ensure this by modifying Algorithm 1 so that it never splits the training sequence during its last 37 days. That is, it searches for splitting points during the first 290 days. This constraint is helpful for forecasting, as shown experimentally in Section 5.5.2. For the LLD and NLLD models without segmentation, we vary the the number of latent factors k from 1 to 6.

In Table 3, we compare the prediction error (in terms of RMSE) of the three epidemic models with and without segmentation. When the LLD model or the NLLD model is used, among 7 different settings, our segmentation scheme leads to the most accurate prediction in 32 and 33 (out of 70) countries, respectively. The second best one, which is the LLD model with k = 2 and no segmentation, is most accurate only in 9 countries. When the SIR model is used, segmentation increases the prediction accuracy in 70 (out of 70) countries. Moreover, prediction without segmentation is unstable with unreasonably large RMSE in some countries, while it is stable with segmentation in all countries. To sum up, segmentation tends to improve the prediction accuracy of all three considered epidemic models.

Download:

Table 3. Segmentation is helpful to accurate prediction of the spread of COVID-19.

https://doi.org/10.1371/journal.pone.0262244.t003

Note that with segmentation, only the last segment, not the entire sequence, is used for prediction. Despite the fact, segmentation increases the accuracy of prediction by letting epidemic models focus on the part that represents the current epidemic dynamics while ignoring the part before inherent changes in the dynamics.

5.5 Additional experimental results

Below, we present the results of additional experiments.

5.5.1 Insensitivity to two arguments: xtol and ftol.

For optimization, we used the lmfit library provided in Python, which minimizes non-linear least-squares. The leastsq function, which we used, requires two arguments, xtol and ftol, which are the desired relative errors in the approximation solution and the sum-of-squares, respectively (see https://lmfit.github.io/lmfit-py/fitting.html#lmfit.minimizer.Minimizer.leastsq for details.). We tested the NLLD model in the Japan dataset using eight different xtol and ftol values (10⁻¹ to 10⁻⁸) and five different latent factors k (2 to 6). In the 40 considered settings, the splitting points of the segmentation were exactly the same (71^th, 198^th, and 324^th day), which implies that the proposed scheme is insensitive to these parameters. Thus, in this work, we do not tune xtol and ftol but fix them to 10⁻⁸ in all experiments in the main paper.

5.5.2 The effect of the constraint on the last segment.

One might concern that avoiding segmentation within the last 37 days before the test set may degrade the flexibility of the model and thus the accuracy of forecasting. Empirically, however, this constraint is helpful for accurate prediction by preventing overfitting. Note that if the length of the last segment is too short, overfitting easily occurs, resulting in a large generalization (i.e., prediction) error. In order to demonstrate the effect of the constraint, we compared the forecasting errors of the NLLD model with (our original setting) and without the constraint in 70 countries. As shown in Fig 6, without the constraint, NLLD greatly overestimated the numbers of infected and recovered individuals in some countries (specifically, Lebanon and Lithuania). It should be noted that the estimates were even larger than the population of the countries. On the other hand, the constraint helped preventing such absurd predictions, and specifically, NLLD with the constraint always made predictions within the population of the countries. In addition, out of the 70 countries, NLLD with the constraint outperformed that without the constraint in 39 countries. The average forecasting error (in terms of RMSE) was also smaller when adopting the constraint. Specifically, it was 94.3 with the constraint and 116.3 without the constraint (averaged only the reasonable results in the 68 countries).

Download:

Fig 6. Ensuring the length of the last segment in the training set to 37 days is helpful for accurate prediction.

NLLD without the constraint sometimes greatly overestimates the numbers of infected and recovered individuals, and the constraint helps prevent such absurd predictions. Moreover, the constraint was beneficial in 39 countries out of the 70 countries. The countries are indexed in the order of the forecasting error of NLLD with the constraint.

https://doi.org/10.1371/journal.pone.0262244.g006

6 Conclusions

In this work, we propose to divide epidemic event sequences into multiple segments and fit a simple model to each segment. To this end, we propose a greedy algorithm based on the MDL principle to decide where to split the sequences. Through extensive experiments using the COVID-19 event sequences from 70 countries, we demonstrate that our methodology has the following advantages:

Automatic: All parameters are tuned automatically based on the MDL principle without relying on users.
Model-agnostic: Any ODE-based epidemic models can be used with our segmentation scheme.
Effective: The fitting error and prediction error of three epidemic models decrease up to 14.29× and 31.54×, respectively, with our segmentation scheme.

Reproducibility: The code and datasets used in the paper are available at https://github.com/geonlee0325/covid_segmentation.

Supporting information

S1 Appendix.

https://doi.org/10.1371/journal.pone.0262244.s001

(PDF)

References

1. Hethcote HW, Stech HW, van den Driessche P. “Periodicity and stability in epidemic models: a survey. In: Differential equations and applications in ecology,epidemics, and population problems.” Elsevier; 1981. p. 65–82.
2. Matsubara Y, Sakurai Y. “Regime shifts in streams: Real-time forecasting of co-evolving time sequences,” In: KDD; 2016.
- View Article
- Google Scholar
3. Matsubara Y, Sakurai Y. “Dynamic modeling and forecasting of time-evolving data streams,” In: KDD; 2019.
- View Article
- Google Scholar
4. Anderson RM, Anderson B, May RM. “Infectious diseases of humans: dynamics and control,” Oxford university press; 1992.
5. Antulov-Fantulin Nino and Lančić Alen and Štefančić Hrvoje and Šikić Mile. “FastSIR algorithm: A fast algorithm for the simulation of the epidemic spread in large networks by using the susceptible–infected–recovered compartment model,” Information Sciences. 2013;239:226–240.
- View Article
- Google Scholar
6. Guo W, Zhang Q, Rong L. “A stochastic epidemic model with nonmonotone incidence rate: Sufficient and necessary conditions for near-optimality,” Information Sciences. 2018;467:670–684.
- View Article
- Google Scholar
7. Osemwinyen AC, Diakhaby A. “Mathematical modelling of the transmission dynamics of ebola virus,” Applied and Computational Mathematics. 2015;4(4):313–320.
- View Article
- Google Scholar
8. Fang H, Chen J, Hu J. “Modelling the SARS epidemic by a lattice-based Monte-Carlo simulation,” In: EMB; 2006.
- View Article
- Google Scholar
9. Li R, Pei S, Chen B, Song Y, Zhang T, Yang W, et al. “Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus(SARS-CoV-2),” Science. 2020;368(6490):489–493. pmid:32179701
- View Article
- PubMed/NCBI
- Google Scholar
10. Dandekar R, Barbastathis G. “Quantifying the effect of quarantine control in Covid-19 infectious spread using machine learning,” medRxiv. 2020.
- View Article
- Google Scholar
11. Papadimitriou S, Sun J, Faloutsos C. “Streaming pattern discovery in multiple time-series,” In: VLDB; 2005.
- View Article
- Google Scholar
12. Papadimitriou S, Yu P. “Optimal multi-scale patterns in time series streams,” In: SIGMOD; 2006.
- View Article
- Google Scholar
13. Yang F, Song HA, Liu Z, Faloutsos C, Zadorozhny V, Sidiropoulos N. “Ares: automatic disaggregation of historical data,” In: ICDE; 2018.
- View Article
- Google Scholar
14. Matsubara Y, Sakurai Y, Faloutsos C. “The web as a jungle: Non-linear dynamical systems for co-evolving online activities,” In: WWW; 2015.
- View Article
- Google Scholar
15. Hooi B, Liu S, Smailagic A, Faloutsos C. “BeatLex: Summarizing and Forecasting Time Series with Patterns,” In: ECML-PKDD; 2017.
- View Article
- Google Scholar
16. Matsubara Y, Sakurai Y, Prakash BA, Li L, Faloutsos C. “Rise and fall patterns of information diffusion: model and implications,” In: KDD; 2012.
- View Article
- Google Scholar
17. Mathioudakis M, Koudas N, Marbach P. “Early online identification of attention gathering items in social media,” In: WSDM; 2010.
- View Article
- Google Scholar
18. Davidson I, Gilpin S, Carmichael O, Walker P. “Network discovery via constrained tensor analysis of fmri data,” In: KDD; 2013.
- View Article
- Google Scholar
19. Chen RT, Rubanova Y, Bettencourt J, Duvenaud DK. “Neural ordinary differential equations,” In: NeurIPS; 2018.
- View Article
- Google Scholar
20. Raissi M, Karniadakis GE. “Hidden physics models: Machine learning of nonlinear partial differential equations,” Journal of Computational Physics. 2018;357:125–141.
- View Article
- Google Scholar
21. Schober M, Duvenaud DK, Hennig P. “Probabilistic ODE solvers with Runge-Kutta means,” In: NIPS; 2014.
- View Article
- Google Scholar
22. Raissi M, Perdikaris P, Karniadakis GE. “Numerical Gaussian processes for time-dependent and non-linear partial differential equations,” arXiv preprintarXiv:170310230. 2017.
23. Matsubara Y, Sakurai Y, Faloutsos C. “Autoplait: Automatic mining of co-evolving time sequences,” In: SIGMOD; 2014.
- View Article
- Google Scholar
24. Jiang F, Zhao Z, Shao X. “Time series analysis of COVID-19 infection curve: A change-point perspective,” Journal of Econometrics. 2020. pmid:32836681
- View Article
- PubMed/NCBI
- Google Scholar
25. Jiang F, Zhao Z, Shao X. “Modelling the COVID-19 infection trajectory: A piecewise linear quantile trend model,” Journal of the Royal Statistical Society:Series B (Statistical Methodology). 2021.
- View Article
- Google Scholar
26. Scott AJ, Knott M. “A cluster analysis method for grouping means in the analysis of variance,” Biometrics. 1974; p. 507–512.
- View Article
- Google Scholar
27. Baranowski R, Chen Y, Fryzlewicz P. “Narrowest-over-threshold detection of multiple change points and change-point-like features,” Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2019;81(3):649–672.
- View Article
- Google Scholar
28. Olshen AB, Venkatraman E, Lucito R, Wigler M. “Circular binary segmentation for the analysis of array-based DNA copy number data,” Biostatistics.2004;5(4):557–572. pmid:15475419
- View Article
- PubMed/NCBI
- Google Scholar
29. Böhm Christian and Faloutsos Christos and Pan Jia-Yu and Plant Claudia. “Ric: Parameter-free noise-robust clustering,” TKDD. 2007;1(3):10–es.
- View Article
- Google Scholar
30. Rissanen J. “Modeling by shortest data description,” Automatica. 1978;14(5):465–471
- View Article
- Google Scholar
31. Rajkumar S. “Novel Corona Virus 2019 Dataset. Day level information on covid-19 affected cases,” 2020. Available from: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset.
- View Article
- Google Scholar

[ref1] 1. Hethcote HW, Stech HW, van den Driessche P. “Periodicity and stability in epidemic models: a survey. In: Differential equations and applications in ecology,epidemics, and population problems.” Elsevier; 1981. p. 65–82.

[ref2] 2. Matsubara Y, Sakurai Y. “Regime shifts in streams: Real-time forecasting of co-evolving time sequences,” In: KDD; 2016.
View Article
Google Scholar

[3] View Article

[4] Google Scholar

[ref3] 3. Matsubara Y, Sakurai Y. “Dynamic modeling and forecasting of time-evolving data streams,” In: KDD; 2019.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref4] 4. Anderson RM, Anderson B, May RM. “Infectious diseases of humans: dynamics and control,” Oxford university press; 1992.

[ref5] 5. Antulov-Fantulin Nino and Lančić Alen and Štefančić Hrvoje and Šikić Mile. “FastSIR algorithm: A fast algorithm for the simulation of the epidemic spread in large networks by using the susceptible–infected–recovered compartment model,” Information Sciences. 2013;239:226–240.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref6] 6. Guo W, Zhang Q, Rong L. “A stochastic epidemic model with nonmonotone incidence rate: Sufficient and necessary conditions for near-optimality,” Information Sciences. 2018;467:670–684.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref7] 7. Osemwinyen AC, Diakhaby A. “Mathematical modelling of the transmission dynamics of ebola virus,” Applied and Computational Mathematics. 2015;4(4):313–320.
View Article
Google Scholar

[16] View Article

[17] Google Scholar

[ref8] 8. Fang H, Chen J, Hu J. “Modelling the SARS epidemic by a lattice-based Monte-Carlo simulation,” In: EMB; 2006.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref9] 9. Li R, Pei S, Chen B, Song Y, Zhang T, Yang W, et al. “Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus(SARS-CoV-2),” Science. 2020;368(6490):489–493. pmid:32179701
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref10] 10. Dandekar R, Barbastathis G. “Quantifying the effect of quarantine control in Covid-19 infectious spread using machine learning,” medRxiv. 2020.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref11] 11. Papadimitriou S, Sun J, Faloutsos C. “Streaming pattern discovery in multiple time-series,” In: VLDB; 2005.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref12] 12. Papadimitriou S, Yu P. “Optimal multi-scale patterns in time series streams,” In: SIGMOD; 2006.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref13] 13. Yang F, Song HA, Liu Z, Faloutsos C, Zadorozhny V, Sidiropoulos N. “Ares: automatic disaggregation of historical data,” In: ICDE; 2018.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref14] 14. Matsubara Y, Sakurai Y, Faloutsos C. “The web as a jungle: Non-linear dynamical systems for co-evolving online activities,” In: WWW; 2015.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref15] 15. Hooi B, Liu S, Smailagic A, Faloutsos C. “BeatLex: Summarizing and Forecasting Time Series with Patterns,” In: ECML-PKDD; 2017.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref16] 16. Matsubara Y, Sakurai Y, Prakash BA, Li L, Faloutsos C. “Rise and fall patterns of information diffusion: model and implications,” In: KDD; 2012.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref17] 17. Mathioudakis M, Koudas N, Marbach P. “Early online identification of attention gathering items in social media,” In: WSDM; 2010.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref18] 18. Davidson I, Gilpin S, Carmichael O, Walker P. “Network discovery via constrained tensor analysis of fmri data,” In: KDD; 2013.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref19] 19. Chen RT, Rubanova Y, Bettencourt J, Duvenaud DK. “Neural ordinary differential equations,” In: NeurIPS; 2018.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref20] 20. Raissi M, Karniadakis GE. “Hidden physics models: Machine learning of nonlinear partial differential equations,” Journal of Computational Physics. 2018;357:125–141.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref21] 21. Schober M, Duvenaud DK, Hennig P. “Probabilistic ODE solvers with Runge-Kutta means,” In: NIPS; 2014.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref22] 22. Raissi M, Perdikaris P, Karniadakis GE. “Numerical Gaussian processes for time-dependent and non-linear partial differential equations,” arXiv preprintarXiv:170310230. 2017.

[ref23] 23. Matsubara Y, Sakurai Y, Faloutsos C. “Autoplait: Automatic mining of co-evolving time sequences,” In: SIGMOD; 2014.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref24] 24. Jiang F, Zhao Z, Shao X. “Time series analysis of COVID-19 infection curve: A change-point perspective,” Journal of Econometrics. 2020. pmid:32836681
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref25] 25. Jiang F, Zhao Z, Shao X. “Modelling the COVID-19 infection trajectory: A piecewise linear quantile trend model,” Journal of the Royal Statistical Society:Series B (Statistical Methodology). 2021.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref26] 26. Scott AJ, Knott M. “A cluster analysis method for grouping means in the analysis of variance,” Biometrics. 1974; p. 507–512.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref27] 27. Baranowski R, Chen Y, Fryzlewicz P. “Narrowest-over-threshold detection of multiple change points and change-point-like features,” Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2019;81(3):649–672.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref28] 28. Olshen AB, Venkatraman E, Lucito R, Wigler M. “Circular binary segmentation for the analysis of array-based DNA copy number data,” Biostatistics.2004;5(4):557–572. pmid:15475419
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref29] 29. Böhm Christian and Faloutsos Christos and Pan Jia-Yu and Plant Claudia. “Ric: Parameter-free noise-robust clustering,” TKDD. 2007;1(3):10–es.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref30] 30. Rissanen J. “Modeling by shortest data description,” Automatica. 1978;14(5):465–471
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref31] 31. Rajkumar S. “Novel Corona Virus 2019 Dataset. Day level information on covid-19 affected cases,” 2020. Available from: https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

Figures

Abstract

1 Introduction

2 Related work

2.1 Epidemic models

2.2 Time-series analysis models

3 Preliminaries

3.1 Susceptible-Infectious-Recovered (SIR) model

3.2 Non-Linear Latent Dynamics (NLLD) model

3.3 Linear Latent Dynamics (LLD) model

3.3.1 Remarks.

4 Proposed method

4.1 Description length

4.1.1 Model cost.

4.1.2 Data cost.

4.2 Segmentation evaluation

4.3 Segmentation search

5 Experiments

5.1 Experimental settings

5.2 Q1. Effectiveness of segmentation

5.3 Q2. Effectiveness of our segmentation scheme

5.4 Q3. Accuracy of forecasting

5.5 Additional experimental results

5.5.1 Insensitivity to two arguments: xtol and ftol.

5.5.2 The effect of the constraint on the last segment.

6 Conclusions

Supporting information

S1 Appendix.

References