Fig 1.
Estimating initial conditions for the SEIR model.
(a) Schematic of the SEIR mathematical model, including the susceptible (S), exposed (E), infectious (I), and removed (R) individuals, which effectively explains epidemic dynamics. Fitting the SEIR model with observed time series (white dots) from enables the estimation of crucial parameters in the epidemic such as reproduction number (
). This estimation strongly depends on the initial condition at
(e.g.,
), the starting point of the model simulation (See Supplementary Information for more details). (b) The initial condition of E (
in red) can be determined by summing up the daily change of E (
) up to the
since the beginning of the disease (0) (red arrow). However, it requires daily incidence of exposure (
) and daily incidence of becoming infectious (
) data before
, which are often unknown. This highlights the need for a method to estimate the initial condition of E using only the available daily data on infectious individuals from time
onward (green arrow). (c) To address this limitation, previous studies estimated the initial condition of E (
in red) by multiplying
at
and the mean latent period (
) (green arrow). (d) However, while this History-Independent estimation (Hist-I) method provides an accurate estimation (red dots) if the latent period follows the exponential distribution (left), it becomes less reliable for the gamma distribution (right) observed in many infectious diseases, whereby an individual is more likely to transition from exposed to infectious the longer their time since exposure.
Fig 2.
Schematic figure for deriving the loss function to estimate the initial condition.
(a) To address the limitation of the history-independent method (left), we developed a novel history-dependent method (right). (b) (i) We established the connection between the known data, , and the unknown
by treating the
as a convolutional output of
and the probability density function of the latent period,
. (ii) By discretizing this relationship and (iii) assuming
remains consistent before
, we can express the known
as a linear combination of unknown
and unknown
with known coefficients
and
.
represents the probability of an individual having a latent period of exactly
days, while
represents the probability of the latent period being longer or equal to
days.
and
can be obtained by integrating the convolution of
and
, where
represents the characteristic function supported on [0,1] (See Methods for more details). (c) Extending the linear combination expression to the whole data (i.e.,
for
), we can construct a matrix that describes the relationship between known data and unknown parameters. (d) We utilized this matrix equation that
must satisfy to establish the data loss function, then sought to minimize this data loss by finding optimal values for unknown parameters, including
. However, as the number of unknown parameters (
) exceeds the number of equations (
), the parameters cannot be determined solely from the data loss. This leads us to incorporate the regularization loss for the
parameters, which aims to smooth the
parameters by minimizing their second order derivatives. Consequently, by finding the parameters that minimizes the total loss function (
), which includes both the data loss and the regularization loss, we can estimate
. By summing up the difference between daily incidence of exposure (
) and daily incidence of becoming infectious at
(
), we finally get the initial condition of E.
Fig 3.
Hist-D outperforms Hist-I, regardless of the phase transition of epidemic dynamics and noise.
(a) The trajectory of E and the daily incidence of becoming infectious () were simulated through the SEIR model whose latent period follows the gamma distribution with shape 4.06 and scale 1.35 (See Methods for more details). (b) Simulated
was then utilized to estimate the
and compare History-Independent estimation (Hist-I) and History-Dependent estimation (Hist-D). Hist-I utilizes data from only single day,
, while Hist-D uses data from
consecutive days after the
, where
is a mean latent period. (c) The graph comparing the true
(light gray-colored bars) and the estimated
(
). (d) The scatter plot displaying the error (
) across different levels of true
. Estimation from Hist-D (green squares) has a much lower error compared to Hist-I (red triangles). (e) The graph showing the root mean squared error (RMSE) (bars) and the mean absolute percentage error (MAPE) (line) of Hist-I and Hist-D. When Hist-D was utilized, RMSE and MAPE was reduced by 86% and 85%, respectively, compared to Hist-I. (f) To better reflect the real-world situation with observation noise in given data, we applied multiplicative noise (
), where
is the uniform distribution on
, to the simulated
data used in (c-e) and compared the accuracy of Hist-I and Hist-D. (g) The scatter plots displaying the estimation error at the noise level
. The error of both Hist-I and Hist-D increased proportionally to the level of true
, and this was specifically manifested in Hist-I (top). In addition, compared to the zero-noise level case (i.e., the case in (c-e)), the error increment of Hist-D was lower than that of Hist-I (bottom). (h) The graph showing the RMSE (bars) and MAPE (line) of Hist-I and Hist-D across the different noise levels (
). Hist-D achieved a lower RMSE and MAPE than Hist-I across all noise levels. (i) To assume the transition of epidemic dynamics, we abruptly changed the transmission rate,
, from
to
at a single point (top), and simulated
data (middle), which were then used to investigate the accuracy of Hist-I and Hist-D. (j) The scatter plot showing the error of Hist-I and Hist-D when the transmission rate has been doubled. Hist-D outperformed Hist-I. (k) The graph showing the RMSE (bars) and MAPE (line) of Hist-I and Hist-D across the different fold change (
/
= 1/3, 1/2, 1, 2, 3). Hist-D consistently outperformed Hist-I across all fold changes. In particular, when
was reduced to 1/3, the absolute increase in RMSE and MAPE for Hist-D was 22% and 19% that of Hist-I, respectively, demonstrating the robustness of Hist-D to sudden changes in
.
Fig 4.
Hist-D provide more accurate estimates of the initial condition of E compare to Hist-I for real COVID-19 data in Seoul, Republic of Korea.
(a) We compared Hist-I and Hist-D to estimate the initial conditions of E for COVID-19 data in Seoul, Republic of Korea, from August 13 to November 25, 2020. From this data, data and the distribution of the incubation period (light blue histogram) were extracted (see Methods for more details) and then used to estimate the initial condition of E with Hist-I and Hist-D. (b) The graph comparing the true
(light gray bars) and estimated
(Hist-I: red triangles, Hist-D: green squares). While both methods capture the long-term trend, Hist-I exhibits more pronounced fluctuations. (c) The scatter plot comparing the true
and estimated
(
). Estimation from Hist-D is closer to the perfect estimation (i.e., the black cross line, where
) than Hist-I. (d) The scatter plot displaying the error (
) across different levels of true
. The error of Hist-I increased proportionally to the
, while such a pattern was not manifested in Hist-D. (e) The graph showing the RMSE (bars) and the mean absolute percentage error (MAPE) (line) of Hist-I and Hist-D. Hist-D achieved 55% lower RMSE (8.44) and 55% lower MAPEs (18.9%) compared to Hist-I (RMSE: 18.76, MAPE: 42.2%), respectively, demonstrating the superior performance of Hist-D, in real-world epidemic data. (f) 95% Credible interval and empirical coverage of our estimated values. The upper and lower horizontal lines of each box represent the upper and lower bounds of the credible interval, corresponding to the 97.5% and 2.5% quantiles, respectively. 91.3% of true values were included in the 95% credible interval of Hist-D.