A history-dependent approach for accurate initial condition estimation in epidemic models

doi:10.1371/journal.pcbi.1013438

Fig 1.

Estimating initial conditions for the SEIR model.

(a) Schematic of the SEIR mathematical model, including the susceptible (S), exposed (E), infectious (I), and removed (R) individuals, which effectively explains epidemic dynamics. Fitting the SEIR model with observed time series (white dots) from enables the estimation of crucial parameters in the epidemic such as reproduction number (). This estimation strongly depends on the initial condition at (e.g., ), the starting point of the model simulation (See Supplementary Information for more details). (b) The initial condition of E ( in red) can be determined by summing up the daily change of E () up to the since the beginning of the disease (0) (red arrow). However, it requires daily incidence of exposure () and daily incidence of becoming infectious () data before , which are often unknown. This highlights the need for a method to estimate the initial condition of E using only the available daily data on infectious individuals from time onward (green arrow). (c) To address this limitation, previous studies estimated the initial condition of E ( in red) by multiplying at and the mean latent period () (green arrow). (d) However, while this History-Independent estimation (Hist-I) method provides an accurate estimation (red dots) if the latent period follows the exponential distribution (left), it becomes less reliable for the gamma distribution (right) observed in many infectious diseases, whereby an individual is more likely to transition from exposed to infectious the longer their time since exposure.

More »

Expand

Fig 2.

Schematic figure for deriving the loss function to estimate the initial condition.

(a) To address the limitation of the history-independent method (left), we developed a novel history-dependent method (right). (b) (i) We established the connection between the known data, , and the unknown by treating the as a convolutional output of and the probability density function of the latent period, . (ii) By discretizing this relationship and (iii) assuming remains consistent before , we can express the known as a linear combination of unknown and unknown with known coefficients and . represents the probability of an individual having a latent period of exactly days, while represents the probability of the latent period being longer or equal to days. and can be obtained by integrating the convolution of and , where represents the characteristic function supported on [0,1] (See Methods for more details). (c) Extending the linear combination expression to the whole data (i.e., for ), we can construct a matrix that describes the relationship between known data and unknown parameters. (d) We utilized this matrix equation that must satisfy to establish the data loss function, then sought to minimize this data loss by finding optimal values for unknown parameters, including . However, as the number of unknown parameters () exceeds the number of equations (), the parameters cannot be determined solely from the data loss. This leads us to incorporate the regularization loss for the parameters, which aims to smooth the parameters by minimizing their second order derivatives. Consequently, by finding the parameters that minimizes the total loss function (), which includes both the data loss and the regularization loss, we can estimate . By summing up the difference between daily incidence of exposure () and daily incidence of becoming infectious at (), we finally get the initial condition of E.

More »

Expand

Fig 3.

Hist-D outperforms Hist-I, regardless of the phase transition of epidemic dynamics and noise.

(a) The trajectory of E and the daily incidence of becoming infectious () were simulated through the SEIR model whose latent period follows the gamma distribution with shape 4.06 and scale 1.35 (See Methods for more details). (b) Simulated was then utilized to estimate the and compare History-Independent estimation (Hist-I) and History-Dependent estimation (Hist-D). Hist-I utilizes data from only single day, , while Hist-D uses data from consecutive days after the , where is a mean latent period. (c) The graph comparing the true (light gray-colored bars) and the estimated (). (d) The scatter plot displaying the error () across different levels of true . Estimation from Hist-D (green squares) has a much lower error compared to Hist-I (red triangles). (e) The graph showing the root mean squared error (RMSE) (bars) and the mean absolute percentage error (MAPE) (line) of Hist-I and Hist-D. When Hist-D was utilized, RMSE and MAPE was reduced by 86% and 85%, respectively, compared to Hist-I. (f) To better reflect the real-world situation with observation noise in given data, we applied multiplicative noise (), where is the uniform distribution on , to the simulated data used in (c-e) and compared the accuracy of Hist-I and Hist-D. (g) The scatter plots displaying the estimation error at the noise level . The error of both Hist-I and Hist-D increased proportionally to the level of true , and this was specifically manifested in Hist-I (top). In addition, compared to the zero-noise level case (i.e., the case in (c-e)), the error increment of Hist-D was lower than that of Hist-I (bottom). (h) The graph showing the RMSE (bars) and MAPE (line) of Hist-I and Hist-D across the different noise levels (). Hist-D achieved a lower RMSE and MAPE than Hist-I across all noise levels. (i) To assume the transition of epidemic dynamics, we abruptly changed the transmission rate, , from to at a single point (top), and simulated data (middle), which were then used to investigate the accuracy of Hist-I and Hist-D. (j) The scatter plot showing the error of Hist-I and Hist-D when the transmission rate has been doubled. Hist-D outperformed Hist-I. (k) The graph showing the RMSE (bars) and MAPE (line) of Hist-I and Hist-D across the different fold change ( / = 1/3, 1/2, 1, 2, 3). Hist-D consistently outperformed Hist-I across all fold changes. In particular, when was reduced to 1/3, the absolute increase in RMSE and MAPE for Hist-D was 22% and 19% that of Hist-I, respectively, demonstrating the robustness of Hist-D to sudden changes in .

More »

Expand

Fig 4.

Hist-D provide more accurate estimates of the initial condition of E compare to Hist-I for real COVID-19 data in Seoul, Republic of Korea.

(a) We compared Hist-I and Hist-D to estimate the initial conditions of E for COVID-19 data in Seoul, Republic of Korea, from August 13 to November 25, 2020. From this data, data and the distribution of the incubation period (light blue histogram) were extracted (see Methods for more details) and then used to estimate the initial condition of E with Hist-I and Hist-D. (b) The graph comparing the true (light gray bars) and estimated (Hist-I: red triangles, Hist-D: green squares). While both methods capture the long-term trend, Hist-I exhibits more pronounced fluctuations. (c) The scatter plot comparing the true and estimated (). Estimation from Hist-D is closer to the perfect estimation (i.e., the black cross line, where ) than Hist-I. (d) The scatter plot displaying the error () across different levels of true . The error of Hist-I increased proportionally to the , while such a pattern was not manifested in Hist-D. (e) The graph showing the RMSE (bars) and the mean absolute percentage error (MAPE) (line) of Hist-I and Hist-D. Hist-D achieved 55% lower RMSE (8.44) and 55% lower MAPEs (18.9%) compared to Hist-I (RMSE: 18.76, MAPE: 42.2%), respectively, demonstrating the superior performance of Hist-D, in real-world epidemic data. (f) 95% Credible interval and empirical coverage of our estimated values. The upper and lower horizontal lines of each box represent the upper and lower bounds of the credible interval, corresponding to the 97.5% and 2.5% quantiles, respectively. 91.3% of true values were included in the 95% credible interval of Hist-D.

More »

Expand