Gaussian process emulation for exploring complex infectious disease models

doi:10.1371/journal.pcbi.1013849

Table 1.

Parameters of the individual-based disease transmission model.

More »

Expand

Fig 1.

Gaussian Process training & emulation workflow.

(A) Gaussian Process (GP) training loop [22]. The GP training begins with an initial training dataset consisting of a Latin hypercube sample (LHS) of 5,000 data points generated from the input domain (Table 1) using the individual-based simulation model (IBM). During training, the GP is evaluated against a validation dataset of 10,000 data points to determine the optimal amount of training iterations and prevent overfitting. After each training cycle, 10⁷ potential new data points are scored based on a policy that considers their predicted value and 95% confidence interval. In each iteration of the training loop, 1,000 additional data points are sampled from these 10⁷ candidate points, with sampling probability proportional to their policy scores. The newly selected data points are then simulated using the IBM, added to the training dataset, and the next training round begins. (B) Use of the trained GP. After training, the GP is tested using an independent dataset of 10,000 LHS data points to evaluate its performance. The trained GP can then be used for rapid predictions, enabling large-scale global sensitivity analyses.

More »

Expand

Fig 2.

Gaussian Process performance evaluation.

Comparison of observed versus predicted values for 500 randomly sampled test data points. The yellow line represents the identity line (x = y) for (A) outbreak probability (B) maximum incidence (i_max), and (C) log₁₀-transformed duration.

More »

Expand

Fig 3.

Sobol sensitivity analysis, maximum incidence (i_max).

(A) First-order and total effects across the entire input domain (Table 1). The first-order effect describes the impact of a single parameter on the model output (i_max), while the total effect of a parameter accounts for both its first-order effect and all interactions with other parameters. Error bars represent the 95% confidence intervals of the sensitivity index estimates. We evaluated a total of 9,437,184 points for the sensitivity analysis. (B) Second-order effects across the entire input domain (Table 1). A second-order effect captures the pairwise interaction between two parameters. Sobol indices with a 95% confidence interval that does not overlap zero are highlighted with a pink border. The largest second-order effect is emphasized with a bold pink border. (C) i_max predictions with varying seasonality strength and first case timing parameters (i.e., the two parameters with the largest second-order effect, see panel B). Other parameters were fixed at default values (Table 1). Corresponding Sobol sensitivity analysis plots for outbreak probability and outbreak duration can be found in S3 and S4 Figs.

More »

Expand

Fig 4.

Summary of model outcomes related to outbreak probability.

(A) First-order sensitivity index estimates for the first case timing parameter across varying average infectivity and average mobility values. For each parameter combination, we evaluated a total of 294,912 points. We varied all other parameters across their full ranges (Table 1). The first-order effect measures the influence of a single parameter on the model output (outbreak probability). Yellow stars mark parameter combinations associated with specific model outcomes shown in (B). (B) Predicted outbreak probabilities using the Gaussian Process surrogate model with varying seasonality strength and first case timing values. Panels represent different average infectivities. All other parameters were fixed at default values (Table 1), except for average mobility, which was set to 1.5. (C) Outbreak probabilities inferred from the individual-based model, with varying seasonality strength and first case timing values. Panels represent different average infectivities. As in (B), the remaining parameters were fixed at default values (Table 1), except for average mobility which was set to 1.5. (B) and (C) thus represent model outcomes for the same model parameters, but conducted with the Gaussian Process surrogate model (B) versus the original individual-based model (C), allowing a direct comparison between the two.

More »

Expand

Fig 5.

Municipality-level average infectivity estimates and Gross Cell Products.

(A) Distribution of municipality-specific average infectivity estimates for the 250 parameter combinations with the lowest root mean square errors. The top 5% of municipalities as sorted by median average infectivity estimates are highlighted in yellow. (B) Average log₁₀-transformed Gross Cell Product (GCP) — a measure of economic activity [56] where higher values represent greater economic activity — distributions, as reported by Siraj et al. (2018), for the municipalities depicted in (A), with the municipalities with the largest average infectivity estimates grouped separately.

More »

Expand