Dynamical footprints enable detection of disease emergence

Developing methods for anticipating the emergence or reemergence of infectious diseases is both important and timely; however, traditional model-based approaches are stymied by uncertainty surrounding the underlying drivers. Here, we demonstrate an operational, mech-anism-agnostic detection algorithm for disease (re-)emergence based on early warning signals (EWSs) derived from the theory of critical slowing down. Specifically, we used computer simulations to train a supervised learning algorithm to detect the dynamical foot-prints of (re-)emergence present in epidemiological data. Our algorithm was then challenged to forecast the slowly manifesting, spatially replicated reemergence of mumps in England in the mid-2000s and pertussis post-1980 in the United States. Our method successfully anticipated mumps reemergence 4 years in advance, during which time mitigation efforts could have been implemented. From 1980 onwards, our model identified resurgent states with increasing accuracy, leading to reliable classification starting in 1992. Additionally, we successfully applied the detection algorithm to 2 vector-transmitted case studies, namely, outbreaks of dengue serotypes in Puerto Rico and a rapidly unfolding outbreak of plague in 2017 in Madagascar. Taken together, these findings illustrate the power of theoretically informed machine learning techniques to develop early warning systems for the (re-)emergence of infectious diseases.

which we will use throughout. The decay rate is specified by the half-life t 1/2 = ln(2)/λ. The full list of EWS considered are listed in Table S1. Each individual EWS can in principle be used in isolation to detect emergence. Alternatively, we here propose that performance might be improved by considering a weighted sum of different EWS, where the index i = 1, . . . , n labels the different EWS. We picked the logistic function, to give a measure of epidemic risk ranging from 0 to 1. Emergence is detected when D t > c, where c is the detection threshold. In order for this approach to be viable various a priori unknown parameters need to be specified: i) the weights {w i } n i=1 and intercept w 0 ii) the detection threshold c and ii) the optimal half-life t 1/2 . To do this we use a learning algorithm based approach.

S2.1 Transfer learning approach
We want to find a measure of emergence risk which is robust to uncertainty in epidemiological parameters. To do this we fit the logistic model D t to a set of 10,000 simulated time series, half emerging and half not, using logistic regression. The simulated time series are generated using the canonical Susceptible-Exposed-Infectious-Recovered (SEIR) model, with uncertain parameters drawn using latin hypercube sampling (LHS).

S2.2 Simulation algorithm
Simulations of epidemiological dynamics were performed using a stochastic SEIR compartmental model [6]. The model incorporates demographic stochasticity, stochasticity in the transmission rate, and reporting error. Simulated data were generated using the Gibson-Bruck next reaction method [7], which simulates a sequence of transition events (e.g. infection, recovery, birth and death), and returns the number of individuals in each model compartment through time. We assumed the mean population size N 0 is stable, and that the per capita birth and death rates are both ν = 1/75 years −1 . Latent and infectious periods were exponentially distributed. We used a parameterisation appropriate for mumps, with mean latent period 1/σ = 13 days and infectious period 1/γ = 6 days [8]. Infection is imported to the population from external sources at rate ζ. Infectious individuals spread the infection to others with transmission rate β(t). Table S2 lists the SEIR model parameters, and Table S3 lists the rates and effects of the transitions in the model. The reproductive number of the SEIR model is The parameters σ, γ and ν are all constant, changes in R 0 (t) reflect changes in the pathogen's transmissibility. We therefore parameterised β(t) in terms of the other parameters via Eq. 4.
Simulations were initialised at the disease-free equilibrium S = N 0 at time t = −10 years with initial reproduction number R (i) 0 . After a 10 year equilibration period, R 0 (t) followed a stochastic trajectory of duration T = 10 years starting from R(0) = R (i) 0 . For emergence, R 0 (t) was a realisation of a Brownian bridge ending at the emergence threshold R 0 (T ) = 1, where W (t) is a Wiener process with W (0) = 0 and variance σ 2 bb t. For time in units of days, we set σ 2 bb = 10 −5 . The parameter κ controls the curvature of the path. For 0 < κ < 1 the path is concave, for κ = 1 the path is linear, for κ > 1 the path is convex (see Fig. S1). Both initial value R (i) 0 and curvature κ are selected by LHS. For the non-emerging scenario R 0 (t) was a realisation of an Ornstein-Uhlenbeck process with fixed mean, solving the stochastic differential equation where the white noise term is defined as . We fix k = 0.01 days −1 and σ 2 ou = σ 2 bb = 10 −5 . The initial value R (i) 0 was drawn from the LHS with the same range as for the emerging scenario, ensuring that both sets of time series (emerging and non-emerging) are initially statistically indistinguishable.
At the end of each aggregation period (set to 1 week, to match the mumps dataset), reporting error was applied to the case count by sampling a negative binomial distribution, with reporting probability ρ and dispersion parameter φ [9,10]. Given C t cases, the mean number reported is µ t = ρC t . The variance is specified by the dispersion parameter via the relation σ 2 t = µ t + µ 2 t /φ. Increasing φ reduces the overdispersion of the data, so that for large φ the distribution of reports is approximately Poisson. We set φ = 100, and selected ρ by LHS.
The five parameters with uncertainty -N 0 , ρ, ζ , R (i) 0 and κ -were selected for each realisation of the model by LHS with ranges listed in Table S4. The latin hypercube generated M = 10000 samples in (0, 1) 5 . For ρ and R (i) 0 we linearly transformed the samples to get the appropriate distribution. To get a wide range of population sizes we assumed ln N 0 has a uniform distribution over the parameter range. For similar reasons we chose that 1/ζ is uniformly distributed. To ensure an equal balance between convex and concave functions for R 0 (t), we chose that 1/κ is uniformly distributed for κ < 1 and κ is uniformly distributed for κ ≥ 1 with equal probability of κ < 1 and κ > 1. Half of the samples were then assigned to be non-emerging, and the other half were emerging (see above).  transmission rate σ exposed to infectious rate γ recovery rate ν per capita birth and death rate N 0 average population size

S2.3 Lasso regression
In order to prevent over-fitting to the training data, we found values for the weights {w i } n i=1 and intercept w 0 by minimising the negative log-likelihood of the logistic model over the space of weights subject to an l 1 -penalisation with strength p. This is also known as lasso regression [11]. The complete cost function we minimised is where the superscript j labels the replicate and y (j) = 1 if the sample was emerging and 0 if not. We performed the lasso regression using the python package sci-kit learn [12], which uses the liblinear solver [13].

S2.4 AUC test statistic
After fitting, D t (Θ t , w) provides a measure of emergence risk. A data point is labelled as emerging if D t exceeds the detection threshold.
In order to determine the optimum threshold, we calculated the Receiver Operating Characteristic (ROC) curve, a parametric plot of the true positive rate against the false positive rate as a function of the detection threshold. The Area Under the ROC Curve (AUC) statistic provides a measure of how distinct the two different classes are when using a classifier [10]. For example, an AUC = 1 means that D t is always larger for emerging time series than for non-emerging time series. There therefore exists a range of intermediary threshold values for which the classifier performs perfectly. Likewise, AUC = 0 means D t is always smaller for emerging time series than nonemerging time series, and the classifier also performs perfectly. Performance is worst if AUC = 0.5, as then it is equiprobable that an observed D t is emerging or not-emerging. AUC values further from 0.5 therefore imply better performance.
We calculate: (i) the AUC for the full ten years of data using the overall true and false positive rates and (ii) the AUC through time, i.e. each time point individually, using the true and false positive rates for that 4-weekly period. The detection threshold c is chosen to minimise the sum of type I and type II errors for the full dataset.

S2.5 Hyperparameter selection
The optimum set of weights and detection threshold depend on two hyperparameters: the half-life t 1/2 of the moving window average and the penalty strength p in the lasso regression. These were selected by 10-fold crossvalidation [11]. The time series were randomly split into 10 chunks, for each chunk the other 9 chunks are used for training with the remaining chunk used as an out-of-fit test. For each out-of-fit-test we calculated the AUC test statistic, we then calculated the mean µ AUC and the standard deviation σ AUC . We performed a grid search over pairs of t 1/2 and p to find This parameter combination corresponds to over-fitting the training data [11]. The best hyperparameter combination instead corresponds to the maximum penalty strength (to reduce over-fitting) and the minimum half-life (to improve responsiveness of D t to changes) that provide satisfactory performance. As criteria for satisfactory performance we required that the mean AUC falls within one standard deviation of the maximum mean AUC [11], i.e.
We first found the value for p, subject to the constraint in Eq. 10. The best half-life was then found, subject to the same constraint, but with p fixed to p best . Note that the resulting (t best 1/2 , p best ) depends on the order the maximisation and minimisation are performed. We determined in preliminary investigations that maximising p was more important to performance than minimising t 1/2 , hence it is performed first. Results from the cross-validation are shown in Fig. S2.
The resulting pair of hyperparameters, (t best 1/2 , p best ), were then used to i) calculate the EWS and ii) perform lasso regression on the full dataset to give a final set of weights and intercept. The weights, intercept and half-life can be applied to calculate EWS and D t from epidemiological data without further fitting. Fitted values for the weights and hyperparameters are listed in Table S5.

S2.6 Summary of the EWS-based detection method
In summary, our approach to detect disease emergence is as follows: 1. Each of the 8 EWS (the autocorrelation, variance etc) is a non-linear transformation of the input time series data that theory predicts to vary as the transition is approached.

2.
We use a logistic model to transform the 8 EWS into one measure of emergence risk. This choice of transformation is chosen for pragmatic reasons, as it allows us to fit the weights using logistic regression.
3. To find the optimal weight to assign each EWS we fit the logistic model to a training data set using lasso regression. The training data were generated using an SEIR transmission model over a range of parameters to ensure robustness to parametric uncertainty. Because we are interested in robustness, we dont need to be as concerned with whether the model is a good model of the dynamics.
Our approach is therefore a statistical approach (a logistic model), using a set of non-linear data transforms as predictors, with the aim of classifying emerging and non-emerging time series. Figure S3: AUC through time for the complete simulated dataset. In panel a) Case counts are converted to incidence data before EWS are calculated, in panel b) raw case counts are used. Our measure of emergence risk, Dt, with weights fitted to the simulated data (light green line) outperforms any individual EWS at distinguishing between emerging and non-emerging time series. Performance is only affected by data type for the mean, variance and index of dispersion. Exclusion of the skewness is seen the have the most detrimental impact on performance, followed by the kurtosis and coefficient of variation. c-d) Based on the ranking in b) EWS are sequentially added with colour indicating the rightmost EWS included in the fit. Including the skewness, kurtosis and coefficient of variation is sufficient to get close-to-optimal performance. Calculated using incidence data.

S2.7 Impact of emergence mechanism and the presence of additional time-varying parameters on algorithm performance
We examined whether performance of our algorithm was sensitive to (i) the mechanism underlying (re-)emergence and (ii) the presence of time-varying parameters (in addition to R eff ). Specifically, if (re-)emergence is driven by waning of vaccinal or natural immunity [14], then increases in R eff would be concave (d 2 R eff /dt 2 < 0). In contrast, if (re-)emergence is the result of the evolution of immune evasion following sequential mutation [15,16], then the trajectory of R eff through time would be convex (d 2 R eff /dt 2 > 0) [17].
The primary dataset we used for training assumed R 0 (t) is the only time-varying parameter. To test the robustness of the learning algorithm fit we also trained on three additional simulated datasets: i) a dataset with only convex (κ > 1) trajectories for R 0 and ii) a dataset with only concave (κ ≤ 1) trajectories for R 0 iii) a dataset where N 0 , ρ and ζ all varied linearly in time, with start and end points included separately in the latin hypercube with identical ranges given in Table S4.
We retrained the learning algorithm on simulated datasets consisting of only (i) concave and (ii) convex trends. Surprisingly, the EWS weights obtained by fitting to simulated data that comprised both mechanisms of increase performed best compared to training on concave/convex data alone (Figs. S5 & S6). We interpret this as support for our claim that D t is a mechanism-independent measure of emergence risk. We performed a similar comparison using data simulated from a model with multiple time-varying parameters, in addition to R 0 (which drives the transition), to assess their confounding effects (Fig. S7). We again found that the EWS weights obtained by fitting to simulated data without such covariates performed comparably to the optimal fit with covariates. Figure S5: AUC through time for the simulated dataset with only convex trends in R 0 . Case counts are converted to incidence data before EWS are calculated. The fit to the simulated dataset with both concave and convex trends (light green) has comparable performance to the fit to just convex simulated data (dark purple). Figure S6: AUC through time for the simulated dataset with only concave trends in R 0 . Case counts are converted to incidence data before EWS are calculated. The fit to the simulated dataset with both concave and convex trends (light green) has comparable performance to the fit to just concave simulated data (dark purple). Figure S7: AUC through time for the simulated dataset with the inclusion of covariates. Covariates included are: the population size, the reporting probability and the importation rate. Covariates vary linearly through time, initial and final values are included as independent variables in the latin hypercube with identical ranges given in Table S4. Case counts are converted to incidence data before EWS are calculated. The fit to the simulated dataset with covariates (light green) has comparable performance to the fit without their inclusion (dark purple).

S3 Motivation for using the General Mixture Model
We chose to use a General Mixture Model (GMM) to model the distribution of 2004-2005 mumps outbreak sizes. This model is motivated by basic epidemiological theory and by observations of patterns in the data.
From the epidemiological theory, the threshold R eff = 1 divides the outcome of the introduction of an infectious individual into two phases: • If R eff < 1 the outbreak is small and the size is determined by stochasticity in the transmission dynamics, with larger outbreaks being increasingly rare (see for example [18]).
• Conversely, if R eff > 1, the outbreak size is much larger and is an approximately deterministic function of the localitys epidemiological properties (e.g. the initial susceptible population size). In this case, most of the variability in outbreak size stems from heterogeneity in the epidemiological parameters.