OutbreakFlow: Model-based Bayesian inference of disease outbreak dynamics with invertible neural networks and its application to the COVID-19 pandemics in Germany

Mathematical models in epidemiology are an indispensable tool to determine the dynamics and important characteristics of infectious diseases. Apart from their scientific merit, these models are often used to inform political decisions and interventional measures during an ongoing outbreak. However, reliably inferring the epidemical dynamics by connecting complex models to real data is still hard and requires either laborious manual parameter fitting or expensive optimization methods which have to be repeated from scratch for every application of a given model. In this work, we address this problem with a novel combination of epidemiological modeling with specialized neural networks. Our approach entails two computational phases: In an initial training phase, a mathematical model describing the epidemic is used as a coach for a neural network, which acquires global knowledge about the full range of possible disease dynamics. In the subsequent inference phase, the trained neural network processes the observed data of an actual outbreak and infers the parameters of the model in order to realistically reproduce the observed dynamics and reliably predict future progression. With its flexible framework, our simulation-based approach is applicable to a variety of epidemiological models. Moreover, since our method is fully Bayesian, it is designed to incorporate all available prior knowledge about plausible parameter values and returns complete joint posterior distributions over these parameters. Application of our method to the early Covid-19 outbreak phase in Germany demonstrates that we are able to obtain reliable probabilistic estimates for important disease characteristics, such as generation time, fraction of undetected infections, likelihood of transmission before symptom onset, and reporting delays using a very moderate amount of real-world observations.


Introduction
Assessing important disease characteristics and transmission dynamics is of utmost importance in the case of new epidemic outbreaks in order to forecast their progression and guide effective public health measurements. Mathematical models that provide a reliable representation of the processes driving the dynamics of an epidemic are an essential tool for this task (see for example [1]). In the case of communicable diseases, these models typically take the form of systems of ordinary differential equations governing the transitions between different population compartments, such as, "Susceptible", "Infected", and "Recovered" (SIR) [1]. Provided that intrinsic properties of the disease (e.g., transmission rates and recovery periods) are already known, SIR models and their extensions are successfully used to simulate outcomes of possible public health interventions.
However, for newly emerging infectious diseases, such as Covid-19, most of these properties are initially unknown and must be estimated before realistic predictions can be made. The task of determining these properties is additionally hampered by limited data availability and integrity within the early outbreak phase. During the initial outbreak of the Covid-19 pandemic, model-based inference was used to provide rapid estimates of key epidemiological parameters, which otherwise can be difficult to infer directly from primary clinical tracing data. For instance, an earlier study [2] incorporated domestic and international travel from and to Wuhan city in a SEIR model and used reported cases outside of Wuhan to infer the reproduction number R 0 and epidemic doubling time. Similar approaches were used to estimate the reproduction number of Covid-19 in various other settings [3][4][5]. Other studies aimed at identifying critical epidemiological characteristics, such as age-specific mortality rates [6], the impact of implemented control measures on disease transmission [7,8], or the fraction of undocumented infections [9].
Importantly, since such SIR-type models are employed to forecast the dynamics of an epidemic dependent on public interventions or seasonal effects, reliable inference of such key epidemiological parameters and trustworthy uncertainty quantification is paramount to support decision making. The estimation of hidden model parameters from observations of model outcomes (inverse inference) is also referred to as model calibration in the medical decision and health policy modeling literature [10]. Traditionally, model calibration has been considered as an optimization problem seeking the best possible parameter configuration explaining the data (e.g., by performing non-linear least squares minimization or relying on maximum likelihood estimation [11]). However, the resulting optimization and maximum likelihood methods focus on point estimates for the individual parameters and usually lack appropriate approaches to assess their accuracy. This is a severe disadvantage, because reliable uncertainty quantification is crucial when the estimates shall be used to model and predict future outcomes.
In contrast to optimization and maximum likelihood approaches, Bayesian methods provide a principled way to quantify the uncertainty of inferred model parameters, as they return full posterior distributions for the unknown parameters rather than single point estimates. Markov chain Monte Carlo (MCMC) sampling represents one of the classical approaches to Bayesian model calibration [12], and it has been extensively used in Covid-19 studies to infer model parameters describing the dynamics of the disease [3,4,8,13]. However, Bayesian model calibration is computationally expensive and depends explicitly on the evaluation of the likelihood function of model parameters. When the likelihood function is intractable or unknown, approximate Bayesian computation (ABC) can be used to approximate the posterior distribution of parameters [14,15]. However, standard ABC methods notoriously suffer from poor scalability (i.e., the efficient application to large data sets and complex models), which confines their utility to relatively simple models containing only a few parameters [16].
In order to overcome the limitations of individual parameter estimation methods, our approach aims to combine the advantages of optimization-based and Bayesian methods by using specialized invertible neural networks. In particular, we develop a novel methodological framework based on a neural network architecture called BayesFlow [17] to facilitate modelbased inference with a primary, but not exclusive, focus on epidemiology. Our method can incorporate an arbitrary number of epidemiological time series (or other type of temporal information) and can, in principle, be applied to any dynamic model described by (stochastic) ordinary differential equations. Moreover, since the length of the time series is not pre-determined during training and inference, we can consider additional information in our analyzes without having to re-train the networks. By using extensive simulation-based training, our method circumvents the general necessity for large training sets that are lacking during emerging epidemics. Additionally, our method returns posterior distributions which are fully compatible with a Bayesian interpretation and can thus be used to assess the uncertainty associated with any estimation and prediction quantity.
We demonstrate the feasibility of our method by analyzing public Covid-19 data for Germany and the individual German federal states based on the reported daily number of infected, recovered and deceased cases during the first months of the pandemic. Our neural network is trained using simulations from a customized SEIR-model variant [18], in combination with an observation model accounting for the differences between true and reported case numbers, and an intervention model describing the intervention measures for prevention and control imposed by German authorities [8]. Despite the limited number of measurements and the considerable complexity of the model with 34 unknown parameters in total, our network manages to extract information for more than half of the parameters. Credibility intervals of our parameter estimates are well in line with independently published results, and re-simulations starting at our estimated parameters reproduce the observed time series very well. In particular, our inference suggests that approximately four fifths of all infectious individuals remained undetected across all German federal states.

Data
The model was applied to epidemiological data on the number of reported Covid-19 cases (infected, recovered and deceased) in Germany and the individual federal states (only infected and diseased) from March 01, 2020 until June 11, 2020. Data were collected from publicly available sources during the same time period and were not subsequently cleaned or corrected in the aftermath. Therefore, all sources of uncertainty remain in the data as they would have in the early days of an ongoing pandemic. Code and scripts for reproducing all results and figures as well as the general framework of OutbreakFlow for training new networks on new models are available at https://github.com/stefanradev93/AIAgainstCorona.

Neural Bayesian parameter estimation
Following a Bayesian approach for parameter estimation requires prior knowledge about reasonable parameter ranges [12]. Combining this prior knowledge with information extracted from the observed data leads to a posterior distribution which is generally narrower than the prior and thus expresses our updated state of knowledge and associated uncertainty for the individual parameters. More formally, let θ be the vector of all unknown model parameters and X ≔ x 1:T = [x 1 , . . ., x T ] a multivariate epidemiological time series with T individual time points indicating, for instance, the number of infected, recovered and diseased individuals. Then the well-known analytical formula for the posterior according to Bayes' rule is where p(X|θ) represents the likelihood of observing data X when the true parameters are θ, p (θ) is the prior distribution encoding our knowledge about plausible parameter configurations for θ, and the denominator represents a normalizing constant (usually called the evidence). Despite being conceptually simple, this formula poses two major challenges in the present setting: (i) Efficient and accurate approximation of the intractable posterior p(θ|X) is challenging; (ii) The likelihood is only implicitly defined via realizations X * p(X|θ) generated by repeatedly running simulations of the underlying epidemiological model with different θ.
We solve both problems with our recently proposed neural Bayesian inference architecture called BayesFlow, which is explained in full mathematical details in the corresponding methodological work [17]. The core component of BayesFlow is an invertible neural network which enables a bidirectional flow of information. During the training phase, the invertible network is run in forward direction to learn an accurate model q(θ|X) � p(θ|X) for the posterior distribution of parameters θ given observations X, using a large number of simulated pairs (X i , θ i ) * p(X|θ) p(θ) as training data. During the inference phase, the network makes use of its invertible architecture and operates in the inverse direction to estimate the posterior q(θ|X = x obs ) � p(θ|X = x obs ) for the actually observed data x obs .
Validation experiments reported in [17] have demonstrated for various model systems, that the BayesFlow method (i) can estimate complex stochastic models of widely varying data types and sizes (e.g., population time series, predator-pray population time series, human responsetime data); (ii) outperforms variational or dropout methods for uncertainty quantification; (iii) learns data representations which are more informative than manually selected summary statistics; and (iv) outperforms case-based methods such as ABC and MCMC, whose computations must be re-run from scratch for every observed dataset. The latter advantage is called "amortized inference": a BayesFlow network learns to generalize the training knowledge and can efficiently apply it to multiple real observations without retraining. The network's training effort thus quickly amortizes over a sequence of inference queries (e.g., time series), in contrast to sampling methods (e.g., MCMC), which cannot leverage experience and require the same large computational effort for every query. In addition, fast amortized inference facilitates model validation by enabling efficient probabilistic calibration and posterior predictive checks on large validation datasets [17].

OutbreakFlow-The BayesFlow approach to epidemiological inference
We propose OutbreakFlow, an instantiation of our BayesFlow architecture that utilizes a novel combination of three jointly trained neural modules to analyze noisy multivariate time series with potentially long-term temporal dependencies, as are typical in the context of epidemiology. It can process both short and long time series and can thus perform efficient Bayesian updating as new data become available (e.g., on a daily basis in case of . Moreover, our method can incorporate additional prior knowledge in the formulation of the underlying generative model. Our neural architecture comprises three sub-networks: (i) a convolutional filtering network performing noise reduction and feature extraction on the raw observational data; (ii) a recurrent summary network reducing filtered time series of arbitrary length to statistical summaries of fixed size; and (iii) an invertible inference network performing Bayesian parameter inference, given the learned summaries of the observations. Fig 1 depicts the training and inference phase with our inference architecture and its essential elements.
The design of the convolutional network is inspired by the well-known Inception network, which has shown tremendous success in computer vision tasks [19]. In particular, our network is implemented as a deep fully convolutional network which applies adjustable one-dimensional filters of different size at each level (cf. Fig 1). The intuition behind this design is that filters of different size might capture patterns at different temporal scales (e.g., a filter of size one will capture daily fluctuations whereas a filter of size seven will capture weekly dynamics). This, in turn, should ease the task of extracting informative temporal features for parameter estimation.
The output of the convolutional network is a multivariate sequence containing the filtered epidemiological time series. In order to reduce the filtered sequence to a fixed-size vector, we pass it through a long-short term memory (LSTM) recurrent network [20]. In contrast to standard feed-forward neural networks, LSTM networks incorporate feedback connections which make them ideally suited for processing sequences of observations such as time series.
A standard LSTM network consists of a cell and three gates. The cell stores the internal memory of the network over arbitrary temporal intervals. The three gates, comprising an input gate, an output gate, and a forget gate, interact in determining the importance of old and new information. Importantly, LSTM networks can deal with sequences of varying length, which enables them to process data whose duration is dictated by data availability and to perform online inference, that is, process new data instantly as they become available. In contrast to predefined pooling operations (e.g., mean, max, or variance), our recurrent networks learn pooling operations that are adapted to the data and can thus extract potentially richer representations. In this way, our inference architecture learns to filter and extract the most informative features from the noisy observations in an end-to-end manner. Thus, the user is freed from the difficult (and usually suboptimal) task to hand-engineer suitable data features (summary statistics). Finally, the inference network has the task of inverting the forward model given the information extracted by the convolutional and recurrent networks (see also [17] for more details on the design of invertible networks for Bayesian inference).
The invertible network has two modes of operation. During training, the network is only evaluated in the forward direction and encouraged via a suitable optimization criterion to transform the posterior into a simple base distribution (e.g., Gaussian) from which samples can be easily obtained. Thus, the inference network integrates information from both the prior and the data-generating mechanism (i.e., the implicit likelihood).
During inference, the inference network is only evaluated in the inverse direction using conditional information from real observed data passed through the filtering and summary networks. The posterior is approximated by repeatedly sampling from the simple base distribution and applying the inverse of the forward transformation learned during the training phase. Importantly, this method recovers the true posterior under perfect convergence of the optimization method [17].
More formally, let us denote the functions represented by the three networks as g, h, and f. Then the filtering network determines a filtered time seriesx 1:T ¼ gðx 1:T Þ from observed data x 1:T , where the number of time steps T depends on data availability. The summary network , the assumed epidemiological model is used to simulate time series resembling the observed epidemiological data, based on prior distributions of the unknown parameters. The synthetic time series are used to train the composite neural network consisting of (i) a convolutional filtering network, which extracts relevant features while preserving the temporal structure of the data, (ii) a summary network, which reduces the transformed time series to a fixed-sized vector of maximally informative representations, and (iii) an inference network, which estimates the joint parameter posterior from these data representations. During the inference phase (blue frame on the right), the real epidemiological data x obs 1:T are passed to the trained network to infer the posterior distribution of the unknown disease parameters. A full description of the architecture and the methodology is provided within Materials and methods.
https://doi.org/10.1371/journal.pcbi.1009472.g001 turns the output of the filtering network into a fixed-size representation y ¼ hðx 1:T Þ by keeping only the final output vector of the LSTM network, which encodes the accumulated information over all observed time steps. Finally, the inference network generates samplesŷ � qðy j x 1:T Þ from the parameter posterior by computingŷ ¼ f À 1 ðy; zÞ with normally distributed random vectors z � N ð0; IÞ. The parameters of all three networks are optimized jointly during the training phase. Denoting the vector of all trainable network parameters as ϕ, the three networks solve the following optimization criterion where KLðp jj qÞ denotes the Kullback-Leibler divergence [21] between probability density functions p and q. We approximate the latter expectation via its empirical mean over realizations (X, θ) * p(θ, X) obtained via simulations from a dynamic model. As previously mentioned, one of the most important advantages of our method is amortized inference, owing to the fact that we approximate the posterior globally via a single set of net- �. This is especially advantageous in epidemiological contexts, where the same model is applied in multiple populations (countries, cultures) or at different scales (states, regions), since the same pre-trained network architecture can be repeatedly utilized for different populations and scales. Indeed, in the following real-world application, we demonstrate efficient amortized inference and excellent predictive performance with a single architecture applied simultaneously to epidemiological data from all German federal states.

The epidemiological model
In order to account for the specific nature of the current Covid-19 outbreak, our epidemiological model consists of three submodels: (i) a disease model describing the true dynamics of relevant population compartments; (ii) an intervention model describing the strengthening and relaxation of non-pharmaceutical intervention measures; and (iii) an observation model describing the deviations of publicly reported data from the true values. These models build upon the previous work of Khailaie et al. [18] and Dehning et al. [8], who adapted the general SIR approach to the specifics of the Covid-19 pandemic and the situation in Germany. Parameter priors are based on our current state of knowledge about disease characteristics and government measures, but are chosen very wide to prevent them from dominating the information extracted from the actual observations. Disease model. The disease model is a system of non-linear ordinary differential equations (ODEs) distinguishing between six compartments: susceptible (S), exposed (E-infected individuals who do not show symptoms and are not yet infectious), infected (I-symptomatic cases that are infectious), carrier (C-infectious individuals who recover without being detected), recovered (R), and dead (D), see Fig 2. Note that direct recovery from the carrier state C covers all reasons why an infection might go undetected, including, among others, truly asymptomatic cases, lack of follow-up on pre-symptomatic cases, limited testing capacity under minor symptoms-our model does not differentiate between these posibilities. Observations with limited accuracy (as described by the observation model below) are available for the compartments I, R, and D. The true time series of all compartments are therefore considered latent and need to be estimated.
The ODEs for our epidemiological model are defined by: The meaning of the model parameters and their priors are detailed in S1 Table in S1 Text. Prior ranges are based on considerations in [8] and [22]. All rate parameters are considered to be constant, except for the transmission rate λ(t) which is considered to be time-dependent as it accounts for possible behavioral changes implied by non-pharmaceutical interventions.
Intervention model. The intervention model accounts for changes in the transmission rate λ(t) due to non-pharmaceutical interventions and mitigation measures. Corresponding to the approach followed by [8], we define three change points for λ(t) encoding an assumed transmission rate reduction in response to intervention measures imposed by the German authorities. Each change point is represented by a piece-wise linear function with three degrees The model is of SEIR-type with six compartments: susceptible (S), exposed (E, i.e. infected but non-infectious), carrier (C, i.e. infectious but undetected), infected (I, i.e. infectious and diagnosed), recovered (R) and dead (D) individuals. In addition, the blue boxes indicate model extensions accounting for external factors, namely intervention measures affecting the transmission rate λ(t) and imperfect case reporting due to noise or delays. Note, that data is only reported for the observable compartments I, R, and D. For a detailed description of the model and the different components see Materials and methods.
https://doi.org/10.1371/journal.pcbi.1009472.g002 of freedom: the effect strength and the boundaries defining the time interval for the effect to fully manifest itself. The chosen priors express the expected effect of each measure to reduce the transmission rate roughly by half after the date when it comes into force, but we allow for wide uncertainty margins to facilitate learning of the actual behavior. In addition to the previous approach in [8], our model also includes a fourth change point expressing the assumption that an eventual withdrawal of effective intervention measures (officially or due to non-compliance) will lead to a slight increase of the transmission rate. Note that we assume that interventions do not affect the risk of infection upon contact with a detected infectious individual (β). Prior distributions and descriptions for the intervention model's parameters are given in S2 Table in S1 Text.
Observation model. The observation model accounts for the fact that officially reported cases might not represent the true case numbers of the epidemics. It represents three error sources: a delay between actual infection and reporting, the weekly modulation of reporting rates (since testing and reporting activities are considerably reduced on weekends), and a noise term describing random fluctuations. Separate parameter sets are learned for each of the three publicly reported time series I (obs) , R (obs) , and D (obs) -the remaining compartments are considered unobservable. The relationship between the reported counts and their true values is described by the following set of discrete-time difference equations with time steps t measured in days.
where L I , L R , and L D denote the reporting delays (lags), and denote σ I , σ R and σ D the scales of multiplicative reporting noise for the respective compartments. The noise variables ξ t follow a Student-t distribution with 4 degrees of freedom. The weekly modulation of reporting coverage f C ðtÞ for each of the compartments C 2 fI; R; Dg is computed as follows: This equation yields three additional unknown parameters for the weekly modulation amplitudes A I , A R , A D , and phases F I , F R , F D , each. The prior distributions and descriptions for the observation model's parameters are listed in S3 Table in S1 Text.

Neural network training
An OutbreakFlow is trained with simulated data by minimizing the negative log posterior according to Eq 3. The training phase can be realized in different ways, depending on the modeling scenario and the modelers' computational budget. First of all, when only a single time series has to be analyzed, non-amortized methods like [23,24] may outperform Out-breakFlow, because they constrain the simulation scope to the vicinity of the observed data. On the other hand, when the model has to be applied to multiple observed time series (e.g., to different federal states in Germany or even countries), our upfront training effort quickly amortizes, since a trained OutbreakFlow executes inference orders of magnitude faster than a case-based (non-amortized) method. We now outline three training modes for OutbreakFlow.
Simulation-based offline learning. Traditional simulation-based approaches utilize a pre-computed reference table D ðSÞ , which is a large data structure containing S pairs (θ, X) of simulation parameters θ and corresponding synthetic observations X [25,26]. This strategy has also been used in machine learning approaches to simulation-based inference [27,28], where the problem of inverse inference resembles a supervised learning task. In the context of OutbreakFlow, the resulting offline learning method is outlined in S1 Algorithm in S1 Text. It is particularly useful when calls to the simulator are computationally expensive: working with recorded synthetic data is then faster at the expense of higher memory demands during training.
Simulation-based online learning. Instead of pre-computing synthetic data, we can generate a potentially limitless number of training pairs (θ, X) on-the-fly. Since each simulation result is discarded after the corresponding backpropagation update, the network never encounters the same inputs twice and overfitting is impossible. Moreover, the training phase can continue as long as the network keeps improving, as measured by continuous performance monitoring. Online training is outlined in S2 Algorithm in S1 Text and is used for all experiments in this work. The present application lends itself to this approach, because the computational cost of running our epidemiological model is negligible, whereas more expensive simulations might become a bottleneck for this strategy.
Simulation-based hybrid learning. Offline and online learning represent two extremes on a continuum of training strategies. Hybrid learning methods combine these two strategies and allow for a more fine-grained allocation of the available simulation budget. For instance, [24] propose a round-based strategy, where each round incorporates its own simulation phase. Thus, the reference table is filled incrementally, and each round can reuse simulations from all previous rounds. Such a round-based training strategy is outlined in S3 Algorithm in S1 Text.

Uncertainty calibration and computational faithfulness
Computational faithfulness refers to the ability of a Bayesian method to recover the correct target posterior in a given modeling scenario. It is an essential precondition for carrying out model-based predictions and interpreting the parameters of a model within a reference theoretical framework. We can estimate the computational faithfulness of any BayesFlow application using simulation-based calibration [29,SBC]. SBC is a diagnostic method which considers the performance of a sampling algorithm over the entire Bayesian joint model p(θ, X), regardless of the structure of the particular model. It leverages the fact that most Bayesian models are generative by construction as well as the self-consistency of the Bayesian joint model. Accordingly, the average posterior distribution over random draws from the Bayesian joint distribution (θ, X) * p(θ, X) should always recover the prior distribution p(θ). In other words, for any given parameter combination θ � , the following should hold: Random draws from p(θ, X) are generated by first sampling a configuration θ from the prior p(θ) and then running the (stochastic) simulator with the sampled parameter configuration to obtain a synthetic outbreak trajectory. This process can be repeated multiple times and does not require an analytically tractable likelihood function. Importantly, if a Bayesian sampling method generates samples from the exact posterior, the equality implied by Eq 14 should hold regardless of the particular form of the posterior. Thus, any violation of this equality indicates some error incurred by the sampling method. The reasons for such an error can be either inaccurate computation of the posterior or an erroneous implementation of the model itself [29].
In practice, we approximate this integral by an ensemble of samples from many posterior distributions estimated from simulated time series with known generating parameters. SBC uses a rank statistic (i.e., the number of posterior draws larger than the prior draw for each simulated time series) to compare the average posterior with the prior. If Eq 14 holds, then the rank statistic of each parameter will be uniformly distributed, allowing us to visually examine the equivalence using univariate histograms. An inspection of the rank histograms thus provides a way to validate the computational faithfulness of OutbreakFlow within the scope of the modeling assumptions [29].
A major disadvantage of SBC is that it can be extremely time-consuming, since it requires inverse inference on potentially thousands of simulated time series. In addition, the obtained posterior draws should exhibit no autocorrelation for SBC to yield reliable results. The latter requirement makes it even more expensive when using MCMC or other non-amortized Bayesian methods yielding dependent samples. Fortunately, amortized inference with Outbreak-Flow alleviates these issues, since inference on multiple time series is extremely efficient and posterior draws are independent given perfectly converged networks [17]. Thus, validating the computational faithfulness of any OutbreakFlow application using SBC becomes a matter of seconds.

Outbreak prediction on the basis of estimated posteriors
The posteriors estimated by a trained OutbreakFlow network can be further used to make forecasts about the future dynamics of the pandemic, provided the parameters remain stationary. Due to changing intervention measures, population behavior, testing policies, and possible treatment advances, this is only true for a relatively short period beyond the observed time series, limiting the prediction horizon to a few weeks. Given observed time series X ≔ x 1:T , the posterior predictive distribution for upcoming data X 0 ≔ x T+1:T 0 is given by: Although this quantity is hard to compute exactly, we can approximate it by running simulations with parameters sampled from the posterior: fy ðmÞ � pðy j XÞg M m¼1 . Since the θ (m) are drawn from the joint posterior, statistical dependencies and correlations between parameters are properly taken into account. The resulting ensemble of M simulated time series fX ðmÞ g M m¼1 can now be used to obtain point predictions (e.g., by computing their mean or median at each future time point) and to quantify the uncertainty of future scenarios (e.g., by computing quantiles or standard deviations at each time point).
In addition to future predictions, posterior predictive re-simulations are also crucial for further model validation. If the estimated posteriors describe the real situation well, re-simulations should replicate the data that served to fit the model in the first place. Mismatches between original and re-simulated time series indicate model misspecification or a simulation gap, that is, errors arising because important aspects of the disease dynamics or unknown corruption of the observed data are not properly represented by the model. These errors are invisible to simulation-based calibration, because it solely assesses whether the posteriors conform to model specifications. Importantly, our OutbreakFlow experiments demonstrate very good agreement between original and re-simulated time series.

OutbreakFlow as a research tool for inferring dynamics of emerging epidemics
In contrast to mainstream neural network applications, such as image or text analysis, analyzing the dynamics of emerging epidemics poses two major challenges: (i) data are usually sparse, with no large sets of training data available; and (ii) a reliable quantification of the uncertainty associated with neural network outputs, such as estimated parameters, is mandatory to allow for reliable subsequent evaluation of possible scenarios.
Standard neural network architectures do not live up to these challenges. Therefore, we developed a novel method on the basis of a neural network architecture called BayesFlow [17] that addresses the aforementioned challenges in two ways: (i) We leverage the epidemiological insight represented by SIR-type models by means of an alternative training procedure using simulated data-simulation-based training; and (ii) we use networks that are specifically designed to perform Bayesian uncertainty quantification over their outputs.
In our framework, a large number of plausible hypothetical scenarios simulating the assumed epidemiological dynamics is processed by the neural network until it becomes an expert in the interpretation of epidemiological observations. After completion of the training phase, the available real-world observations are passed to the network, which then estimates full Bayesian posterior distributions for the real-world parameters of interest. The design of our network architecture is depicted in Fig 1. The details of the method, as well as the model architecture are given in Materials and methods.
The ultimate goal of our approach is comparable to that of traditional simulation-based Bayesian inference methods, such as ABC. However, our method operates much faster and generalizes, without retraining, to any real-world dataset within the scope of its training expertise [17].

Testing and validation of OutbreakFlow
To validate our approach and test its performance in inferring parameter values in epidemiological models, we applied our architecture to a standard SIR-model describing the dynamics of an epidemic. This greatly simplified model is suitable for the initial two weeks of the pandemic and highlights essential properties of our approach. It distinguishes between susceptible, S, infected, I, and recovered, R, individuals with infection and recovery occurring at a constant transmission rate λ and recovery rate μ, respectively. The model is defined by the following system of ODEs: with N = S + I + R denoting the total population number. In addition to the ODE parameters λ and μ, we consider a reporting delay parameter L and a dispersion parameter ψ, which affect the number of reported infected individuals via where I ðnewÞ t ¼ l ðS t I t =NÞ and we assume that the number of newly observed cases arises from a negative binomial distribution [30]. In addition, we estimate the initial number of infected individuals I 0 , so the full parameter vector of interest becomes θ = (λ, μ, L, ψ, I 0 ). Priors over the five parameters are given in S4 Table in S1 Text.
As a first step, we trained our network on simulations from the simple SIR-type model formulated above above and then applied the network to the number of reported cases from the first 14 days of the Covid-19 pandemic in Germany. The results from this initial study are depicted in Fig 3. First, we observe that our posterior estimates are in line with those reported in a previous modeling study [8], which utilized the same data and a similar model. Second, we note that the SBC plots indicate no systematic biases in the approximate posteriors and thus suggest that the posterior samples are trustworthy (assuming no simulation gap). Finally, the posterior predictive check indicates that the model can accurately reproduce the observed data (Fig 3).
Since Outbreak-Flow is especially designed to tackle complex stochastic models whose estimation usually necessitates simulation-based approaches, we compared its performance to an

PLOS COMPUTATIONAL BIOLOGY
OutbreakFlow: Model-based Bayesian inference of disease outbreak dynamics analysis performed by ABC-SMC, a popular approximate Bayesian computation algorithm based on sequential Monte Carlo sampling [31]. Our benchmark comparison reveals converging results. However, OutbreakFlow implies a much sharper marginal posterior of ψ than ABC-SMC, indicating more uncertainty reduction with regard to the dispersion parameter. Since the SBC plots indicate no overconfidence (i.e., no overdispersion) of the OutbreakFlow posteriors, it is likely that the ABC-SMC algorithm yields an underconfident (i.e., underdispersed) marginal posterior with respect to this one parameter. As for wall-clock running times, the ABC-SMC algorithm converged in 4.1 hours, whereas OutbreakFlow trained with 50, 000 iterations using online learning required 4.3 hours on the same laptop machine without parallel simulations. This similar performance speaks in favor of amortized inference, as the training effort already amortizes after as few as two data sets.
To further test if OutbreakFlow is able to provide a reliable quantification of model parameters, we analyzed if the additional consideration of non-identifiable parameters within the model would affect the calibration or predictive performance of our method. To this end, we included 5 dummy variables u j * Uniform(0, 1) within our model that were not used for the generation of the simulated data, but were later included in the unknown parameter vector θ during training and inference.
Performing the same training and inference phase with these 5 additional dummy parameters neither hurts the calibration of OutbreakFlow, nor does it impact inference on the observed time series in a noticeable way, since the posterior estimates for the relevant parameters appear to be unaffected by the dummy parameters (see S1 Text for full results). The same is true for model-based posterior predictions, which underlines the ability of OutbreakFlow to reliably characterize parameter identifiability in case of insufficient data or over-parameterized models.

Inferring epidemiological characteristics from the early Covid-19 pandemic in Germany
After validation of the general applicability of our novel approach, we applied OutbreakFlow to data from the Covid-19 pandemic in Germany, analyzing reported cases (infected, recovered and deceased) in the time period from March 01, 2020 until June 11, 2020. These data captured the early dynamics of the emerging epidemic associated with considerable uncertainty and stochasticity with regard to the number of detected cases, as well as the effect of subsequent public interventions. For our analysis, we used an extended SEIR-type model that had been developed recently and distinguishes between detected and undetected carriers of the disease comprising a total of 34 unknown model parameters (see Fig 2 and Materials and methods for a detailed description of the model) [18]. The model was trained on a time-period from March 01 until May 21 using wide prior distributions across plausible parameter ranges from previous literature [8,22]. The remaining data (three weeks from May 22 until June 11) are then used to assess the predictive value of the model.
The observed and predicted dynamics, as well as the marginal posterior distributions of the individual model parameters are depicted in Fig 4 (see S6 Fig in S1 Text Text for simulationbased calibration). Our model was able to recover the observed dynamics and yields good predictions for the future period, with its forecasts having well-calibrated uncertainty bounds for the newly infected, recovered, and diseased cases (see Fig 4). Despite the large number of unknown model parameters and limited data, our analysis indicated considerable reduction in uncertainty in relation to the prior knowledge for most of the model parameters (Fig 4). Standard point estimates (median, mean, MAP mode) and credibility intervals (95%-CI between the 2.5% and 97.5% quantiles of the corresponding posterior) for all 34 model parameters are given in Table 1.
Interestingly, our parameter estimates (cf. Fig 4) are consistent with previous findings about central disease parameters [8]. With the number of undetected cases being one of the most important estimates to assess epidemic-related dynamics, our network estimates a median probability of remaining undetected (parameter α) of 0.63 with the maximum a posteriori (MAP) estimate at 0.79 (95%-CI [0.07-0.91]). Notwithstanding the large uncertainty surrounding the number of undetected cases, the posterior of α is clearly far from uniform (our prior assumption), and peaks well beyond 0.5 (see Fig 4). This estimate is consistent with the results of representative seroprevalence studies in Germany [32,33], which find that 75% of

PLOS COMPUTATIONAL BIOLOGY
the sero-positive individuals at the end of the first wave had not been diagnosed with the disease before. An even higher fraction of undetected cases around 80% was reported by seroprevalence studies focusing on the hotspot regions of Gangelt, Kupferzell, and Tischenreuth [34][35][36]. Together with our estimated case fatality rate (parameter δ) of 4.1% (median) resp. 3% (MAP), this results in an infection fatality rate of about 0.63% (MAP estimates) or 1.5% (median estimates).
Additionally, our informative priors for the parameters η and γ are not updated by the observed data. Accordingly, around 3.2 days will typically pass before the infection is detected  [37] (around 4 days) and the World Health Organization [38] (5-6 days). Combined with the latent period (i.e., the time in compartment E: median 6.7 days, 95%-CI [3.33-11.1], 1/γ), this leads to a median incubation period until symptom onset or a positive test outcome of around 9.9 days, as originally implied by our priors. If the disease remains undiagnosed throughout (asymptomatic, weakly symptomatic etc.), recovery takes a median number of 4.6 days (95%-CI [2.99-11.11], 1/θ), a time period in which pre-symptomatic and undiagnosed individuals can be responsible for a considerable fraction of infections. Furthermore, if we assume that most infections occur near the end of the carrier stage, that is, after 2.5 days in compartment C, we arrive at a generation time of around 9 days. In conjunction with the delay for reporting infections I of 5.5 days (parameter L I ), this is consistent with the generally acknowledged fact that the success of intervention measures only becomes apparent after around two weeks.
For diagnosed individuals, the median recovery period is estimated at 8.1 days (95%-CI [6.13-10.20], 1/μ). Thus, manifestly ill cases have a more severe disease progression than undiagnosed individuals and typically require 1/η + 1/μ = 3.2 + 8.1 = 11.3 days until recovery [39]. The time between diagnosis and death (6.7 days, 95%-CI [3.12-14.3], parameter 1/d) is shorter than in clinical reports, with parameter identification possibly impaired by the estimated reporting delay for disease-associated deaths D of 11.3 days (parameter L D ), which is probably much longer than in reality. From the available time series alone, the model is not able to distinguish a long critical phase with short reporting delay from rapid death with long reporting delay. Nevertheless, it is remarkable how much information about 34 free parameters our networks can extract from seeing only about 70 time steps of real data (see Fig 4).
Finally, our results corroborate the timing of intervention measures and the gradual reduction in transmission rate as observed in [8]. According to our estimates, the lifting of measures around May 6 would have led to an approximately 40% increase in the transmission rate, as assumed by our prior. However, since the spreading rate at t 4 has already decreased to a median of 0.09 (95%-CI: [0.05-0.15]), the increase to a median of 0.13 (95%-CI [0.05-0.28]) does not lead to an exponential growth of infections.

Testing the robustness of model analysis and inferred dynamics
To test the validity of our results, we performed a series of ablation studies which independently reduce either the architecture of our neural network or the considered epidemiological model. Thereby, we are able to test (i) the importance of specific technical components for parameter inference, as well as (ii) the importance of specific aspects in the model structure when trying to explain the observed dynamics. Accordingly, the implemented changes comprise: 1) removing the convolutional filtering network; 2) removing the recurrent summary network; 3) removing the observation model; 4) removing the intervention model; 5) removing the carrier (C) compartment from the latent disease model. For each of these ablation studies, we used the same number of simulations and training settings as before. We then compared the predictive performance, calibration, and reliability in parameter estimation of each of the modified analyzes to our previous analysis.
The results from all ablation studies are available in S1 Text. Indeed, our experiments indicate that all network and model components are crucial for the performance of OutbreakFlow and for the fidelity of model predictions. Interestingly, no ablation scenario leads to dramatic miscalibration across all marginal posteriors. However, predictive performance markedly deteriorates in all ablation studies. For instance, removing either the intervention or the observation model leads to the rejection of all time series drawn from the posterior predictive due to divergences. Further, removing the latent carrier compartment leads to poor fits and notable underestimation of the reported cases. On the other hand, removing either the filtering or the summary network does not lead to dramatic misfits but prevents the architecture from fully capturing the structure of the observed time series (e.g., weekly modulation is poorly captured, see S1 Text).

Predicting the epidemiological dynamics within the individual German federal states
In the previous section, we demonstrated the ability of our method to recover observed epidemiological dynamics and to infer reliable parameter estimates that determine disease characteristics based on data covering the early Covid-19 epidemic in whole Germany. Arguably, the importance of uncertainty concerning the number of reported cases, as well as stochastic effects increase with lower case numbers. To this end, we also applied our method to each German federal state separately, as individual states are characterized by different population sizes, as well as different onsets and progression of the epidemic. Because data for recovered cases per federal state was lacking (in contrast to recovered cases for the entire Germany), the network was trained solely on the simulated reported infected cases and deaths. Note that we only trained a single OutbreakFlow, which we then applied unchanged to each German federal state. In this way, the training effort amortized over the repeated applications of the same neural estimator (see S13 Fig in S1 Text for simulation-based calibration of the trained network).
Posterior predictions and forecasts for cumulative infections (derived from estimated new cases, see S14 Fig in S1 Text) in each federal state are depicted in Fig 5 (see also S15 Fig in S1 Text for predictions of cumulative deaths). As for Germany as a whole, we observe that median predictions follow very closely the reported cumulative number of cases across all federal states. Furthermore, the reported cases are very well represented by the uncertainty bounds derived from the parameter posteriors, with prediction uncertainty increasing towards the future (i.e., predictions after the dotted vertical lines in Fig 5). However, median predictions can become unreliable when only a few cases are available for model training (see predictions for cumulative deaths for the state Mecklenburg-Western Pomerania, S15 Fig in S1 Text). Therefore, well-calibrated uncertainty estimates are particularly important and need to be taken into account when reporting point predictions.
The posterior distributions of individual model parameters for each of the 16 German federal states are depicted in S1 Text. For comparison of individual estimates between different states, we focused on four latent parameters that are essential for assessing early epidemical dynamics: (i) the probability of infecteds to remain undetected (α), (ii) the recovery rate of undetected (θ), (iii) the number of initially exposed individuals (E 0 ), and (iv) the initial transmission rate (λ 0 ).
We observe that posterior estimates of α across states tend to peak well above 0.5 (see Fig  6a), suggesting once again that many infections have remained undetected/undiagnosed during the initial months of the Covid-19 pandemic in Germany. Furthermore, we observe that some states have smaller probabilities of undiagnosed infections, especially Bavaria, Berlin, and North Rhine-Westphalia (possibly indicating a more successful testing policy), although these estimates are associated with substantial uncertainty. Interestingly, the α posteriors of each state are considerably sharper than the posterior of α for the entire Germany (see S1 Text for full marginal posterior plots for each state). This may be a consequence of the high variability of α across states, but could also be an estimation artifact of the simplified model (only two epidemiological time series, new infections and deaths, are available for each state).
Less variability between states is observed in the estimates for the recovery rate of undetected, θ, suggesting an overall fast recovery of undiagnosed individuals (Fig 6b). However, some differences between states are evident in the estimates of initially exposed individuals, E 0 (cf. Fig 6c), with the states of Bavaria, Berlin, and North Rhine-Westphalia having significantly more exposed individuals at onset than other states, which may correspond to observed outbreaks in these states after skiing holidays and festivals. Finally, Fig 6c depicts a comparison of initial transmission rates λ 0 between states. We observe that estimates vary around a median value of 2.27 across states, with the state Mecklenburg-Western Pomerania having the lowest and the state Baden-Württemberg having the highest median transmission rate at onset. Thus, our analysis is able to reveal differences in the epidemiological dynamics between individual German federal states and proves reliable for different population sizes.

Discussion
In this work, we presented a novel simulation-based Bayesian inference framework for complex epidemiological models. We directly demonstrated the utility of our method by applying it to publicly available data on the early reported infected, recovered, and deceased individuals during the first phase of the Covid-19 pandemic in Germany.

PLOS COMPUTATIONAL BIOLOGY
Given the general uncertainty in reported numbers for emerging infectious diseases, estimation methods need to account for this uncertainty when providing parameter estimates and be able to efficiently incorporate incoming data. Our method works reliably well, providing well calibrated uncertainty bounds for individual parameter estimates even in case of small sample sizes and a limited amount of observations. Given the limited amount of data available for model calibration (i.e., number of reported infections, defined recoveries and deaths) and the uncertainty concerning these values, our analysis is able to reduce the uncertainty about quantities that could not be obtained during the time course of the epidemics.
Our estimates suggest that a large fraction of the infected individuals (60-80%) might have gone undetected through the course of the early Covid-19 outbreak in Germany. This finding has been confirmed by subsequent seroprevalence studies from Germany [32][33][34][35][36] and other countries [40][41][42][43] and is also in line with estimates on reporting rates [44]. However, our posteriors also suggest that there is non-negligible uncertainty surrounding this estimate when derived in a purely model-based manner. Moreover, different summary statistics (e.g., means, medians, MAPs) derived from non-symmetric posteriors offer slightly different conclusions. This observation highlights the need to consider the full posteriors and corresponding credibility intervals when aiming to draw substantive conclusions about epidemiological parameters and possible forecasts for the progression of the epidemic or the effect of specific public health interventions. When interpreting the results of parameter estimates, one should also be aware that mechanistic models like the one used here only describe the average behavior of entire compartments. Accordingly, the given confidence intervals quantify our uncertainty about the inferred parameter averages and can not serve as a measure for the variability between individual cases.
Our approach has two key advantages over standard Bayesian and likelihood-based point estimation methods. First, it can flexibly deal with arbitrarily complex models and data structures, requiring no closed-form likelihoods or ad hoc distributional restrictions regarding the shape of the joint prior or posterior. As standard SIR-models based on (stochastic) ordinary differential equations generally provide a coarse-grained view on the epidemic dynamics [1], more complex models accounting for heterogeneous social interactions, age-dependent effects, and/or spatial and temporal heterogeneity become more and more important to predict the progression of an epidemic or guide intervention measures [8,[45][46][47][48][49]. Such agent-based and stochastic models can be easily incorporated within our neural Bayesian framework.
As a second advantage, the amortized inference property of our method, that is, training the network only once on simulated data, allows efficient posterior sampling and simultaneous application to multiple data sets. In addition, it allows for efficient online-learning and validation (i.e., the continuous integration of upcoming data), once the networks have been trained with sufficient amounts of simulated data. These advantages are important, since they enable researchers to concentrate on formulating, testing, and validating complex model systems without worrying about estimation efficiency or analytical tractability.
Future developments will include Bayesian model comparison, multilevel modeling with hierarchical priors and a systematic comparison between different neural inference architectures. Hierarchical modeling allows us to better distinguish between disease-specific and region-specific parameters by representing the natural cluster structure of epidemiological data. In addition, model comparison is an especially important research avenue to compare different possible disease transmission and progression schemes that could explain the observed dynamics. Currently, in order to compare multiple competing dynamic models, separate neural networks would need to be trained and stored-one network corresponding to each model. Even though such a training can be carried out in parallel, it would be much more efficient to embed all competing models within a single neural architecture, which can perform both prior predictive and posterior predictive model comparison. This idea has already been explored in [50] and future research should investigate its utility for comparing complex epidemiological models.
In addition, tackling non-stationary dynamic processes is an important open problem of our framework that we plan to address in future work. This limitation is one reason why we focus on the early outbreak dynamics, where the assumption of stationarity appears plausible (e.g., absence of major virus mutations, largely fixed testing policy and implied fraction of undetected infections, no significant progress on treatment options and subsequent recovery times), with the exception of government interventions and behavioral changes, which we explicitly model. Moreover, our Bayesian treatment ensures that deviations from stationarity and shortcomings of the reported data do not result in catastrophic failure of our method, but are reflected in wider uncertainty regions than would otherwise have been achievable.
When applying OutbreakFlow, one should be aware of three potential error sources that can distort the outcomes and interpretations of amortized Bayesian workflows. First, model misspecification and data contamination can result in a simulation gap. A simulation gap occurs when the model cannot represent the actual disease dynamics or when data collection is biased or contaminated in ways not accounted for by the model. We addressed these issues by suitable model extensions, which are motivated by theoretical considerations and ablation studies. Remaining misspecifications can often be detected via standard Bayesian model checking methods, for instance, insufficient posterior predictive accuracy, divergent re-simulations, or very low posterior probability under the prior. However, theoretical guarantees on upper bounds for the residual errors remain an important open problem of our approach.
The second source is the Monte Carlo error introduced by approximating the expectation in Eq 3 with only a finite number of simulations. It is also referred to as approximation error and widely acknowledged throughout all Monte Carlo methods. This error can be mitigated relatively easily in our online learning setting, because we can generate a potentially endless stream of synthetic training data until the continuously monitored prediction accuracy is satisfactory. In this respect, simulation-based inference is better positioned to fully utilize the capacity of deep neural networks than traditional supervised learning methods, which rely on a limited supply of annotated real data.
The third source of error is an amortization gap, which refers to potential deficiencies of the network for atypical real dynamics. Atypical dynamics are, by definition, underrepresented in any training set, and an amortized inference scheme may be less accurate in such cases. An amortization gap can be detected via the probabilistic calibration methods advocated in this work, that is, simulation-based calibration. When the amortization gap is too large, resorting to non-amortized methods such as [23,24] might be a viable option. Alternatively, the accuracy of an OutbreakFlow architecture for more rare situations can by improved by increasing the expressiveness of the filtering, summary, and inference networks (e.g., more layers and more units, along with more training iterations). The choice of optimal neural network architectures and training algorithms is a topic of ongoing research in the deep learning community and an important target for future work on simulation-based inference.
In summary, our OutbreakFlow architecture provides a general inference framework for complex epidemiological scenarios and enables efficient simulation-based inference for key epidemiological parameters. We therefore believe that our proposed architecture can facilitate uncertainty-aware inference with complex and realistic epidemiological models, especially during the early phase of epidemic outbreaks when information is scarce and data reliability low. Rapid trustworthy inference can timely reveal crucial dynamic aspects of a spreading disease and inform effective public health interventions.