Efficient Bayesian inference for stochastic agent-based models

The modelling of many real-world problems relies on computationally heavy simulations of randomly interacting individuals or agents. However, the values of the parameters that underlie the interactions between agents are typically poorly known, and hence they need to be inferred from macroscopic observations of the system. Since statistical inference rests on repeated simulations to sample the parameter space, the high computational expense of these simulations can become a stumbling block. In this paper, we compare two ways to mitigate this issue in a Bayesian setting through the use of machine learning methods: One approach is to construct lightweight surrogate models to substitute the simulations used in inference. Alternatively, one might altogether circumvent the need for Bayesian sampling schemes and directly estimate the posterior distribution. We focus on stochastic simulations that track autonomous agents and present two case studies: tumour growths and the spread of infectious diseases. We demonstrate that good accuracy in inference can be achieved with a relatively small number of simulations, making our machine learning approaches orders of magnitude faster than classical simulation-based methods that rely on sampling the parameter space. However, we find that while some methods generally produce more robust results than others, no algorithm offers a one-size-fits-all solution when attempting to infer model parameters from observations. Instead, one must choose the inference technique with the specific real-world application in mind. The stochastic nature of the considered real-world phenomena poses an additional challenge that can become insurmountable for some approaches. Overall, we find machine learning approaches that create direct inference machines to be promising for real-world applications. We present our findings as general guidelines for modelling practitioners.

with probability 1-or a GSC and a GPP with probability . A GPP, on the other hand, might divide into two GPPs or produce one GDS with equal probability. GDSs do not divide.
The cells inhabit an environment that contains nutrients. Initially, these nutrients are evenly distributed across the lattice. However, we have placed a nutrient source in the lower left-hand corner of the lattice, and the nutrients are locally consumed by the cells as the tumour grows. Gradients in the nutrient concentration (n) thus arise over time. The resulting diffusion of nutrients across the lattice is modelled using a reaction-diffusion equation on the form [15][16][17] dn dt = D∇ 2 n − λ C ρ + S.
Here, D denotes the diffusion coefficient for the nutrient flow, λ C scales the nutrient consumption of the cells, and ρ denotes the tumour cell concentration, i.e. we assume that all cell types consume the same. Finally, S signifies any nutrient sources, i.e. the aforementioned inflow of nutrients in the lower left-hand corner. At each time step, we solve Equation (1) using the Python Package FiPy [18]. Note that Equation (1) is solved on a Cartesian grid. At each time step, interpolation is required between this grid and the irregular lattice on which the cells live. We illustrate the described set-up in S2 Fig. We assume that a sufficiently low nutrient concentration will inflict immediate cell death [19]. Such low concentrations might occur at the centre of the tumour leading to a necrotic core. This morphology is observed for real tumours.
For glioblastoma multiforme, the migration of tumour cells is known to primarily occur along white matter tracks, and blood vessels [20][21][22]. One might take this behaviour into account by letting the stochastic rules that govern cell actions vary across the lattice. In addition, one could allow for mutations that alter these rules. However, for our purposes, simple rules for the cell dynamics suffice. Our primary focus lies in extracting general guidelines for stochastic models rather than capturing all aspects of this specific brain cancer type.

Parameters and summary statistics
We initialise each simulation with a single GSC near the centre of the lattice and vary four input parameters (Θ): • Our first parameter is the probability, P div , for GSCs with a free neighbouring lattice site to divide during each time step. While GPPs are assumed to divide 0.6 times this rate, GDSs do not divide. Note that these rules are chosen for their simplicity rather than given a biological justification.
• The second parameter is the probability, P cd , for GPPs to undergo programmed cell death during each time step. While GDSs are assumed to be three times more likely to die at every time step, GSCs do not die spontaneously. Again, these rules are chosen for their simplicity. Note that cells die irrespective of their cell type if the nutrient concentration falls below a critical level.
• Our third parameter is λ C in Equation (1).
• Finally, we include that denotes the probability for GSCs to differentiate into GPPs during cell division.
Note that λ C is the only parameter that is not a probability. However, we choose the values for the remaining parameters (D and S) in Equation (1) such that interesting scenarios arise when λ C is varied between 0.0 and 1.0. While this choice is not biologically motivated, it means that all parameters can be treated in the same way, which provides an interesting contrast to the epidemiological SIRDS model presented in the section Epidemic, SIRDS model. We note that our cancer CA model has many more parameters than the ones we vary in our study. Among others, these parameters include D, S, and the critical nutrient level below which immediate cell death due to starvation occurs. We set these to 5 units of area per time step, 500 per time step at the lower-left corner, and 0.1 per unit area, respectively. We also do not vary the migration probability of the different cell types. For GSCs and GDSs, we set the probability for cell migration to 0.02, while GPPs migrate with a probability of 0.5. Due to the "Go or Grow" hypothesis, however, the migration probability of any individual cell is temporarily set to zero if the cell has undergone cell division within the same time step.
To decide on the summary statistics, we must consider the type of data that one might encounter in a cancer study. In an animal-based experiment, we might sacrifice the animal to study its brain. Alternatively, we might deal with magnetic resonance imaging (MRI) or in vivo imaging system (IVIS) data. In any case, we would know the time since the injection of the tumour cells into the animal. We hence assume that our data will comprise a detailed snapshot of the tumour rather than a time series for tumour growth. In the best-case scenario, we might have access to spatial transcriptomics (ST) data that enables us to distinguish different cell types and resolve single cells.
The raw cell count already provides extensive information about Θ. The reason for this is twofold. First, since we do not vary the initial conditions, variations in Θ do not imply significant spatial variation that might affect tumour growth. Secondly, we do not intend to infer spatial properties, such as varying initial conditions or the impact of cell migration. Based on these considerations, we settle for the following five summary statistics: • The number of alive glioblastoma stem-like cells (GSC) at t = 100.
• The number of alive propagating glioblastoma cells (GPP) at t = 100.
• The number of alive differentiating glioblastoma cells (GDS) at t = 100.
• The number of dead tumour cells at t = 100.
• The total number of tumour cells in the proliferating rim at t = 100.
As discussed above, we define the proliferating rim as those cells that are placed within 8 lattice sites of a free lattice site. Note that the maximal width of the proliferating rim (d rim ), i.e. here 8 lattice sites, is yet another parameter that we keep fixed in our analysis. Also, note that the probability for cell division is set to decrease for cells that are embedded within the proliferating rim, reflecting the work needed to push other tumour cells aside. For a tumour cell with distance d to the nearest free lattice site, the probability for cell division is thus set to be a factor of exp(− d 2drim ) smaller for d < d rim .

Epidemic, SIRDS model
Agent-based models are in widespread use within the field of mathematical epidemiology. They have thus been applied to a broad variety of data, including studies of the SARS-CoV-2 pandemic [23,24]. The considered agent-based models for infectious diseases fall into the category of compartmental epidemiological models [25]. Such models, at the very least, distinguish between the infected (I) and susceptible (S) individuals or subpopulations. In addition, many diseases lead to immunity, and some infections might be fatal, making it essential to track recovered (R) individuals as well as the death toll (D). To indicate that our model takes these four categories into account, we refer to it as a SIRDS model. By repeating the reference to the susceptible individuals (S), we express that recovered individuals might become susceptible to the disease again after a certain period of immunity.
One might model the agents as a set of Brownian particles that move in a two-dimensional plane. Much can already be learned from such an approach, and the correct macroscopic behaviour can be reproduced [26]. This being said, human interactions are governed by rules that reflect socio-demographic structures. In addition to random encounters, people thus tend to have predictable encounters within a certain social group. Regular social interactions with the same individuals take place within households, at educational facilities or in workplaces. Agent-based models present an opportunity to include such social networks [24,[27][28][29][30].
In this paper, we consider agents on a square lattice (see [31][32][33][34][35] for other examples of lattice-based models). In addition to having N re random encounters, each agent interacts with the agents at the eight neighbouring lattice points on a daily basis. Leaving random encounters aside, we note that the agents do not move throughout the simulation, i.e. the social networks are rigid. The distance between agents thus expresses social relationships rather than geographical proximity. In total, there are 12,100 (110 × 110) available lattice sites. Only 10,000 randomly selected sites are occupied to account for the variability of social groups. A susceptible agent that encounters an infected agent will be infected with a probability P i (for each day of exposure). There are initially three infected individuals randomly scattered across the lattice. An infected individual will recover or die after a number of days drawn from an exponential distribution with mean t i . The agent dies with the probability P d . If the agent recovers, it will be protected from being infected during a number of days drawn from an exponential distribution with mean t p . The agent will then re-enter the population of susceptible individuals (see also S3 Fig).
We stress that our epidemiological SIRDS model is rather simplistic. For instance, factors such as an individual's socioeconomic group, sex, or age might play a role in mortality and infection rates. Many mathematical models thus build on census data to draw a more realistic picture [36,37]. We do not take such information into account. However, the considered level of complexity is sufficient for our purposes. Rather than modelling a specific disease, our focus lies on the stochasticity that all epidemiological agent-based models share, irrespective of their level of detail.

Parameters and summary statistics
In the case of our SIRDS model, we vary five input parameters (Θ): • The first parameter is the infection probability, P i , per day during contact with an infected individual. Since P i denotes a probability, this parameter can take values between 0.0 and 1.0.
• Our second parameter is the mortality rate, P d , of the disease. Again, this parameter can take values between 0.0 and 1.0.
• As a third parameter, we include the time, t i , during which the infected person can transmit the disease. We evolve the system in time steps, dt, of 0.1 days with an agent becoming contagious from the time step that follows the infection. As regards the model grids presented in the section Model grid in our paper, we vary this parameter on a linear scale from 1.0 to 30.0 days.
• Our fourth parameter is the time, t p , during which a recovered individual is protected from getting the disease. We vary this parameter on a logarithmic scale between -2.0 and 4.0 when constructing the grids presented in the section Model grid in our paper. At low values for t p , the individual does not build up any immunity. Such models are appropriate for some bacterial infections, including meningitis. At high values for t p , the disease effectively confers lifelong immunity as in the case of some viral diseases, such as measles [38].
• Finally, we vary the number, N re , of random encounters with the general population. For the grids in the section Model grid in our paper, we use a logarithmic scale from -2.0 to 2.0. The model is hereby, for instance, able to account for social distancing and lockdown measures.
In contrast to the brain cancer model, we hence include parameters on different scales. Rather than being interested in the spatial properties of our specific model, we aim to gain insights into the general properties of agent-based models for infectious diseases. When choosing our summary statistics, we hence turn to the emergent properties of the simulations. These properties comprise the time evolution of the different subpopulations, i.e. S(t), I(t), R(t), and D(t). To facilitate an easy comparison with our cancer model, we distil these time series into five summary statistics: • Our first measure is the duration of the outbreak. Note that the computation is interrupted if the duration exceeds 1825 days.
• Secondly, we include the total death toll at the end of the outbreak.
• The third measure is the largest number of infected individuals during the outbreak. Note that we only record the largest peak, although the epidemic might lead to several waves.
• Fourthly, we include the time at which the largest peak in infection numbers occurs. To motivate this choice, let's assume that the first peak in infection numbers is the largest. The ratio between the time at which it occurs and the total duration of the epidemic would then hint at the existence of multiple waves and help to constrain t p .
• Finally, we include the largest number of recovered agents during the epidemic. In contrast, the number of recovered people at the end of the epidemic is not a suitable measure because we include scenarios where the disease does not confer immunity.

The Schlögl Model
The Schögl model [39][40][41] describes a chemical reaction network containing three species, X, A and B. The following reactions couple these species: We are dealing with (discrete state, continuous-time) stochastic processes. In contrast to our brain tumour and epidemic models, the Schögl model is neither spatially resolved nor does it keep track of individual agents to describe the chemical reaction system. The parameters k 1 − k 4 are associated with the propensity functions, i.e. the probability that any of the four reactions take place during a time step given the concentration of each of the four species. We can compute the time evolution of the system using the Gillespie algorithm. In short, this algorithm uses a Monte Carlo method to compute the time until the next reaction. It then updates the system by randomly choosing one of the reactions based on the propensity functions. For more details, we refer the reader to Ilie et al. (2019) [41]. The advantage of stochastic over deterministic modelling of reaction networks lies in the fact that it correctly captures the probability distributions of the different species, wherever the number of participating molecules is finite. For instance, for some combinations of k 1 − k 4 , the Schögl model famously exhibits bi-stability. As discussed below, however, bi-stabilities did not occur in any of our synthetic data.
To study the Schögl model, we constructed a grid containing 10,000 models varying k 1 − k 4 between 10 −8 and 10 −1 , sampling from the parameter space in a quasi-random manner. We initialised all systems with X, A and B at 100, 5,000 and 5,000 molecules, respectively. The computation of each reaction network was terminated after 50,000 reactions took place, which means that the final state of the system varies in four variables across the grid: the threes species (X, A, B) and the time T , at which the last computed reaction took place. Here, we do hence not assume that the populations of A and B are constant as it is commonly done.
We used the grid to train two emulators: a neural network (NN) and Gaussian Processes (GP). We then tested the performance of these emulators by using them to predict the output of 250 previously unseen sets of parameter values. For comparison, we ran the simulations 250 times for each of these 250 parameter sets to compute the real distributions of X, A, B, and T to which to compare. The results are summarised in S4 Fig.We note that bi-stabilities do not occur in our synthetic data sets, i.e. that the distributions of the different chemical species are seemingly uni-modal for all the 250 investigated quasi-randomly drawn parameter sets that constitute our synthetic data.
We find that both emulators predict the correct median of the variables (X, A, B, and T ) with high accuracy. For nearly all the 250 test cases, the median is thus predicted with an accuracy that is better than an order of magnitude. However, in this case study, both emulators overestimate the stochastic nature of the chemical networks, which would affect any subsequent inference scheme. Once again, the NN outperforms the simple GP, even though both emulators predict the median similarly well and thus would score similarly well on many classical performance metrics, such as the coefficient of determination (R 2 ). This underlines one of our main statements: It is essential to explore and potentially discard several machine learning methods and metrics to tailor the emulation to the individual scientific case study.
In addition, we have trained two direct inference machines based on the grid of simulations: a neural network and Gaussian processes. We find that both direct inference machines can recover k 1 and k 2 with high accuracy. The NN yields much narrower posterior distributions than the GP, whereby it outperforms the GP in terms of − log 10 (q). We summarise these results in S5 Fig. However, as regards k 3 and k 4 , both the NN and GP systematically yield parameter estimates that lie close to the centre of the corresponding parameter intervals irrespective of the synthetic data. Picking a value close to the centre of the interval is a good choice if one were to guess blindly. At first glance, both approaches do hence not do a good job pinning down k 3 and k 4 , which is reflected in our performance metrics in S5 Fig. It is useful to take a look at how the different variables behave across the parameter space to understand why the direct inference machines perform better for k 1 and k 2 than for k 3 and k 4 We thus summarise the behaviour of X and T in S6 Fig. As can be seen from the figure, some clear patterns occur in the plane spanned by k 1 and k 2 , suggesting that the corresponding reactions dominate the network. However, there seems to be little information about X and T in the planes spanned by k 3 and k 4 . Likewise, the contour plots of A and B are noisy in the (k 3 , k 4 )-plane. Correspondingly, measurements on X, A, B and T reveal little about these two parameters. Thus, it is expected that k 3 and k 4 cannot be inferred with high accuracy and precision. Taking this into account, we note that the direct inference machines actually do very well for all four parameters at a low computational cost.