A wall-time minimizing parallelization strategy for approximate Bayesian computation

Approximate Bayesian Computation (ABC) is a widely applicable and popular approach to estimating unknown parameters of mechanistic models. As ABC analyses are computationally expensive, parallelization on high-performance infrastructure is often necessary. However, the existing parallelization strategies leave computing resources unused at times and thus do not optimally leverage them yet. We present look-ahead scheduling, a wall-time minimizing parallelization strategy for ABC Sequential Monte Carlo algorithms, which avoids idle times of computing units by preemptive sampling of subsequent generations. This allows to utilize all available resources. The strategy can be integrated with e.g. adaptive distance function and summary statistic selection schemes, which is essential in practice. Our key contribution is the theoretical assessment of the strategy of preemptive sampling and the proof of unbiasedness. Complementary, we provide an implementation and evaluate the strategy on different problems and numbers of parallel cores, showing speed-ups of typically 10-20% and up to 50% compared to the best established approach, with some variability. Thus, the proposed strategy allows to improve the cost and run-time efficiency of ABC methods on high-performance infrastructure.


Introduction
The ultimate goal of systems biology is a holistic understanding of biological systems [1,2].Mechanistic models are important tools towards this aim, allowing to describe and understand underlying mechanisms [3].Usually, such models have unknown parameters that need to be estimated by comparing model outputs to observed data [4].For complex stochastic models, in particular multi-scale models used to describe the complex dynamics of multi-cellular systems, evaluating the likelihood of data given parameters however becomes quickly computationally infeasible [5,6].For this reason, simulation-based methods that circumvent likelihood evaluation have been developed, such as approximate Bayesian computation (ABC), popular for its simplicity and wide applicability [7,8].
ABC generates samples from an approximation to the true Bayesian posterior distribution.While asymptotically exact, a known disadvantage of ABC is its reliance on repeated simulation, often hundred thousands to millions of times.Therefore, methods to efficiently explore the search space have been developed [9].In particular, ABC is frequently combined with a Sequential Monte Carlo scheme (ABC-SMC), which over several generations successively refines the posterior approximation via importance sampling while maintaining high acceptance rates [10,11].Furthermore, in ABC-SMC the sampling for each generation can be parallelized, enabling the use of high-performance computing (HPC) infrastructure.This has in recent years enabled tackling increasingly complex problems via ABC [12,13,14,15].
It would be desirable if available computational resources were perfectly exploited at all times, to minimize both the wall-time until results become available to the researcher, and the cost associated with allocated resources.However, the problem is that established parallelization strategies to distribute ABC-SMC work over a set of workers leave resources idle at times and thus fall short of this aim.The parallelization strategy used in most established HPC-ready ABC implementations is static scheduling (STAT), which defines exactly as many parallel tasks as accepted particles are required [16,17].While it minimizes the active compute time and consumed energy, typically a substantial amount of workers become idle towards the end of each generation.Dynamic scheduling (DYN) mitigates this problem and reduces the overall wall-time by continuing sampling on all workers until sufficiently many particles have been accepted [18].It was shown to reduce the walltime substantially.However, also in this strategy at the end of each generation workers become idle, waiting until all simulations have finished.
In this manuscript, we present a novel ABC-SMC parallelization strategy for multi-core and distributed systems, called look-ahead scheduling (LA).The strategy builds upon dynamic scheduling, but in addition, as workers get idle at the end of each generation, preemptively formulates tentative sampling tasks for the next generation.By this, it makes use of all available workers at almost all times.We show that by appropriate sample reweighting we obtain an unbiased Monte Carlo sample.We provide an HPC-ready implementation and test the method on various problems, demonstrating both efficiency and accuracy.Moreover, we show that the strategy can be integrated with adaptive algorithms for e.g.summary statistics, distance functions, or acceptance thresholds.

ABC
We consider a mechanistic model described via a generative process of simulating data y ∼ π(y|θ) ∈ R ny for parameters θ ∈ R n θ .Given observed data y obs , in Bayesian inference the likelihood π(y obs |θ) is combined with prior information π(θ) to the posterior distribution π(θ|y obs ) ∝ π(y obs |θ) • π(θ).We assume that evaluating the likelihood is computationally infeasible, but that it is possible to simulate data y ∼ π(y|θ) from the model.Then, classical ABC consists in the 3 steps of first sampling parameters θ ∼ π(θ), second simulating data y ∼ π(y|θ), and third accepting θ if d(y, y obs ) ≤ ε, for a distance metric d : R ny × R ny → R ≥0 and acceptance threshold ε > 0. This is repeated until sufficiently many particles, say N , are accepted.The population of accepted particles constitutes a sample from an approximation of the posterior distribution, π ABC,ε (θ|y obs ) ∝ I[d(y, y obs ) ≤ ε]π(y|θ) dy •π(θ). ( for t = 1, . . ., n t do set g t and ε t while less than N acceptances do sample parameter θ ∼ g t (θ) simulate data y ∼ p(y|θ) accept θ if d(y, y obs ) ≤ ε t compute weights w i t = π(θ i t ) gt(θ i t ) , for accepted parameters {θ i t } i≤N Algorithm 1: ABC-SMC algorithm.
Under mild assumptions, π ABC,ε (θ|y obs ) converges to the actual posterior π(θ|y obs ) as ε → 0 [19,20].Commonly, ABC operates not directly on the measured data, but summary statistics thereof, capturing relevant information in a low-dimensional representation [21].Here, for notational simplicity we assume that y already incorporates summary statistics, if applicable.

ABC-SMC
The vanilla ABC formulation exhibits a trade-off between reducing the approximation error induced by ε, and high acceptance rates.Thus, ABC is frequently combined with a Sequential Monte Carlo scheme (ABC-SMC) [22,23].In ABC-SMC, a series of particle populations P t = {(θ i t , w i t )} i≤N is generated, constituting samples of successively better approximations π ABC,εt (θ|y obs ) of the posterior, for generations t = 1, . . ., n t , with acceptance thresholds ε t > ε t+1 .In the first generation (t = 1), particles are sampled directly from the prior, g 1 (θ) = π(θ).In later generations (t > 1), particles are sampled from proposal distributions g t (θ) π(θ) based on the last generation's accepted weighted population P t−1 , e.g.via a kernel density estimate.The importance weights w i t are the Radon-Nikodym derivatives w t (θ) = π(θ)/g t (θ).This is precisely such that the weighted parameters are samples from the distribution w t (θ)I[d(y, y obs ) ≤ ε t ]π(y|θ) dy •g t (θ) = I[d(y, y obs ) ≤ ε t ]π(y|θ) dy •π(θ), i.e. the target distribution (1) for ε = ε t .Common proposal distributions first select an accepted parameter from the last generation and then perturb it, in which case g t takes the form g t (θ) = N i=1 w i t−1 K(θ|θ i t−1 )/ N i=1 w i t−1 , with e.g.K(θ|θ i t−1 ) = N (θ|θ i t−1 , Σ t−1 ) a normal distribution with mean θ i t−1 and covariance matrix Σ t−1 .The performance of ABC-SMC algorithms relies heavily on the quality of the proposal distribution, on its ability to efficiently explore the parameter space.Methods that adapt to the problem structure, e.g.basing Σ t−1 on the previous generation's weighted sample covariance matrix and potentially localizing around θ i , have shown superior [24,25,26].The steps of ABC-SMC are summarized in Algorithm 1.
The output of ABC-SMC is a population of weighted parameters For a statistic f : R n θ → R, the expected value under the posterior is then approximated via the self-normalized importance estimator which is asymptotically unbiased.Here, W i t := w i t / N j=1 w j t are self-normalized weights.This is necessary because the weights w t (θ) = π(θ)/g t (θ) are not normalized in the joint sample space (θ, y), therefore effectively another Monte Carlo estimator is employed for the normalization constant (for details see the Supplementary Information, Section 1.1).
In importance sampling, samples are assigned different weights, such that some impact estimates more than others.This can be quantified e.g.via the effective sample size (ESS) [27,9]:

Established parallelization strategies
In ABC, often hundred thousands to millions of model simulations need to be performed, which is typically the computationally critical part.To speed up inference, parallelization strategies have been developed that exploit the independence of the N particles constituting the t-th population.Suppose we have W parallel workers, each worker being a computational processor unit e.g. in an HPC environment.There are two established techniques to parallelize execution over the workers: In static scheduling (STAT), given a population size N , N tasks are defined and distributed over the workers.Each task consists in sampling until one particle gets accepted (Figure 1A).The tasks are queued if N ≥ W . STAT minimizes the active computation time and number of simulations and is easy to implement, only requiring basic pooling routines available in most distributed computation frameworks.However, even for W > N only N workers are employed, although the number of required simulations is usually substantially larger than N .In addition, at the end of  As soon as no more simulations are required for generation t (green), a preliminary simulation task for generation t + 1 is formulated either based on generation t (dark purple) or t+1 (light purple).Resulting simulations are considered when evaluating the next generation (t = t + 1), and suitable weight normalization is applied to all samples (top right).Over time, the number of workers dedicated to generation t decreases, while that for generation t + 1 increases (bottom).every generation the number of active workers decreases successively, most workers idly waiting for a few to finish their tasks.STAT is available in most established ABC-SMC implementations [17,16].
In dynamic scheduling (DYN), sampling is performed continuously on all available workers until N particles have been accepted (Figure 1B).However, simply taking those first N particles as the final population would bias the population towards parameters with short-running simulations.Instead, in DYN all workers are waited for to finish, and out of the then Ñ ≥ N accepted particles, only the N that started earliest are considered accepted [18].This ensures that acceptance probability of a particle is in accordance with the target distribution, independent of later events and thus its run-time.

Parallelization using look-ahead dynamic scheduling
DYN allows to exploit the available parallel infrastructure to a higher degree than STAT and therefore already substantially decreases the wall-time (by a factor of between 1.4 and 5.3 in test scenarios, see [18]).Nonetheless, some workers remain idle at the end of each generation while waiting for others to complete.This fraction increases as the number of workers increases relatively to the population size.Additionally, the idle time increases if simulation times are heterogeneous, which is often the case, e.g. with estimated reaction rates determining the number of simulated events (Supplementary Information, Section 3.6).In case of fast model simulations, also the time between generations, e.g. to post-process and store results, may be relatively long.

Proposed algorithm
We propose to extend dynamic scheduling by using the free workers at the end of each generation to proactively sample for the next generation (Figure 2): As soon as N acceptances have been reached in generation t − 1 and workers thus start to get idle, we construct a preliminary proposal gt , based on which particles for generation t are generated and simulations performed on the free workers.gt can be based on a preliminary population of accepted particles Pt−1 = {( θi t−1 , ŵi t−1 )} i≤N based on these first N acceptances.However, Pt−1 may introduce a practical bias (in a finite sample sense) towards particles with faster simulations times.This can in particular occur when computation time is highly parameter-dependent.To address this issue, the preliminary proposal can alternatively be based on P t−2 (such that gt = g t−1 ), giving inductively practically unbiased proposals.If a particle θt ∼ gt gets accepted according to the acceptance criteria of generation t, its non-normalized weight is calculated as wt ( θt ) = π( θt) gt ( θt) .As soon as all simulations for generation t−1 have finished and thus the actual P t−1 is available, all workers are updated to continue working with the actual sampling task based on proposal g t .As the time-critical part of typical ABC applications is the model simulation, the cost of generating the preliminary sampling task is usually negligible.
The assessment of acceptance of preliminary samples depends on whether everything is predefined: If the acceptance components, including distance function d and acceptance threshold ε t for generation t are defined a-priori, then acceptance can be checked directly on the workers without knowledge of the complete previous population P t−1 .If however any component of the algorithm is adaptive and hence based on P t−1 (e.g. the acceptance threshold is commonly chosen as a quantile of {d(y i t−1 , y obs )} i≤N ), the acceptance check must be delayed until the actual P t−1 is available.This serves to use one common acceptance criterion across all particles within a generation, so that all particles target the same distribution.
The population of generation t is then, corrected for run-time bias as in DYN by only considering the N accepted particles that were started first, given as with 0 ≤ Ñ ≤ N particles based on the preliminary proposal gt , and N − Ñ on the final g t .The weights need to be normalized appropriately, as explained in the following section.We call this parallelization strategy, which during generation t − 1 already looks ahead to generation t, Lookahead (dynamic) scheduling (LA).

Weights and unbiasedness
A key property of ABC methods is that they provide an asymptotically unbiased Monte Carlo sample from π ABC,εn t (θ|y obs ), with π ABC,εn t (θ|y obs ) → π(θ|y obs ) for ε → 0. The sample (4) obtained via LA conserves this property: The point is that each subpopulation on its own gives an asymptotically unbiased estimator, since the weights wt ( θ) = π( θ)/g t ( θ), w t (θ) = π(θ)/g t (θ) are exactly the Radon-Nikodym derivatives w.r.t. the respective proposal distributions.These estimates are then combined, which decreases the Monte Carlo error due to the larger sample size.Instead of simply tossing all samples together, it is preferable to first normalize the weights relative to their subpopulation, 3).This is because both weight functions are non-normalized, with generally different normalization constants, which renders them not directly comparable.A joint estimate based on the full population can then be given as with β ∈ [0, 1] a free parameter.A straightforward choice is β = Ñ /N , rendering the contribution of each subpopulation proportional to the respective number of samples.Instead, we propose to  [12] C++ 7 M2 Liver tissue regeneration [29] Morpheus 14 choose it to maximize the overall effective sample size (3), rendering the Monte Carlo estimate more robust.This is a simple constrained optimization problem with solution i.e. the contribution of each subpopulation is proportional to its effective sample size (Supplementary Information, Section 1.4).Supposing that for N → ∞, Ñ /N → α ∈ [0, 1], (5) converges to the left-hand side, as required.A more detailed derivation and extension to more than two proposal distributions is given in the Supplementary Information, Section 1.

Implementation and availability
We implemented LA in the open-source Python tool pyABC [28], which already provided STAT and DYN.We employ a Redis low-latency server to handle the task distribution.If all components are pre-defined, we perform evaluation of the "look-ahead" samples ( θ, ỹ) directly on the workers.If there are adaptive components, the delayed evaluation is performed on the main process.To avoid generating unnecessarily many preliminary samples in the presence of some very long-running simulations, we limited the number of preliminary samples to a default value of 10 times the number of samples in the current iteration.To not start preliminary sampling unnecessarily, we employed schemes predicting whether any termination criterion will be hit after the current generation.The code underlying this study can be found at https://github.com/EmadAlamoudi/Lookahead_study.A snapshot of code and data can be found at https://doi.org/10.5281/zenodo.7875905.

Results
Wall-time superiority of DYN over STAT has already been established in prior work [18].To study the performance of LA and compare it to DYN, we applied both to several parameter estimation problems and in various scenarios of population size N and workers W .We distinguish between "LA Pre" using the preliminary Pt−1 to generate gt , and "LA Cur" using P t−2 instead.

Test problems
We considered four problems (   the ODE simulations.For this model, run-times are fast, permitting repeated analysis to check correctness of the method, quantify stochastic effects and assess average behavior.Problem M1 describes the growth of a tumor spheroid using a hybrid discrete-continuous approach, modeling single cells as stochastically interacting agents and extracellular substances via PDEs [12].Problem M2 describes the metabolic status of mechano-sensing during liver generation, describing the reaction network dynamics by a set of ODEs [29].

Biased proposal can induce practical bias in accepted population
The analysis of test model T1 revealed that for small population sizes N relative to the number of workers W , together with high acceptance rates, LA Pre can indeed lead to a bias towards shortrunning simulations (Figure 3 right).This can happen when Pt−1 is only based on short-running simulations, and only proposes particles from that regime, enough of which are then accepted to form P t .For N ≥ W , this effect no longer arose, likely because given large population sizes and sampling from other modes with associated high importance weights eventually happened, and because low acceptance rates make it improbably that acceptances in P t only stem from Pt−1 .

Sampling from unbiased proposal solves bias
When replacing LA Pre by LA Cur, i.e. sampling from P t−2 instead of Pt−1 , the bias no longer occurred (Figure 3 right).This is as expected, because gt−2 is has no run-time bias.In practice, we did not encounter any such problems on the considered application examples, where results from DYN, LA Pre and LA Cur were highly consistent.Yet, LA Pre may fail in some situations, which also demonstrates that ABC-SMC algorithms are sensitive to potential bias in the proposal distribution.Thus, in the following, we focus on the stable LA Cur, showing pendants for LA Pre in the Supplementary Information.

Look-ahead sampling gives accurate results
We used problem T2 to analyze different scenarios, with population sizes N from 20 to 1280, worker numbers W from 32 to 256, and log-normally distributed simulation times of variances σ 2 from 1.0 to 4.0.We ran each scenario 13 times to obtain stable average statistics.We considered means and standard deviations as point and uncertainty measures.
Point estimates for DYN and LA converged to the same values across population sizes (Figure 4  A+B).The proportion of accepted LA samples in the final population originating from the preliminary distribution ranged from nearly 100% to 50% (LA Pre) and 20% (LA Cur), as expected decreasing for larger population sizes (Figure 4 E+F).The more pronounced decrease for LA Cur than LA Pre is reasonable because void of bias, Pt−1 provides a better sampling distribution than P t−2 .Effective sample sizes were stable across DYN and LA (Figure 4 D).A higher run-time variance lead to an increase in accepted samples originating from the preliminary proposal distribution (Supplementary Information, Figure S1), which is expected, because greater heterogeneity in runtimes raises the chance of encountering exceptionally long-running simulations, which DYN has to wait for, while LA proceeds already.

Considerable speed-up towards high worker numbers
To analyze the effect of scheduling strategy on the overall wall-time, we ran model T2 systematically for different population sizes and numbers of workers .We considered population sizes 32 ≤ N ≤ 1280 and numbers of parallel workers 32 ≤ W ≤ 256, which covers typical ranges used in practice.Each scenario was repeated between 13 times to assess average behavior, here we report mean values.
As a general tendency, the wall-time speed-up of LA over DYN became larger with increasing ratio of the number of workers by the population size.For a model sleep time variance of σ 2 = 1 (Figure 5), e.g. for N = 20 and W = 256, the average wall-time got reduced by a factor of almost 1.8.In most scenarios, a wall-time reduction by a factor of between 1.11 and 1.8 was observed.Only when the population size was large compared to the number of workers, was the speed-up comparably small.
For a sleep time variance of σ 2 = 2 (Supplementary Information, Figure S2), we observed similar behavior.There, the acceleration was generally more pronounced with up to a factor of roughly 1.9 and many factors in the range 1.2 to 1.9.This indicates that indeed the advantage of LA over DYN is more pronounced in the presence of highly heterogeneous model simulation times.
Also on T1, the comparison of run-times (Figure 3 left) revealed a speed-up of LA over DYN.Further, we could confirm on both T1 and T2 (Figure 3 and Supplementary Information, Figure S3)the substantial speed-up DYN already provides over STAT, as reported in [18], on which we here improved further.

Scales to realistic application problems
Given the high simulation cost of the application problems M1-2, we only performed selected analyses to compare LA and DYN.A reliable comparison of run-times in real-life application examples is challenging, because the total wall-time varies strongly due to stochastic effects, and computations are too expensive to perform inference many times.
For the two models, the parameter estimates obtained using LA (both LA Per and LA Cur) and DYN are consistent, up to expectable stochastic effects .Together with the previous analyses, this indicates that for practical applications, the multi-proposal approach of LA allows for stable and accurate inference, similar to the single proposal used by DYN.In early generations, a considerable part of the accepted particles was based on the preliminary proposal distribution (near 100%), which then decreased in later generations (Supplementary Information, Figures S6 and S12).This is consistent with the decrease in acceptance rate and thus the relative time during which the preliminary and not the final proposal distribution is used.For the tumor model M1, we used an adaptive quantile-based epsilon threshold schedule [30], with DYN, LA Pre and LA Cur, population sizes N ∈ {250, 500, 1000}, and W ∈ {128, 256} workers.For each considered configuration we performed 2 replicates (in total 8) to assess average behavior.Reported run-times are until a common final threshold was hit by all runs.The speed-up of LA over DYN varied depending on the ratio of population size and number of workers, similar to what we observed for T1+2.For high ratios, LA was consistently faster up to 35%.However, for low ratios, less improvement was observed.In some runs, LA was slightly slower than DYN (Figure 6).Over the 8 runs, we observed a mean speed-up of 21% (13%) and a median of 23% (16%) for LA Cur (LA Pre).This indicates expected speed-ups of 13-23%, however it should be remarked that large run-time differences and volatility could be traced back to single generations taking vast amounts of time (Supplementary Information, Figure S7).Those long generations occurred in all scheduling variants and exist most likely because the epsilon for that generation was chosen too optimistically, indicating a weakness of the used epsilon scheme, rather than the parallelization strategy.
For the liver regeneration model M2, we performed similar analyses, with adaptive quantile-based epsilon threshold schedules, population sizes N ∈ {250, 500, 1000} and W ∈ {128, 256} workers, with 2 replicates per configuration.Similar to model M1, we observed a faster performance of up to 35%.However, with a smaller ratio between population size and the number of workers, a slightly lower performance improvement was achieved (Figure 7).Similarly to M1, the acceleration varied quite strongly.For LA Pre we observed a mean speed-up over all 8 runs of 36% (median 31%) over DYN.However, for LA Cur we observed contrarily a mean slow-down of 39% (median 43%) over DYN.It is not clear what caused this stark difference, which is again subject to high fluctuations.Further tests would be needed to assess the reasons for this specific model.

Discussion
Simulation-based ABC methods have made parameter inference increasingly accessible even for complex stochastic models, are however limited by computational costs.Here, we presented "look-ahead" sampling, a parallelization strategy to minimize wall-time by using all available high-performance computing resources at near-all times.On various test and application examples, we verified the accuracy and robustness of the novel approach in typical settings.Depending on model simulation run-time heterogeneity, and the relation of population size and the number of cores, we observed a speed-up of up to 45% compared to the previously most efficient strategy, dynamical scheduling.On typical application examples, we observed a speed-up of roughly 20-30%, however with some variability, assessing which would require further tests with considerable computational resources.We would also like to remark that ABC-SMC is sensitive to the choice of proposal distribution.Finite samples can induce a practical bias, as we observed here for parameter-dependent run-times of models -a problem that occurred in extreme cases but could only be solved by using look-ahead sampling with the previous, and not the preliminary, proposal distribution.
Conceptually and aside implementation details, the presented strategy provides the minimal wall-time among all parallelization strategies, as all cores are made use of at practically all times.We observed that look-ahead sampling using preliminary results (LA Pre) provided a performance speed-up over re-using the previous generation (LA Cur), however at the cost of practical bias.Were it possible to construct an unbiased proposal using those preliminary results, e.g.via reweighting or imbalance detection, we could thus increase the speed-up with robust performance.When using delayed evaluation, it would be possible to parallelize the evaluation as well, which we have not done here.If evaluation times are long relative to simulation times, e.g. if (adaptive) summary statistics involve complex operations, this would be beneficial.In order to reduce a potential bias in the preliminary proposal distribution towards fast-running simulations, it may be beneficial to  update it as soon as more particles finish.This would imply the use of more than two importance distributions, the theory of which we have already provided in the Supplementary Information.
In conclusion, we showed how we can minimize wall-time and associated computing cost of ABC samplers with substantial performance gains over established methods.This concept is generally applicable for sequential importance sampling methods, thus we are certain that it will find widespread use.can be obtained by simple projection to the first component.π ABC,ε (θ|y obs ) is an approximation of the actual parameter posterior, with π(θ|y obs ) = lim ε 0 π ABC,ε (θ|y obs ) under mild assumptions.

Standard importance sampling
We employ importance sampling (ABC-IS) embedded in a sequential Monte-Carlo scheme (ABC-SMC).For a proposal distribution g(θ) π(θ) and a target population size N , we use an ABC-IS scheme as shown in Algorithm 1.In ABC-SMC, Algorithm 1 is then iterated multiple times for successively refined thresholds ε and proposals g.Here we focus on a single such iteration.This form of ABC-IS generates samples from the distribution Here, the g(θ) is because we sample θ ∼ g(θ), further we simulate data y from the likelihood and discard particles not satisfying the distance condition, such that while less than N acceptances do sample parameter θ ∼ g(θ) simulate data y ∼ π(y|θ) accept θ if d(y, y obs ) ≤ ε compute weights w i = π(θ i ) g(θ i ) , for accepted parameters {θ i } i≤N output: weighted samples {(θ i , w i )} i≤N Algorithm 1: Importance ABC algorithm.
Therefore, the importance weights, Radon-Nikodym derivatives, of the proposal (2) against the target (1) are given by with normalization constants C, C G .If all normalizations were known exactly, we could calculate the unbiased importance estimator of a test function f over the ABC posterior as In general, we however do not know C and C G and thus v only up to normalization, and therefore need to employ the self-normalizing estimate with self-normalized weights W i := w i j w j , which uses another Monte-Carlo approximation for the normalization constant.It is only asymptotically unbiased, converging almost surely as N → ∞.
Remark: Were we instead to accept all particles irrespective of the distance, as some implementations do, we would need to regard I[(y, y obs ) ≤ ε] as part of the weighting, as in that case . In general, this interchangeability of importance and rejection sampling occurs in various places in ABC algorithms.Further note that the here presented scenario based on a uniform acceptance kernel I[d(•, y obs ) ≤ ε], which is the most widely used one, naturally generalizes to arbitrary acceptance kernels K ε (•, y obs ).

Effective sample size
Importance sampling allows to e.g.better explore high-density regions by tailored proposals.However, as the particles need to be weighted, some contribute more to estimates than others, which can be interpreted as having a lower effective sample size (ESS) than a sample of the same size directly from the target distribution.A common definition of the ESS can be motivated as follows: Consider a linear combination S N = i≤N w i X i i≤N w i of i.i.d.random variables X i of variance σ 2 > 0, with weights w i ≥ 0. The unweighted mean 1 N e i≤Ne X i of N e variables has variance σ 2 /N e .Equating, we obtain as a scale-invariant quantification of the ESS of the weighted sum.

Importance sampling with multiple proposal distributions
Now assume we have multiple proposal distributions {g l (θ)} l≤L instead of just one, and have generated N l > 0 accepted particles {(θ l i , w l i )} i≤N l from each proposal, with N = l N l , giving a total population P = {{(θ l i , w l i )} i≤N l } l≤L of size N .This generalizes the scenario presented in the main manuscript, where we have two proposal distributions, a preliminary one and a final one.
The goal is to for a test function f define a robust Monte-Carlo estimate making use of all samples.One possible estimator could be obtained by just ignoring the fact that multiple proposals were used and just throwing all samples and weights together: Equation ( 4) shows that we obtain an asymptotically unbiased estimate, however the subpopulations are weighted by the normalization constants C l , which is in general not meaningful, unless these carry an appropriate interpretation.Further, this joint estimator does not account for the difference in importance estimator quality of the various G l .
Instead of (4), we suggest to first consider each subpopulation separately, and obtain a joint estimate as a weighted sum of estimators with coefficients β l s.t. and subpopulation-wise self-normalized weights.

Subpopulation weighting
The β l are free parameters and should be chosen to yield a robust estimator.A straightforward choice is i.e. to normalize the contribution of each proposal by the number of samples generated from that proposal.
As the proposal quality can vary, this weighting may not be ideal.Instead of ( 7), we suggest to choose the β l to e.g.maximize the overall ESS, in order to obtain a robust estimator.Here, (3) takes the shape Denote for short Q l := i≤N l (W l i ) 2 , the diagonal matrix Evaluating ∇ β,λ [β T Qβ + λ(1 T β − 1)], using a Lagrange multiplier λ ∈ R, readily shows that the unique solution of ( 9) is given by the positive-definite linear system 2Q 1 1 T 0 Written out, we obtain i.e. the overall ESS is maximized by setting the contribution of sub-population l proportional to its ESS, which intuitively makes sense.

Algorithm implementation in pyABC
The ABC-SMC algorithm and all samplers used and developed in this work have been implemented and made available as part of the python-based open-source package pyABC (https://github.com/icb-dcm/pyabc).The sampler implementation uses the Redis package (https://redis.io)as broker between main process and workers on a distributed high-performance computing (HPC) architecture.
Before the analysis, a Redis server is set up.To this server the desired number of workers is then connected.Workers can also be added interactively during the analysis.In addition, there is a process for the main analysis, the ABC-SMC algorithm, which submits in each generation simulation tasks to the server, which broadcasts them to the workers and collects results, to be read in again by the main process.The whole process of environment setup and analysis has been fully automated for HPC infrastructure.
Simulation tasks are broadcast by the server via dedicated port channels, which workers listen on if they are idle.Workers return simulation results via a queue, which is continually collected from by the main process.Start times are tracked via shared variables to ensure the main process is aware which tasks need to be waited for.
Analyses of model T2, M1, and M2 were performed on the Bonna cluster at the University of Bonn.
A standard node has the specification of 2 × 16 cores, 6 GB RAM per core, and 2x480GB SSD.
For the different models, we used 1-9 nodes for T1 and 1-8 nodes for T2, M1 and M3.

Test models
To test the correctness and the performance of the new concept, we used 4 models as described in the Main Manuscript, Table 1, which we describe in further detail in this section.First, we used LA on some basic toy models (T1 and T2) to examine the general behavior and some specific properties.Later, we also performed tests on more complex application examples (M1 and M2) to test the performance in realistic settings.

(T1) Unbalanced Modes
Usually, the run-time of a single simulation will not be independent of the parameter candidate but instead, be somehow correlated to the region of the parameter space.High variance of the run-time paired with correlation between the simulation time and the candidate could possibly lead to a strong imbalance in the preliminary proposal.This raises the question of what happens when it is likely that the preliminary population is strongly biased.
To analyze this phenomenon, we constructed a two-mode scenario by considering a model that simply squares its only parameter, i.e. y = θ 2 and then adds additive noise to that.Assuming the observed data consist of one observation with value 1 and that the prior is uniformly distributed on [−2, 4], we get two parameters that could equally be true, −1 and +1.We added a shorter idle time sampled from a log-normal distribution with mean 0.05 and σ 2 =0.025 to the negative mode and a larger one (mean=1 and σ 2 =0.5) to the positive mode.Given a large number of parallel workers and a comparably small population sizes, it is thus possible that the preliminary population Pt−1 consists entirely of samples from the negative side.Depending on the shape of the thus created preliminary proposal gt , it is then possible that it only proposes samples from the negative side.It can happen that enough of these get accepted, such that the subsequent accepted population P t consists only of samples from the negative side.

(T2) ODE Model
While ODE based models are not the main application area of ABC, it is fairly simple to create an example for which we can easily verify the correct result.Parameter inference for such a model is also not as time intensive as real life application examples.Therefore, we can perform it several hundred times with the new concept and compare the results to the ones of established approaches, making this a good test model to perform sanity checks on the new algorithm.
Our basic ODE-model with two parameters describes the inter-conversion between two species x 1 , x 2 , with rates θ 1 , θ 2 .It is given by Using this ODE, we can immediately verify that the results converge towards the same value as the original version for which the asymptotical correctness is known.In order to not have a fully deterministic system and to consider that in reality measurements are not perfectly accurate, we added multiplicative normal noise ∼ N(1, 0.03) to each model evaluation.
To analyse the run-times and the effect of heterogeneous simulation times, we used the same ODE and added some idle time to each model call.This idle time was chosen in a way that imitates some single simulations taking vastly longer than the average value.As distribution for t idle we employed a log-normal one t idle ∼ lognormal(µ n , σ 2 n ), where µ n , σ 2 n are the mean and variance of the underlying normal distribution.µ n , σ 2 n were chosen such that the real variance σ 2 takes values between 0.25 and 4 and the real mean µ is constantly equal to 1 for the different values of σ 2 , i.e.

Results -Correctness
Using (T2) we generated 13 replicas of different configurations of both using DYN and LA scheduling for each population size on 32 to 256 workers.
Both scheduling approaches seem to return equal results (see Figure 4) when taking into account that any two runs will always slightly differ due to the innate random nature of the sampling, the noise and the run-time.
(A) demonstrates how the mean for both DYN and LA converges to the same value as we increase the population size.At the same time, (B) shows that also the standard deviation within individual distributions decreases, meaning that, as N increases, both posteriors peak more sharply at the same value.(C) makes it clear that a stronger variances for t idle does not affect the mean, even though there is a larger fraction of preliminary particles expected.In (D), one can see that LA does not seem to substantially affect the effective sample size either even though we here used N = 20, which, on 256 workers, makes it likely that several generations are sampled completely from the preliminary.In E and F, we can see the fraction Ñ /N of accepted samples in the final population at t = n t that were generated using the preliminary proposal gnt (θ).In G, we can see a comparison of the posterior distribution of the different scheduling.One can appreciate the nice agreement in the shape of the posterior from different scheduling strategies.By showing that the preliminary and the final proposal both considerably contribute to the posterior distribution, we get a strong indication that everything works as intended.And indeed, in many of the relevant scenarios the amount of particles that were based on the preliminary in the last generation ends up making up roughly half of the population, which is returned as output by the ABC-SMC algorithm (see Figure S1).  .
In Figure S1, one can also observe how the fraction of particles sampled from the preliminary decreases as the population size increases.This makes sense, since keeping the number of workers constant results in a more or less constant amount of resources used for the preliminary sampling between the N -th acceptance in a generation and the last worker to finish working on that generation.So, there should be a roughly constant number of preliminary acceptances opposite to an increasing size of the total population.
Similarly, a higher run-time variance increases the fraction of preliminary particles, as this results in more time between the N -th acceptance and the last simulation of a generation.

Results -Run-Time
To analyze how much effect the new scheduling can have on the wall time, we ran the same model on different population size to worker ratios.First, we used an idle time with variance σ 2 = 1, the same population sizes as above, i.e. between 20 and 1280 and ran each of those on 1, 2, 4, and 8 nodes with 32 workers each, so on 32 to 256 workers.For a run-time variance of σ 2 = 1, we started with 13 repetitions of each scheduling approach (Except for STAT, where we used only 3 repetitions for time-preserving purposes) for each population size on 256 workers (see Figure 5).For σ 2 = 2, run-times were generally higher (see Figure S2).
We can observe a substantial speed-up when using LA instead of DYN scheduling (see Figure 5 and S2), whenever the population size is about as large or slightly larger than the amount of workers.In the best case, the LA approach decreased the wall time by nearly a factor of two when compared to the dynamic scheduling runs.On the other hand, the difference in efficiency is much less apparent, or even almost trivial, in case the two factors, population and worker size, are vastly different.
In the case where the amount of workers is substantially larger than our population size, only few simulations remain for each worker.This can go as far as even having enough preliminary acceptances to complete the next generation even before the previous one is finished leaving all further simulations retrospectively useless, which results in LA scheduling also having a poor parallel efficiency.Nonetheless, in that case there is still a substantial acceleration when compared to established approaches as LA can almost complete two generations in the time it takes to complete one using DYN.
When the population size is multiple times larger than the amount of available workers, dynamic scheduling already performed very well, leaving little room for improvement.However, even in that scenario we do not have any changes to the negative, so any additional computation necessary to enable the sampling from the preliminary population appears always worth the effort.
Further, using the same setup, we examined the effect of the run-time variance of the single evaluations on the speed-up.For that we ran the same tests as before with an idle time variance of σ 2 = 2; and indeed results (see Figure S2) seem to indicate that the achieved acceleration is even slightly higher than for the less varying run-time with σ 2 = 1.Finally, both DYN and LA scheduling scale better than STAT when using more computational resources (See Figure S3)

(M1) Tumor Growth Model
The tumor growth model (M1) is our first test instance based on a real life example.The model was developed mainly in [2] and aims to describe the growth of a tumor spheroid while taking into account its spacial structure.For example, the proliferating cells are almost exclusively the ones in the outer rim of the spheroid whereas the ones contained in the core are mostly necrotic.
As hybrid discrete-continuous model it uses different mathematical tools trying to accurately depict a real life biological system.

Time Spheroid radius [μm]
An agent-based approach with stochastic interactions between separate particles in a system, is used to model the relations between the individual cells, while a system of PDEs describes the extracellular matrix.At the same time, mechanisms like cell division and cell death are modeled using a continuous time Markov process.For further details about the model see [2,3] and for some biological background see e.g.[4,5].
The version we used as test instance is a two dimensional implementation of the tumor growth model with seven parameters.It is available in the python package tumor2d (https://github.com/ICB-DCM/tumor2d).Even using several hundred workers and a population size of e.g.1000, the parameter inference for the tumor growth model takes hours to days, making it close to impossible to run the simulation on anything but large-scale parallelized infrastructure and strictly necessary to employ an efficient workload distribution scheme.Yet, the run-time is not yet as exorbitantly high as for some other models, such that it was possible to perform several runs to obtain at least some reliability in the results.This was particularly important as the trajectories of the tumor growth model underlie stochastic fluctuations, especially in the early phases, as there the number of cells is still very limited.Towards the end, the cell numbers are far higher and the fluctuations are much less severe due to the averaging effect of the Law of Large Numbers.

Results -Correctness
We ran model M1 with population sizes ranging from 250 to 1000, employing two different amounts of workers 128 and 256.The acceptance criteria for candidates of a generation are adaptively chosen after the previous one finishes as the 50% quantile of the previous accepted distances.The run finishes after a generation t in which the threshold ε t falls below a certain value, in our runs ε nt ≤ 700.Every time one scenario was run, one instance with dynamic scheduling, one with LA Pre, and one with LA Cur scheduling was performed with the same setup to directly compare the results.
During the runs, no substantial deviations in the quality of results were observed (see Figure S4 and Figure S5; detailed results of the other runs can be found in the supplementary code).While the fraction of preliminary particles tends to decrease over the course of the inference, it varies strongly from generation to generation (see Figure S6).As the time between the N -th acceptance in generation t and the last worker to finish is expected to be more or less constant, we have a similar amount of evaluations from the preliminary proposal in each generation.Generally, the acceptance rate decreases with the epsilon threshold however, and this also holds for the preliminary proposal based candidates.So, as more total evaluations are necessary to reach the desired number of accepted particles N , a smaller fraction of those should end up being sampled from the preliminary as the generations progress and the acceptance rate decreases.Results -Run-Time For more complex models, it is usually far more effective to use the adaptive epsilon schedule mentioned in the previous section.That however, leads to a different final acceptance threshold in every run, which strongly affects the run-time.Together with the very stochastic start of the tumor growth model, this yields a high variance of the wall times, making a direct comparison more difficult.Nonetheless, we can for example observe the time point after which each epsilon value is achieved.
In the more extreme cases, it also occurred that the LA scheduling took slightly longer than the corresponding DYN run, but similarly also that the LA run decreased the wall time by a factor of more than 2 when compared to the DYN scheduling based one (see Figure S7).These stronger differences in wall time usually seem to be traceable to one single generation taking vast amounts of time.Those long generations occur in all scheduling variants and exist most likely because the epsilon for that generation was chosen too optimistically.On average, it seems that acceleration varies based on the population size and the number of workers to be used.Over the 8 times we have executed the inference of the tumor model with an adaptive epsilon schedule, we observed a mean acceleration of 21%, with the median value being 23% for LA Cur.

(M2) Liver regeneration model
M2 is a model of YAP regulation by mechanical stimulation through expansion of the bile canaliculi (BC) [6].A single realisation of a model varies greatly based on parameter values ranging from 3 to 3,000 seconds.This model has 14 unknown parameters and two observables, namely nuclear YAP and total YAP intensities which were quantified from image tiles covering an entire portal and central vein.The yes-associated protein (YAP) is an activator that activates the Hippo pathway that plays an important role in liver regeneration.
Two sub-models were used to describe the changes of osmotic pressure and the concomitant activation of YAP after PH.The first sub-model is a biophysics-based model to predict the local mechanical stress and apical membrane strain that result from the alteration of osmolyte (bile acid) load in the BC network after partial hepatectomy.It considers the spatial geometry of the BC within the portal and central vein axis of the lobule.Sub-model 2 is a biochemistry-based model that predicts the cellular response of YAP to the local mechanical stress.

Results -Correctness
Several fitting configurations were done to model (M2) with population sizes ranging from 250 to 1000, with two different amounts of workers, 128 and 256.The acceptance criteria for particles were adaptively set to be the 30% quantile of the previously accepted distances.The run set to be finished after the discrepancy threshold ε t falls under some values.The value was set differently based on different configurations ranging from 1.4e+03 to 2.0e+03.Similar to model (M1), the three different samplers, DYN, LA Pre, and LA Cur, were executed on similar configurations to ensure a fair comparison.
Assessing the quality of the result, there were no substantial differences of the results during the run (see Figure S9 and S10 and Figure S11; detailed results of the other runs can be found in the supplementary code).The fluctuations of this fraction exist because the acceptance threshold, and thus the acceptance rate, decreases less consistently than when using a static epsilon schedule, as can at least partially be observed in Figure S15.
As a large fraction of the population is based on the preliminary proposal for a substantial part of the generations, it shows that these preliminaries have a large effect on the LA version of the ABC-SMC algorithm.Nonetheless, the posteriors are consistently similar, indicating that everything works as intended and no bias was introduced.

Results -Run-Time
To save computational resources, we employed an early rejection strategy to reject particles based on a maximum run-time for individual simulations not 386 matching the data.However, this will decrease run-time heterogeneity.To assess the affect of that on the speed up of the look-ahead sampler, we fit the model using two different maximum run-time, 15, and 30 minutes.The result of the 30min maximum run-time can be seen in Figure 6.In addition, the 15min maximum run-time is presented in figure S14.Result shows clearly that if heterogeneity increases, one would expect higher speed-up from both LA Pre and LA Cur compared to DYN sampler.Figure S15 shows the development of the epsilon threshold for the 8 different runs.In the more extreme cases, it also occurred that the LA scheduling took slightly longer than the corresponding DYN run, which was observed more in the less heterogeneous runs (e.g. when using 30min as maximum run-time).On average, it seems that acceleration varies based on the population size and the number of workers to be used.Over the 8 times we have executed the inference of the liver regeneration model with an adaptive epsilon schedule, we observed a mean acceleration of 36%, with the median value being 31% for the LA Pre.

Simulation time variability
The simulation time for a model often relies on the parameter space, as exemplified by model M2.The forward simulation of this model is heavily influenced by the chosen set of parameters.For instance, parameters such as the inactivation rate of YAP (k4) or the activation rate of YAP (k5) both greatly affect the computation time, as demonstrated in Figure S16.

Figure 1 :
Figure1: Illustration of core usage over run-time for static (STA), dynamic (DYN) and look-ahead (LA) scheduling for a population size N = 5 on W = 8 workers, over 3 generations (colors).The shading indicates whether a sample satisfies the acceptance criterion and is included in the final population (dark), satisfies the acceptance criterion but is discarded because enough earlier-started accepted samples exist (medium, for DYN+LA), or does not satisfy the acceptance criterion and is rejected (light).Solid lines indicate the end of a generation, dashed lines indicate the (preliminary) beginning of a generation (different from solid lines only for LA).

Figure 2 :
Figure2: Concept visualization of look-ahead scheduling (LA).As soon as no more simulations are required for generation t (green), a preliminary simulation task for generation t + 1 is formulated either based on generation t (dark purple) or t+1 (light purple).Resulting simulations are considered when evaluating the next generation (t = t + 1), and suitable weight normalization is applied to all samples (top right).Over time, the number of workers dedicated to generation t decreases, while that for generation t + 1 increases (bottom).

Figure 3 :
Figure 3: Run-time and posterior approximation for 5 different runs of model T1 with STAT, DYN, LA Pre and LA Cur, with population size N = 100 on W = 144, 240, 432 workers.

Figure 4 :
Figure 4: Results for problem T2 for different population sizes N , worker numbers W , and sleep time variances σ 2 .Unless otherwise specified, we used N = 1280, W = 256, and a log-normally distributed sleep time t sleep of variance σ 2 = 1.To increase comparability, the ε t values over n t = 8 generations were pre-defined.(A) and (B): Mean and variance of the posterior approximation π ABC,εn t (θ|y obs ).Box-plot over 13 repetitions.(C): Posterior mean for different sleep time variances, for W = 20.(D): Effective sample size across different sleep time variances, for N = 256 and W = 20, in which case it is likely that several generations are sampled completely from the preliminary proposal.(E): Fraction Ñ /N of accepted samples in the final population t = n t that originate from the preliminary proposal gnt (θ) for LA Pre and LA Cur.(F): Exemplary visualization of 1d posterior approximation marginals for single runs.

Figure 5 :
Figure 5: Speed-up (1 − {Wall-time LA}/{Wall-time DYN}) of LA Pre (left) and LA Cur (right) over DYN for various population sizes and numbers of workers, for a model sleep time variance of σ 2 = 1.

Figure 6 :
Figure 6: The run-time and posterior distributions for 2 different runs of the model (M1) with population size 1000, 500, 250 on 128 and 256 workers.

Figure 7 :
Figure 7: The run-time and posterior distribution of 2 different runs of the model (M2) with population size 1000, 500, 250 on 128 and 256 workers.

Figure S1 :
Figure S1: Fraction of particles sampled from the preliminary proposal in the last generation (for σ 2 = 1 (left) and N = 1280 (right)) for LA Pre (upper row) and LA Cur (second row)..

Figure S2 :
Figure S2: The speed-up for LA Pre (left) and LA Cur (right) with σ 2 = 2

Figure S3 :
Figure S3: The time taken by different scheduling methods on different population sizes and numbers of workers.

Figure S4 :
Figure S4: The credible interval of the model M1 with population size 1000, 500, 250 on 128 and 256 workers.

Figure S5 :
Figure S5: Direct comparison of the posterior distributions for all seven parameters of (M1) for a run with N = 1000 on W = 256 workers using DYN, LA Pre, and LA Cur scheduling.The parameters with sharp peaks are visibly at the same location and if the inference returned a broader distribution, it did so in all cases.

Figure S6 :
Figure S6: Total evaluations and fraction of accepted particles of M1 based on the preliminary population in each generation in the LA Pre (top) and LA Cur (bottom) runs of the same run as in Figure S5

Figure S7 :
Figure S7: Development of the acceptance threshold over time for the different runs for model M1.These are all the 8 runs we performed.

Figure S8 :
Figure S8: Comparison of best parameter fit for M1 using DYN (left), LA Pre (middle), and LA Cur (right) for a population size of N= 1000

Figure S9 :
Figure S9: The credible interval of the model (M2) with population size 1000, 500, 250 on 128 and 256 workers.A similar maximum simulation time was used here as in Figure S14 (900 seconds as maximum simulation time)

Figure S10 :
Figure S10: The credible interval of the model (M2) with population size 1000, 500, 250 on 128 and 256 workers.A maximum simulation time was used here as in Figure 6 (1800 seconds as maximum simulation time).

Figure S11 :
Figure S11: Direct comparison of the posterior distributions for all 14th parameters of (M2) for a run with N = 1000 on W = 256 workers using DYN, LA Pre, and LA Cur scheduling.The sharp peaks of the parameters are visibly at the same location, and if the inference yielded a more broader distribution, it does.It did so in all cases..

Figure S12 :
Figure S12: Total evaluations and fraction of accepted particles for (M2) based on the preliminary population in each generation in the LA Pre (top) and LA Cur (bottom) runs of the same run as in Figure S11

Figure S13 :
Figure S13: Comparison of best parameter fit for (M2) using DYN (top), LA Pre (middle), and LA Cur (bottom) for a population size of N= 1000

Figure S14 :
FigureS14: The run-time and posterior distributions for 2 different runs of the model M2 with population size 1000, 500, 250 on 128 and 256 workers.In this run, we use a maximum simulation time of 900 seconds, as opposed to 1900 seconds in Figure5.We can see that decrease in heterogeneity level on simulation time drastically affected the effectiveness of the LA samplers. 0

Figure S15 :
Figure S15: Development of the acceptance threshold over time for the different runs for model M2.These are all the 8 runs we performed.

Figure S16 :
Figure S16: Simulating the liver regeneration model M2 using different values for the parameters k4 and k5 that range from 0 to 4.5 in log10 space

Table 1
ODE) model with 2 parameters describing a conversion reaction x 1 ↔ x 2 , with observables obscured by random multiplicative noise.To analyze sampler behavior under simulation run-time heterogeneity, we added random log-normally distributed delay times t sleep of various variances atop