Scalable and flexible inference framework for stochastic dynamic single-cell models

Understanding the inherited nature of how biological processes dynamically change over time and exhibit intra- and inter-individual variability, due to the different responses to environmental stimuli and when interacting with other processes, has been a major focus of systems biology. The rise of single-cell fluorescent microscopy has enabled the study of those phenomena. The analysis of single-cell data with mechanistic models offers an invaluable tool to describe dynamic cellular processes and to rationalise cell-to-cell variability within the population. However, extracting mechanistic information from single-cell data has proven difficult. This requires statistical methods to infer unknown model parameters from dynamic, multi-individual data accounting for heterogeneity caused by both intrinsic (e.g. variations in chemical reactions) and extrinsic (e.g. variability in protein concentrations) noise. Although several inference methods exist, the availability of efficient, general and accessible methods that facilitate modelling of single-cell data, remains lacking. Here we present a scalable and flexible framework for Bayesian inference in state-space mixed-effects single-cell models with stochastic dynamic. Our approach infers model parameters when intrinsic noise is modelled by either exact or approximate stochastic simulators, and when extrinsic noise is modelled by either time-varying, or time-constant parameters that vary between cells. We demonstrate the relevance of our approach by studying how cell-to-cell variation in carbon source utilisation affects heterogeneity in the budding yeast Saccharomyces cerevisiae SNF1 nutrient sensing pathway. We identify hexokinase activity as a source of extrinsic noise and deduce that sugar availability dictates cell-to-cell variability.

Here, it is assumed that the dynamics, X (i) are governed by known cell-specific covariates D (i) ∈ R m , and unknown model-quantities. The unknown quantities are the individual rateconstants c = (c (1) , . . . , c (M ) ) where c (i) ∈ R q , the rate-constants with no assumed variability between individuals κ ∈ R p , the parameters for the measurement error ξ ∈ R s , and the population parameters η ∈ R r which parameterizes the distribution of the individual parameters: c (i) ∼ π(c (i) |η). Here, it is always assumed that the data is noise-corrupted where the l represents iid distributed measurement errors having distribution π (ξ) which is parameterised by a vector-parameter ξ; x is the values of the model-states at time t l and g(·) is a (possibly non-linear) function of its inputs. For the moment we do not add further restrictions to g(·), however in some specific case we will need it to be a linear function of x (i) l and l , see Eq S15. Lastly, both κ and c (i) can be augmented to include the initial system state values at time zero. As D (i) equals the empty set throughout the paper, we hereafter discard D (i) for simplicity of notation.
The unknown model quantities are inferred by sampling from the posterior π(c, κ, η, ξ|y) ∝ π(c (1) , . . . , c (M ) , κ, η, ξ) M i=1 π(y (i) |c (i) , κ, ξ), where y = (y (1) , ..., y (M ) ) contains all observations across all M individuals, and the likelihood term for the i-th individual can be written as and the notation z 1:l means (z 1 , ..., z l ) for a generic variable z. For the posterior sampling we employ a Gibbs sampler. Some of the Gibbs-step have an intractable likelihood and for these we employ a pseudo-marginal approach as in [1]. The tractable steps are sampled via efficient Hamiltonian Monte Carlo [2,3,4]. Overall, this approach produces exact Bayesian inference, however, the run-time can be substantial. We thus have two options for running PEPSDI. The first option is the Gibbs approach in Alg. 2. The second (and the PEPSDI default option) is to run a Gibbs-sampler where cell-constant parameters are allowed to vary weakly between cells (Alg. 3). The latter considerably speeds up the run time (S3 Fig), with the potential caveat that the credibility intervals of the cell-constants parameters could be slightly wider (however in S1 Fig for all parameters but one there is a remarkable agreement between the inferences from the two Gibbs approaches). S1 / S27

A.1 Stochastic simulators
In this section we describe the inference framework in detail. We first present the different stochastic simulators that are currently implemented for simulating intrinsic noise. Then, the pseudo-marginal approach on which the Gibbs-samplers depend is described. Finally, both our Gibbs-samplers are outlined.

A.1 Stochastic simulators
For a well mixed system the dynamics of a biochemical stochastic reaction-network of d species with associated stoichiometry matrix S ∈ R r×d is described by the chemical master equation [5]. The key player here is the propensity vector h(x t , c (i) , κ, t)dt describes the probability that reaction j will occur in the interval [t, t + dt). Typically, we cannot solve the master-equation so PEPSDI relies on simulations of its exact or approximate solutions. To accommodate a wide range of scenarios, our framework currently implements four different simulators.
The first simulator in our framework is Gillespie's "direct method" [6]. This method produces exact stochastic simulations. However, the direct method assumes time-invariant rate-constants c. To allow for time-varying rate-constants PEPSDI also accommodates the Extrande-simulator [7]. Albeit exact, both the Extrande-and Direct-method simulate each reaction event. Consequently, for large molecule numbers and/or large reaction rates, they can be slow [5]. To reduce run-time our framework also allows for approximate simulators.
The third, and first approximate, simulator in our framework is the "tau-leaping" method with fixed step-length [8,9]. Assuming that the propensities do not change noticeably within [t, t + ∆τ ], that is assuming then the reactions will be close to independent one of the other. The reactions channels will thus fire independently, and the number R j of reactions for reaction channel j is Poisson distributed as R j ∼ P h j (x (i) t , c (i) , κ, t)τ . Overall, the state-vector can be updated via The fourth, and second approximate, simulator in our framework is the Langevin-chemical master equation. If, in addition to leap-condition 1, it is assumed that the propensities are sufficiently large, that is then the Poisson-variable in Eq S5 can be approximated by a Normal-distribution. Combining this with leap-condition 1, the state-vector can be described by a stochastic differential equation known as the Langevian chemical master equation [8]: Typically we cannot solve Eq S7 analytically, so we use the Euler-Maruyama method to approximate the solution from time t to t + ∆τ as Supplementary of Manuscript A.2 Pseudo-marginal inference and particle filters where I d×d is an identity unit matrix. As for the tau-leaping simulator an appropriate steplength ∆τ must be chosen.
Overall, we implemented four simulators to allow the framework to handle a wide range of scenarios. When using the tau-leaping or the Langevin-simulators an appropriate step-length ∆τ must be chosen. Approaches for doing this are given in B. Owing to the modular nature of our code, it is straightforward to implement additional simulators such as hybrid-stochastic simulators [10], or tau-leaping with an adaptive step-length [11]. However, as highlighted in A.5, it is beneficial for performance to use simulators where the number of pseudorandom numbers required for model simulation is known prior to starting the inference.
We thus have from (S9) that the parameter posterior is given by where π(y (i) |c (i) , κ, ξ) is the data likelihood for individual i. The data likelihood is obtained via the following marginalisation The integral in (S10) is generally intractable (except for simple cases, where the observations equation (S1) and the states dynamics are both linear in the states, and both have Gaussian noise, and in that case the Kalman filter can be applied), however we can use a pseudo-marginal approach when we replace the intractable integral with a unbiased estimation obtained via sequential Monte Carlo. Specifically, the pseudo-marginal approach allows for exact Bayesian inference when the observed data likelihood is intractable, but a non-negative and unbiased estimate is available. The pseudo-marginal Metropolis-Hastings scheme samples from the desired posterior via the marginal of an extended posterior [12]. We have sketched some notions in the main paper, which we reprise here for convenience. In the pseudo-marginal approach we consider the posterior where often (though not necessarily) parameters are a-priori independent, π(κ, η, ξ) = π(κ)π(η)π(ξ), and the u (i) ∼ π(u (i) ) are auxiliary variables used to obtain an unbiased non-negative estimatê In our case, the u (i) are the pseudo-random draws produced to implement sequential Monte Carlo (particle filter) procedures, such as forward model simulation and particles resampling.
The assumed unbiasedness of the approximate likelihood ensures that the marginal of the targeted posterior (Eq S11) is the desired (exact) posterior, ∝ π(c (i) , κ, ξ, η|y (i) ). (S13) To obtain an unbiased [13,14] likelihood estimate we use a sequential Monte-Carlo procedure known as "particle filter" (Alg. 1), see [15] for a recent review. Briefly, a particle filter recursively estimates a sequence of filtering densities π(x (i) 1:t are the observations from time-step 1 to t for the i-th individual. This is achieved by using a series of move and re-sampling steps. To move the particles a model dependent proposal This proposal distribution could correspond to a SSA-simulator, or an approximate simulator such as the tau-leaping simulator or the Euler-Maruyama method for SDEs, but the list could continue. Following a move, the particles are resampled (with replacement) to avoid so-called "particle impoverishment". A side-product of approximating the filtering distribution is that an unbiased non-negative estimate of the observed data likelihood is obtained viâ where N (i) is the number of particles used in the particle filter for the i-th individual. Less formally, the particle filter propagates forward in time multiple traces (particles) of the latent Algoritm 1 General particle filter Input: Parameters c (i) , κ, ξ, data y (i) , auxiliary variables u (i) and number of particles N (i) . Output: Observed likelihood estimateẐ (i) for individual i. Notation: We use the convention that k means "for all k = 1, . . . , N (i) ".
1: Initialisation (t = 1) a: Initialize the starting state: set the value for x (i) 1,k or sample it from some initial distribution x (i) 1,k ∼ q 1 (·). b: Compute the weights: system's state from the assumed model. At each time step, the traces (particles) that are closer to observed data are favoured in the re-sampling step. Only the resampled particles are then propagated forward, and therefore the less promising ones "die".
There are multiple factors to consider when executing the particle filter. Any number of particles can be used, however, using too few increases the variance of the estimated likelihood, which has negative consequences for the mixing of the Markov-chain when this approximate likelihood with high variance is used in a Metropolis-Hastings step. An approach for selecting the number of particles is given in A.5. The particles can be moved using SSA-simulator, the tau-leap simulator, the Langevian-simulator or via guided proposals. Which propagator to use depends on the situation. The propagators currently implemented in our framework are discussed in A.2.1. The particles can be re-sampled in multiple ways. Following [1] we use systematic resampling which has linear complexity. Notice, Alg. 1 contains an optional "sorting" step, which is only executed if a procedure meant at correlating the particles is employed, as motivated in A.5.1.

S5 / S27
Supplementary of Manuscript A.2 Pseudo-marginal inference and particle filters A.2.1 Propagators for particle filters There are several ways to propagate particles forward in time, here we consider two approaches. The simplest is to propagate the particles forward in time using the latent state's transition density as proposal distribution, that is setting q(·|y . This approach defines the so-called "bootstrap filter" [16,17], and it has evident appeal as it is often not clear how to construct a proposal distribution q(·) that depends on data, while it is always possible to propagate forward the system's state (or an approximation thereof), which essentially means simulating from the transition density (or an approximation thereof). Clearly, in the bootstrap filter the particles weights (step 2d Alg. 1) simplify to becomew(u t,k , ξ). Also, the bootstrap filter allows any error distribution as long as π(y can be evaluated pointwisely. However, propagating according to the transition density is often inefficient if the model is strongly stochastic such as in the Schlögl-model (Fig 3B). This is because a large number of particles may need to be used to capture stochastic events. In these scenarios, making the proposal distribution q(·) depend on data is necessary for efficient simulation. In this sense, having "guided proposals" that at time t − 1 propagate the particles for states x t−1 conditionally on the next data point y t can be much more efficient [18,19,20]. However, guided proposals are typically limited to specific error-models.
For propagators using the transition density (i.e. the bootstrap filter), in our framework the particles can be propagated using the SSA direct method [6], the Extrande-method [21], tauleaping method (Eq S5) and the Langevin-method (Eq S7). The method to choose depends on which leap-conditions are fulfilled (Eq S4 and S6).
Regarding guided proposals, our framework currently includes the modified diffusion for the Langevin-method [22]. First assume a linear error model where P is a constant matrix (as an example, if P is diagonal with diagonal entries in {0, 1}, then a 0 on the diagonal would denote a corresponding unobserved coordinate of x). Here also assume that the model dynamics are described by the Langevian-equation, and an appropriate time-discretisation t l = τ 0 < · · · < τ m = t l+1 where ∆τ = τ k+1 − τ k . Then a linear Gaussian approximation of the conditional distribution q(x Overall, this allows the particles to be propagated according to A guided proposal, such as the modified diffusion bridge, produces filters that are often statistically more efficient than transition-density based propagators, for the purpose of approximating integrals such as those defining likelihood terms [22,23]. However, the guided proposal outlined above is limited to the stated linear error model in Eq S15 and to models having Gaussian S6 / S27

Supplementary of Manuscript
A.3 Gibbs-sampler transitions densities (for example, the Euler-Maruyama discretisation induces a Gaussian approximation to the true transition density).
Moreover, there are additional propagators to those mentioned here. The modular nature of our code makes it straightforward to implement additional ones, such as guided tau-leaping proposals [24].

A.3 Gibbs-sampler
We consider two Gibbs samplers: the first one is described in this section, while the second (and particularly computationally efficient) one is in section A.4. The Gibbs sampler described in this section targets the posterior in Eq S2 by sampling from the following conditionals (named "blocked Gibbs" in [ (notice a similar sequence of steps as in Eq (S18) is presented in [23] as "CWPM".) The pseudocode is presented in Algorithm 2. Briefly, the individual parameters c (i) and the cell-common parameters (κ, ξ) are updated via a pseudo-marginal approach since their conditionals depends on the observed data likelihood (notice, in step 1 we perform M samplings, one for each c (i) ).
Here, u (i) are the auxiliary-variables used to obtain an unbiased estimate of the likelihood via a particle filter (Alg. 1). For the Langevin and Poisson-methods we can induce a high correlation in these auxiliary variables to reduce the number of particles used. Lastly, to propose parameter efficiently, adaptive kernels q(·) are employed for each individual [25,26,27]. Notice, new pseudorandom variates u are created in step 2a of Alg. 2, however the u that are used in step 3b are the ones returned from step 2d, that is we recycle the pseudorandom variates in step 3b when evaluating the approximate likelihoodπ u (j) (y|c (j) , κ ( * ) , ξ ( * ) ) at the newly proposed (κ ( * ) , ξ ( * ) ).
The population parameters η are proposed using an efficient Hamiltonian Monte Carlo (HMC) sampler [2,3,4]. This allows for a wide range of different prior distributions; c (i) ∼ π(η).
Currently, a log-normal distribution is implemented: c (i) ∼ LN (µ, Ω) where µ and Ω are the mean and covariance matrix of the associate Gaussian distribution.
Here Ω can be diagonal, or non-diagonal of the form Ω = D(τ )ΦD(τ ) where D(τ ) is a diagonal matrix with the scalevector τ on the diagonal, and Φ is a non-diagonal correlation matrix. Typically, this separation approach performs well for covariance matrix inference [28]. Additionally, the modular structure of the code allows the user to add additional distributions for c (i) . Moreover, for models where the HMC-sampler might be to computationally demanding (non-diagonal Ω with many individuals) a conjugate-prior update scheme can be implemented [28].

A.4 Modified Gibbs-sampler (default option in PEPSDI)
The Gibbs-sampler (Alg. 2) produces exact Bayesian inference, however, the run-time can be substantial. A contributing factor is that sampling of (κ, ξ) requires a stochastic approximation of the product of the individual likelihoods (Alg. 2 step 3). As we approximate this product by computing a stochastic approximation of the likelihood for each individual, the variance of the approximation is typically large. As discussed in section A.5 a "large" variance may considerably reduce the chain mixing and even making the chain stuck. To avoid this, a possibility is to use a larger number of particles for each individual (N (i) Alg. 1), which would however result in longer run-times.
To reduce the number of particles, and thus the run-time, we introduced a "modified sampler" (Alg. 3). Here, inspired by how cell-constant parameters can be handled in the software Monolix [29], we perturb the original SSMEM slightly by treating the cell-common parameters (κ, ξ) as parameters that vary between cells. However, we fix their variance at a small value. For example, we may assume that (κ (i) , ξ (i) ) ∼ N (κ pop , ξ pop ), δ 2 · I). Here, δ > 0 is a small tuning parameter specified by the user. We set δ somewhat arbitrarily, but we found that for parameters having magnitude 10 0 − 10 1 a value of δ = 0.01 worked well (S1 Fig). In scenarios where the magnitude of (κ, ξ) is unknown, a pilot-run can be performed to find the scale of these parameters, and δ adjusted accordingly (pilot runs need typically to be performed in any case, e.g. when observing if adaptive MCMC works adequately well). Ultimately, a "modified Gibbs sampler" is a standard Gibbs sampler, but performed on a perturbed SSMEM, where the perturbation is induced by the use of δ. Overall, given this perturbation of the SSMEM, the focus for the inference becomes to infer (ξ pop , κ pop ) alongside with the population parameters η. Overall, as we update (κ pop , ξ pop ) using an HMC-sampler the product of the individual likelihoods is not required for any parameter update. Rather, only the individual likelihoods are used in the pseudo-marginal approach. Thus, only the variance of the individual likelihoods needs to be controlled, and this requires fewer particles compared to controlling the variance of the population likelihood (i.e. the product of the individual likelihoods). Moreover, as described in A.5, this allows us to tune the number of particles per individual. In summary, the perturbation induced by δ allows (ξ pop , κ pop ) to be inferred alongside with the population parameters η in step 3 of the Gibbs sampler, and (ξ (i) , κ (i) ) to be inferred alongside c (i) in step 1.
Step 2 is thus avoided.
As noted for the Schlögl model (S3 Fig) the modified sampler is substantially faster. However, as we can see for the Ornstein-Uhlenbeck model this is at the price of (κ pop , ξ pop ) having slightly wider credibility intervals (S1 Fig). However, we consider this a worthwhile compromise. Moreover, if the user is interested in exact inference, the modified sampler can be used for pilotruns. Then the standard sampler (Alg. 2) can be run from the posterior-mode obtained from the modified sampler. Lastly, as no parameters are likely to be constant between individuals [30], the modified sampler in a sense infers a more realistic model. How more realistic is hard to quantify since δ still is determined quite arbitrarily.

A.5 The number of particles in the Gibbs-samplers
Choosing an appropriate number of particles for the pseudo-marginal steps of the Gibbssamplers is crucial for performance. If too few particles are used the Gibbs-sampler is likely to get stuck. This is because the estimated likelihoodẐ will have a large variance. More precisely, imagine a parameter θ 1 is accepted with an unusually high likelihood value (and the latter is also virtually stored, since it is a function of the accepted u and accepted parameters) owing to the variability of the estimated likelihood. Now, it is likely that for a newly proposed θ * the corresponding estimated likelihood will have a smaller value than the previously accepted likelihood. Overall, a long series of rejections may be started this way. Moreover, the parameters region targeted by adaptive MCMC proposals may result as drastically being shrunken, as many adaptive schemes reduce the size of their proposal region following a rejection [26,27]. Although theoretically any number of particles may be selected for the Gibbs sampler to return exact Bayesian inference (in view of the properties of pseudo-marginal methods) as the number of iterations go to infinity, in practice a "large enough" amount of particles is crucial [31,14,32,33]. However, the run time of the particle filter increases with the number of particles. Therefore we need to find a trade-off. Here, principle approaches for setting the number of particles are covered.
where I d is the identity matrix and ρ ∈ [0.95, 0.999] which induces a strong positive correlation between successive approximations to the likelihood,Ẑ * andẐ. By inducing a strong correlation the variance of the acceptance ratio (step 2d and step 3c in Alg. 2) is reduced. While this does not reduce the variance of the approximated likelihoods, the intuition is that what matters is reducing the variability of the likelihoods ratio. Consequently, fewer particles can be used without the risk of the sampler getting stuck.
Three factors must be accounted for when employing correlated particles: (i) to correlate the particles the amount of auxiliary variables u (i) must be known a-priori. For the tau-leapingmethod and Langevian-method, the number of random numbers (auxiliary variables) used for running the particle filter is known given a simulation discretisation stepsize τ , however, for the Gillespie's method the number of auxiliary variables is itself a random variable, and thus a correlated particle filter cannot be employed. (ii) The auxiliary variables generated by Eq S19 are normally distributed. However, the resampling steps in the particle filter require uniform random numbers, and the tau-leaping-method requires Poisson-random numbers. In these cases uniform-numbers can be obtained via the CDF of the normal distribution Φ(u (i) ). Moreover, for a given uniform draw we have implemented an efficient Poisson-generator (https://github. com/cvijoviclab/PEPSDI/blob/main/Code/Stochastic_solvers/Poisson.jl). (iii) The resampling step can break the correlation between particles. This can be alleviated by using an Euclidean sorting step [35]. However, this sorting procedure can fail for larger measurements errors as a high noise-level can cause correlations to deteriorate [35].

A.5.2 Tuning the particles
In the Gibbs-samplers the number of particles to use can be tuned. Here, it is important to perform the tuning at a central-posterior location for the parameters to infer. In fact, compared to when computed at a parameter close to a mode, the estimated likelihood has a larger variance than when computed far away from a mode. This is because, conditionally to implausible parameter values, the particles will often end-up far away from the data, thus providing a poor approximation to the likelihood (this is informally discussed in https: //darrenjw.wordpress.com/2014/06/08/tuning-particle-mcmc-algorithms/). Since the goal is to perform sampling around a central-posterior location, it is thus desirable to tune at a central-posterior location. To find such a location a pilot-run is required.
Overall, we tune the particles by first running a pilot run (with a high number of particles) for 3000 − 5000 iterations to prevent the sampler from getting stuck. Then at the mode (central posterior location) we fix c (i) , κ, ξ and choose the number of particles according to the correlation level ρ. When the particles are not correlated we choose N (i) such that the variance of the log-likelihood (σ 2 N (i) ) is smaller than 2. When the particles are correlated (ρ = 0) we follow [1] and choose N (i) such that σ 2 N (i) < 2.16 2 /(1 − ρ 2 l ), where ρ l is the estimated correlation between π(y i , u i |κ, ξ, c (i) ) and π(y (i) , u (i, * ) |κ, ξ, c (i) ). For the standard Gibbs-sampler we use the same N (i) for all individual and we tune according to the full data likelihood (Alg. 2 step 3c). As we sidestep the full data likelihood for the modified sampler (default option in PEPSDI) we use the same number of particles for each individual in the pilot run, but, we then tune the particles based on the variance of the individuals likelihoods. This allows the particles to vary between individuals for the main inference run, and typically this results in fewer particles being required per individual. Naturally, for the modified sampler the particles can also, if necessary for computational efficiency, vary between individuals for the pilot run.

B Guidelines for running our inference framework
Here we provide guidelines for how to run our inference framework for a user provided statespace model. Guidelines refer to both the default perturbed sampler (Alg. 3 where κ, ξ are perturbed to weakly vary between cells) and the non-perturbed sampler (Alg. 2 where κ, ξ are cell-constant). When appropriate, remarks are provided for the different steps.
• Any distribution in the Julia distributions package can be used.
• The mean of the priors can be used if no suitable starting points are known.
• A large particle number is necessary to prevent the sampler from getting stuck during the pilot run.
• If the aim is to do inference using Gillespie's method, an approximate simulation algorithm can be used for the pilot run.
4. Tune the particles at the end location of the pilot run according to the tuning criteria.
• Our implementation automatically tunes the particles at the end of the pilot run.
• Run time strongly depends on the number of particles, making this a crucial step.
• The default perturbed sampler does not need to compute the full data likelihood (product of individual likelihoods). Consequently it tunes the particles on an individual basis following the pilot run (all individuals have same number of particles in the pilot). The non-perturbed sampler uses the same number of particles for each individual, and the particles are tuned to control the variance of the full data-likelihood often resulting in a large number of particles.
5. Run a further main inference by initializing the Gibbs sampler at the posterior mean obtained from the pilot run, and using the number of particles as obtained from step 4. Further, to propose c, κ and ξ use the tuned covariance matrices obtained in the pilot run, as returned by an adaptive-MCMC scheme (Fig 4).
• Our implementation automatically loads the tuned covariance matrices, posterior mean values and number of particles. 6. Judge the quality of inference via diagnostic plots such as trace-plots and posterior visual checks.
• If trace-plots shows that the sampler got stuck, increase the number of particles and re-run step 5. The tuning criteria, albeit typically working well, can fail.
Remarks for step 2 and 5: The tau-leaping and Langevin simulators require a step-length to be set. This can be done by simulating the model at the starting values using different steplengths, and then choose the maximum value from which the results do not appear to change. Use correlated particles and guided proposals to reduce the number of particles required. Our implementation employs multi-threading to update the individual parameters in parallel. The user can set the number of threads via the Julia threads variable.

C A tutorial on constructing a single-cell dynamic model
Single-cell dynamic models can elucidate cellular reaction dynamics and sources of cell-to-cell variability. To help both modellers and experimentalists utilise such models, we provide a brief tutorial on model construction. For the gene regulatory network in Fig 2A, we describe how the reaction dynamics are formulated, how intrinsic and extrinsic noise can be modelled, and in a complementary notebook S1 , how PEPSDI infers unknown model quantities.
As a starting point, we assume that the single-cell protein time-lapse data for the gene regulatory network in Fig 2B has been collected. From here, the first modelling step is to formulate a literature-informed hypothesis of the reaction dynamics, and subsequently draw a reaction map (Fig 2A). This reaction map (model) should be sufficiently complex to capture core features, for example, when modelling a gene regulatory network this can correspond to excluding the finer details of transcription and translation (as in Fig 2A).
Given a reaction map, the next step is to formulate the propensities for each cellar process (reaction) in the reaction map (often via laws of mass-action): Given propensities, the model can be simulated via a stochastic simulator to capture intrinsic noise. For small number of molecule (e.g 2-10 molecules) exact simulators are accurate [6,7], while faster approximate simulator are often accurate for larger molecule numbers [5,8]. If the number of molecules are unknown, an approximate simulator can be used initially, and if some model components have few molecules one can switch to an exact simulator.
Extrinsic noise can be modelled by letting the initial values (mRNA, Protein), and/or rateconstants vary between cells. For motivating the latter, consider the translation propensity; . Assuming that ribosome abundance varies between cells due to extrinsic noise, c 3 should also vary. Similarly, (c 1 , c 2 , c 4 ) might vary between cells, but, not arbitrarily. Typically, the rate-parameters can be constrained by assuming that these follow probability distributions with an appropriate support. A log-normal distribution [36,37] is the most common assumption. In the example here, this would imply that rate constants c (i) follow a log-normal distribution (c In summary, to construct a single-cell model: (i) draw a reaction network; (ii) for each reaction formulate the propensity; (iii) capture intrinsic noise via exact stochastic simulators [6,7], for S1 https://github.com/cvijoviclab/PEPSDI/blob/main/Code/Examples/Multiple_individual/ Tutorial_PEPSDI.ipynb small molecule numbers, or approximate simulator for larger molecule numbers [8,5], and iv) capture extrinsic noise by letting rate-constants and/or initial values vary between cells by following a probability distribution. However, model construction is only the first step. For example, typically we also want to investigate if a model can accurately describe the observed data, characterise its behaviour, and use it to predict new behaviour. To this end, we need to know the rate-constants c (i) , and how they vary between cells. Thus, to fully utilise a derived model, we can use PEPSDI to infer in a Bayesian fashion the rate-parameters (c (i) for individual i), the strength of the measurement error (ξ), and the population parameters (µ, Ω). Furthermore, albeit the rate-parameters are unknown we can sometimes have a prior idea of, for example, their magnitude. Since PEPSDI follows a Bayesian-framework, the user can incorporate prior information into the inference via parameter priors. In a complementary notebook, we show a code for the model we derived here and how to use PEPSDI to perform the inference. S14 / S27

D Simulation examples
Here, the models used in the simulation examples (circadian clock regulated gene network, Ornestein-Uhlenbeck model and Schlögl model) are presented. For each model, the model equations, simulation options, and inference details are presented.

D.1 Circadian clock regulated gene network model
The circadian clock model regulated gene network (Fig 2) consists of four reactions with ω = 2π/24 to emulate a circadian-clock cycle of 24 hours. The associated stoichiometry matrix is and the propensity (hazard) vectors is

D.1.1 Simulation of data-set
Using the Extrande-method [7] 40 individuals were simulated using the circadian clock-model. For each individual, we simulated data at 64 equidistant time-points from 0.5 to 48 hours. Regarding the observations Y t , we assumed that only the protein was observed. We also assumed that the observations were noise-corrupted by a normal noise; Y t = Protein t + t where t ∼ N (0, ξ 2 ) = N (0, 2 2 ). The individual parameters c = (c 1 , . . . , c 4 ) were generated by applying the exponential function on random draws from a multivariate normal N (µ, Ω), where Ω = D(τ )ΦD(τ ), where D(·) refers to diagonal matrix with τ on the diagonal. Here τ is the scale-vector (variance), and Φ is a non-diagonal correlation matrix. As the exponential of a normal-random variable is log-normal random variable, c becomes multivariate log-normal. The actual population parameters used where S15 / S27 Note that Φ is a symmetric positive-definite matrix.

D.2 Inference details
The parameters inferred were the individual parameters c (i) , the mean population vector µ, the scale-vector τ of the covariance matrix, the correlation matrix Φ, and the strength of the measurement error parameter ξ. Note, the covariance matrix was parameterised using the separation approach as D(τ )ΦD(τ ). This approach typically performs better than using conjugate priors [28]. Furthermore, the individual parameters were inferred on the log-scale as they are positive.
We ran the inference using the default options in PEPSDI, where the cell-constant parameters are slightly perturbed. We used as starting values a random sampled vector from the prior. The inference was run according to the guidelines in B. Briefly, for the pilot run 5, 000 iterations were performed using 2,000 particles. The high number of particles were employed to control the variance of the particle filter estimated likelihood. At the end location of the pilot run, according to the criteria for non-correlated particle filters, we tuned the number of particles per individual, and the minimum, median and maximum number of particles required for an individual was 90, 245 and 1150 respectively. Notice, that we could not correlate the particles since when performing inference using the Extrande-simulator we do not, a priori, know the number of random numbers required by the particle filter. Finally, starting from where the pilot run ended, 50, 000 further iterations of the perturbed Gibbs sampler were run to produce the final inference.

D.3 The Schlögl model
The Schlögl model [39] consists of four reactions S16 / S27 Supplementary of Manuscript D.3 The Schlögl model noticec 1 andc 3 in R 1 and R 3 . It is often (as it is here) assumed that A and B are available in such a large amount that they are practically constant. Consequently, the associated stoichiometry matrix is and the propensity (hazard) vector is Here, since A and B are assumed constant, we have that c 1 =c 1 A and c 3 =c 3 B. For the Langevian-approximation, the drift-vector and diffusion matrix are given by The Schlögl model is a toy-model, however it exhibits stochastic bi-stability. Hence, it is a suitable benchmark example to use in order to investigate if our framework can infer parameters for challenging stochastic biological systems.

D.3.1 Simulation details
Using Gillespie's direct method, 40 individuals were simulated from the Schlögl model. For each individual, we simulated data at 99 equidistant time-points from 1 to 50 minutes. Regarding observations, we assumed that X t was directly observed and noise-corrupted by a Gaussian additive error, Y t = X t + t where t ∼ N (0, 2 2 ). The parameters c = (c 1 , c 2 , c 3 , c 4 ) were divided into parameters that vary between individuals, here c 1 , and those constant between individuals, here κ = (c 2 , c 3 ). When simulating data, c 4 was set at c 4 = 37.
We ran the inference using the modifed Gibbs-sampler (κ, ξ cell-varying). As the mean of the priors yielded an non-finite likelihood we used a random sample from the prior as starting value. The inference was run according to the guidelines in B. Briefly, for the pilot run 10 000 iterations were performed using 1000 particles. The high number of particles were employed to control the variability of the likelihood. To further control the variability of the likelihood we used a guided particle filter (modified diffusion bridge) with a step length dt = 5e − 2, and we correlated the particles with a correlation level ρ = 0.999. At the end location of the pilot run, according to the criteria for correlated particle filters, we tuned the number of particles, and the minimum, median and maximum particles required for an individual was 10, 10 and 20 respectively. Finally, starting from where the pilot run ended, 500 000 further iterations of the sampler were produced. S17 / S27 where the dW (i) t are increments of standard Brownian motions.
The Ornstein-Uhlenbeck (OU) model is a suitable benchmark model for state-space mixedeffects (SSMEM) framework which, for the case where SSMEMs have latent dynamics {X (i) t } driven by SDEs, it has been named SDEMEMs . A list of resources for inference for SDEMEMs is available at https://umbertopicchini.github.io/sdemem/. The process solution to Eq. S29 has a known Gaussian transition density. The fact that Eq. S29 is linear in the latent states implies that if we assume that the observed data follows a linear error model with additive Gaussian measurement noise, Y where t ∼ N (0, σ 2 ), then the observed data likelihood (Eq. S3) can be evaluated exactly using a Kalman filter [?]. Consequently, exact Bayesian inference for the OU model can be performed by plugging the Kalman-provided observed likelihood into our Gibbs sampler. As the Kalman approach does not rely on stochastic approximations of the likelihood, it is a gold-standard benchmark example. Here, we use observed data, and inference results from [1] when embedding the Kalman filter in the Gibbs sampler. In [1] they produced inference by using the stochastic differential mixed-effects framework that our framework is based on.

D.4.2 Validation of the inference framework
To test the correctness of the implementation of our framework, we ran PEPSDI with the non-perturbed option (κ, ξ are constant between cells) and the default perturbed option (κ, ξ vary weakly between cells) for the OU model and compared with the exact inference provided by using the Kalman filter to compute the likelihood function and embedding the latter in a Gibbs sampler. PEPSDI performs well (S1 Fig). There is some bias for τ 2 , however for all other parameters PEPSDI performs well.
Overall, PEPSDI with the default perturbed option correctly infers the population parameters (µ, τ ). The only drawback is a slightly wider credibility interval for that one parameters that is constant between individuals (σ), however differences in this case are minimal, to the second decimal digit.  [27]. To compare the performance of these three samplers, we benchmarked them for the Schlögl model (Section. D.3 and Fig 3) and the Ornstein-Uhlenbeck (OU) model (Section D.4 and S6 Fig). For computational reasons, we considered single time series inference for these two models, that is inference for a single individual.

D.5 Performance evaluation of adaptive MCMC proposals
For the benchmark we simulated three data sets for each model. Then for each dataset we ran five pilot runs each of 5000 iterations using 3000 particles, with initial parameter values randomly sampled from the prior, and at the end of each pilot run we tuned the number of particles according to the tuning criteria in Section B. For both models, we correlated the particles with a level ρ = 0.99. Starting from the last drawn parameter value in each pilot run, we launched 10 further inference runs, and these were run for 60 000 iterations. By discarding the first 20% of samples as burn-in, the performance of different samplers were compared using the MultiESS-criterion [40]. For the OU model we also computed the first order Wasserstein distance between each resulting posterior and the exact (up to Monte Carlo error due to using a finite number of Monte Carlo samples) posterior using the last 15 000 posterior samples via the transport package in R [41]. For the OU model it is possible to obtain the "exact" posterior by using the Kalman filter to compute the exact likelihood [42].
For the Schlögl model we simulated the data using the parameters (c 1 , c 2 , c 3 ) = (1.8×10 −1 , 2.5× 10 −4 , 2.2 × 10 3 ), and we fixed c 4 = 37.5. As in the main text, we simulated intrinsic noise using the Langevin approximation, and employed a guided particle filter with step length dt = 0.05. For the OU model we simulated the data using the parameters (c 1 , c 2 , c 3 ) = (0.496, 9.974, 0.407). We used a bootstrap particle filter, and when solving the SDE we used a step-length dt = 0.01. To ensure fair comparison we used the same tuning parameters for each sampler. Briefly, all three samplers propose new parameters via where the samplers update both β and Σ (GenAM), or only Σ (AM, RMA) during the inference run. The same starting covariance matrix Σ (diagonal with 0.1 on the diagonal for both models) and step length β = 1 were used. Except for the first 100 iterations, we updated the samplers at each iteration with a diminishing adaption parameter γ i , for example For all three samplers, at iteration i we used γ i = γ 0 /i α , with α = 0.7 and γ 0 = 1 as in [1]. The GenAM and RAM samplers further have a target acceptance rateᾱ, which we set atᾱ = 0.07 similar to [1]. S19 / S27

E Model of Mig1 nuclear dynamics
The Mig1 model aims to describe the dynamics of the ratio between nuclear Mig1 and cytosolic Mig1 (Mig1n/Mig1c) upon fructose addition to carbon starved cells. To calibrate the model we used two single-cell time lapse datasets (Fig 5) where the external fructose concentration was elevated from 0% to 0.05% and 2%, respectively, at t=1.5 min after the start of images acquisition. This results in a rapid increase in the ratio, followed by a decrease corresponding to the Mig1 nuclear entry followed by relocation to the cytosol. The extend of the cytosolic fraction of Mig1 depends on the fructose concentration.
To deduce the mechanisms governing the Mig1 dynamics we considered two pathway structures (Fig 6A and 6B). A biological explanation behind each model is provided in the main text. Since model-structure 1 (Fig 6A) is a sub-structure of model-structure 2 ( Fig 6B) we here focus on model-structure 2.
Overall, model-structure 2 consists of eight reactions: Here M refers to the metabolic component. Noticeably Reg1 and M are not consumed in (R 3 , R 5 ). The is because Reg1 is a phosphatase, which likely acts on Mig1 by dephosphorylation [43]. Further, M thus likely has phosphate/kinase activity.
Overall, there are eight reactions with associated rate-constants (c 1 , . . . , c 8 ). Below, each reaction is outlined in detail, and the choice of the prior for each parameter is motivated. The reactions characteristics are further outlined in Tab. S1. Following this, the initial values used for each state are described.
To infer the unknown parameters, we use that the observed Mig1 data represents a ratio of nuclear to cytosolic intensity, and we match our model to data via an additive Gaussian error model g(x We assumed this error model, as it has shown to work well in other modelling studies working with single-cell fluorescent data [44,45,46].

E.1 Motivation of model reactions
R 1 describes the activation of Reg1 via an external carbon source. Albeit we know that Reg1 is regulated by carbon source availability [47], intermediate steps are likely encapsulated by c 1 . These include the activity of hexose transporters, and the catalytic ability of hexokinases. As we figured that these factors should vary between cells, we allowed c 1 to vary between cells. Moreover, c 1 should also depend on the external fructose level. The most simple explanation is a proportional dependence. For a typical cell c cell 1 = c 1 * [F ructose], however such a proportional scaling fails to properly describe the observed data. Thus, we allowed the log-mean value of c 1 , S20 / S27 For c 1 and c 8 we use informative priors. Typically, the amount of Reg1 per cell is of the magnitude of several thousands [48]. Moreover, when adding glucose to starved cells it takes around 3.5 minutes for the Mig1-ration to peak and plateau [49,45], and we expect a similar response time when adding fructose. As we expect M to be a delayed signal, due to the observed long-term nuclear export, it thus follows that Reg1 should in 2% fructose reach a peak in around 3.5-minutes. If we now assume that peak amount Reg1 per cell is around 2000 (the amount Reg1 per cell is of O(10 3 )) it thus follows that c 1 ≈ 1500, and c 8 ≈ 0.75 (see below). Here, c 1 applies for [fru] = 2.0%. Since we let c 1 vary between cells we employ the prior log µ (F ru=2.0) 1 ∼ N (7.3, 2 2 ) and log µ (F ru=0.05) 1 ∼ N (6.9, 2 2 ), where N refers to the normaldistribution. The latter prior highlights that we expect a weaker response, that is activation of Reg1, for lower fructose concentrations [49,50]. R 6 , R 7 describes the carbon source independent shuttling of Mig1 [49]. As the cells at time zero, when no carbon source in the environment is available, are at a steady state [49], we have that c 7 = c 6 M ig1n(t 0 )/M ig1c(t 0 ). With the same argument as for c 8 we assume that c 6 is cellconstant (c 7 is almost constant as M ig1n t 0 M ig1c t 0 ). Moreover, since the Mig1 shuttling, on our timescale, is fast with a FRAP recovery time around 5 seconds [49], we fixed c 6 = 40.0 (see below). We know that the "true" value is likely not 40, however this value encapsulates, on our time-scale, fast nuclear shuttling. R 2 , R 4 describe the activation and degradation of M . As with c 1 , we expect that multiple processes are encapsulated by c 2 . Consequently, we let c 2 vary between cells. Moreover, similarly to c 1 we assume that the log-mean value of c 2 , denoted µ 2 , differs depending on fructose availability. As with c 8 , we assume that c 4 is constant between cells. Our prior knowledge regarding (c 2 , c 4 ) is limited. However, we know that the growth rate of M should be relatively slow (as the feedback is slow, due to the delayed signal (see above for Reg1)). Consequently, we put a prior log µ 2 ∼ N (2, 3 2 ) for both [F ru] = 2.0 and [F ru] = 0.05. For the scale-parameter of c 2 we employed half-Cachy priors τ ∼ C(0, 2.5) where τ > 0. Further, we put log c 4 ∼ N (−6, 3 2 ). The latter ensures that M does not plateau, and the amount of M is relatively small (around 100), thus we are able to investigate if the intrinsic noise might play a large role in the cellular dynamics. To compensate for a wrong assumptions regarding M , we employ relatively weak priors.
R 5 describes the M dependent shuttling of into the cytosol. As we assume M is around 100 we know that c 3 should be in a range of about 0.05 to 20 (else all Mig1 will leave the nucleus), thus we used a log c 5 ∼ N (0, 3 2 ) prior.
Overall, we have that c 1 , c 2 are cell-varying. The remaining rate-constants c 3:8 = (c 3 , . . . , c 8 ) likely vary between cells. However, since we also have variability in the model components (e.g varying Reg1 amount between cells) and c 3:8 describes fewer cellular processes compared to (c 1 , c 2 ), that is should vary less between cells, we here assumed negligible extrinsic variability.  number of active Reg1 molecules are not zero, however, since Reg1 does not interact with Glc7 in low carbon source conditions, we expect the amount of active Reg1 molecules to be low [51].

E.2 Initial values
The initial amount of Mig1c and Mig1n molecules are not zero. Based on available numbers [50], we know that M ig1c t 0 is around 1500 and M ig1n t 0 is around 100 molecules per cell when glucose or fructose are not available. Moreover, the Mig1 amount should vary between cells [49,45]. Consequently, for the initial values we employed priors log M ig1c t 0 ∼ N (7.3, 0.2) and log M ig1c t 0 ∼ N (4.6, 0.2). Furthermore, we know from previous experiments that the variance of log M ig1c t 0 is around 0.2 [52]. Hence, on the scale component of M ig1 t 0 we employed G(0.4, 0.4) priors.
Overall (c 1 , c 2 , M ig1n t 0 , M ig1c t 0 ) vary between cells. As we suspected a relatively strong correlation (e.g between Mig1 in the cytosol and nucleus) we employed a LKJ(0.1) prior on the correlation matrix Φ.

E.3 Derivation of priors and quantity values
To fix the value of c 6 to 40, we simulated a fluorescence recovery after photobleaching (FRAP) experiment. More specifically, to emulate the setting in [49] we consider a scenario where Mig1 is GFP tagged, no external carbon source is available, and the nucleus is bleached. Specifically, we consider the nuclear shuttling reactions with initial values M ig1c t 0 = 1500 and M ig1n t 0 = 100 and c 7 = c 6 M ig1n t 0 /M ig1c t 0 . The latter captures that the cells are in a steady state [49]. To model the FRAP experiment each molecule was given a GFP tag that equals 1 if the molecule has an active fluorophore, and 0 if it has a none-active fluorophore (i.e. is bleached). At time zero the tag for all molecules in the nucleus was set to zero (to mimic the bleaching), and the system in Eq. S33 was simulated with different values of c 6 , namely c 6 = 0.4, 1.4, 2.4, . . . , 200. To account for stochastic dynamics each combination was simulated 50 times. By regarding the ratio M ig1n(GF P = 1)/M ig1n tot we noticed that the simulated FRAP (M ig1n(GF P = 1)/M ig1n tot ≈ 1) times decreases with increased c 6 . Observed recovery time is around 5 seconds [49] (when the FRAP curve flattens out), and at 1 seconds the recovery is around 0.8. When simulating the reactions in Eq. S33 with different c 6 -values, this corresponds to a value of around c 6 ≈ 40.
To obtain priors for c 1 and c 8 we considered the reaction network S22 / S27 Our premise here is that Reg1 should start to plateau around t = 3.5 minutes, and that the steady state-value should be around 2000 in high fructose conditions ([fru] = 2%). Given this it follows that the steady state value in high fructose should be c 2 = c 1 /2000. By simulating different different values for c 1 using the reactions in Eq. S34, we found that a value of c 1 = 1500 fulfils these criteria.

E.4 Inference details
The parameters inferred were the individual parameters (c Since we did not expect the molecule number in the states (Reg1, M ig1c, M ig1n, M ) to be low (e.g 2-10 molecules), we simulated the model dynamics using the tau-leaping simulator with a fixed step length. Due to stiffness, we had to employ a small step length of ∆t = 5 × 10 −3 . To achieve efficient inference we correlated the particles with a correlation ρ = 0.999. Since we strongly correlated the particles, we could use relatively few particles to estimate the likelihood (see below).
We ran the inference using the default options in PEPSDI, where the cell-constant parameters are slightly perturbed. We used the mean of the priors as starting values. The inference was run according to the guidelines in S3. Briefly, for each model (Model 1A/B and Model 2A/B Fig 6) we ran multiple pilot runs using 200-300 particles and 1500-3000 samples. At the end location of the pilot run, we tuned the number of particles per individual according to the criteria for non-correlated particle filters. Lastly, starting from where the pilot run ended, 40,000 further iterations of the perturbed Gibbs sampler (default option in PEPSDI) were run to produce the final inference.