Inference for Stochastic Chemical Kinetics Using Moment Equations and System Size Expansion

Quantitative mechanistic models are valuable tools for disentangling biochemical pathways and for achieving a comprehensive understanding of biological systems. However, to be quantitative the parameters of these models have to be estimated from experimental data. In the presence of significant stochastic fluctuations this is a challenging task as stochastic simulations are usually too time-consuming and a macroscopic description using reaction rate equations (RREs) is no longer accurate. In this manuscript, we therefore consider moment-closure approximation (MA) and the system size expansion (SSE), which approximate the statistical moments of stochastic processes and tend to be more precise than macroscopic descriptions. We introduce gradient-based parameter optimization methods and uncertainty analysis methods for MA and SSE. Efficiency and reliability of the methods are assessed using simulation examples as well as by an application to data for Epo-induced JAK/STAT signaling. The application revealed that even if merely population-average data are available, MA and SSE improve parameter identifiability in comparison to RRE. Furthermore, the simulation examples revealed that the resulting estimates are more reliable for an intermediate volume regime. In this regime the estimation error is reduced and we propose methods to determine the regime boundaries. These results illustrate that inference using MA and SSE is feasible and possesses a high sensitivity.


Numerical simulation
MA and SSE yield ordinary differential equation models. For these models the forward sensitivity equations are derived. These equations describe the derivative of the time-dependent state of the ODE with respect to the parameter. To solve the ODE system for states and forward sensitivities simultaneously the SUNDIALS library CVODES [1,2] is used. CVODES allows for highly efficient numerical integration and is usually significantly faster than the corresponding MATLAB implementation. For the simulation we used the BDF integrator with a Newton dense non-linear solver. For relative and absolute tolerances we used 10 −8 for the state equations and estimated tolerances for the sensitivity equations via the CVodeSensEEtolerances command. To prevent numerical problems for systems that exhibit weakly damped oscillations, we activated stability limit detection via the CVodeSetStabLimDet command. We compared the resulting simulation time to those of the SSA method used for data generation. The results are shown in Figure 1. Even for the considered small-scale examples, the simulation of IOS and 3MA is on average faster than computing an emsemble of 100 SSA runs. Typically thousands of SSA runs are necessary to obtain reliable statistics. All simulations were carried out at the true parameter which were also used for data generation.

Model Definitions
In the following we specify the reactions in the trimerization model, the model of enzymatic protein degradation and the JAK/STAT signaling pathway. We do not deem a definition of the individual equations reasonable, as the systems of equations become quite large for higher order expansions. This renders a manual implementation of the equations error-prone and an automatic generation of differential equations via e.g. the ACME toolbox much more tractable. Where applicable the models were reduced by their conservation laws.
Trimerization Model The trimerization model was formulated using mass-action kinetics leading to the reactions provided in Table 1 with the parameters provided in Table 2. The three species were initialized with zero molecules at time t = 0.

Enzymatic Degradation Model
The reactions of the reduced system are given in Table 2 while parameters are provided in Table 4. The model was then reduced by the enzyme-protein complex exploiting the conservation law where we assumed that all species have zero molecules initially except the enzyme species whose initial concentration is [Enzyme] 0 . JAK/STAT model The JAK/STAT model consists of nine species of nuclear and cytosol compartments. The respective compartment volumes of BaF3 cells are Ω nuc = 450 µm 3 and Ω cyt = 1400 µm 3 [3]. We assume that initially only ST AT is present in the cytosol at a concentration of [STAT] 0 . The reactions in this pathway are given in Table 5. Note that the overall concentration of ST AT is constant and hence the model can be reduced by the concentration of the nuclear complex which is given by Note that pEpoR(t) in Table 5 denotes a time-dependent function describing phosphorylation of STAT which is parametrized by a cubic spline between the time points 0, 5, 10, 20 and 60. Its five parameters are estimated along with scaling parameters and the biologically relevant parameters p 1 , p 2 ,p 3 , p 4 and [STAT] 0 . Since the SSE is commonly formulated for a single compartment, we performed the multi-compartment analysis by rescaling the bimolecular reaction rate constant p 2 by the cytosolic volume and performing the  Table 2. Parameter values of the trimerization model from which reference dataset was obtained.
Parameters value unit lower bound upper bound Table 3. Reactions of the enzymatic degradation model.

Parameter Estimation for the Simulation Examples
We carried out parameter estimation for all generated datasets. In the following we will outline the results found for one of the datasets for the trimerization model at Ω = 100µm 3 . These results are representative for both models and other sample size and volume scenarios. Figure 4 (a) shows the simulation of the different description at the respective estimated parameters and the data which was used to estimate parameters for the trimerization model. We find that there is a good agreement between simulation and data. To demonstrate the efficiency of optimization using sensitivity (SE) based gradients over finite difference (FD) based gradients we compared convergence rate and computation time for both approaches (c.f. Figure 4 (b-c)). Both approaches yield comparable convergence rates to the lowest found objective function values across all descriptions. In contrast, we find pronounced differences in computation time. FD optimization takes almost 10 times longer than SE optimization. This means that SE optimization yield 10 times more converged starts than FD optimization in the same amount of time. Comparing the optimization time between different descriptions, we find that on average the required time is of the same order of magnitude for all descriptions. Hence for problems where RRE based parameter estimation is feasible, MA and SSE based parameter estimation should be feasible as well.
For the MA a large fraction of optimizer runs ended in parameter domains where numerical integration was not possible. Figure 4(b) therefore only includes runs where no such difficulties occurred. For the SSE all runs are shown, which indicates no problems with integrability. This means that for this systems MA will in general require more starts to obtain the same number of convergent runs. Similar problems were 4/10

Further Quantification of Estimation Error and Model Selection Criteria
In the following we provide a comparison of the estimation error and uncertainty analysis for all model parameters between RRE and EMRE as well as LNA and IOS. Moreover we provide a comparison of estimation errors and model selection between EMRE and 2MA as well as IOS and 3MA.    . Model selection results with AIC for the trimerization and enzymatic degradation model. Median AIC weight for 2MA and 3MA at respective estimated parameters. A green color indicates that the 2MA and 3MA description is more probable and a blue color indicates the RRE and LNA description is more probable. The median was computed from 100 datasets for every volume/sample size scenario. Figure 9. Model selection results with BIC for the trimerization and enzymatic degradation model. Median BIC weight for 2MA and 3MA at respective estimated parameters. A green color indicates that the 2MA and 3MA description is more probable and a blue color indicates the RRE and LNA description is more probable. The median was computed from 100 datasets for every volume/sample size scenario.