The authors have declared that no competing interests exist.
A common challenge in systems biology is quantifying the effects of unknown parameters and estimating parameter values from data. For many systems, this task is computationally intractable due to expensive model evaluations and large numbers of parameters. In this work, we investigate a new method for performing sensitivity analysis and parameter estimation of complex biological models using techniques from uncertainty quantification. The primary advance is a significant improvement in computational efficiency from the replacement of model simulation by evaluation of a polynomial surrogate model. We demonstrate the method on two models of mating in budding yeast: a smaller ODE model of the heterotrimeric G-protein cycle, and a larger spatial model of pheromone-induced cell polarization. A small number of model simulations are used to fit the polynomial surrogates, which are then used to calculate global parameter sensitivities. The surrogate models also allow rapid Bayesian inference of the parameters via Markov chain Monte Carlo (MCMC) by eliminating model simulations at each step. Application to the ODE model shows results consistent with published single-point estimates for the model and data, with the added benefit of calculating the correlations between pairs of parameters. On the larger PDE model, the surrogate models allowed convergence for the distribution of 15 parameters, which otherwise would have been computationally prohibitive using simulations at each MCMC step. We inferred parameter distributions that in certain cases peaked at values different from published values, and showed that a wide range of parameters would permit polarization in the model. Strikingly our results suggested different diffusion constants for active versus inactive Cdc42 to achieve good polarization, which is consistent with experimental observations in another yeast species
Mathematical models in systems biology often have many parameters, such as biochemical reaction rates, whose true values are unknown. When the number of parameters is large, it becomes computationally difficult to analyze their effects and to estimate parameter values from experimental data. This is especially challenging when the model is expensive to evaluate, which is the case for large spatial models. In this paper, we introduce a methodology for using surrogate models to drastically reduce the cost of parameter analysis in such models. By using a polynomial approximation to the full mathematical model, parameter sensitivity analysis and parameter estimation can be performed without the need for a large number of model evaluations. We explore the application of this methodology to two models for yeast mating polarization. A simpler non-spatial model is used to demonstrate the techniques and compare with published results, and a larger spatial model is used to demonstrate the computational savings offered by this method.
Mathematical models provide a more quantitative description of biological systems compared to qualitative arrow diagrams. A major tool of mathematical modeling is differential equations representing the dynamics of various components of the system which may be a cell, organism, or ecosystem [
One of the challenges in modeling is identifying the parameters from data [
For parameter estimation, two major approaches are Bayesian and maximum likelihood [
In general, global sensitivity analysis and parameter estimation both require sampling of the parameter space. For systems with large parameter counts, this can become very challenging due to the curse of dimensionality. Too many parameters can make sampling of the parameter space computationally intractable, especially for partial differential equation models that are expensive to solve. Many advances have been made in reducing computational cost in the field of uncertainty quantification (UQ), which is concerned with the characterization and reduction of uncertainty in mathematical models [
In this paper, we apply a method for parameter sensitivity analysis and parameter estimation that uses polynomial approximation to significantly reduce the computational cost for large problems. A key step in the proposed method is the construction of a polynomial surrogate model. This surrogate model allows for sampling methods to be applied without the need to solve the full system for each sample. The use of surrogate models (e.g. support vector machines) for biological systems has been explored previously in [
To demonstrate the capability of the proposed method, we apply it to models of yeast cell polarization. Cell polarization is the process by which intracellular species (e.g. proteins) become asymmetrically localized, which is fundamental to cellular processes such as cell division, differentiation, and movement [
We consider two models: an ODE model for only one module of the system (the heterotrimeric G-protein cycle), and a spatial model that incorporates a larger signaling pathway as well as membrane diffusion of the proteins. We will refer to these models as Model 1 and Model 2, respectively. Model 1 was proposed in [
It should be noted that the results of parameter sensitivity and parameter estimation are dependent on the assumed model structure. In systems biology there is often significant uncertainty in the model structure itself. Some work has been done on quantifying the structural uncertainty in models of biological networks and reconstructing networks from data [
The structure of this paper is as follows. We first present the mathematical methods for surrogate model construction and how to perform parameter sensitivity analysis and parameter estimation using a polynomial surrogate. We then demonstrate the methods on Model 1, performing sensitivity analysis and estimation in two cases: first, varying only the two free parameters, and second, varying all eight parameters. We then present Model 2 and use sensitivity analysis to significantly reduce the parameter count. Bayesian parameter estimation is then performed in the reduced parameter space. We discuss the computational savings afforded by the use of a polynomial surrogate for parameter estimation in Model 2. Finally, we discuss biological implications of the results and future applications of the polynomial surrogates in Bayesian model analysis.
Biological systems often possess many parameters whose true values are unknown. In order to gain an understanding of the effects of each parameter, we need to sample the parameter space. However, sampling a high-dimensional space is a difficult task. For example, in the next section we consider a large PDE model with 35 parameters. In this case, even with only two sample points in each dimension we would need 235 ∼
If multiple response functions are of interest (for example, different time points or different values of some input), there are two options—one can either increase the number of variables in the polynomial or use multiple polynomials. For example, if measurements are taken at several time points
To perform the polynomial fitting, we use an orthogonal polynomial basis from the generalized polynomial chaos (gPC) approach [
Recall that the number of basis functions for the set of polynomials of degree up to
The samples can be chosen in a variety of ways (e.g. uniform random sampling, sparse grids, Latin hypercube sampling, etc.). A quasi-optimal sampling scheme for least squares polynomial fitting has been explored in [
1. Determine the desired polynomial degree and how many samples can reasonably be obtained.
2. Sample the parameter space using the sampling method of your choice. The sampling method may depend on whether you are undersampling or oversampling (e.g. for oversampling, you may want to use quasi-optimal points for least squares [
3. Using the samples from step 2, set up a linear system
4. Solve for the coefficients. If undersampling, perform compressed sensing with
The accuracy of the polynomial can be estimated by cross-validation. In cross-validation, the model is evaluated at additional sample points that were not used in the polynomial fitting. The model output can then be compared with the polynomial value at those points to determine the error. One may also perform
Once the polynomial surrogate model is constructed, it can be used to perform parameter sensitivity analysis and parameter estimation (
The polynomial fitting procedure is described in Algorithm 1. The parameter space samples are generated by model simulation. The sensitivity analysis and parameter estimation use the fitted surrogate polyomial.
We define the sensitivity of a response function
We can then assess the importance of each parameter based on its sensitivity. If the response is not sensitive to a parameter
For parameter estimation, we use Markov chain Monte Carlo (MCMC) method with Metropolis-Hastings algorithm [
MCMC methods have become a popular choice for parameter estimation in biological systems [
A key question is knowing when the MCMC has converged, meaning that the distribution of the Markov chain samples has converged to the posterior distribution. Several convergence diagnostics for MCMC have been proposed [
All codes have been made publicly available on GitHub in the repository
The yeast strain CGY-021 is a derivative of W303-1A and contains the
Cells were cultured in YPD (yeast extract-peptone-dextrose) media supplemented with adenine. Cells were treated for 60 minutes with 10 nM
We apply the proposed method to two models of the yeast mating response. Haploid budding yeast cells assume two mating types,
First, the pheromone
Two key features of this process are the positive and negative feedback loops. In the positive feedback loop, membrane-bound Bem1 binds and activates Cdc24 which catalyzes the formation of active Cdc42 which binds more Bem1. In the negative feedback loop, active Cdc42 activates Cla4 which inhibits the membrane-bound Cdc24, leading to a lower activation rate of Cdc42. Cdc42 is of particular interest since it plays a key role in establishing polarity and is highly conserved from yeasts to humans [
To demonstrate our methods, we first consider a simple model: an ODE model of the heterotrimeric G-protein cycle taken from [
Since the ultimate goal is parameter estimation, the response functions of interest are those outputs for which we have experimental data. Using the data from [
We first construct a polynomial surrogate model that approximates the ODE model which allows us to sample the parameter space at a much lower computational cost. In this example, we construct a set of polynomials in two variables (
The degree of the polynomial as well as the number of points used for least squares fitting can be adjusted depending on the error of the resulting polynomial. The error can be determined by calculating the difference between the polynomial and the simulated full model at randomly sampled points using cross-validation. Since the number of samples may need to be adjusted, it is best to use a sampling technique that allows for the sequential addition of points, such as simple random sampling or Sobol sampling.
In
Error mean and standard deviation (measured using 100 random samples by cross-validation) for different polynomial fits (top), and the cost to compute the polynomials (bottom). (A) 5th order polynomials fit using different numbers of sample points. (B) Polynomials of varying degree using least squares fitting with 1000 points. Polynomial error is the average difference between the polynomial and the model output, and the error bars indicate the standard deviation of the error over the 100 sample points.
We plot the computational cost of the polynomial fitting as a function of number of samples or polynomial degree at the bottom of
For the sensitivity analysis and parameter estimation, we use the 10th degree polynomial fit from 1000 sample points. Since each data point in
Sensitivity coefficients are given for different time points (
Data points | Sensitivity to |
Sensitivity to |
|
---|---|---|---|
4.6 × 10−1 | −2.4 × 10−1 | ||
4.5 × 10−1 | −3.0 × 10−1 | ||
4.5 × 10−1 | −3.4 × 10−1 | ||
4.3 × 10−1 | −3.7 × 10−1 | ||
4.3 × 10−1 | −4.0 × 10−1 | ||
4.2 × 10−1 | −4.0 × 10−1 | ||
4.0 × 10−1 | −4.0 × 10−1 | ||
3.9 × 10−1 | −4.0 × 10−1 | ||
3.5 × 10−1 | −2.0 × 10−1 | ||
3.8 × 10−1 | −2.3 × 10−1 | ||
4.2 × 10−1 | −2.7 × 10−1 | ||
4.4 × 10−1 | −2.9 × 10−1 | ||
4.4 × 10−1 | −3.1 × 10−1 | ||
4.4 × 10−1 | −3.2 × 10−1 | ||
4.5 × 10−1 | −3.2 × 10−1 | ||
Mean sensitivity | 4.2 × 10−1 | −3.2 × 10−1 |
We perform parameter estimation using the data from [
Probability distributions are obtained via Markov chain Monte Carlo and a 10th degree polynomial. (A) Distributions for individual parameters, normalized so that the total area is equal to 1. Red lines indicate the optimal (maximum likelihood) parameter values
We now apply the same parameter estimation procedure to the G-protein model allowing all 8 of the kinetic parameters to vary. In other words, we assume that all the parameters are unknown and would like to use our model to estimate these parameters. The parameters are assumed to be log-uniformly distributed in the ranges in
For this problem we choose a 5th degree polynomial surrogate that allows oversampling; the 5th degree polynomial space in 8 parameters has 1287 basis polynomials. We perform uniform random sampling on 1500 points generated by model simulation to construct the polynomial by least squares fitting. The resulting polynomial has mean absolute error 2.5 × 10−2.
Using the polynomial as a surrogate for the full model, we compute parameter sensitivities for the 8 parameters, and the mean sensitivities over the dataset are given in
Both mean sensitivities and the mean of the absolute value of the sensitivities are shown.
Parameter | Mean sensitivity | Mean abs. value of sensitivity |
---|---|---|
8.2 × 10−2 | 8.2 × 10−2 | |
−3.2 × 10−2 | 3.2 × 10−2 | |
9.2 × 10−3 | 1.2 × 10−2 | |
1.1 × 10−3 | 6.3 × 10−3 | |
−6.2 × 10−2 | 6.2 × 10−2 | |
5.6 × 10−4 | 7.4 × 10−3 | |
3.1 × 10−1 | 3.1 × 10−1 | |
−2.6 × 10−1 | 2.6 × 10−1 |
Next, we perform parameter estimation on all 8 parameters and obtain the distributions in
(A) Parameter distributions from ODE model (
We determined the mean values for each parameter distribution to create the mean parameter set (
The correlation between pairs of parameters can be calculated along with the individual distributions. A graphical representation of the correlations among the 8 parameters is given in
To capture the spatiotemporal dynamics of yeast cell polarization during mating, one needs a mechanistic spatial model. In this model, protein spatial dynamics are driven by two processes: surface diffusion on the cell membrane and reactions with other proteins in the system. This leads to a system of reaction-diffusion equations, similar to the model presented in [
The coefficients are given by
Parameter | Description | Previous estimate | Range | Ref. |
---|---|---|---|---|
Diffusion of R | 0.001 | ±10% | [ |
|
Diffusion of RL | 0.001 | ±10% | [ |
|
Diffusion of G | 0.01 | [0.005, 0.02] | [ |
|
Diffusion of Ga | 0.01 | [0.005, 0.02] | [ |
|
Diffusion of Gbg | 0.01 | [0.005, 0.02] | [ |
|
Diffusion of Gd | 0.01 | [0.005, 0.02] | [ |
|
Diffusion of C24m | 0.01 | [0.005, 0.02] | [ |
|
Diffusion of C42 | 0.01 | [0.005, 0.02] | [ |
|
Diffusion of C42a | 0.01 | [0.005, 0.02] | [ |
|
Diffusion of B1m | 0.01 | [0.005, 0.02] | [ |
|
RL association | 2 × 10−3 nM−1s−1 | ±10% | [ |
|
RL dissociation | 10−2 | ±10% | [ |
|
R internalization | 4 × 10−4 | ±10% | [ |
|
R synthesis | 4/ |
±10% | [ |
|
G-protein activation | 10−5 × |
±10% | [ |
|
G-protein deactivation | 0.1 | ±10% | [ |
|
Heterotrimer association | 1 | ±10% | [ |
|
Cdc42 deactivation | 0.02 | [0.02, 2] | [ |
|
Cdc42 activation | 10−5 × |
[10−5, 10−3] × |
[ |
|
G |
0.04 × |
[0.004, 0.4] × |
[ |
|
Bem1 recruitment of Cdc24 | 3.3 × 10−3 × |
[3.3 × 10−4, 3.3 × 10−2] × |
[ |
|
Cdc24, membrane to cytoplasm | 1 | [0.1, 1] | [ |
|
Bem1, membrane to cytoplasm | 0.01 | [0.01, 1] | [ |
|
Bem1, cytoplasm to membrane | 10−5 × |
[10−5, 10−3] × |
[ |
|
Cla4 activation | 0.006 | [0.0006, 0.06] | [ |
|
Cla4 deactivation | 0.01 | [0.001, 0.1] | [ |
|
Negative regulation of Cdc42 cycle | [0.1, 10] × |
[ |
||
Hill coefficient for |
100 | [1, 100] | [ |
|
Hill coefficient for |
8 | [1, 8] | [ |
|
Total Cdc24 | 2000 | [1000, 3000] | [ |
|
Total Bem1 | 3000 | [2000, 5000] | [ |
|
Total receptor | 10000 | ±10% | [ |
|
Total G-protein | 10000 | ±10% | [ |
|
Total Cdc42 | 10000 | [5000, 20000] | [ |
In our numerical simulations, the cell membrane is simulated as a circle centered at the origin with radius 2
The quantity of interest in this model is the extent of cell polarization, more specifically, the extent of active Cdc42 polarization. Therefore, we consider a scalar function of active Cdc42 (
We perform polynomial fitting using a Legendre polynomial basis to fit the response function
5, 000 points are used to fit a 5th order polynomial in the full 35-dimensional parameter space. The accuracy of the polynomial is evaluated on an additional 500 uniformly random points. A histogram of the errors between the model and polynomial is shown in
Once we have established a polynomial surrogate model, we can analytically compute parameter sensitivities. Assuming that each parameter is uniformly distributed in [−1, 1], the sensitivity of the response function
We observe that many of the parameters have small sensitivity coefficients, and the parameters of primary importance are those associated with the Cdc42 cycle dynamics. Based on the parameter sensitivities in
In this 15-dimensional subspace, we can again perform polynomial fitting to obtain a surrogate model. We use 6000 points to fit a 5th order polynomial using
Parameter | Sensitivity |
---|---|
1.4 × 10−3 | |
−4.0 × 10−3 | |
−1.3 × 10−2 | |
1.7 × 10−2 | |
2.5 × 10−2 | |
4.4 × 10−2 | |
5.5 × 10−2 | |
−5.6 × 10−2 | |
−5.9 × 10−2 | |
6.1 × 10−2 | |
7.0 × 10−2 | |
7.7 × 10−2 | |
9.8 × 10−2 | |
−1.3 × 10−1 | |
1.4 × 10−1 |
We wished to estimate the model parameters that could produce polarization by fitting to experimental data. The key species in yeast polarization is active Cdc42 (C42a) which we can monitor using the reporter Ste20-GFP, a fusion protein that binds active Cdc42 and possesses a fluorescent tag [
With these data, we can perform parameter estimation using the 15-parameter polynomial surrogate model and an MCMC method.
Parameter distributions based on MCMC with chain length 2 × 106 for the reduced 15-parameter PDE model. The parameter range is a log-scale except for the parameters
Sometimes it is desirable to obtain a single best parameter estimate (e.g. maximum likelihood) to visualize how closely the model can fit the data, and to determine the parameter values at that best fit. In this example, due to the large amount of uncertainty in the parameter distributions, there is no clear choice for such a point estimate. In fact, given the limited data relative to the large number of parameters, multiple “best” parameter estimates may exist. One approach is to take the mean of the MCMC iterates (
Another option is to use an optimization method such as simulated annealing to improve upon
Steady state solutions for the mean parameter set from the MCMC (solid black) and a parameter set identified via simulated annealing (dashed black). Polarization is depicted by the concentration of active Cdc42 (C42a, number of molecules/
In this work we apply novel methods from uncertainty quantification to perform parameter sensitivity analysis and parameter estimation of two models of yeast mating. The central innovation is the construction of polynomial surrogate models to replace simulation for calculating the model output. We demonstrate the accuracy of the polynomials by cross-validation on random sample points left out from the polynomial fitting. For Bayesian parameter estimation, the method provides a dramatic reduction in computational cost.
Typically, MCMC requires a model evaluation at every iteration. Since our Markov chain length for the 15-parameter model was 2 × 106, we would require 2 × 106 evaluations of the PDE model to steady state. It would likely take even more iterations for the MCMC to converge for the full 35-parameter model. The PDE is solved with an implicit method implemented in Fortran, and each evaluation takes 40-60 minutes of CPU time. Thus, the full MCMC would require at least ∼200 years of CPU time. Further, MCMC is not inherently parallelizable, although advancements have been made in parallel MCMC methods [
Using the polynomial surrogate, we are able to practically eliminate the cost of MCMC by evaluating only a polynomial at each MCMC iteration. Computing a chain of length of 2 × 106 takes only a few hours in MATLAB. In place of this cost, we must evaluate the full PDE model at the sample points used to fit the polynomial. For our full 35-parameter model, we use 5000 sample points to fit a polynomial to perform the sensitivity analysis. We then are able to reduce the parameter count to 15, and use 6000 additional samples to fit a polynomial in the reduced parameter space. Thus we require 11,000 model evaluations in total. There is also some cost to fit the polynomial via
The computational savings in the ODE test model are not as dramatic, since the ODE model is inexpensive to solve. In numerical tests for the 2-parameter ODE model with a 10th degree polynomial surrogate, we found a 20% reduction in CPU time in evaluating the polynomial vs. evaluating the model directly. In the 8-parameter model with a 5th degree polynomial surrogate, we found a more than 10-fold reduction in CPU time; we believe the greater reduction in cost is afforded by the lower polynomial degree. The computational savings afforded by using polynomial surrogates will vary depending on the ODE solver, the degree of the polynomials, and the time step required to solve the ODE. Whether a problem warrants the use of surrogate models will generally depend on the cost of evaluating the original model, the number of sample (data) points required for accurate parameter estimation, and the polynomial degree required to fit the model output.
The primary challenge with the method is constructing accurate polynomials. As we demonstrate in the ODE example, more sample points and a higher degree polynomial produce greater accuracy. One concern is the ability of the surrogate polynomials to describe highly nonlinear relationships between parameters and outputs arising from bifurcations. If the model output is discontinuous with respect to the parameters, for example, then the model output will not be well-approximated by polynomials. This issue may exist in the PDE model presented here, since it has previously been shown that the model for some parameter values possesses multistability contributing to the polarization [
Another issue is that one may make false assumptions in determining a response function. In the PDE model we choose a response function that quantifies the cell polarization at steady state, and thus we are assuming that the system settles to a steady state. While this seems to be a reasonable assumption for the system presented here, this may not always be the case. If a system has periodic solutions rather than a stable steady state in some region of the parameter space, then one would need to carefully consider how to build an appropriate response function. Unfortunately, it is not always clear a priori whether such solutions exist for a given system.
Finally, a third issue is the combinatorial increase in the number of polynomial coefficients as the number of parameters increases. The 5th degree polynomial for the 35 parameter model possesses 658,008 coefficients and a 100 parameter 5th degree polynomial would possess over 75 million coefficients. For the PDE model we employ compressed sensing methods (
In the yeast G-protein ODE model, the parameter distributions inferred from the time-course and dose-response data are consistent with the parameter estimates and experimental measurements from [
The PDE model shows broad distributions for nearly all 15 parameters examined indicating that a wide range of parameter values are compatible with good polarization of active Cdc42. The fact that the feasible region of the parameter
This work also highlights the inability of the current PDE model to produce the sharp polarization peak of active Cdc42 observed in cells. One explanation is that the model is missing important dynamics or positive feedback mechanisms that enhance cell polarization. In the future, we plan to include additional spatial dynamics such as the polarized transport of Cdc42 to the front of the projection, which is absent from the model.
The broadness of the obtained parameter distributions also implies that the current data are insufficient to obtain tight parameter estimates. In this study we focused on identifying parameter values that would produce polarization in the model versus an unpolarized state. Further data can be collected tracking the spatial dynamics of the other species in the model such as G
In our analysis, we presented only the sensitivity measure
Polynomial surrogates may also be used in methods for parameter estimation not addressed in this paper. In principle, polynomial surrogates can be applied to any type of model for which parameter ranges are known, and for any sampling-based method that requires model evaluations. By fitting polynomials to the quantities for which data is available, every model evaluation in a computational method can be replaced by a polynomial evaluation. While we have demonstrated this here only in the context of a Markov chain Monte Carlo method, the same principles may be used to accelerate the computations involved in other Bayesian methods for parameter estimation, such as rejection sampling and sequential Monte Carlo.
Yet another potential application of polynomial surrogates is to accelerate methods for Bayesian model selection. The idea behind Bayesian model selection is that we can recover a probability distribution for a model index parameter
Arrows indicate the conversion of protein species from inactive to active form or from cytoplasmic localization to membrane localization (where the protein is active). Solid dots represent reactions catalyzed by the connected proteins. Lines terminating in a vertical bar (instead of an arrow) represent inhibition. Species and reactions are described in the main text.
(EPS)
Correlations between the 8 kinetic parameters in the MCMC chain using the polynomial surrogate of Model 1.
(EPS)
Error in the 35-dimensional polynomial surrogate function for Model 2 fit using 5000 points, and measured (tested) at 500 uniform random samples.
(EPS)
Error in the 15-dimensional polynomial surrogate function for Model 2 fit using 6000 points, and measured (tested) via 10-fold cross-validation.
(EPS)
Correlations between the parameters in the MCMC chain using the 15-dimensional polynomial surrogate of Model 2.
(EPS)
Parameter values are taken from [
(PDF)
Experimental data for the given time points and
(PDF)
Ranges for the kinetic parameters used for parameter estimation of all 8 parameters in Model 1 (heterotrimeric G-protein model).
(PDF)
Sensitivity coefficients, in order of ascending magnitude, from sensitivity analysis of all 35 parameters in Model 2 using a 5th order surrogate polynomial fit to 5000 sample points.
(PDF)
(PDF)
(PDF)
The authors would like to thank He Yang for assistance with image processing, and Yeonjong Shin for assistance with uncertainty quantification. Large computations were carried out using the Ohio Supercomputer Center [