Fast uncertainty quantification for dynamic flux balance analysis using non-smooth polynomial chaos expansions

We present a novel surrogate modeling method that can be used to accelerate the solution of uncertainty quantification (UQ) problems arising in nonlinear and non-smooth models of biological systems. In particular, we focus on dynamic flux balance analysis (DFBA) models that couple intracellular fluxes, found from the solution of a constrained metabolic network model of the cellular metabolism, to the time-varying nature of the extracellular substrate and product concentrations. DFBA models are generally computationally expensive and present unique challenges to UQ, as they entail dynamic simulations with discrete events that correspond to switches in the active set of the solution of the constrained intracellular model. The proposed non-smooth polynomial chaos expansion (nsPCE) method is an extension of traditional PCE that can effectively capture singularities in the DFBA model response due to the occurrence of these discrete events. The key idea in nsPCE is to use a model of the singularity time to partition the parameter space into two elements on which the model response behaves smoothly. Separate PCE models are then fit in both elements using a basis-adaptive sparse regression approach that is known to scale well with respect to the number of uncertain parameters. We demonstrate the effectiveness of nsPCE on a DFBA model of an E. coli monoculture that consists of 1075 reactions and 761 metabolites. We first illustrate how traditional PCE is unable to handle problems of this level of complexity. We demonstrate that over 800-fold savings in computational cost of uncertainty propagation and Bayesian estimation of parameters in the substrate uptake kinetics can be achieved by using the nsPCE surrogates in place of the full DFBA model simulations. We then investigate the scalability of the nsPCE method by utilizing it for global sensitivity analysis and maximum a posteriori estimation in a synthetic metabolic network problem with a larger number of parameters related to both intracellular and extracellular quantities.


Introduction
The utility of mathematical modeling in biology is on the rise due to computational advancements and the increasing availability of data provided by high-throughput experimental techniques [1].Flux balance analysis (FBA) is widely used for modeling cellular metabolism in a large range of metabolic and biochemical engineering problems [2,3].Given a constrained metabolic network, FBA assumes the intracellular fluxes are regulated by the cell to optimize a predefined cellular objective function (e.g., maximizing the biomass growth rate [4]) subject to mass balances of the intracellular metabolites and other feasibility constraints (e.g., bounds on the substrate uptake and product secretion rates).However, FBA only identifies metabolic flux distributions at steady-state and, thus, provides no information on metabolite concentrations or the dynamic behavior of the fluxes.A dynamic extension to FBA, commonly referred to as dynamic FBA (DFBA), was originally developed in [5] and has been subsequently applied in several applications [6][7][8][9].In DFBA models, the intracellular fluxes are given by the solution of a FBA model, which is coupled to a set of dynamic equations that describes the time-varying nature of the extracellular substrate and product concentrations as a function of the extracellular environment [10].The key assumption in DFBA is that the intracellular fluxes equilibrate instantaneously.This "quasi steady-state" assumption is valid as long as the intracellular dynamics are significantly faster than the extracellular dynamics.
Generally, the prediction of the behavior of biological systems such as those described by DFBA models can be subject to various sources of uncertainty including unknown model parameters, unknown model structure, and experimental uncertainty such as measurement error [11].Accurate quantification of these uncertainties, as well as their impact on the quality of model predictions, is vital when applying these models in decision-support or optimization tasks such as parameter estimation or optimal experiment design.The task of uncertainty quantification (UQ) can be divided into two major problems: forward uncertainty propagation and inverse uncertainty estimation.The forward problem focuses on propagating all uncertainties through the model to predict the overall uncertainty in the outputs, whereas the inverse problem aims to calibrate the model with experimental data [12][13][14].However, the most commonly used UQ methods are intractable for expensive-to-evaluate computational models [15], which has severely limited their application to DFBA models.An overview of the various challenges in DFBA simulations can be found in [16].
Surrogate modeling techniques are being increasingly adopted to enable complex UQ analyses that would otherwise be impossible.Of the available surrogate modeling approaches, polynomial chaos expansions (PCEs) are one of the most commonly used methods for UQ, which have been shown to yield accurate representations of model outputs using limited computational resources in various engineering systems [17][18][19][20] as well as biological systems [21][22][23].However, an important underlying assumption in PCE is that the model response is a smooth function of the uncertain parameters such that the response can be accurately approximated by a collection of polynomial functions.For non-smooth models, PCE has been shown to either converge very slowly or even fail to converge altogether depending on the type of nonsmoothness [24,25].This is a critical challenge in DFBA models because they are known to become singular (i.e., lose differentiability) at certain time points due to the underlying quasi steady-state assumption [10,26,27], meaning that even state-of-the-art PCE methods are not directly applicable to DFBA models.
In this work, we propose an extension to PCE, referred to as non-smooth PCE (nsPCE), that can adequately capture the non-smooth behavior exhibited by DFBA models.The underlying concept behind the proposed nsPCE framework is that the time of occurrence of any singularity in a DFBA model is a smooth function of the parameters, which can be effectively modeled with a PCE.Thus, for any given time of interest, the PCE model of the singularity time can be used to partition the parameter space into two non-overlapping regions (or elements) that represent the collection of parameters for which the singularity has and has not occurred.Separate PCEs can then be constructed over each of these elements, leading to a piecewise polynomial approximation of the overall model response.We adopt a non-intrusive, regression-based approach for PCE construction from a limited number of expensive DFBA simulations.In particular, we take advantage of state-of-the-art sparse regression methods to systematically locate the terms that have the greatest impact on the model response out of a very large candidate set of terms.By exploiting sparsity, we can mitigate the curse-of-dimensionality that can plague traditional PCE, allowing the application of the proposed nsPCE approach to problems with reasonably large number of uncertain parameters.
To demonstrate the effectiveness of the nsPCE method, it is applied to accelerate Bayesian estimation of parameters in the substrate uptake kinetic expressions of diauxic growth of a batch monoculture of Escherichia coli on a glucose and xylose mixed media.The metabolic network reconstruction used for E. coli is iJ904, which is a genome-scale model that contains 1075 reactions and 761 metabolites [28].Parameter estimation is performed using measurements of the concentrations of extracellular metabolites and biomass that are taken at certain time points throughout the batch.We selected this particular system due to the fact that reported parameter estimates were determined from experimental data using a trial-and-error procedure [8].This was likely due to the computational complexity of the genome-scale DFBA model in conjunction with the limited data set that may not enable unique estimation of parameters.In addition, we demonstrate how nsPCE can be applied to vastly speedup forward UQ analyses including global sensitivity analysis and estimation of the probability distribution of the model response.To demonstrate the scalability of nsPCE, it is used for maximum a posteriori parameter estimation in a synthetic metabolic network problem with twenty unknown parameters related to quantities in both the intracellular reaction network and the extracellular environment.The codes that implement the proposed nsPCE method for generic DFBA models are provided at the repository [29].

Dynamic flux balance analysis models
We focus on modeling a microbial cultivation process using dynamic flux balance analysis (DFBA), in which the bioreactor is viewed as a combination of the fluid medium (extracellular environment) and the microorganisms (intracellular environment).Cell walls act as physical boundaries between these two phases, through which certain chemical metabolites are exchanged.The DFBA model can be mathematically formulated as [26] _ sðtÞ ¼ fðt; sðtÞ; vðsðtÞÞÞ; sðt 0 Þ ¼ s 0 ; ð1Þ where s denotes the state variables describing the extracellular environment (e.g., concentrations of substrates, biomass, and products) with time derivative _ s and initial conditions s 0 ; v denotes the metabolic fluxes that include both intracellular fluxes and exchange rates; A is the stoichiometric matrix of the metabolic network; and v LB (s) and v UB (s) are the lower and upper bounds on the fluxes, respectively, which are functions of the extracellular concentrations.The vector function f, specified by the set of mass balances in the extracellular medium, defines the rate of change of each component of s and must be integrated to determine the time evolution of extracellular concentrations.The scalar function h is the cellular objective that is maximized by the cells.Whenever more than one microbial species are present in the culture, then multiple flux balance models of the form (2) must be incorporated into (1) [10].
DFBA models can be classified as ordinary differential equations with embedded optimization wherein the lower-level FBA optimization can either be a linear or nonlinear program [30].A variety of methods have been developed for integrating DFBA models, which are summarized in S1 Text.We focus on the direct approach for integrating DFBA models in this work due to its ability to ensure accurate solutions through the use of error-controlled integration schemes.Another advantage of the direct approach is that a unique solution set to the FBA (2) can be obtained using lexicographic optimization [10,27], which may help overcome numerical challenges that can occur when using alternative DFBA simulators (e.g., see [31,Chapter 3]).Since the direct approach requires continuous monitoring and identification of any active set changes in (2), it constitutes a dynamic simulation with discrete events (i.e., a hybrid system).In the next section, we present the proposed nsPCE method that is capable of directly accounting for the hybrid nature of DFBA models.

Polynomial chaos expansions
Theoretical background.We consider a DFBA model with a set of M input parameters that are denoted by x = (x 1 , . .., x M ).These parameters can appear in the initial conditions s 0 , rate of change function f, cellular objective h, and/or the flux limits v LB , v UB .We look to develop a computationally cheap-to-evaluate representation of some output of the DFBA model referred to as the model response.The model response y ¼ MðxÞ can be any chosen function of the states or fluxes that appear in (1) and (2) including, for example, metabolite concentrations, growth rate, or time-to-consumption of any metabolite.The model response function M : R M !R need not be known analytically, and can be approximated using a finite number of model evaluations.We focus on the scalar response case y for notational simplicity.However, the developed procedure can be easily applied separately to each component of a vector of responses y.
Unless the parameters x are perfectly known, they must be treated as uncertain.Parameter uncertainty can generally be represented by a random vector X with some known probability density function (PDF).In this case, the model response also becomes a random variable with some unknown PDF that is implicitly defined by where * denotes "distributed as" and f X denotes the PDF of uncertain parameters.Determining the distribution f Y (or its statistical moments) of the model response represents the forward UQ problem that can be tackled in various ways, the majority of which require extensive sampling that is not feasible whenever M is a computationally expensive model.
where a α 2 R are coefficients of the expansion, C α : R M !R are multivariate polynomials, α = (α 1 , . .., α M ) is a multi-index that identifies the degree of the multivariate polynomials in each of the input parameters X i , and N ¼ f1; 2; . ..g is the set of positive integers.The polynomial basis functions are required to be orthonormal with respect to the parameter distribution, such that they satisfy where S is the support of the distribution of X and δ αβ is the Kronecker delta that is 1 whenever α = β and 0 otherwise.For computational purposes, the series (4) must be truncated after a finite number of P terms, which yields the following approximation where A is a finite set of multi-indices with cardinality equal to P, a 2 R P is a vector of the coefficients, and Ψ : R M !R P is a vector containing all polynomial basis functions.The expansion coefficients are defined to be those that minimize the mean-square error (MSE) between the exact representation (4) and the truncated PCE (6) The right-hand side of this expression represents the analytic solution to the MSE optimization problem and directly follows from the Hilbert projection theorem [32].The expressions in ( 5) and ( 7) involve multivariate integration over complicated nonlinear functions.As such, the construction of the polynomial basis and computation of the expansion coefficients are usually carried out numerically in practice, which leads to additional sources of error.The choice of A also plays an important role in PCE performance because A directly controls the number of coefficients that must be estimated.Larger P values require more computational effort and are more susceptible to numerical sources of error.An overview of state-of-the-art methods for addressing these challenges is provided next.
Orthonormal basis construction.The complexity of determining the polynomials fC α ðXÞg α2A depends fully on the structure of the PDF f X .Whenever the uncertain parameters are statistically independent, then (5) reduces to the tensor product of M univariate polynomials that are orthonormal with respect to each marginal density f X i .These polynomials have been analytically derived for many common PDFs [17], and can be found numerically for generic PDFs using algorithms in terms of the three-term recurrence relationship for orthogonal polynomials [33].There are two main approaches for handling the more general case that X has statistically dependent (or correlated) elements.The first approach involves transforming the generic random vector X into a standard random vector Z for which it is simpler to build the polynomial basis functions [34].Any isoprobabilistic transformation that preserves the PDFs of these random vectors can be utilized, though the most commonly used is the Rosenblatt transformation [35].The second approach involves applying a more sophisticated numerical procedure that is able to impose the conditions in (5) simultaneously in M dimensions.This includes the Gram-Schmidt process [36] as well as the modified Cholesky decomposition of the Gram moment matrix [37,38].
Sparse truncation and regression.We denote the approximate PCE with numerically estimated coefficients â as follows A variety of methods have been proposed for estimating the coefficients that can be broadly categorized as intrusive (e.g., Galerkin projection [12]) or non-intrusive (e.g., pseudo-spectral projection [39] or regression [15]).Here, we focus exclusively on non-intrusive methods.The phrase "non-intrusive" implies that coefficient estimates are obtained over a finite set of parameter realizations X ¼ fx ð1Þ ; . . .; x ðNÞ g, referred to as the experimental design (ED).These samples can be chosen in various ways including Monte Carlo sampling, quasi-random samples derived from Sobol or Halton sequences, or sparse grids to name a few [40].The computational model is then evaluated at every point in the ED, i.e., Y ¼ fy ð1Þ ; . . .; y ðNÞ g with y ðiÞ ¼ Mðx ðiÞ Þ for all i = 1, . .., N. As such, non-intrusive approaches are "black-box" in the sense that they can be applied to any function, even when this function is not explicitly known, and do not require any modification to the deterministic solver.We will focus on regression methods due to their flexibility when it comes to enforcing sparsity.In the regression approach, coefficients â are defined as those that minimize the leastsquare residual of the polynomial approximation over the ED X â ¼ argmin where A 2 R N�P is the model matrix that contains the values of all polynomial basis functions evaluated at all ED points.The solution of (9) requires a minimum number of sample points N � P to ensure a unique solution exists.Since every sample requires an expensive DFBA simulation here, the truncation scheme plays a central role in reducing the complexity of surrogate model construction.The total degree method is the most commonly used approach for specifying A, which looks to keep all polynomials up to a specified order p in the series.For total degree truncation, the set of multi-indices is defined as A ¼ fα 2 N M : kαk 1 � pg, where kαk 1 = α 1 + � � � + α M and P ¼ ðMþpÞ!M!p! .Due to the sharp increase in P as the polynomial order increases, the total degree truncation scheme can quickly lead to a prohibitive number of model evaluations, especially in high dimensions.This issue is often termed the curse-ofdimensionality, which is known to considerably limit standard PCE methods.
We look to take advantage of two approaches for overcoming the curse-of-dimensionality limitation.The first approach involves replacing the total order truncation with the so-called hyperbolic truncation scheme, which is defined as where 0 < q � 1. Lower values for q limit the number of high-order interaction terms considered, which directly lead to sparser solutions.The second approach looks to further sparsify the solution, without sacrificing potentially important interaction terms, by including a regularization term of the form lkãk 1 with λ � 0 in the least-squares problem (9).This regularization term is known to force the minimization to favor low-rank solutions and ensures the existence of a unique solution even when N < P.
The key challenge with regularization is a proper choice of λ, which indirectly specifies the number of non-zero coefficients included in the expansion.In this work, we use the hybrid least angle regression (LAR) method to solve the regularized version of (9).LAR is an efficient procedure for variable selection, which aims to select the predictors (i.e., polynomials C α ) that have the greatest impact on the model response among a potentially large set of candidates [41].Hybrid LAR is a variant of the original LAR that uses a modified cross-validation scheme to estimate the approximation error [19].This modification relies on only a single call to the LAR procedure, which provides significant savings in computational cost when compared to the original method.The relative MSE (RMSE), which is defined as ε = MSE/Var{Y}, is the natural choice of the approximation error in PCE and can be robustly estimated by the leaveone-out (LOO) cross-validation error ε LOO .Not only can ε LOO be calculated analytically for PCE models [42], but it is known to be much less sensitive to overfitting than the empirical estimator [43].
Provided a sensible sampling strategy has been chosen, the remaining parameters that must be selected are related to truncation p and q and the ED size N.We use a systematic procedure for selecting these parameters to achieve a target error level ε target .As discussed in [19], a basisadaptive strategy can help overcome potential limitations of an a priori fixed truncation set A by letting the maximum degree be driven directly from the data.The basic idea is to start with small values for p and q, estimate the coefficients using hybrid LAR, and calculate ε LOO .These steps are repeated for incremented values of p and q, and the algorithm returns the PCE model with the lowest error.Early stop criteria can easily be introduced to avoid an excessive number of iterations.However, when dealing with computationally expensive models, the number of model evaluations N dominates the cost of construction of the surrogate model.We therefore propose an iterative "greedy" approach for constructing the ED to ensure that N can be kept as small as possible.This sequential ED strategy can be summarized as 1.Initialize the current ED with a relatively small number of samples N init .
2. Train a sparse basis-adaptive PCE using the current ED and calculate ε LOO .
3. If ε LOO < ε target , stop the algorithm and return current PCE.Otherwise, enrich the current ED with N add more samples and return to Step 2.
Note that any method can be used in the training step of this algorithm.Thus, in the proposed nsPCE method, the desired accuracy level is the key parameter that must chosen by the user.

The nsPCE surrogate modeling method
The PCE method is guaranteed to converge as both the number of model evaluations N and number of terms in the expansion P increase; however, the rate of convergence can be very slow whenever M exhibits any singularities [24].This is a primary challenge in DFBA models since they can lose differentiability when a switch in the active set of the FBA problem (2) occurs.Inspired by [25], we look to take advantage of the following multi-element representation of PCE as it is capable of capturing non-smooth behavior where N e denotes the number of elements; S k , a k,α , and C k,α denote the local support, coefficient, and orthogonal polynomials in element k, respectively; S ¼ S N e k¼1 S k ; and I S k ðXÞ are indicator random variables defined by The indicator random variables can be used to define the following conditional random variables The local polynomials in (11) are orthogonal with respect to X k while the coefficients are similarly defined as in ( 7) but now in terms of X k .This implies that the same strategies discussed above for building the polynomials, estimating the coefficients using regularized least squares, truncating the expansion, and sequentially populating the ED can be utilized locally within each element.The remaining unanswered question is how to design the elements fS k g N e k¼1 to limit the growth in the number of model evaluations since N will scale approximately linearly with N e .The best decomposition should ensure that the model response behaves smoothly in every element.The proposed nsPCE method decomposes the support into two elements S 1 and S 2 that denote, respectively, the set of parameters for which the singularity has not and has occurred.This idea is best illustrated through a simple example.Consider the following non-smooth ODE system _ y ¼ À x if y > 0 and _ y ¼ 0 otherwise with initial condition y 0 > 0, whose solution is given by yðt; xÞ ¼ This function is not differentiable at the time point t s (x) = y 0 /x, which can be thought of as the "singularity manifold" in the parameter support space, i.e., t s is the boundary function that separates S 1 and S 2 .At any given time of interest t, the two elements can be defined in terms of t s (x) as follows S 1 ðtÞ ¼ fx 2 S : t s ðxÞ > tg; S 2 ðtÞ ¼ fx 2 S : t s ðxÞ � tg: ð15Þ Let us briefly analyze the behavior of these elements.The elements are continuous functions of time, meaning that every time of interest t requires a different decomposition.Whenever t is outside of the support of t s (X), then one of these sets is empty and we revert back to traditional PCE that covers the full support S. In light of this, we can easily generalize the idea to the case of multiple n s > 1 sequential singularities as long as the random variables ft s i ðXÞg n s i¼1 do not have overlapping supports.When multiple non-overlapping singularities are present, we must simply find the support in which t lies and define the two elements using that corresponding boundary function.The case of overlapping supports is more challenging due to the fact that more elements would need to be created based on the intersection of S 1 and S 2 for all active singularities.For the simple scalar example in ( 14), we can analytically derive the boundary function; however, this is not generally possible in DFBA models.Based on the observation that the singularity boundary depends smoothly on the parameters, we instead propose to construct a sparse PCE model to approximate the boundary in multiple dimensions, i.e., t s � tPCE s .The nsPCE method thus creates a surrogate model with the following structure for any where the coefficients âk are estimated from the sparse regression problem based on the local ED X k ¼ fx ð1Þ k ; . . .; g in terms of N k samples.Notice that the full DFBA model must be integrated when constructing tPCE s .Instead of discarding this information, it can be reused by storing the list of state and time points generated when integrating the DFBA model and then interpolating these points when calculating the model response function.Thus, we can use this approach to initialize the ED X, model response data Y, and singularity time data T s .Using T s along with the set definitions in (15), we can easily partition X and Y into the required local EDs.The sequential ED strategy is then applied in each element to ensure that the target error is achieved.
A flowchart that summarizes the main steps of the nsPCE method is shown in Fig 1 .By evaluating the nsPCE surrogate in (16), which is much cheaper to evaluate than the full model M, on a collection of Monte Carlo samples of the parameters, we can directly approximate statistical properties of Y including moments, parametric sensitivities, or even its full distribution.

Numerical implementation
The complete set of Matlab scripts that implement the nsPCE method is available at [29].All of the modifiable parameters in the algorithm are defined in the "User inputs" section of the main_pce.mscript, which automatically executes the steps summarized in Fig 1 .It is important to note that the scripts require the installation of two additional packages that integrate the DFBA model and construct sparse PCE models.The nsPCE scripts are written to be compatible with readily available DFBA and PCE toolboxes to provide flexibility.The simulation of DFBA models can be done with any non-smooth integration code including COBRA [44], ORCA [45], and DFBAlab [10].All files needed by the DFBA integrator should be placed in the dfba_model folder.We opt for DFBAlab in this work due to certain numerical advantages that it exhibits over the available alternatives (see [27,31] for more details).The sparse PCE operations are carried out using UQLab [43], which implements the hybrid LAR method as well as the required calculations to determine the cross-validation error ε LOO .The syntax in main_pce.m is heavily based on UQLab.Hence, some modifications to the source code may be needed to perform the same operations with other toolboxes.

Results
We present two separate case studies in this section.The first case study explores Bayesian estimation of six parameters related to the substrate uptake kinetics in a computationally expensive DFBA model of E. coli with a genome-scale metabolic network.The goal of the first case study is to demonstrate advantages of the proposed nsPCE method over alternatives as well as its application to a realistic problem that has been previously studied in the literature.The second case study focuses on maximum a posteriori estimation in a synthetic DFBA problem with a relatively large number of parameters, i.e., twenty uncertain parameters appearing in a variety of intracellular and extracellular quantities.The goal of the second case study is to provide preliminary evidence of the scalability of nsPCE as well as the fact that the method is applicable to a wide-variety of UQ applications.

Case study 1: Batch fermentation of E. coli monoculture
This case study is based on a DFBA model of a batch fermentation reactor consisting of an E. coli monoculture, which has been investigated for the production of valuable chemicals such as ethanol.Here, we focus on the initial phase of batch operation of the E. coli fermentation reactor under aerobic growth in a glucose and xylose mixed media [8].No ethanol production is observed under aerobic conditions (i.e., this phase is mainly used to increase the biomass), such that the concentration of ethanol can be omitted from the dynamics.This case study is commonly used as a benchmark for comparing DFBA solvers (see, e.g., [16,27,31]), as it exhibits stiff dynamics and multiple singularities.
The dynamic mass balance equations of the form (1) for the extracelluar environment can be summarized as follows _ bðtÞ ¼ mðtÞbðtÞ; _ g ðtÞ ¼ À u g ðtÞbðtÞ; where b(t), g(t), and z(t) denote the biomass, glucose, and xlyose concentrations at time t, respectively.The uptake kinetics for glucose, xylose, and oxygen are given by Michaelis-Menten kinetics u g ðtÞ ¼ u g;max gðtÞ K g þ gðtÞ ; where parameters u g,max , u z,max , u o,max , K g , K z , K o , and K ig correspond to the maximum substrate uptake rates, saturation constants, and inhibition constants.It is assumed that the reactor oxygen concentration, o(t), is controlled and is therefore constant.The growth rate μ(t), on the other hand, is determined from the metabolic network model of wild-type E. coli.The chosen metabolic network reconstruction was iJR904 [28], which contains 1075 reactions and 761 metabolites.The cells are assumed to maximize growth, implying ( 2) is an LP of the form where c is a vector of weights that represent the contribution of each flux to biomass formation while v g ext , v z ext , and v o ext are, respectively, the exchange fluxes for glucose, xylose, and oxygen (i.e., elements of the flux vector v).Thus, the metabolic network interacts with the extracellular environment through the exchange fluxes in (19).The initial conditions of the batch are assumed to be fixed at 0.03 g/L of inoculum, 15.5 g/L of glucose, and 8 g/L of xylose; the oxygen concentration is kept constant at 0.24 mmol/L; and A, c, v LB , and v UB are specified by the iJR904 model.However, the parameters in the substrate uptake rates (19) should be fit to experimental data since they cannot be easily predicted from first principles.This problem of identifying the model parameters was partially tackled in [8], where most of the parameters were fixed according to estimates provided in the literature while u z,max and K ig were adjusted by trial-and-error to match transient measurements of biomass, glucose, and xylose.The reported parameter estimates are given in S1 Table .Since o(t) is fixed, u o,max and K o can be lumped into a single parameter u o .These six parameters are unknown and here are modeled as a random vector whose elements are independent and uniformly distributed around ±10% of the nominal values.We selected this range to reflect a reasonable level of confidence in the reported literature values.In the following, we demonstrate how the proposed nsPCE surrogate modeling method can facilitate UQ tasks that are otherwise computationally intractable with respect to the full DFBA model.
All reported computations are performed in MATLAB R2016a on a MacBook Pro with 8 GB of RAM and a 2.6 GHz Intel i5 processor.The DFBA model is simulated using DFBAlab with default options for integration and LP optimization tolerances.CPLEX was used as the LP solver and MATLAB ode15s was used as the integrator.
Decomposition of parameter space.Before selecting the element decomposition, we must first simulate the DFBA model to locate any significant singularities.The extracellular glucose, xylose, and biomass concentration profiles are plotted in Fig 2 for one hundred randomly sampled parameter values.For a given realization of the parameter, the full simulation requires approximately 1.5 seconds of CPU time.
At the start of the batch, glucose is consumed preferentially over xylose.Once glucose has been depleted, the LP solution switches and xylose becomes the main carbon source.The final batch time is then specified as the time that both glucose and xylose have been fully depleted, at which point the LP becomes infeasible and the solution ceases to exist.The E. coli cells stop growing at this point due to the lack of a carbon source.Although physically the cells would begin to die in this situation, DFBA models cannot directly predict the cell death phase and thus we assume the biomass remains constant for simplicity.The time-to-consumption of glucose t g and xylose t z represent the two singularities in this problem, and clearly depend on the value of the model parameters.Since the singularity time functions cannot be derived analytically, we look to construct PCE approximations for both t g and t z .We investigate two different fitting methods: classical full PCE with coefficients estimated using ordinary least squares (OLS) and sparse basis-adaptive PCE with coefficients estimated using hybrid LAR.The degree of the polynomials is varied from 1 to 6 in the full PCE method, where N = 2P model evaluations are used for regression with P denoting the size of the basis.In the sparse PCE method, the maximum degree is allowed to vary from 1 to 20, and a hyperbolic truncation scheme (10) is used with q = 0.75.The experimental designs (EDs) are generated using Monte Carlo (MC) sampling with a fixed random seed to ensure repeatable results.Fig 3a and 3b show the RMSE as a function of the number of model evaluations used to fit surrogate models for t g and t z , respectively.The sparse PCE method consistently outperforms full PCE, achieving approximately an order-of-magnitude lower RMSE for all ED sizes.
The sparse PCE surrogate models for t g and t z are used in the nsPCE method to build surrogates for the extracellular concentrations.Additionally, these surrogate models contain useful information on which parameters influence the consumption of different substrates.The Sobol' indices of t g (X) and t z (X) are shown in Fig 4, which are a commonly used tool in global sensitivity analysis for ranking the parameters according to their contribution to the variance of the model response.The Sobol' indices can be computed analytically from the PCE coefficients [43], which requires less than one second of CPU time here.It is interesting to note that u g,max and u o mainly contribute to the variance of t g (X), while u g,max , u z,max , and u o are the significant contributors to the variance of t z (X).
The surrogate models can also be used to estimate the PDF of t g (X) and t z (X), as shown in Fig 4 .From the estimated PDFs, we find that t g (X) ranges from approximately 6.31 to 7.87 hr, whereas t z (X) ranges from approximately 7.33 to 9.12 hr.This suggests that the model response is a non-smooth function of X 2 S for any t 2 [6.31, 9.12] hr, so that we must split S into two disjoint regions according to 15.Since the supports of t g (X) and t z (X) partially overlap for any t 2 [7.33, 7.87] hr, additional elements should be introduced to ensure the model response is smooth.However, for times outside of this window, we can exclusively define the elements of the parameter space in terms of t g for times before 7.33 hr and t z for times after 7.87 hr.Plots of these two regions at times 6.5, 7.0, and 7.25 hr projected onto the two most sensitive For comparison purposes, we also show the decision boundary in green (along with 95% confidence limits with dashed green lines) learned from a support vector machine (SVM) binary classifier [46] that was trained using the same 500 data points.The SVM model is unable to capture the significant nonlinear behavior of the boundary as it evolves over time.Thus, SVM results in relatively large misclassification errors due to the limited training data.The sparse PCE model, however, is able to accurately represent the t g function over the full support (see the parity plot in Fig 5d ), which leads to a much more accurate representation of these two elements using limited data.
The "true" RMSE values reported in Fig 3 were estimated using a large validation set that consisted of 10,000 evaluations of the full DFBA model, which required over 3 hours of CPU time.Ideally, these additional model evaluations could be avoided by directly estimating the RMSE from the ED either empirically or using cross-validation techniques.The empirical estimate of the RMSE is based on sample-based approximations to the integral expressions for mean and variance.Cross-validation obtains a more robust RMSE estimate by splitting the ED into various training and validation sets, fitting different models with each training set, and averaging the prediction error of each model.We focus exclusively on ε LOO in this work.Table 1 gives the estimated RMSE values for the sparse PCE surrogate models fit using different ED sizes.We observe that the empirical estimator greatly underpredicts the "true" RMSE found from the large validation set.In fact, for the smallest size N = 10, the empirical estimate   The validation error is computed using a large set of samples not used in the fitting procedure.Cross-validation and empirical error, however, are computed using only points in the original experimental design.Cross-validation partitions the experimental design into various training and validation sets such that multiple models can be fit and their prediction errors averaged in order to compute more robust error estimates than its empirical counterpart.
Here, a leave-one-out cross-validation procedure is utilized. https://doi.org/10.1371/journal.pcbi.1007308.t001 Validation of nsPCE surrogate models.We have verified that the PCE surrogate models are able to accurately represent the singularity manifold that leads to non-smooth behavior in the states of the DFBA model.Thus, they can be used to build nsPCE surrogates for the extracellular concentrations based on the algorithm summarized in Fig 1 .We choose three quantities of interest for illustrative purposes: glucose at time 7.0 hr, xylose at time 8.0 hr, and biomass at time 8.0 hr.We look to compare so-called global PCE to the proposed nsPCE method for these three quantities of interest.In global PCE, a single surrogate model is constructed over the full parameter support, while nsPCE systematically breaks down the support into two disjoint elements using the singularity time function as a dividing boundary.To ensure a fair comparison, the expansion coefficients of both global PCE and nsPCE are estimated using the basis-adaptive hybrid LAR strategy with maximum degree varying from 1 to 20 and q = 0.75 in the hyperbolic truncation scheme (10).In addition, the ED in both approaches are sequentially enriched using MC sampling with a fixed random seed.To simplify the construction of the polynomial basis functions when training the nsPCE surrogate models, the elements S 1 and S 2 were outerbounded with hyper-rectangles.However, only parameter values that explicitly fall within these sets are incorporated into the local ED.This simple approach for dealing with elements of any shape is currently used in the provided scripts [29], but other ways of dealing with generic elements can also be explored.
The convergence properties of the nsPCE surrogate models for the three quantities of interest are compared to that of global PCE in Fig 6 .The nsPCE surrogates achieve significantly lower RMSE than the global PCE surrogates in virtually all cases considered, while requiring many fewer samples to converge to the target error level.In addition, global PCE saturates at the maximum number of ED samples for all three quantities of interest.This implies that global PCE is unable to achieve the desired accuracy levels, whereas nsPCE only saturates for the lowest target error of xylose.This behavior is expected since the convergence rate of global PCE is known to be substantially lowered whenever singularities are present in the model response function.Thus, nsPCE is able to significantly improve the rate of convergence based on a properly chosen elemental decomposition of the parameter support.To show that lower target error levels translate to improved predictions, parity plots for the three quantities of interest are shown in Fig 7. Note that global PCE has large prediction errors for particular values of the parameters (see the blue dots that largely deviate from the y = x line), which is likely due to the fact that an inherently non-smooth function is being represented by smooth polynomials.This is highly undesirable when using the PCE to predict specific response values, as opposed to predicting statistical quantities that average over the response values where individual points are not as important.The nsPCE surrogate models clearly mitigate this limitation of global PCE in a significant way since there are no outlier predictions in the set of 10,000 validation points.
Bayesian parameter inference.Here, we focus on the inverse UQ problem of estimating parameters from data, which can be greatly accelerated using nsPCE.The same data set used in [8] is utilized, which includes measurements of the extracellular biomass, glucose, and xylose concentrations at t 2 {5.5, 6.0, 6.5, 7.0, 7.25, 8.0, 8.25, 8.5} hr.The measurements are corrupted with noise  Given a set of measurements, the change in the state of information about the parameters is given by Bayes' rule [11] f XjD ðxjdÞ ¼ where f X|D is the posterior density; f D|X is the likelihood function; f X is the prior density; and f D is the evidence.As Bayesian inference looks to characterize the full posterior density, it directly provides an explicit representation of the uncertainty in the parameter estimates.The prior and likelihood function must be specified before solving (23).We assume the same uniform priors as those used to construct the nsPCE surrogate models, though these can differ in general.The likelihood function describes the discrepancy between the observed data and the model predictions in a probabilistic way.The likelihood function is specified by the data and noise models in (21) and (22), and is given by  Although we use a Gaussian likelihood here, the same Bayesian estimation approach can be applied to any choice of likelihood function and thus can be easily modified to incorporate other potentially important factors including sensor bias or asymmetric noise.Since ( 23) cannot be solved analytically, we must resort to sample-based approximations that rely on generating samples from the target posterior distribution.A variety of methods have been developed for sampling from the unknown posterior f X|D , including Markov Chain Monte Carlo (MCMC) [47][48][49] and sequential Monte Carlo (SMC) [50][51][52] algorithms.The proposed surrogate models can be used to accelerate any sampling-based method; however, we focus on SMC since this is a class of algorithms that can be made fully parallelized.SMC is based on the concept of importance sampling, which can be implemented in an iterative fashion such that the posterior is updated every time a new measurement becomes available.For a given number of particles N p , the SMC approximation to ( 23) can be summarized as follows: When the algorithm stops at time k f , the set of N p particles targets the posterior distribution of interest.We use systematic resampling in Step 3 due to its computational simplicity and good empirical performance, though a variety of other methods are available [50].
Step 2 is usually the computational bottleneck because the model must be repeatedly solved in order to evaluate the likelihood weight factors using (24).Therefore, we propose to replace the evaluation of v(t i ; x) with a nsPCE surrogate model v nsPCE (t i ; x) for every v 2 {b, g, z} and i = 1, . .., 8.We must then construct a total of 24 surrogates before running the SMC algorithm.The same basic strategy described in the previous section is used for constructing all 24 of the nsPCE surrogate models.Similarly to how the samples for the singularity time are used to initialize the ED in each element, we can store the list of state and time points generated when integrating the DFBA model and interpolate these points to calculate the extracellular concentrations at every time point of interest.By keeping a working ED that is used to initialize each element at every time point, we can greatly limit the number of expensive DFBA simulations that represent the computational bottleneck in SMC.The proposed algorithm in Fig 1 is run with a target error of ε target = 10 −3 , 250 initial ED samples, 10 ED samples added at each iteration, 2500 maximum ED samples, maximum degree varying from 1 to 20, and hyperbolic truncation with q = 0.75.The algorithm converged with cross-validated errors ε LOO below the desired tolerance using only a total of 1200 DFBA simulations to train all 24 nsPCE surrogate models.The basis-adaptive hybrid LAR method consistently estimated coefficients in less than 30 seconds, verifying that the DFBA simulations are the dominant cost in this case study.The validation RMSE values are summarized in S2 Table, which are all below the target error threshold.
Fig 8 shows the posterior density estimated using SMC with N p = 1 × 10 6 particles for a synthetic data set, where the likelihood weights are evaluated using the inexpensive nsPCE surrogate models.The synthetic data ('x' marks in S1 To verify that the SMC algorithm approximately converged with this many particles, we performed 10 separate bootstrap runs that produced a set of very similar posterior densities.Note that a discussion on challenges and open issues in Bayesian estimation is provided in the Discussion section.The SMC code is provided in the main_smc.mscript in [29]. The estimated posterior density in Fig 8 provides interesting physical insights.Three of the parameters (K g , K z , K ig ) are unobservable with the current data set since their posterior (blue) and prior (green) densities are equivalent.This observation could not be easily made before running the estimation procedure due to the nonlinear and indirect relationship between D and X.A change in the experimental conditions such as the initial conditions, controlled oxygen concentration, or substrate feed profiles can enhance the sensitivity of the data to parameters (K g , K z , K ig ).For example, running the batch at low glucose concentrations g(t) � K g results in a glucose uptake rate u g ðtÞ � u g;max gðtÞ K g that is a strong function of K g , whereas running the batch at high glucose concentrations (as done in this case study) produces a nearly constant uptake rate u g (t) � u g,max that is independent of K g .Although the data is sensitive to (u g,max , u z,max , u o ), these parameters are highly correlated as seen in the off-diagonal plots of their joint densities in Fig 8 .Thus, the currently available data from one single batch is insufficient for accurately estimating all the parameters of interest.The evolution of the marginal posterior densities of the observable parameters over time is shown in Fig 9 .Since glucose is mostly consumed by 7.25 hr, the densities of u g,max and u o remain constant for the remaining batch time.The density of u z,max , however, is constant before 7.25 hr because xylose remains mostly at its initial condition.Forward uncertainty propagation.Let Y ¼ MðXÞ denote the vector of all model responses.The forward UQ problem looks to characterize the uncertainty in the model predictions by propagating uncertainty in the parameters through M.This can involve estimating either the prior predictive distribution f Y (before any data has been collected), or the posterior predictive distribution f Y|D (after data has been obtained).The only difference between these two problems is that M is evaluated at i.i.d.samples drawn from the prior in the former and the posterior in the latter.The densities of the model predictions estimated using 1 × 10 6 samples are shown in Fig 10.By replacing the full DFBA model with the nsPCE surrogate model, these histograms were obtained in less than 1 minute of CPU time.As expected, the prior predictive distributions are much wider than the posterior predictive distributions, indicating there is significant uncertainty in the predictions before incorporating data.In addition, we see that many of these distributions have sharp changes and long tails due to the non-smooth behavior of the model responses, which can be accurately captured with the proposed nsPCE framework.It is also interesting to note that the posterior predictive distributions have low variance, even though the parameters are not perfectly estimated.This highlights the impact that nonlinearity can have on both estimation and uncertainty propagation.

Case study 2: Synthetic metabolic network
This case study is based on a synthetic metabolic network originally introduced in [31, Chapter 8].The goal of this case study is to show that the proposed nsPCE method can be applied to problems with a larger number of parameters as well as alternative UQ approaches.The synthetic metabolic network consumes a carbon source C, a nitrogen source N, and an oxygen source O to produce lipids L, ethanol E, biomass X, ATP, and some oxidation product COX.Although used for illustrative purposes, this network is meant to mimic the behavior of living organisms in the sense that: (i) E can only be produced in the absence of O, (ii) L can only be accumulated in the absence of N, (iii) there is a minimum ATP requirement, and (iv) the aerobic oxidation of C produces more energy than where the subscript ex denotes extracellular metabolites and all of the reactions are assumed to be unidirectional.The unknown stoichiometric coefficients are denoted by S i,j , where i represents the metabolite name and j represents the reaction name.The dynamic mass balance equations for the extracellular environment are given by where α is a penalty state that remains equal to zero until the state trajectories become infeasible (e.g., when all of the metabolites are depleted).A detailed discussion on how to determine the instantaneous penalty value γ is provided in [10], which is automatically computed in DFBAlab.We assume that the uptake kinetics are given by the following expressions where s = (X, C, N, O, L, E, COX, α) is the vector of extracellular species.A hierarchical set of objectives is used in the FBA problem (2) to ensure that unique reaction fluxes are obtained (see S3 Table ).A total of twenty parameters in this DFBA model, appearing in both intracellular and extracellular quantities, are assumed to be uniformly distributed between upper and lower bounds summarized in S4 Table .Global sensitivity analysis.To locate any possible singularities, we first simulate the DFBA model with randomly sampled parameter values.The results are shown in Fig 11 wherein we see that the penalty state becomes positive α(t) > 0 once all substrates are depleted, which introduces a strong discontinuity into the state profiles.Even though this is a considerably smaller metabolic network than the one considered in the E. coli case study, it still takes approximately 0.5 seconds of CPU time per realization of the parameter.Thus, it is still advantageous to construct a surrogate model to speedup both forward and inverse UQ problems.
We look to run the proposed nsPCE method (see Fig 1) using the time that the penalty state switches from zero to positive as the singularity time.As suggested in [31, Chapter 8], we consider the seven substrate and product concentrations (X, C, N, O, L, E, COX) at four time points t 2 {10, 20, 30, 40} hr as our main quantities of interest.The nsPCE method was applied in the same manner as described in the previous case study.Here, we specified a target error of ε target = 10 −3 , 100 initial ED samples, 100 ED samples added at each iteration, 1500 maximum ED samples, the maximum polynomial degree could vary from 1 to 30, and a hyperbolic truncation scheme with q = 0.6.The algorithm converged using a total of 2800 DFBA simulations.The resulting parity plots are shown in S2 Fig, which all have empirical RMSE values significantly below the target error.To further assess the accuracy of these models, we calculated the RMSE using 1000 additional samples that were not used during the training process.The validation RMSE averaged over the 28 models was found to be 8.1 × 10 −4 .Only 6 of the 28 surrogates had RMSE values slightly above the target, with the largest overall RMSE being 3.5 × 10 −3 , indicating that the surrogates are reasonably accurate representations of the original model.Note that these errors can be refined by specifying a lower ε target at the cost of more DFBA simulations.
Once the nsPCE surrogate models are constructed, they can be used to efficiently perform global sensitivity analysis in order to quantify the respective effects of each individual parameter on the variance of the model response.Although many sensitivity measures exist, we use Sobol' indices since they make no assumption on the underlying linearity or monotonicity of the model.The global sensitivity results for the various quantities of interest over time are shown in Fig 12 .A variety of interesting conclusions can be drawn from these results.For example, the model appears to be insensitive to K C and K iE , which is likely due to the fact that the batch was run at high carbon and low ethanol concentrations.In addition, the measurements of carbon, nitrogen, and oxygen are highly sensitive to their respective initial conditions C 0 , N 0 , and O 0 at the first measurement time of 10 hr; however, this sensitivity drops considerably as time evolves.We emphasize that obtaining such insights using random sampling on the full model can be prohibitively expensive, but requires negligible cost using the nsPCE surrogate models.
Optimization-based parameter estimation.We now utilize maximum a posteriori (MAP) estimation to estimate the unknown model parameters from synthetically-generated experimental data (see S5 Table ).The MAP estimate is defined as the mode of the posterior distribution, and can be stated directly as an optimization problem of the form [53] xMAP ðdÞ ¼ argmax where the prior acts as a regularization term that can stabilize the solution whenever the parameters cannot be uniquely inferred from the available data [54].We consider a Gaussian likelihood, with noise standard deviations reported in S5 Table, and a Gaussian prior whose mean is equal to the midpoint of the bounds in S4 Table and standard deviations equal to 10% of the mean values.Under the Gaussian restrictions, we can convert the MAP problem to the minimization of a regularized weighted least squares objective by applying a negative log transformation.We solved the optimization (28) using both the full DFBA and nsPCE surrogate models in order to assess the computational gains afforded by the nsPCE method.To ensure a fair comparison, we solved both of these MAP problems in Matlab using the non-smooth optimizer SolvOpt [55] with default parameter settings and the mean of the prior as the initial guess.The algorithm took approximately 2.5 hr to converge when using the full DFBA model, which was substantially reduced to less than 2 minutes (i.e., a factor of 60) when the full model was replaced with the nsPCE surrogates.Not only did the use of the nsPCE surrogate models accelerate the optimization, it also produced a solution with a lower overall objective function value.The objective improved from 471.73 to 1.56 when using the surrogate models as compared to 63.57 when using the full DFBA model.Convergence to a suboptimal solution is likely a consequence of numerical issues related to the stability of derivative approximation using finite difference in DFBA models, which were also observed in [31,Chapter 8].On the other hand, since the nsPCE surrogate models are defined in terms of simple polynomial functions, the finite difference derivative approximation seems to produce more stable iterations towards the solution of the MAP problem, at least in this particular case study.The predictions of the DFBA model under the MAP estimates found using the nsPCE surrogates are shown in Fig 13.We see that the predictions using the posterior parameter estimates very closely match the observed data, which is a large improvement when compared to the predictions based on the prior parameter estimates.Posterior distribution analysis.The MAP estimation (28) determines the parameters that maximize the posterior density.However, we need a characterization of the entire posterior to assess uncertainty in these estimates.Here, we use a Laplace approximation of the posterior density, which is based on a second-order Taylor series of −log(f X|D (x|d)) around the MAP estimate [56].As shown in [57], this leads to a Gaussian approximation of the posterior whose mean is equal to the MAP estimate and whose covariance is defined in terms of the model response sensitives.The approximated posterior marginal densities and 95% confidence regions for the twenty MAP parameter estimates are shown in Fig 14 .As can be seen, the true (unknown) parameter values are contained within the reported confidence regions.We also see that the parameters with the highest global sensitivity indices (see Fig 12) are accurately estimated, whereas the parameters that have little-to-no sensitivity to the data have much wider variances that are similar to that of the prior.Lastly, we observe that physically-related parameters exhibit a significant degree of correlation including, for example, nitrogen uptake parameters v max,N and K N .It is worth noting that the surrogate models can also enable the use of more advanced methods for posterior characterization such as randomized MAP [58], which would require the repeated solution of (28) with randomly perturbed data.

Discussion
In this work, we develop a novel surrogate modeling method for handling the non-smooth nature of computationally expensive dynamic flux balance analysis (DFBA) models.It is shown that surrogate models can vastly accelerate uncertainty quantification (UQ) tasks, such as calibrating the model with experimental data (inverse problem) and quantifying confidence in the model predictions (forward problem).The proposed surrogate modeling method is based on an extension of polynomial chaos expansion (PCE), which we refer to as non-smooth PCE (nsPCE).The main idea behind nsPCE is to systematically partition the parameter space into two non-overlapping regions (or elements) on which the model response behaves smoothly.The nsPCE uses a model of the time that the singularity occurs in order to define the boundary between these two elements.State-of-the-art (i.e., sparse basis-adaptive) regression methods are used to estimate the coefficients of the expansions, such that the overall model response is approximated by a sparse piecewise polynomial function.
We demonstrate the advantages of the nsPCE surrogate modeling method on two separate case studies.The first case study is based on a DFBA model of an E. coli fermentation reactor under aerobic growth in a glucose and xylose mixed media.A genome-scale metabolic network reconstruction with 1075 reactions and 761 metabolites is used to represent the intracellular behavior, which results in an expensive-to-simulate DFBA model that is prohibitive for use in most UQ tasks.Thus, we illustrate how both inverse and forward UQ can be significantly accelerated using nsPCE surrogate models on this problem.In particular, we use a Bayesian estimation method to infer six uncertain parameters related to the substrate uptake kinetics from data.The posterior parameter distribution is estimated using sequential Monte Carlo with 1 × 10 6 samples, which would have required *17 days of CPU time to compute using the full DFBA model, but takes less than one hour when using the nsPCE surrogate models including the cost of training the models.The resulting posterior distribution yields significant physical insights including that the available data set is insufficient to reliably estimate all six parameters, with three of the parameters being non-identifiable under the current experimental conditions.We then demonstrate the scalability of the proposed nsPCE method on a synthetic metabolic network problem with twenty unknown parameters that are related to both intracellular and extracellular quantities.We estimate these parameters using maximum a posteriori (MAP) estimation, and observe that the cost of the optimization algorithm can be reduced by a factor of 60 when using the nsPCE surrogates in place of the full DFBA model.Note that the observed speedups are expected to be even greater for more complex DFBA models, such as those with nonlinear cellular objectives, multiple cultures, or even larger metabolic networks due to the increased cost of the simulations.

Scalability properties of nsPCE
The nsPCE method is specifically constructed to take advantage of the hybrid LAR method for sparse regression, which was originally developed in [19].As such, nsPCE directly inherits the beneficial scalability properties of hybrid LAR that introduces two sources of sparsity into the expansions: (i) low-rank truncation that discards basis terms that lead to high-order interaction of the parameters that are irrelevant in most engineering problems and (ii) regularized least squares is used to systematically add basis terms that are strongly correlated to the model response.Additionally, the risk of over-fitting the surrogate model to the available data set can be reduced even further by making the approach basis-adaptive, i.e., separate PCE models are fit for varying maximum degrees and the one with the lowest error is selected.
The basis-adaptive hybrid LAR approach has been successfully applied to a wide-variety of problems and has consistently shown the ability to greatly mitigate the curse-of-dimensionality that is inherent in traditional PCE methods (see, for example, [19,59,60]).To the best of our knowledge, [60] tackled the largest problem to-date, which is a hydrogeological model with 78 parameters (68 identified to be sensitive) that can be accurately represented using a sparse PCE trained using only 2000 model evaluations.Although uncertainty in high-dimensional DFBA models has not been explored in the literature, these promising results and those shown in the synthetic case study give some confidence that nsPCE may be able to scale to the sizes needed to solve these challenging problems.Note that very recent work has shown that sparse PCEs can be applied to ultrahigh-dimensional problems (on the order of 10 4 parameters) by incorporating a dimensionality reduction step before training the surrogate model [61].It may be possible to use similar approaches to incorporate uncertainty in the complete set of intracellular model parameters into the nsPCE surrogate models.These are interesting and important challenges that deserve further investigation.

Further reducing the number of model evaluations
In this work, the surrogate models are trained using experimental designs (EDs) populated with random samples of the parameters.Recent work has demonstrated that the number of ED points needed to achieve a desired accuracy level can be further reduced by maximizing the information content of the sample locations.Multiple approaches have been developed to tackle this challenging problem, including coherence-optimal sampling [62] and numerical "moment-matching" optimization [34,37].The optimal placement of samples in arbitrary domain shapes in a sequential fashion remains largely unexplored in the literature.
Additionally, the current implementation of nsPCE involves only two elements; however, it is unclear if the convergence rate can be improved even more by further decomposing these elements.An adaptive approach for decomposing the random parameter space that uses sensitivity information to decide which elements to split was proposed in [25].A similar concept could be potentially utilized within nsPCE, though the method would likely benefit from the incorporation of more advanced geometries than simple boxes.

Considerations and challenges in parameter estimation
Many of the difficulties encountered during parameter estimation are related to poor identifiability of model parameters.Performing parameter identifiability tests can help mitigate these difficulties by ensuring the parameter estimation problem is well-posed, which is especially important when dealing with limited experimental data and/or considering a large number of model parameters.It is common to distinguish between structural and practical identifiability.Structural identifiability is a theoretical property of the model structure that depends only on the observation function and the manipulated input function.Since a structurally non-identifiable parameter is independent of the accuracy of available experimental data, it cannot be resolved by a refinement of existing measurements.The only remedy is a qualitatively new measurement or experiment that alters the structure of the mapping between the parameters and the data.In contrast, practical identifiability also takes into account the amount and quality of the measured data, meaning that it can in principle be resolved by improving the quality of the measurements or increasing the number of measured time points.A thorough treatment of these issues in the context of biological models can be found in, e.g., [63][64][65].To the best of our knowledge, structural and practical identifiability analysis has not been demonstrated on DFBA models, which is an interesting area for future work.It is important to note that, although many methods exist for detecting non-identifiable parameters, they often have restrictions on the class of functions so that they are not directly applicable to DFBA models.
Although not observed here, sequential Monte Carlo (SMC) can suffer from degeneracy wherein fewer and fewer particles retain significant weight.This is especially prevalent in high-dimensional problems including those with a large number of parameters or a large time horizon [66].In [67], it is shown that the degeneracy phenomenon occurs unless the sample size is chosen to be exponential in the dimension, which indicates some type of curse-ofdimensionality.This sample degeneracy can be protected against by adding a rejuvenation step that "moves" the resampled particles according to a Markov chain Monte Carlo (MCMC) transition kernel [51].This operation does not change the target distribution, but does reduce impoverishment since identical replicates of a single particle are replaced with new values.The most challenging part of the MCMC step is ensuring that the samples obtained realistically represent the desired distribution.It is known that convergence of the Markov chain fails for posteriors that are not proper, which can happen whenever the prior is improper (e.g., uniform density with infinite bounds) or non-identifiable parameters exist in the model [68].In these situations, neither the prior assumptions nor the likelihood that represents the experimental data sufficiently constrain the posterior distribution.As such, the convergence properties of SMC and MCMC methods may improve considerably by resolving parameter identifiability issues before running the algorithm [69].

Extensions to optimal experiment design
The selection of optimal conditions for conducting experiments (e.g., measurement times, initial conditions, and time-varying input profiles) is important for ensuring maximum information is extracted from the observations, especially when the experiments are expensive and time-consuming to perform.For example, it may be useful to change the feed rate or the measurement times in the considered case study so that the data ensures tight parameter estimates are obtained.Optimal experiment design (OED) has been extensively studied in the classical framework wherein the design criteria are defined as some scalar function of the Fisher information matrix (FIM) [70,71].More recently, OED has been tackled from a fully Bayesian perspective that replaces the approximated classical design criteria with an expected utility function that is rigorously chosen from a decision-theoretic point of view [72][73][74].
The nsPCE surrogate models could be used to efficiently evaluate classical or Bayesian design criteria at any fixed experimental condition.However, the parameter space decomposition depends strongly on the experiment, such that separate surrogates need to be constructed for all experiments of interest.This is not a major challenge when only a small number of experiments are considered, but may become intractable for continuous design spaces.Developing efficient procedures for both classical and Bayesian OED in genome-scale DFBA models is an important area for future research.One possible direction is to treat the experiment design variables as parameters when constructing the surrogate model, as suggested in [75] for global PCE.It would be interesting to see how well nsPCE can handle these additional dimensions, since the model responses would likely be highly sensitive to the design variables.
with v(s(t)) being an element of the solution set of the flux balance model vðsÞ 2 argmax v hðv; sÞ subject to : Av ¼ 0;

Fig 1 .
Fig 1. Flowchart for the proposed nsPCE surrogate modeling method.The model response function can be freely chosen by the user.The singularity time function should be specified implicitly as a function of the DFBA model states.This function can be identified by simulating the DFBA model with nominal parameters and locating at which time points a switch in the active set of the FBA solution occurs.The PCE coefficients are fit using the basis-adaptive version of the hybrid LAR method, while the ED is sequentially enriched to ensure that the target accuracy level is achieved.https://doi.org/10.1371/journal.pcbi.1007308.g001

Fig 2 .
Fig 2. Monte Carlo simulation of E. coli DFBA model.The genome-scale model is integrated from 0 to 8.5 hours for 100 different parameter realizations that are independently drawn from the uniform prior density.The consumption of xylose only occurs after glucose is fully exhausted, which is a strong function of the parameters.https://doi.org/10.1371/journal.pcbi.1007308.g002

Fig 3 .
Fig 3. Accuracy of singularity time surrogate models.RMSE versus the number of model evaluations (i.e., size of the experimental design) used to train the PCE model for (a) the glucose singularity t g and (b) the xylose singularity t z .The RMSE was estimated empirically from a validation set of 10,000 full DFBA simulations.https://doi.org/10.1371/journal.pcbi.1007308.g003

Fig 4 .
Fig 4. Uncertainty propagation with singularity time surrogate models.The estimated global sensitivity indices of (a) t g and (b) t z with respect to the uncertain parameters.The estimated PDF of (c) t g and (d) t z based on 1e+6 surrogate model evaluations, which only requires approximately 1 second of CPU time.https://doi.org/10.1371/journal.pcbi.1007308.g004

Fig 5 .
Fig 5. Parameter space decomposition over time.The decomposition of the parameter support into two nonoverlapping elements at (a) 6.5 hr, (b) 7.0 hr, and (c) 7.25 hr using a sparse PCE model of the glucose singularity time t g .The blue and red regions represent parameters for which t g (x) > t and t g (x) � t, respectively, projected onto the two most sensitive parameters.The green line represents the decision boundary learned using an SVM classifier trained with the same set of data as the sparse PCE model, while the dashed green lines represent the corresponding 95% confidence limits.(d) Parity plot for the sparse PCE model of t g for 1e4 validation points.https://doi.org/10.1371/journal.pcbi.1007308.g005

Fig 6 .
Fig 6.Convergence properties of nsPCE surrogate models.(a,b) Glucose concentration at time 7 hours.(c,d) Xylose concentration at time 8 hours.(e,f) Biomass concentration at time 8 hours.Left plots show the validation RMSE versus the specified error tolerance.Right plots show the total number of model evaluations based on a sequential ED construction, with a maximum of 1000 samples allowed.The global sparse basis-adaptive PCE results are also shown for comparison purposes.https://doi.org/10.1371/journal.pcbi.1007308.g006

Fig 7 .
Fig 7. Parity plots for nsPCE surrogate models.(a,b,c) Target RMSE level ε target = 10 −2 .(d,e,f) Target RMSE level ε target = 10 −3 .(g,h,i) Target RMSE level ε target = 10 −4 .The left, middle, and right columns correspond to glucose concentration at 7 hours, xylose concentration at 8 hours, and biomass concentration at 8 hours, respectively.The parity plots for global sparse basis-adaptive PCE are overlaid for comparison purposes.The global PCE has considerably larger error than nsPCE.https://doi.org/10.1371/journal.pcbi.1007308.g007 Fig) was obtained by simulating the genomescale E. coli DFBA model with fixed parameters (red lines in Fig 8) and adding random noise realizations (22) to the resulting model outputs.The 1200 DFBA simulations used to construct the surrogates require * 30 minutes of CPU time while the surrogate-based SMC algorithm, which takes advantage of vectorization, finishes in * 2 minutes of CPU time.Hence, over 800-fold savings in computational cost is achieved when compared to SMC without surrogates that would require approximately 17 days of CPU time under the same settings (1 × 10 6 DFBA simulations at a cost of 1.5 seconds per evaluation).The DFBA model predictions under the MAP estimates (i.e., parameters that maximize the posterior) are shown in S1 Fig, which closely match the observed data.

Fig 8 .
Fig 8. Posterior distribution of the estimated model parameters.The diagonal subplots represent marginal densities while the off-diagonal subplots represent two-dimensional projections of samples from the joint density.Blue denotes the posterior density while green denotes the prior density.The red line represents the true parameter values used to generate synthetic data for estimation purposes.https://doi.org/10.1371/journal.pcbi.1007308.g008

Fig 9 .
Fig 9. Evolution of the posterior marginal densities for the observable model parameters over time.Each subplot shows the histogram of parameter posterior samples estimated using the sequential Monte Carlo method.The x-axis represents the range of values of the parameters and the y-axis represents frequencies.The red line represents the true parameter values.https://doi.org/10.1371/journal.pcbi.1007308.g009

Fig 10 .
Fig 10.Predicted probability distributions of extracellular concentrations.(a) Model predictions using parameter samples from the prior.(b) Model predictions using parameter samples from the posterior.Each subplot shows the histogram of samples of the model output obtained by substituting i.i.d.samples from the parameter distribution into the corresponding ME-PCE surrogate model.The x-axis represents the range of values of the model outputs and the yaxis represents frequencies.https://doi.org/10.1371/journal.pcbi.1007308.g010

Fig 11 .
Fig 11.Monte Carlo simulation of the synthetic metabolic network.The synthetic DFBA model with twenty uncertain parameters is integrated from time 0 to 40 hours for 100 different parameter realizations drawn independently from the uniform prior density.The time profiles are shown for (a) biomass and lipids, (b) the substrates and products, and (c) the penalty state.https://doi.org/10.1371/journal.pcbi.1007308.g011

Fig 12 .
Fig 12. Global sensitivity indices for the quantities of interest in the synthetic metabolic network.(a)-(g) Global sensitivity indices for extracellular substrate and product concentrations at various time points with respect to the twenty uncertain parameters.https://doi.org/10.1371/journal.pcbi.1007308.g012

Fig 13 .Fig 14 .
Fig 13.Comparison of model predictions and data for the synthetic metabolic network.The model predictions for (a) biomass and lipids and (b) the substrates and products, shown with solid lines, were obtained by integrating the DFBA model with the MAP estimates of the parameters.The '□' marks represent the data that was obtained by corrupting the model predictions using the true (unknown) parameters with randomly generated noise.The dotted lines represent the model predictions based on the initial parameter guess that was used to initialization the optimizer.https://doi.org/10.1371/journal.pcbi.1007308.g013

Table 1 . Relative mean square error estimates for glucose singularity time surrogate models under multiple exper- imental design sizes.
1. Initialization: set k = 1 and generate samples and weights fx i ; w i g Reweighting: update the weights w i w i × w k (x i ) wherew k (x i ) / f D k |X (d k |x i ).3.Resampling: resample fx i ; w i g 4. Loop: set k k + 1 and fx i ; w i g