Scalable nonlinear programming framework for parameter estimation in dynamic biological system models

We present a nonlinear programming (NLP) framework for the scalable solution of parameter estimation problems that arise in dynamic modeling of biological systems. Such problems are computationally challenging because they often involve highly nonlinear and stiff differential equations as well as many experimental data sets and parameters. The proposed framework uses cutting-edge modeling and solution tools which are computationally efficient, robust, and easy-to-use. Specifically, our framework uses a time discretization approach that: i) avoids repetitive simulations of the dynamic model, ii) enables fully algebraic model implementations and computation of derivatives, and iii) enables the use of computationally efficient nonlinear interior point solvers that exploit sparse and structured linear algebra techniques. We demonstrate these capabilities by solving estimation problems for synthetic human gut microbiome community models. We show that an instance with 156 parameters, 144 differential equations, and 1,704 experimental data points can be solved in less than 3 minutes using our proposed framework (while an off-the-shelf simulation-based solution framework requires over 7 hours). We also create large instances to show that the proposed framework is scalable and can solve problems with up to 2,352 parameters, 2,304 differential equations, and 20,352 data points in less than 15 minutes. The proposed framework is flexible and easy-to-use, can be broadly applied to dynamic models of biological systems, and enables the implementation of sophisticated estimation techniques to quantify parameter uncertainty, to diagnose observability/uniqueness issues, to perform model selection, and to handle outliers.


Introduction
Dynamic modeling is essential for understanding the behavior of biological systems. Systems of interest in this domain include microbial communities and microbiome, gene regulatory networks, and metabolic pathways [1][2][3]. An important task that arises in modeling studies is validation against experimental data by using parameter estimation techniques. This task is computationally challenging because of the need to solve optimization problems constrained by differential equations. Challenges arise from the dimensionality, nonlinearity, and stiffness of the dynamic model, from the incomplete observation of the system states, from the need to estimate many parameters, and from the need to handle a large number of experimental data sets.
Extensive research on solution methods for estimation problems with differential equations has been reported in the computational biology literature (see [4,5] for comprehensive reviews). These methods target maximum likelihood estimation formulations (which are derived from Bayesian principles). In these formulations, one aims to find parameters that maximize the likelihood function. The most used strategy to handle such formulations is the so-called simulation-based approach. Here, the idea is to perform repetitive simulations of the dynamic model at different trial parameter values to identify a set of parameters that maximizes the likelihood. The trial parameter values are updated using derivative-based or derivative-free search schemes [6][7][8]. While the simulation-based approach is intuitive, repetitive solutions of large dynamic models become computationally expensive and differential equation solvers can fail at trial parameter values that are non-physical or that trigger unstable responses. In addition, techniques to compute first and second order derivatives for derivative-based schemes (e.g., finite differences, forward and adjoint sensitivities) involve intrusive procedures and are often limited to first-order derivatives [7]. The need for derivatives can be bypassed by using derivative-free search schemes [9,10], which are widely popular in computational biology. Such methods include simulated annealing [11,12], genetic algorithms [13,14], particle swarms [15], approximate Bayesian computation [16,17], and various other methods [18,19]. Derivative-free schemes do not scale well in the number of parameters (a larger number of trial parameter values often need to be explored compared to derivativebased schemes). Moreover, second order derivative information is needed to determine if the parameter estimates are unique/observable given available experimental data [20,21]. The uniqueness/observability test of the parameter estimates is based on curvature information of the likelihood function at the solution.
Simulation-based estimation frameworks previously reported in the computational biology literature have focused on problems that usually contain less than 100 parameters [10,16,22]. To the best of our knowledge, the largest estimation problem solved using a simulation-based framework contains 3,780 data points and 1,801 parameters [7]. Such a problem was solved (to partial optimality) using a derivative-based search scheme that uses first-order derivative information (using an adjoint method) and required over 5 hours of computing time. The scalability limitations of simulation-based approaches present an important obstacle in considering models of higher fidelity, in exploiting high-throughput experimental data, in analyzing parameter uncertainties, and in implementing sophisticated techniques such as ensemble modeling.
In this work, we propose a nonlinear programming (NLP) framework for solving estimation problems with embedded dynamic models [23,24]. The framework is based on a direct transcription approach wherein the dynamic model is converted into a large set of algebraic equations by applying time-discretization techniques. The algebraic equations are then embedded directly as constraints in the optimization problem (a nonlinear program-NLP). The NLPs arising from time discretization are of high dimension (easily reaching hundreds of thousands to millions of variables and constraints) but are also sparse and structured. Moreover, by transforming the dynamic model into algebraic equations, it becomes possible to use automatic differentiation techniques available in modern algebraic modeling languages to compute first and second derivatives. Exploitation of sparsity and structure, together with the availability of derivative information, enable the solution of estimation problems with complex dynamic models and efficient handling of many parameters and experimental data sets.
Discretization-based estimation approaches have been widely studied in diverse fields such as chemical engineering [25][26][27] and aerospace engineering [28][29][30] (see [31] for a comprehensive review) but less so in computational biology. A major factor that has hindered wider adoption is the lack of easy-to-use computational frameworks that facilitate access to nonexpert users. In this work, we demonstrate that modern modeling and solution tools can be combined to create scalable, robust, easy-to-use, and flexible frameworks. We demonstrate the benefits by solving challenging estimation problems arising in microbial community models.
The proposed framework enables the implementation of higher level tasks such as observability analysis and uncertainty quantification. Uncertainty quantification (UQ) seeks to characterize parameter posterior distribution, which is necessary to obtain confidence levels/ regions and parameter correlation information. Conventionally, UQ is performed by using second order derivative (Hessian) information of the likelihood function to construct an approximate parameter posterior covariance matrix [24,32] or by using a Markov-Chain-Monte-Carlo (MCMC) techniques [33][34][35][36]. The Hessian-based approach is scalable but it requires intrusive computation (cannot be automatically computed by the solver) and does not capture well the effect of nonlinearities and physical constraints [32]. In MCMC, one samples parameters from the prior parameter density and compares the associated model outputs with experimental data to decide whether to accept that sample or not. By repeating these accept/reject steps one can construct an approximate parameter posterior. MCMC is rather easy to implement (it is not intrusive and does not require solving an optimization problem) but, being simulation-based, also suffers from potential failures of the differential equation solver at non-physical parameter samples, it does not scale well with the number of parameters, and it cannot handle constraints directly (e.g., nonnegative concentrations) [37]. In this work, we propose to overcome some of these challenges by using a randomized maximum a posteriori (rMAP) framework [36][37][38][39]. This method computes approximate (to second order) samples from the parameter posterior distribution by performing random perturbations on the experimental data and by re-solving the estimation problem. This allows exploration of the parameter space more efficiently compared to the MCMC scheme because each sample can be computed in parallel (MCMC is sequential). Moreover, the rMAP approach is non-intrusive, can capture nonlinear and physical constraints effects, and avoids potential failures of differential equation solvers. The proposed estimation framework is flexible and can easily accommodate advanced estimation formulations. To demonstrate this, we implement formulations that use different prior regularization schemes and k-max norms (the mean of a specified fraction of largest values) to mitigate large outliers [40,41].
The main contributions of this paper are summarized as follows. First, we provide an overview of parameter estimation and uncertainty quantification that leverages state-of-the-art optimization methods. Second, we demonstrate the computational capability of the nonlinear programming framework to handle large-scale parameter estimation problems for biological models. Lastly, we propose a novel problem formulation for parameter estimation using risk measures. Fig 1 shows an outline of parameter estimation framework presented in this paper.
The paper is organized as follows. The methods section provides a general form for the estimation problem under study and discusses how this can be cast as a sparse NLP by using time-discretization techniques. Furthermore, we introduce basic concepts behind NLP solvers that exploit sparsity and structures at the linear algebra level. In addition, we discuss rMAP and outlier mitigation schemes. In the results section, we demonstrate that the proposed framework can handle challenging estimation problems arising in microbial community models. Mathematical models for biological systems are often expressed as systems of differential equations with parameters that need be estimated from experimental data. We formulate the estimation problem using a maximum a posteriori (MAP) formulation. This yields optimization problems constrained by differential equations that are transformed into fully algebraic nonlinear programs by using discretization schemes. The resulting NLPs can be easily implemented in algebraic modeling languages such as JuMP and Plasmo.jl that compute derivatives automatically and that are interfaced to powerful interior-point optimization solvers that exploit sparsity and structure to achieve high computational efficiency. The proposed framework is scalable, robust, easy-to-use, and flexible. These capabilities facilitate high-level tasks such as identification of parameter observability/uniqueness issues, model selection, and uncertainty quantification. https://doi.org/10.1371/journal.pcbi.1006828.g001

Estimation for dynamic models
We consider estimation problems of the following form: Here, K ≔ f0; 1 . . . Kg is the set of experiments and T k ≔ ft 1 ; t 2 ; � � � ; t n k g is the set of measurement (sampling) times in experiment k 2 K. Time is denoted as t 2 [0, T k ], where T k 2 R þ is the duration for experiment k 2 K. The variable vector x k : R ! R n x k are the differential state time trajectories, y 2 R n y are the model parameters, � Z k ðtÞ 2 R n Z k are the model predicted outputs with corresponding experimental observations Z k ðtÞ 2 R n Z k at time t 2 T k and experiment k 2 K, and x 0 k 2 R n k are initial conditions for experiment k 2 K. For convenience in the notation, we define the output vectors � Z k ¼ ð� Z k ðt 1 Þ; � � � ; � Z k ðt n k ÞÞ and the experimental output vectors Z k ¼ ðZ k ðt 1 Þ; � � � ; Z k ðt n k ÞÞ for experiment k 2 K as well as the total output vector � Z ¼ ð� Z 1 ; � � � ; � Z K Þ and the total experimental output vector η = (η 1 , � � �, η K ). The vector function f k (�) denotes the dynamic model mapping, φ(�) and φ k (�) are objective function mappings, ϕ k (�) is the state-to-output mapping, and h k (�) is the constraint mapping. All the mappings are assumed to be at least twice continuously differentiable with respect to all the arguments. The estimation formulation (1)-(5) captures all the features of our proposed framework. Our framework, however, can also accommodate more general features; for instance, the initial conditions (3) can be also considered as unknown variables that need to be estimated and we can define non-additive objective functions that penalize large errors.
Problem (1)-(5) can be derived from Bayesian principles. To see this and introduce some useful notation, we start from Bayes theorem: Here, p(θ | η) is the parameter posterior density (i.e., the parameter density given knowledge on the outputs), p(η | θ) is the output posterior density (i.e., the outputs given knowledge on the parameters), p(θ) is the prior density (i.e., parameter density before knowledge of the output), and p(η) is the output marginal density. In a maximum a posteriori (MAP) formulation, the goal is to find the parameters that maximize p(θ | η). Because p(η) does not depend on θ, this can also be achieved by maximizing p(η|θ)p(θ). Maximizing p(η|θ)p(θ) is equivalent to maximizing the log-likelihood function L(θ) = log(p(θ)) + log p(η|θ). If the outputs from the experiments k 2 K are independent (which is usually the case), we have that pðZjyÞ ¼ Q k2K pðZ k j yÞ and thus: The observed outputs are random variables that are usually considered to be Gaussian and thus Z k jy � N ð� Z k ; S k Þ, where S k is the covariance matrix. The prior density p(θ) is also often assumed to be Gaussian and thus y � N ð � y; S y Þ, where � y is the mean of the prior distribution and S θ is its covariance. With this, we obtain: By comparing (7) with (1) we can see that minimizing φðyÞ þ P k2K φ k ðZ k ; � Z k Þ is equivalent to minimizing −L(θ). Here, the dynamic model together with the state-to-output mapping defines a parameter-to-output mapping of the form � Z k ≔ m k ðyÞ. In a simulation-based estimation approach, the mapping m k (θ) is computed by simulating the dynamic model (2)-(3) at a trial value θ using a differential equation solver and by evaluating the outputs at the sampling times t 2 T k using (4). In a discretization-based approach, the mapping m k (θ) is not computed explicitly (but we use it here as a mathematical representation that is used to explain some relevant concepts). Constraints (5) restrict the parameter space to be explored.
After dropping all constant terms in the likelihood function we obtain: The function φ(θ) is usually known as the prior term and provides a regularization effect that stabilizes the solution of the estimation problem when the parameters cannot be uniquely infered from the available data [42][43][44][45]. This regularization term arises from the prior density p(θ) and provides a mechanism to encode knowledge on the parameters. Assuming that the prior density is Gaussian gives rise to a prior term that is defined by a weighted L2 norm. Recently, the machine learning community has also proposed the use of regularization terms that use L1 norms (e.g., φðyÞ ¼ ky À � yk 1 ). The L1 norm induces sparsity in the parameters and corresponds to assuming that the prior density is Laplacian. One can also show that an L1 norm acts as an exact penalty function and implicitly induces constraints on the parameters. Similarly, one can also use the inequality constraints h k (�) to directly embed physical knowledge in the MAP formulation (e.g., concentrations can only be positive).
From (10) and (11) we see that φ k (�) are squared error terms and thus the MAP problem minimizes the sum of the squared errors across all experiments k 2 K. This approach offers limited control on large errors that might result from data outliers. Here, we propose to use a k-max norm to mitigate these issues. Our proposal is based on the observation that a k-max norm is equivalent to a conditional-value-at-risk (CVaR) norm [40,41,46]. The CVaR β norm of a vector e = (e 1 , � � �, e K ) with components e k ¼ φ k ð� Z k ; Z k Þ is defined as the average of the βfraction of largest elements of the vector (where β 2 [0, 1] is a parameter that defines the size of the fraction) [46]. One can show that, when β ! 1, the CVaR norm is the largest fitting error and, when β ! 0, the CVaR norm is the sum of fitting errors (as in the standard MAP formulation). A key computational property of the CVaR norm is that it can be formulated as a standard optimization problem. In particular, the MAP problem with a CVaR error norm can be expressed as: where [�] + = max(0, �) is the max function and γ is an auxiliary variable [41].

Nonlinear programming formulation
To solve the MAP problem (1)-(5) we approximate the differential equations by using a discretization scheme. This enables the use of computationally efficient NLP solvers and facilitates high-level UQ and observability monitoring tasks. Time discretization. Discretization schemes such as Euler, Runge-Kutta, and orthogonal collocation are commonly used to transform differential equations into algebraic ones. Orthogonal collocation is often preferred because accurate approximations can be obtained with few discretization points [23]. To simplify the presentation we use an implicit Euler scheme, which can be shown to be a special type of an orthogonal collocation scheme (it is a one-point Radau collocation scheme). We discretize the time domain [0, T k ] into a set of intervals with fixed discrete-time points ft 0 ; By applying an implicit Euler scheme, the dynamic model (2) is converted into a set of nonlinear algebraic equations of the form: With this, we can approximate the MAP problem (1)-(5) using the NLP: Here, we use x j k ¼ x k ðt j Þ as short-hand notation to represent states at time t j and experiment k. It is important that a sufficient number of time discretization points are used so that the differential equations are accurately approximated. If the number of time points is not sufficient, the solution of NLP can be sensitive to the choice of the number of points. To verify that the discretization is sufficient, we can use a test that assesses the sensitivity of the solution with respect to discretization. For example, one can increase the number of time points and see if the solutions are reasonably close.
For convenience, we express (16)- (20) in the following abstract form: where w 2 R n is a large-dimensional vector containing all the discrete-time states x j k , parameters θ, and additional auxiliary variables. The mapping F : R n ! R is the objective function and P : R n ! R m are equality constraints that contain algebraic equations obtained fro discretization of the dynamic model and other auxiliary equations. General inequality constraints can be transformed into equality constraints and simple non-negativity bounds by using auxiliary slack variables (i.e., 0 � h k ðx j k ; yÞ can be written as s k ¼ h k ðx j k ; yÞ with s k � 0). A useful representation of the NLP results from noticing that the parameters θ are the only complicating (coupling) variables across experiments k 2 K. Consequently, we can express the NLP in the structured from [47]: Here, the variable vector w k contains all the discrete-time states and auxiliary variables of experiment k 2 K, F(�) is the prior term, F k (�) is the contribution of experiment k 2 K to the likelihood function, and P k (�) contains the discretized dynamic model equations and auxiliary equations for experiment k 2 K. As we discuss next, this representation can be used to derive parallel solution approaches.
Interior-point solvers. The NLPs that result from time discretization exhibit a high degree of algebraic sparsity (only a few variables appear in each constraint) and are highly structured. Sparsity and structure permeates down to linear algebra operations performed inside the optimization solver. This is sharp contrast to the simulation-based approach, which induces dense linear algebra operations in the space of the parameters θ. Most modern largescale NLP solvers such as Ipopt and Knitro seek to exploit sparsity and structure at the linear algebra level to achieve high computational efficiency [48,49]. Interior point solvers, in particular, provide a flexible framework to do this. These solvers replace the variable bounds by using a logarithmic barrier function. In the context of NLP (21)- (23), this results in a logarithmic barrier subproblem of the form: where m 2 R þ is the so-called barrier parameter. The logarithmic term becomes large as w (i) approaches the boundary of the feasible region. This ensures that variables remain in the interior of the feasible region (hence the origin of the term barrier). A key observation is that one can recover a solution of the original NLP (21)-(23) by solving a sequence of barrier problems for decreasing values of μ [50]. An important property of interior-point methods is that the original NLP with bounds is converted into a sequence of NLPs with equality constraints. This removes the combinatorial complexity of identifying the set of bounds that are active or inactive at the solution (a bottleneck in active-set solvers). Sparse linear algebra. Interior-point methods enable efficient linear algebra implementations. To explain how this is done, we note that the optimality conditions of the barrier problem are given by the following set of nonlinear equations: where, l 2 R m are the Lagrange multipliers of the equality constraints, ν are the Lagrange multipliers of the bound constraints, V = diag(ν) and W = diag(w) are diagonal matrices, and 1 is a vector of all ones. By applying Newton's method to (29)-(31), we obtain the following linear algebra system: |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl } Here, ℓ is the Newton iteration index, Δw ℓ is the search direction for the primal variables, Δλ ℓ is the search direction for the dual variables, and Hðw ' ; is known as the augmented matrix. The Newton step computation in a simulation-based approach operates only in the space of the parameters θ (the states are implicitly eliminated by simulation). In the time discretization approach, the Newton search is in the space of both the discretized states and parameters (contained in the high-dimensional variable vector w). Interestingly, however, the augmented matrix found in typical applications is highly sparse (with less than 1% of its entries are non-zero) [24]. The constant k w 2 R þ is a Hessian regularization parameter which plays a key role in the context of parameter estimation. In particular, one can prove that the augmented matrix M ℓ (κ w ) is non-singular (and thus the linear algebra system has a unique solution) if and only if the reduced Hessian matrix Z T H(w ℓ , λ ℓ )Z is positive definite and the Jacobian matrix r w P(w ℓ ) has full row rank. Here, the matrix Z 2 R n y �n y is such that its columns span the null-space of the Jacobian (i.e., r w P(w ℓ )Z = 0). Moreover, the matrix Z is of the same dimension as the number of degrees of freedom (in our context this is precisely the number of parameters). When the reduced Hessian is positive definite (i.e., all its eigenvalues are positive) and the Jacobian has full row rank, one can prove that the Newton step of the primal variables Δw ℓ obtained from the solution of (29)-(31) is a descent direction for the objective function (i.e., (Δw ℓ ) T r w F(w ℓ ) < 0) when the constraints are close to being satisfied (i.e., P(w ℓ ) � 0). This is key because it indicates that the Newton step improves the objective function (in our context, the negative likelihood function). This property cannot be guaranteed when the reduced Hessian is not positive definite. When such a situation is encountered, one can increase the regularization parameter κ w until the reduced Hessian is positive definite and a descent direction is obtained. This approach is closely connected to the Levenberg-Marquardt method used in simulation-based estimation approaches (in which one regularizes the Hessian of the negative likelihood function as À r yy Lðy ' Þ þ k y I) [51]. Another key observation is that, when the reduced Hessian is positive definite at the solution w � , the estimated parameters are unique. This provides an indication that the experimental data is sufficiently informative to identify the parameters uniquely (i.e., the parameters are observable). We note that using a prior term φ(θ) in the MAP formulation has the effect of adding the positive definite matrix S À 1 y to the reduced Hessian. This artificially regularizes the problem (as is done in the Levenberg-Marquardt scheme by adding the term k w I). Consequently, when testing for observability/uniqueness, it is necessary to drop the prior term from the MAP formulation. Testing for observability also requires exact second order derivative information because the Hessian is needed. In the time-discretization approach, such information can be obtained directly from algebraic modeling languages. Simulation-based solution approaches often cannot check observability of the parameters (computing second derivatives using adjoint and sensitivity schemes is complicated).
Computing the eigenvalues of the reduced Hessian to check for positive definiteness is expensive. Interestingly, one can also determine if the reduced Hessian is positive definite by using inertia information of the augmented matrix M ℓ (κ w ). The inertia of a matrix M is denoted as Inertia(M) = {n + , n − , n 0 } where n + , n − , and n 0 are the number of positive, negative, and zero eigenvalues of matrix M, respectively. One can prove that the reduced Hessian matrix is positive definite if Inertia(M ℓ (κ w )) = {n, m, 0}, where we recall that n is the dimension of the variable vector w and m is the number of constraints. Notably, one can obtain the inertia of M ℓ (κ w ) without computing the eigenvalues of the matrix. This is done by using modern sparse symmetric factorization routines such as MA57 or Pardiso [50]. Such routines factorize the matrix M ℓ (κ w ) as LBL T where L is a lower triangular matrix and B is a matrix with 1 × 1 and 2 × 2 blocks in the diagonal. One can show that the number of positive and negative eigenvalues of M ℓ (κ w ) are the number of positive and negative eigenvalues of B (which are easy to determine).
Modern interior-point solvers are equipped with highly sophisticated safeguarding techniques that enable the solution of highly nonlinear problems. A powerful approach is called a filter line-search method, in which one seeks to find a step-size κ such that the trial Newton iteration w ℓ+1 = w ℓ + κΔw ℓ either decreases the objective function or the constraint violation ||P(w ℓ )||. If the step is accepted, the current values for the objective and constraint violation (F(w ℓ ), P(w ℓ )) are stored in a filter (a history of previous successful iterations). At the next iterate, one requires that the Newton step is not in the filter and that it improves either the objective or the constraint violation. This rather simple strategy is extremely effective in practice.
We highlight the fact that the proposed discretization approach bypasses the need to repetitively simulate the dynamic model (the discretized dynamic model contained in F(w) is solved progressively by Newton's method). This brings substantial computational savings. Moreover, since the discrete-time model is solved at the solution w � of the NLP (21)-(23), we have that the discrete-time states fx j k g � approximate the state trajectories xðtÞ; t 2 T k ; k 2 K. In the absence of inequality constraints, one can also show that the reduced Hessian Z T H(w � , λ � )Z approximates the Hessian of the negative log-likelihood function r θθ L(θ � ) in a neighborhood of w � (which contains θ � ). We thus have that the reduced Hessian approximates the inverse parameter covariance matrix V À 1 y . When inequality constraints are present, some of the parameters or state variables might hit their physical bounds and this deteriorates the approximation. When the parameters are not unique (the reduced Hessian has zero eigenvalues), the parameter covariance matrix is singular. We highlight that such information can be obtained as by-products of the solution from the nonlinear programming solver. Therefore, the observability of the parameter estimation problem can be analyzed without further computational efforts. We also highlight that parameter observability/identifiability can be checked prior to solving the estimation problem [52,53]. Unfortunately, such a priori methods require computing parameter-to-output sensitivity matrices, which requires specialized implementations and which might miss to capture impacts of nonlinearities (the sensitivity is computed only at a reference point). Moreover, such approaches do not capture the impact of constraints. Under the proposed NLP framework, observability is checked automatically by using the inertia of the reduced Hessian at the MAP point.
Structured linear algebra. A key advantage of using interior-point solvers is that they enable modular linear algebra implementations. For instance, the multi-experiment structure of problem (24)-(26) permeates down to the linear algebra system, to give a system of the form: where Δθ is the Newton step for the parameters and Δw k = (Δx k , Δλ k ) is the Newton step for variables in experiment k. The above system is said to have a block-bordered diagonal (BBD) structure. Here, we have that: where The BBD matrix is a permutation of the augmented matrix M ℓ (κ w ) (obtained by ordering variables by experiment). The BBD matrix can thus be expressed as P T M ℓ (κ w )P where P is a permutation matrix. The permutation does not affect the eigenvalues of the matrix. The BBD system (33) can be solved in parallel by using a Schur complement decomposition approach [47,54] or specialized preconditioning strategies [55]. In this work, we focus on Schur complement-based approach. This requires the solution of the linear algebra systems: Here, S ¼ K y À P k2K B k K À 1 k B T k is the Schur complement matrix which has the same dimension as the number of degrees of freedom (in our case the number of parameters). The key observation is that the experiment matrices K k can be factorized by using an LBL T factorization (by using MA57 or PARDISO) in parallel. As a result, Schur decomposition can achieve high computational efficiency in estimation problems with many experimental data sets.
When using a Schur decomposition, one can estimate the inertia of the BBD matrix by using Haynsworth's formula: We recall that n = n θ + ∑ k n k and m = ∑ k m k . Consequently, if we have that Inertia(K k ) = {n k , m k , 0} for all k 2 K then Inertia(M ℓ (κ w )) = {n, m, 0} if and only if Inertia(S) = {n θ , 0, 0} (i.e., the Schur complement is positive definite). One can obtain the inertia of the blocks K k and S using LBL T factorization. This allows us to test observability of the parameters. We highlight that Schur decomposition and block cyclic reduction techniques can also be used to decompose the estimation problem along the time horizon. This enables the solution of problems with fine time resolutions and long horizons (see [56]). Unfortunately, scalable implementations of such techniques are currently not available (this is an interesting direction of future work). We also highlight that there exist generic parallel linear solvers (such as MA97) that seek to identify and exploit structures on-the-fly. Such approaches are in general not competitive with approaches that communicate structure directly to the solver (such as Schur decomposition).

Uncertainty quantification
The estimation problem under the MAP framework gives the values of the parameters θ � that maximize the parameter posterior density. However, a characterization of the entire posterior is necessary to assess parameter uncertainty. The posterior covariance may be approximated from the reduced Hessian at the solution of the problem w � and the covariance matrix can be used to determine ellipsoidal level sets of the posterior (confidence regions). This approach, however, might fail to capture nonlinear and constraint effects [24]. In this work, we circumvent these issues by using a randomized maximum a posteriori (rMAP) approach. Under this method, the posterior distribution is explored by using random perturbations on the experimental data (which can be easily parallelized). The rMAP framework can also deliver approximate samples from the parameter posterior distribution and implicitly captures nonlinear and constraint effects. To show this, we use the implicit mapping representation � Z k ¼ m k ðyÞ. Under this representation, the posterior density (6) can be expressed as: where m(θ) ≔ (θ, m 1 (θ), � � �, m K (θ)), S ≔ diag(S θ , S 1 , S 2 , � � �, S K ), andẐ ¼ ð � y; ZÞ. Here, we redefine Z Ẑ to enable compact notation. Since θ � is a solution of the MAP problem, we have that: If the mapping m k (�) is continuously differentiable, we have that: mðyÞ ¼ mðy � Þ þ rmðy � Þðy À y � Þ þ Oðky À y � k 2 2 Þ: ð41Þ To enable compact notation we define η � = m(θ � ) and rm � = rm(θ � ). We have that θ � satisfies the stationary condition of (40): We use (41) to obtain a second-order Taylor approximation of the posterior as: pðy j ZÞ � 1 pðZÞ exp À 1 2 ðZ � À Z þ rm � ðy À y � ÞÞ T S À 1 ðZ � À Z þ rm � ðy À y � ÞÞ � � ¼ expððZ � À ZÞ T S À 1 ðZ � À ZÞÞ pðZÞ exp À 1 2 ðy À y � Þ T rm �T S À 1 rm � ðy À y � ÞÞ This implies that the posterior is approximately represented as: We recall that the output error is Gaussian and we can thus write η = m(θ) + � with � � N ð0; SÞ. We now consider the MAP problem with randomly perturbed data: and note that ðmðyÞ À ðZ þ �ÞÞ T S À 1 ðmðyÞ À ðZ þ �ÞÞ ¼ ðZ � À Z þ rm � ðy À y � Þ À �Þ T S À 1 ðZ � À Z þ rm � ðy À y � Þ À �Þ þ Oðky À y � k where C is a constant. Consequently, for sufficiently small �, we can linearize the mapping m(�) to obtain an approximate solution of (45) of the form: Here, we observe that the right-hand side of (47) is Gaussian with mean θ � and covariance: We thus have that:ỹ � N ðy � ; ðrm �T S À 1 rm � Þ À 1 Þ; ð49Þ Consequently, solving (45) provides an approximate sample from the posterior distribution p (θ j η). The sampling procedure (45) is accurate up to second order. To obtain an exact sampling from the posterior, one needs to implement a rigorous MCMC scheme. The MCMC scheme removes the bias that appears in the rMAP sample density, which results from the second order approximation [35,36,36,37]. This effect is illustrated with an examples in [37]. Several works in the literature, however, report that accurate posterior densities can be obtained using an rMAP scheme [38,57,58]. We also note that the rMAP scheme implicitly captures nonlinearities and physical constraints when computing samples from the posterior (MCMC does not capture constraints). In particular, solving the perturbed problem (45) corresponds to solving the MAP problem (1)-(5) with randomly perturbed data and the MAP problem enforces constraints and handles the full nonlinear model. The computational framework provided in this work focuses on the solution of the estimation problem and seeks to highlight that these capabilities enable the implementation of advanced UQ procedures. A detailed analysis and comparison of different UQ methods for biological dynamical models is an interesting topic of future work. Systematic comparisons in other settings are provided in [35,36].

Algebraic modeling platforms
Having an algebraic representation of the estimation problem has many practical and computational advantages. In particular, one can implement the estimation problem in easy-to-use and open-source modeling languages such as JuMP [59], Plasmo.jl [60], and Pyomo [61,62]. These modeling languages are equipped with automatic differentiation techniques that compute exact first and second derivatives. Derivative information is communicated to optimization solvers without any user intervention. Modern algebraic modeling languages such as Plasmo.jl and Pyomo also allow users to convey structural information to the solvers. This is beneficial in the case of parameter estimation, where the structure can be exploited to enable parallelism and the use of high-performance computing clusters. In our framework, we use the modeling language Plasmo.jl to express multi-experiment estimation problems as graphs.
Our implementation using Plasmo.jl is illustrated in Fig 2. The full Julia script is available at https://github.com/zavalab/JuliaBox/tree/master/MicrobialPLOS. We highlight that the same script can be used to solve the estimation problem using a general NLP solver such as Ipopt on a single-processor computer or with a structure-exploiting parallel NLP solver such as PIPS-NLP on multiple parallel computing processors (this might be a multi-core computing server or a large-scale computing cluster). This allows users with limited knowledge on scientific computing to gain access to advanced high-performance computing capabilities.

Results
Human gut microbial communities and microbiomes are highly dynamic networks coupled by positive or negative interactions and numerous feedback loops that display complex behaviors [63][64][65][66]. The generalized Lotka-Volterra (gLV) model provides a useful approach to capture such behavior [67][68][69][70][71]. Specifically, gLV captures single species growth rates and intraspecies and inter-species positive and negative interactions. We apply the proposed NLP framework to estimate the growth and interaction parameters of the gLV model from experimental data collected in [66]. The microbial species involved in the experiments are shown in Table 1. Experiments were designed to study the synthetic ecology encompassing 12 prevalent human-associated intestinal species. The gLV model is given by: where S ¼ f1; 2; � � � ; Sg is the set of microbial species, x s : R ! R is the trajectory of the abundance of species s 2 S, μ s is the growth rate of species s, and α ss 0 is the interaction parameter that captures the effect of the abundance of species s 0 on the growth rate of species s. Species s and species s 0 are referred to as recipient species and donor species, resepectively. Since (Model 1) only includes smooth mappings, one can apply the parameter estimation framework presented in the previous section.
The parameters (growth rates and interaction) cannot be calculated directly from first-principles and must be estimated from experimental data. The means of the prior densities of the parameters are assumed to be � m s ¼ � a ss 0 ¼ 0 and their standard deviations are assumed to be s m ¼ s a ¼ 1= ffi ffi ffi ffi ffi 50 p . Such values are empirically determined by selecting the standard deviation values that give biologically feasible parameter estimates (the range of biologically feasible parameter values are 0.09 < μ s < 2.1, −10 < α ij < 10, and −10 < α ii < 0). The variances for the output measurements are assumed to be σ k,s (t) = 0.05 max (0.1, η k,s (t)). There are a total of 156 parameters including 12 monospecies growth rate and 144 interaction parameters (12 x 12). The set of experiments K includes 12 monospecies experiments and 66 pairwise community experiments (total of K = 78 experiments). The estimation problem contains a total of 144 differential equations (i.e. the model is a system of ordinary differential equations on R 144 ). The computational characteristics of the estimation problem are summarized in Table 2 (labeled as  P1).
The dynamic model is discretized using an implicit Euler scheme with five equally-spaced discretization points (monospecies experiments) and 120 equally-spaced discretization points (pairwise experiments). The sufficiency of discretization is verified by checking the sensitivity of the solution with respect to the change of discretization. The following scheme is used: 1. Solve the problem to obtain the set of parameters θ � .
2. Increase the number of discretization points by a factor of two (i.e., reduce the time step by half), and resolve the problem to obtain the set of parametersỹ.
3. Check if θ � andỹ satisfy the following sensitivity criteria: We use � abs = � abs = 0.01. We have verified that the above-mentioned discretization sheme yields solutions that satisfy (50).

Observability and regularization
Parameter observability was checked by solving the MAP formulation for P1 (which uses the available experimental data) and by checking the inertia of the augmented system at the solution (reported by Ipopt). Here, we omitted the prior regularization term φ(�). We found that parameters obtained from P1 are not unique (not observable from the available data). Moreover, we found that the estimated parameter values without regularization have unrealistic (non-physical) values, see Fig 3(a). This observation justifies the need to use prior information. We have used L1 and L2 priors with uniform standard deviations (other forms of prior can also be used as long as they can be reformulated as smooth functions). The results are presented in Fig 3(b) and 3(c). Unique parameter estimates were found when L1 or L2 priors were used. We also found that the L1 prior induces sparser solutions (many parameters are zero). For the remainder of the results, we use the formulation with an L2 prior.

Model fitting
Model validation was performed by assessing the goodness of fit to the experimental data (Fig 4). We can see that the model is capable of fitting most of the data points, but there are a  number of experiments where the model prediction deviates significantly from the experimental data (such experiments are highlighted with red boxes). Furthermore, we can observe outliers at single data points (highlighted with red circles). Poor fitting can be caused by either bad local minima (the optimization solver finds a local optimal solution rather than the global optimal solution) or by a structural model error (the model structure is incapable of capturing the actual behavior of the system). To avoid bad local minima, we solved the MAP problem with multiple starting points. Such an approach increases the probability to find the global optimum, but obtaining a rigorous certificate of a global minimum is computationally challenging (rigorous global optimization techniques are currently not scalable to large problems). We found that the use of multiple starting points does not improve the model fit. Consequently, we attribute fitting errors to the model structure itself. In particular, the gLV model neglects various physical and biological phenomena such as lag phase or interaction coefficients that change as a function of time [66]. To investigate structural errors, we solved the MAP problem with a variant of the gLV model. In particular, we investigated the saturable gLV model [72, where K ss 0 > 0 are additional interaction parameters. The saturable model exhibits a much higher degree of nonlinearity than the gLV model and includes S 2 more parameters (the number of degrees of freedom increases from S 2 + S to 2S 2 + S). As a result, the saturable gLV model provides more flexibility to improve model fitting. These results thus seek to illustrate that the proposed NLP framework can be used to handle challenging dynamic models. The model fitting obtained with the saturable gLV form is illustrated in Fig 5. As can be seen, significant improvements were made; in particular, the overall fitting error was reduced by 30%. Increasing the number of degrees of freedom can cause overfitting, however, and this can make the model less predictive. Consequently, there is a trade-off between fit and predictability.
To mitigate outliers, we solved the MAP problem with a CVaR norm and β = 0.9 (to penalize the 10% largest errors). The relevant results are summarized in Fig 6. It can be observed that the fitting errors for the outliers obtained with the standard MAP formulation are reduced. The effect of the CVaR formulation is also evident when analyzing the prediction error histogram (see Fig 7). In particular, we observe that the tail of high prediction errors becomes less pronounced under the CVaR formulation. In particular, the mean of the 10% largest errors decreases by 18% (from 167.81 to 137.04). On the other hand, it can also be observed that the mean error increases under CVaR and that the tail of small errors shrinks. This illustrates the fundamental trade-off that usually arises in robust statistics. The behavior induced with CVaR aids estimator performance because it prevents overfitting experimental data sets.

Inference (posterior) analysis
We used rMAP to assess the uncertainty of the 156 parameters estimated from P1 using the available experimental data. To do so, we draw data perturbations as η k,s (t) η k,s (t) + � k,s (t) with � k;s ðtÞ � N ð0; s k;s ðtÞ 2 Þ. We solved 500 MAP problems to obtain parameter samples and use this to approximate the covariance matrix for the posterior. The standard deviations are shown in Fig 8(a) and the marginals for the posterior are shown in Fig 9. A large standard deviation indicates that the estimated parameter value is not reliable. We note that about half of the parameters can be estimated reliably while the other half exhibit significant uncertainty. This indicates that more experimental data should be obtained. From the sample covariance, we generated 95% ellipsoidal confidence regions for each pair of parameters. The correlation plots of μ s against α ss 0 for s; s 0 2 S are shown in Fig 10 and the Pearson correlation coefficients are shown in Fig 8(b). In an ideal case, the parameters should be uncorrelated because data should be sufficient to estimate each parameter reliably. Using our data set, however, we can observe strong correlations between the parameters μ s and α ss in Fig 10, and strong positive and negative correlations can also be found in Fig 8(b). Furthermore, since the whole approximate distribution is obtained in the inference analysis based on rMAP framework, we can perform more sophisticated analysis on the characteristics of the distribution. In particular, one can investigate third and fourth moments (Fig 11) to examine the skewness and the kurtosis of the distribution. Such information can be used to     investigate the deviation of the posterior distribution from the normal distribution. If the posteriors are stritly normally distributed, the third and fourth moments should be zero and 3σ 4 , respectively. However, we can observe that many posterior distributions deviate from such expectations. Thus, we can see that some of the distributions are not close to the normal distribution.

Computational scalability
We assessed the computational scalability of the estimation framework by analyzing problems with different sizes and characteristics. Problem P1 was implemented in the algebraic modeling platform JuMP and solved with the NLP solver Ipopt configured with the sparse linear solver MA57. Problem P1 with gLV model and L2 prior was solved in 134 seconds and 78 NLP iterations on a standard computing server with an Intel(R) Xeon(R) CPU E5-2698 v3 processor running at 2.30GHz. Problem P1 with gLV model and L1 prior was solved in 219 seconds and 68 NLP iterations with the same hardware. A comparable problem requires over 7 hours to solve using a simulation-based approach implemented in Matlab and that uses finite differences to obtain first derivatives [66]. Despite the significant gains in computational performance obtained with Ipopt, its solution time scales nearly quadratically with the number of data sets. To overcome this scalability issue, we compared the performance of the serial solver Ipopt against that of the parallel solver PIPS-NLP (which uses a Schur complement decomposition to perform linear algebra operations). To test the scalability of PIPS-NLP, we generated a larger version of the estimation problem (labeled as P2). This problem is created by adding synthetic data sets. The NLP corresponding to P2 has over one million variables and constraints (but the number of parameters is the same as that of P1). This problem was implemented in Plasmo.jl. The benefit of using a parallel approach is clearly seen in Fig 12. Here, we highlight that PIPS-NLP solved P2 in less than 10 minutes and 94 NLP iterations using 16 cores while Ipopt requires around 30 minutes and 67 iterations. Furthermore, IPOPT found a different local solution and the solution from PIPS-NLP had a better In rMAP-based uncertainty quantification, the main computational challenge was the repetitive solution of the optimization problems. However, such challenge can be overcome by using the existing solution information. The required number of iterations in NLP solver can be greatly reduced when a good starting point (initial guess of the solution) is available (often referred to as warm start). Since only small modifications are made to the original problem to formulate the rMAP problem, the NLP solution of rMAP problem is very similar to that of the original problem. Thus, by warm-starting the NLP with the original NLP solution, the computational efforts to solve rMAP problem can be significantly reduced. In particular, most rMAP sampling problem was solved in less than 10 NLP iterations while the original problem required 78 NLP iterations.
We also assessed computational capability in estimation problems with the larger number of species in the microbial community (which increases the number of differential equations and parameters). Here, we generated synthetic data using simulations for larger communities. The generated data are summarized in Table 2. The number of the parameters and of data points scales nearly quadratically with respect to the size of the community. The computation times are shown in Fig 13. The results indicate that, by using PIPS-NLP, one can solve estimation problems with up to 48 species in less than 15 minutes and 40 NLP iterations (using 12 parallel computing cores). We highlight that, to the best of our knowledge, problem S4 is the largest estimation problem reported in computational biology literature. This problem contains 2,304 differential equations, 2,352 parameters, and 20,352 data points. The corresponding NLP contains 1.3 million variables and constraints.

Discussion
The high computational efficiency achieved with the proposed framework can enable kinetic modeling of complex biological systems ranging from biomolecular networks to high-dimensional microbial communities [74]. Indeed, the proposed framework can be used to construct and analyze high-fidelity models of whole-cells or microbiomes [75,76]. In particular, these methods can be applied to develop predictive dynamic models of multi-gene synthetic circuits The computation times for problems S1-S4 (see Table 2). The problems were solved with PIPS-NLP on 12 parallel cores (Intel(R) Xeon(R) CPU E5-2698 v3 processor running at 2.30GHz). https://doi.org/10.1371/journal.pcbi.1006828.g013 interacting with host-cell processes for accurately predicting cell growth and synthetic circuit activity [77] or kinetic models of metabolite transformations driving community dynamics. The proposed framework can also be used to handle more sophisticated models that arise in biological systems such as delay differential equations, differential and algebraic equations, and partial differential equations. Dynamic biological models with embedded metabolic flux formulations (giving rise to non-smooth behavior) can be handled with the proposed framework by using reformulations [78]. Estimation problems for stochastic differential equations (such as stochastic chemical kinetics models) cannot be handled with the proposed framework and remain a challenging class of problems [79].
The methods provided in this work will advance our capability of integrating mechanistic modeling frameworks with large-scale experimental data. Furthermore, uncertainty quantification and observability analysis can provide valuable information to guide and accelerate experimental data collection. These capabilities are also essential in diagnosing structural model errors. The proposed framework uses state-of-the-art and easy-to-use modeling and solution tools that can be broadly applied to diverse biological systems and accessible to a wide range of users. In the future, the proposed framework can be interfaced with systems biology markup languages such as SBML [80] and CellML [81] to allow broader applicability. Together, these advances can ultimately transform biology into a predictive and model-guided discipline.