Differential methods for assessing sensitivity in biological models

Differential sensitivity analysis is indispensable in fitting parameters, understanding uncertainty, and forecasting the results of both thought and lab experiments. Although there are many methods currently available for performing differential sensitivity analysis of biological models, it can be difficult to determine which method is best suited for a particular model. In this paper, we explain a variety of differential sensitivity methods and assess their value in some typical biological models. First, we explain the mathematical basis for three numerical methods: adjoint sensitivity analysis, complex perturbation sensitivity analysis, and forward mode sensitivity analysis. We then carry out four instructive case studies. (a) The CARRGO model for tumor-immune interaction highlights the additional information that differential sensitivity analysis provides beyond traditional naive sensitivity methods, (b) the deterministic SIR model demonstrates the value of using second-order sensitivity in refining model predictions, (c) the stochastic SIR model shows how differential sensitivity can be attacked in stochastic modeling, and (d) a discrete birth-death-migration model illustrates how the complex perturbation method of differential sensitivity can be generalized to a broader range of biological models. Finally, we compare the speed, accuracy, and ease of use of these methods. We find that forward mode automatic differentiation has the quickest computational time, while the complex perturbation method is the simplest to implement and the most generalizable.


Introduction
In many mathematical models underlying parameters are poorly specified. This problem is particularly acute in biological and biomedical models. Model predictions can have profound implications for scientific understanding, further experimentation, and even public-policy decisions. For instance, in an epidemic some model parameters can be tweaked by societal or scientific interventions to drive infection levels down. Differential sensitivity can inform medical judgement about the steps to take with the greatest impact at the least cost. Similar considerations apply in economic modeling. Additionally, parameter estimation for model fitting usually involves differential sensitivity through maximum likelihood or least squares criteria. These optimization techniques depend heavily on gradients and Hessians with respect to parameters. While some parameter estimation methods rely on Bayesian computational techniques [1] rather than gradients, these techniques tend to scale poorly as the number of model parameters increases. A common way to alleviate the poor scaling of Bayesian inference is Hamiltonian Monte Carlo [2], which itself requires gradient calculations. Techniques for assessing sensitivity of stochastic models often rely on the gradient-dependent Fisher information matrix of the model, which is the basis for a variety of multi-step local sensitivity analysis techniques for discrete stochastic models [3].
Calculation of gradients and Hessians of a model can also be important in other steps of the scientific process. For example, iterative model development [4] involves using the Fisher information matrix to inform experimental design. Extended Kalman filtering [5] incorporates differential sensitivity into model construction. Regardless of the method, parameter estimation is an important step in fitting a biological model, and the success of this step strongly impacts the ultimate utility of the model. Understanding the uses and limitations of differential sensitivity can aid in determining the identifiability of model parameters, how sensitive they are to experimental error or measurement noise, and the overall importance of their existence in the model. Finally, it is worth noting that while local sensitivity analysis is the focus of this manuscript, global sensitivity analysis often relies on local differential sensitivity estimates to inform optimal stepsizes in regional searching [6] or to resolve inconsistencies that arise when local sensitivity is non-monotonic [7].
In any case it is imperative to know how sensitive model predictions are to changes in parameter values. Unfortunately, assessment of model sensitivity can be time consuming, computationally intensive, inaccurate, and simply confusing. Most models are nonlinear and resistant to exact mathematical analysis. Understanding their behavior is only approachable by solving differential equations or intensive and noisy simulations. Sensitivity analysis is often conducted over an entire bundle of neighboring parameters to capture interactions. If the parameter space is large or high-dimensional, it is often unclear how to choose representative points from this bundle. Faced with this dilemma, it is common for modelers to fall back on varying just one or two parameters at a time. Model predictions also often take the form of time trajectories. In this setting, sensitivity analysis is based on lower and upper trajectories bounding the behavior of the dynamical system.
The differential sensitivity of a model quantity is measured by its gradient with respect to the underlying parameters at their estimated values. The existing literature on differential sensitivity is summarized in the modern references [8,9]. There are a variety of software packages that evaluate parameter sensitivity. For example, the Julia software DifferentialEquations.jl [10] makes sensitivity analysis routine for many problems. Additionally, PESTO [11] is a current Matlab toolbox for parameter estimation that uses adjoint sensitivities implemented as part of the CVODES method from SUNDIALS [12]. Although the physical sciences have widely adopted the method of differential sensitivity [13,14], the papers and software generally focus on a single sensitivity analysis method rather than a comparison of the various approaches. This singular focus leaves open many questions when biologists conduct sensitivity analyses. Should the continuous sensitivity equations be used, or would automatic differentiation of solvers be more efficient on biological models? On the types of models biologists generally explore, would implicit parallelism within the sensitivity equations be beneficial, or would the overhead cost of thread spawning overrule any benefits? How close do simpler methods based on complex perturbation get to these techniques? The purpose of the current paper is to explore these questions on a variety of models of interest to computational biologists.
In the current paper we also suggest an accurate method of approximating gradients that compares favorably against forward automatic differentiation techniques, provided a model involves analytic functions without discontinuities, maxima, minima, absolute values, or any other excursion outside the universe of analytic functions. In the sections immediately following, we summarize known theory, including the important adjoint method for computing the sensitivity of functions of solutions [13,14]. Then we illustrate sensitivity analysis for a few deterministic models and a few stochastic models. Our exposition includes some straightforward Julia code that readers can adapt to their own sensitivity needs. These examples are followed by an evaluation of the accuracy and speed of the suggested numerical methods. The concluding discussion summarizes our experience, indicates limitations of the methods, and suggests new potential applications.
For the record, here are some notational conventions used throughout the paper. All functions that we differentiate have real or real-vector arguments and real or real-vector values. All vectors and matrices appear in boldface. The superscript indicates a vector or matrix transpose. For a smooth real-valued function f(x), we write its gradient (column vector of partial derivatives) as rf(x) and its differential (row vector of partial derivatives) as df ðxÞ ¼ rf ðxÞ T .
If g(x) is vector-valued with ith component g i (x), then the differential (Jacobi matrix) dg(x) has ith row dg i (x). The chain rule is expressed as the equality d½f � gðxÞ� ¼ df ½gðxÞ�dgðxÞ of differentials. The transpose (adjoint) form of the chain rule is rf � gðxÞ ¼ dgðxÞ T rf ½gðxÞ�. For a twice-differentiable function, the second differential (Hessian matrix) d 2 f(x) = drf(x) is the differential of the gradient. Finally, i will denote ffi ffi ffi ffi ffi ffiffi À 1 p .

Forward method
S3 Appendix briefly discusses sensitivity analysis for the linear constant coefficient system Þx t ð Þ of ordinary differential equations (ODEs). Sensitivity of the nonlinear system d dt x t; β ð Þ ¼ f xðtÞ; β ½ � can be evaluated by differentiating the original ODE with respect to β j , interchanging the order of differentiation, and numerically integrating the system This formulation of the problem depends on knowing x(t, β). In practice, one solves the system d dt xðt; βÞ r β xðt; βÞ jointly, where d β x[t, β] is the Jacobi matrix of x(t, β) with respect to β. This is commonly referred to as forward sensitivity analysis and is carried out by software suites such as Differen-tialEquations.jl and SUNDIALS CVODES [12]. We note that a common implementation of sensitivity analysis is to base calculations on directional derivatives. Thus, the directional derivative f fxðtÞ þ �r x f ½xðtÞ; β�; βg À f ½xðtÞ; β� � version of the forward method allows one to evolve dynamical systems without ever computing full Jacobians. The forward method can also be applied when quantities of interest are defined recursively.

Adjoint methods
The adjoint method is incorporated in the biological parameter estimation software PESTO through CVODES [12]. This method [8,9] is defined directly on a function g[x(β), β] of the solution of the ODE. The adjoint method introduces a Lagrange multiplier λ(β), numerically solves the ODE system forward in time over [t 0 , t n ], then solves the system d β lðβÞ ¼ d x f ½xðβÞ; β�lðβÞ þ d β g½xðβÞ; β�; for λ(β) in reverse time, and finally uses the introduced parameter to compute derivatives via lðt; βÞd β xðt; βÞdt: The second and third stages are commonly combined by appending the last equation to the set of ODEs being solved in reverse. This tactic achieves a lower computational complexity than other techniques, which require solving an n-dimensional ODE system p times for p parameters. In contrast, the adjoint method solves an n-dimensional ODE forwards and then solves an n-dimensional and a p-dimensional system in reverse, changing the computational complexity from OðnpÞ to Oðn þ pÞ. Whether such asymptotic cost advantages lead to more efficiency on practical models is precisely one of the points studied in this paper.
Alternatively, one can find the partial derivatives using finite differences. The simplest method here is to compute a slightly perturbed trajectory x(t, β+Δv) and form the forward differences xðt; β þ DvÞ À xðt; βÞ D at all specified time points as approximations to the forward directional derivatives of x(t, β) in the direction v. Choosing v to be unit vectors along each coordinate axis gives ordinary partial derivatives. The accuracy of this crude method suffers from round-off error in subtracting two nearly equal function values. These round-off errors are in addition to the usual errors committed in integrating the differential equation numerically. Round-off errors can be ameliorated by using central differences rather than forward differences. However, the central difference method requires twice the number of computations as the forward difference method. Thus, the choice of a difference method depends on prioritization of accuracy versus computational efficiency. In small models, computational efficiency may be less of a priority, in which case central difference methods are preferred.

Complex perturbation methods
There is a far more accurate way of computing model sensitivity when the function f[x, β] defining the ODE is analytic in the parameter vector β [15]. An analytic function can be expanded in a locally convergent power series around every point of its domain. This implies that the trajectory x(t, β) is also analytic in β. For a real analytic function g(β) of a single variable β, the derivative approximation in the complex plane avoids roundoff and is highly accurate for Δ>0 very small [16,17]. Thus, in calculating a directional derivative of x(t, β), it suffices to (a) solve the governing ODE d dt x t; β ð Þ ¼ f xðtÞ; β ½ � with β+Δiv replacing β, (b) take the imaginary part of the result, and (c) divide by Δ. To make these calculations feasible, the computer language implementing the calculations should support complex arithmetic and ideally have an automatic dispatching mechanism so that only one implementation of each function is required. In contrast to numerical integration of the joint system (Eq 1), the complex perturbation method is much more simply parallelizable across parameters.
The following straightforward Julia routine for computing sensitivities takes advantage of the simplicity of multithreading the complex perturbation method by parameter. This function requires a function f(p): R n 7 !R m of a real vector p declared as complex. The perturbation scalar Δ should be small and real, say 10 −10 to 10 −12 in double precision. If the parameters p j vary widely in magnitude, then a good heuristic is to perturb p j by p j di instead of di. The returned value df is an m×n real matrix. The Julia commands real and complex effect conversions between real and complex numbers, and Julia substitutes im for i ¼ ffi ffi ffi ffi ffi ffiffi À 1 p . We will employ these functions later in some case studies. A recent extension [18] of the complex perturbation method facilitates accurate approximation of second derivatives. The relevant formula is where e pi=4 ¼ ð1 þ iÞ= ffi ffi ffi 2 p . Roundoff errors can now occur but are usually manageable. Here we present a novel result for how to extend the complex perturbation method to approximate mixed partials. Our derivation is condensed into the following equations This approximation is accurate to order O(Δ 6 ) and allows us to infer that Imagg½x þ e pi=4D ðe j þ e k Þ� þ g½x À e pi=4D ðe j þ e k Þ� Since we can approximate @ 2 @b 2 j g β ð Þ and @ 2 @b 2 k g β ð Þ, we can now approximate @ 2 @b j @b k g β ð Þ to order O(Δ 4 ). These approximations are derived in S1 Appendix. The Julia code for computing second derivatives function hessian(f::F, p, Δ) where F d2f = zeros(length(p), length(p)) # hessian dp = Δ � (1.0 + 1.0 � im) / sqrt ( operates on a scalar-valued function f(u) of a real vector p declared as complex. The second-order complex perturbation method can also be multithreaded by parameter, provided the unmixed second partials are computed prior to the mixed ones. Because roundoff error is now a concern, the perturbation scalar Δ should be in the range 10 −3 to 10 −6 in double precision. The returned value d 2 f is a symmetric matrix.

Automatic differentiation
Another technique one can use to calculate the derivatives of model solutions is to differentiate the numerical algorithm that calculates the solution. This can be done with computational tools collectively known as automatic differentiation [19]. Forward mode automatic differentiation is performed by carrying forward Jacobian-vector products at each successive calculation. This is accomplished by defining higher-dimensional numbers, known as dual numbers [20], coupled to primitive functions f(x) through the action f ða þ b�Þ ¼ f ðaÞ þ �df ðaÞb: Here � is a dimensional marker, similar to the complex i, which is a two-dimensional number. For a composite function f = f 2 � f 1 , the chain rule is df ðaÞb ¼ df 2 ½f 1 ðaÞ�df 1 ðaÞb. The ith column of the Jacobian appears in the expression f ðx þ e i �Þ ¼ f ðxÞ þ �r i f ðxÞ. Since computational algorithms can be interpreted as the composition of simpler functions, one need only define automatic differentiation on a small set of base cases (such as +, � , sin, and so forth, known as the primitives) and then apply the accepted rules in sequence to differentiate more elaborate functions. The For-wardDiff.jl package [20] in Julia accomplishes this by defining dispatches for such primitives on a dual number type and provides convenience functions for easily extracting common objects like gradients, Jacobians, and Hessians. Hessians are calculated by layering automatic differentiation twice on the same algorithm to effectively take the derivative of a derivative.
In this form, forward mode automatic differentiation shares many similarities to the complex perturbation methods described above without the requirement that the extension of f(x) be complex analytic. At every stage of the calculation f(x) must be differentiable, a weaker yet still restrictive assumption. Conveniently, automatic differentiation allows for arbitrarily many derivatives to be calculated simultaneously. By defining higher-dimensional dual numbers that act independently via one can calculate entire Jacobians in a single function call f ða þ P i e i � i Þ. This use of higher-dimensional dual numbers is a practice known as chunking. Chunking reduces the number of primal (non-derivative) calculations required for computing the Jacobian. Because the For-wardDiff.jl package uses chunking by default, we will investigate the extent to which this detail is applicable in biological models.

Case studies
We now explore applications of differential sensitivity to a few core models in oncology and epidemiology.

CARRGO model
The CARRGO model [21] was designed to capture the tumor-immune dynamics of CAR Tcell therapy in glioma. The CARRGO model generalizes to other tumor cell-immune cell interactions. Its governing system of ODEs follows cancer cells x as prey and CAR T-Cells y as predators. This model captures Lotka-Volterra dynamics with logistic growth of the cancer cells. Our numerical experiments assume the parameter values and initial conditions y ¼ 1 � 10 À 6 =day; r ¼ 6 � 10 À 2 =day; g ¼ 1 � 10 9 cells; x 0 ¼ 1:25 � 10 4 cells; y 0 ¼ 6:25 � 10 2 cells suggested by Sahoo et al. [21].
A traditional sensitivity analysis hinges on solving the system of ODEs and displaying the solutions at a chosen future time across an interval or rectangle of parameter values. Fig 1 shows how x(t) and y(t) vary at t = 1000 days under joint changes of κ 1 and κ 2 , where κ 1 is the rate at which cancer cells are destroyed in an interaction with an immune cell, and κ 2 is the rate at which immune cells are recruited after such an interaction. This type of analysis directly portrays how a change in one or two parameters impacts the outcome of the system. Surprisingly, the number of cancer cells x(t) depends strongly on κ 2 but only weakly on κ 1 . In contrast, the number of immune cells y(t) depends comparably on both parameters, perhaps because the initial population of immune cells is much smaller than the initial population of cancer cells.
There are limitations to this type of sensitivity analysis. How many solution curves should be examined? What time is most informative in displaying system changes? Is it necessary to compute sensitivity over such a large range of parameters when the trends are so clear? These ambiguities cloud our understanding and require far more computing than is necessary. Differential sensitivity successfully addresses these concerns. Gradients of solutions immediately yield approximate solutions in a neighborhood of postulated parameter values. The relative importance of different parameters in determining species levels can be determined from inspection of the gradient. Furthermore, modern software easily delivers the gradient along entire solution trajectories. There is no need to solve for an entire bundle of neighboring solutions.
Differential assessment is far more efficient. The required calculations involve solving an expanded system of ordinary differential equations just once under the automatic differentiation method or solving the system once for each parameter under the complex perturbation method. Either way, the differential method is much less computationally intensive than the traditional method of solving the ODE system over an interval for each parameter or over a rectangle for each pair of parameters. Here is our brief Julia code for computing sensitivity via the complex perturbation method.  p = complex([6.0e-9, 3.0e-11, 1.0e-6, 6.0e-2, 1.0e9]); # parameters x0 = complex([1.25e4, 6.25e2]); # initial values (d, tspan) = (1.0e-13, (0.0,1000.0)); # step size and time interval in days solution = sensitivity(x0, p, d, tspan); # find solution and partials CARRGO1 = plot(solution [1][:, 1], label = "x1", xlabel = "days", ylabel = "cancer cells x1", xlims = (tspan [1],tspan [2])) CARRGO2 = plot(solution [1][:, 2], label = "d1x1", xlabel = "days", ylabel = "p1 sensitivity", xlims = (tspan [1],tspan [2])) In the Julia code the parameters κ 1 , κ 2 , θ, ρ, and γ and the variables x and y exist as components of the vector p and x, respectively. The two plot commands construct solution curves for cancer and its sensitivity to perturbations of κ 1 . Fig 2 reinforces the conclusions drawn from the heatmaps, but more clearly and quantitatively. Additionally, differential sensitivity allows for the assessment of the sensitivity over the course of time, rather than just at a single time or small set of times. For example, the sensitivity of x with respect to γ in this model exhibits both large positive and large negative values over the course of time. Measured as the difference in x caused by a difference in γ at our endtime t = 1000, these effects tend to cancel each other out and fail to communicate the impact of the parameter γ on x at intermediate times. In brief, the scaled sensitivity of cancer cells x is much more dependent on carrying capacity γ later in the simulation, while the model sensitivity to birth rate ρ is most pronounced around the earlier time t = 200.

Deterministic SIR model
The deterministic SIR model follows the number of infectives I(t), the number of susceptibles S(t), and the number of recovereds R(t) during an epidemic. These three subpopulations satisfy the ODE system where η is the daily infection rate per encounter and δ is the daily rate of progression to immunity per person. For SARS-CoV-2, current estimates [22] of η range from 0.0012 to 0.48, and estimates of δ range from 0.0417 to 0.0588 [23]. As an alternative to solving the extended set of differential equations, we again use the complex perturbation method to evaluate parameter sensitivities.
The following Julia code for the complex perturbation method reuses the generic sensitivity function from the CARRGO model example. Our parameter choices roughly capture measurements for the COVID-19 virus from early in the pandemic [22,23]. Fig 3 plots the susceptible curve and its sensitivities. In this case all three curves conveniently occur on comparable scales. Fig 3 captures not only the pronounced parameter sensitivity early in the pandemic but also the symmetry between the δ and η parameters.

Second-order expansions of solution trajectories
In predicting nearby solution trajectories, the second-order Taylor expansion improves accuracy over the first-order expansion The improved accuracy achieved by including second-order terms often justifies their computation. The complex perturbation method permits straightforward computation of second derivatives via approximations (Eq 2) and (Eq 3). The DiffEqSensitivity.jl and ForwardDiff.jl packages implement both adjoint and forward difference methods for computing the second derivatives of differential equation systems. For example, the top right panel of Fig 4 shows that the solution trajectory of infected individuals bends dramatically with a change in parameters. This behavior is much better reflected in the second-order prediction compared to the first-order prediction, which over-corrects at the peak. The Euclidean distance between the actual and predicted trajectories at the sampled time points is about 25.4 in the first-order case and only about 9.06 in the second-order case, a reduction of over 60% in prediction error. By contrast, the trajectory of the recovered individuals steadily increases in a much more linear fashion. The bottom left panel of Fig 4 shows that the first-order prediction now remains reasonably accurate over a substantial period. Even so, the discrepancy between the predicted solutions grows so that by day 100 the Euclidean distance between the first-order prediction and the actual trajectory exceeds 154, compared to about 34.0 for the second-order prediction. Thus, calculating second-order sensitivity is helpful in both highly non-linear systems and systems with long time scales.

Stochastic SIR model
We now illustrate sensitivity calculations in the stochastic SIR model. This model postulates an original population of size n with i infectives and s susceptibles. The parameters δ and η again capture the rate of progression to immunity and the infection rate per encounter. Since extinction of the infectives is certain, we focus on the time to elimination of the infectives. It is also convenient to follow the vector (i, n), where n = i+s is the sum of the number of infectives i plus the number of susceptibles s. The mean time t in to elimination of all infectives satisfies the recurrence for 0<i<n together with the boundary conditions The expression for t ii stems from adding the expected time for the i!i−1 transition, plus the expected time i−1!i−2, and so forth. This system of equations can be solved recursively for i = n, n−1,. . .0 starting with n = 1. Once the values for a given n are available, n can be incremented, and a new round is initiated. Ultimately the target size n = N is reached. Taking partial derivatives of the recurrence (Eq 6) yields a new system of recurrences that can also be solved recursively in tandem with the original recurrence. The complex perturbation method is easier to implement and comparable in accuracy to the partial derivative method.
Another important index of the SIR process is the mean number of infectives m in ever generated starting with i initial infectives and n total people. These expectations can be calculated via the recurrences for 0<i<n together with the boundary conditions m ii ¼ i and m 0n ¼ 0: One can compute the sensitivities of the m in to parameter perturbations in the same way as the t in . Here is the Julia code for the two means and their sensitivities via the complex perturbation method. Note how our earlier differential function plays a key role.  The left column of Fig 5 displays a heatmap of the expected total number of individuals infected and the right column displays a heatmap of the expected days to extinction of the infection process. Rows 2 and 3 show the sensitivites of these quantities to the η and δ parameters in the stochastic SIR model.
It is interesting to compare results from differential sensitivity to estimates from stochastic simulations. To see the difference in accuracy, we calculated the average number of individuals infected and the average time to extinction by stochastic simulation using the software package BioSimulator.jl [24]. Table 1 records the analytic and simulated means of these outcomes in the SIR model. As Table 1 indicates, the simulated means over r = 100 runs are roughly comparable to the analytic means, but the standard errors of the simulated means are large. Because

PLOS COMPUTATIONAL BIOLOGY
Differential methods for assessing sensitivity in biological models the standard errors decrease as 1 ffiffi r p , it is difficult to achieve much accuracy by simulation alone. In more complicated models, simulation is so computationally intensive and time consuming that it is nearly impossible to achieve accurate results. Of course, the analytic method is predicated on the existence of an exact solution or an algorithm for computing the same.
Parameter sensitivities inform our judgment in interesting and helpful ways. For example, derivatives of both the total number of infecteds and the time to extinction with respect to η are very small except in a narrow window of the η parameter. This suggests that we focus further simulations, sensitivity analysis, and possible interventions on the region of parameter space where η falls in these windows. Derivatives with respect to δ also depend mostly on η except at very small values of δ. These conclusions are harder to draw from noisy simulations alone.

Branching processes
Branching process models offer another opportunity for checking the accuracy of sensitivity calculations. For simplicity we focus on birth-death-migration processes [25]. These are multitype continuous-time processes [26,17] that can be used to model the early stages of an epidemic over a finite graph with n nodes, where nodes represent cities or countries. On node i we initiate a branching process with birth rate β i >0 and death rate δ i >0. The migration rate from node i to node j is λ ij �0. All rates are per person, and each person is labeled by a node. Let λ i = ∑ j6 ¼i λ ij be the sum of the migration rates emanating from node i. Given this notation, the mean infinitesimal generator of the process is the matrix The entries of the matrix e tΩ ¼ ½m ij ðtÞ� represent the expected number of people at node j at time t starting from a single person of type i at time 0. The process is irreducible when the pure migration process corresponding to the choice β i = δ i = 0 for all i is irreducible. Equivalently, the process is irreducible when the graph representing transition probabilities is strongly connected. Henceforth, we assume the process is irreducible and let Γ denote the mean infinitesimal generator of the pure migration process. The process is subcritical, critical, or supercritical depending on whether the dominant eigenvalue ρ of O is negative, zero, or positive.
To determine the local sensitivity of ρ to a parameter θ [26,27], suppose its left and right eigenvectors v and w are normalized so that vw = 1. Differentiating the identity Ow = ρw with respect to θ yields If we multiply this by v on the left and invoke the identities vO = ρv and vw = 1 we find that Because @ @d i Ω ¼ À @ @b i Ω, it follows that an increase in δ i has the same impact on ρ as the same decrease in β i . The sensitivity of v and w can be determined by an extension of this reasoning [28]. The extinction probabilities e i of the birth-death-migration satisfy the system of algebraic equations for all i. This is a special case of the vector extinction equation for a general branching process with offspring generating function P i (x) for a type i person [29]. For a subcritical or critical process, e = 1. For a supercritical process all e i 2(0,1). Iteration is the simplest way to find e. Starting from e 0 = 0, the vector sequence e n = P(e n−1 ) satisfies 0 � e n � e nþ1 � e and converges to a solution of the extinction equations. Here all inequalities apply component-wise. To find the differential [28] of the extinction vector e with respect to a vector θ of parameters, we assume that the branching process is supercritical and resort to implicit differentiation The indicated inverse does, in fact, exist. Alternatively, one can compute an entire extinction curve e(t) whose component e i (t) supplies the probability of extinction before time t starting from a single person of type. This task reduces to solving the ODE for d dt e t ð Þ by the methods previously discussed.
The following Julia code computes the sensitivities of the extinction probability for a twonode process by the complex perturbation method.
The average number a ij of infected individuals of type j ultimately generated by a single initial infected individual of type i is also of interest. The matrix A = (a ij ) of these expectations can be calculated via the matrix equation where F is the offspring matrix One can determine the local sensitivity of the expected numbers of total descendants by differentiating the equation A = (I n −F) −1 . The result depends on the sensitivity of the expected offspring matrix F. Julia code for the complex perturbation method with two nodes follows.

Results
We now measure the accuracy, computational speed, and prediction error for adjoint, forward mode, and complex perturbation methods. To account for the variety of settings encountered by biologists, we include two additional ODE models in our comparisons. The ROBER model describes chemical reactions typical of enzymatic behavior [30] and furnishes an example of a stiff ODE system. More information on the ROBER model can be found in S2 Appendix. To compare the three methods in a high-dimensional ODE model, we turn to the mammalian cell cycle (MCC) model. Our MCC model is a simplified version of the original MCC model constructed by Gerard and Goldbetor [31], as explained in more detail in S2 Appendix. The model comprises 11 equations and 15 parameters and captures aspects of cell reproduction and cycling mediated by chemical signaling via cell-state dependent proteins such as tumor repressors, transcription factors, and other DNA replication checkpoints. The model relies on cell state as opposed to cell mass and nicely replicates sequential progression along the cell cycle.

Accuracy
It is important to understand how close computed differential sensitivities are to true differential sensitivities. Unfortunately, the latter are almost always unavailable for ODE models. For the stochastic SIR and branching process models, true sensitivities are well matched by the approximate sensitivities delivered by the complex perturbation methods, provided the complex perturbation is small enough [32]. As a proxy for comparison to true values in ODE models, one can compute the Euclidean distance between sensitivities delivered by the complex perturbation method and the methods relying on the chain rule. In general, we find that these distances are very small.
For the forward and adjoint sensitivities of non-stiff ODEs such as the SIR and CARRGO models, it is known that as one decreases the tolerance of the underlying ODE solver, the solution and its sensitivities converge to their true values [33]. To demonstrate that the same behavior occurs in our cases, we compute the sensitivities @ @Z S of the SIR model and @ @p 1 x 1 of the ROBER model at t = 1000 using the adjoint, forward, and complex perturbation methods at a variety of tolerances ranging from 1×10 −2 to 1×10 −8 . Fig 6 shows that all three method types (adjoint, forward, and complex perturbation) ultimately converge. In the non-stiff case (the SIR model), the adjoint method requires a step size of 1.0 to converge, while the stiff case (the ROBER model) requires a much smaller step size of 0.1 to converge. Each method converges at a different rate and potentially from a different direction. In the case of a relatively small, non-stiff model, the complex perturbation method converges more quickly (and at a higher tolerance) than the other methods. Notably, when the tolerance for the adjoint method is too weak the error rate increases more dramatically than for the forward method. This behavior becomes even more pronounced if we consider a stiff ODE model such as ROBER. In this case it is worth noting that the forward and complex perturbation methods converge, albeit under a more stringent tolerance. The adjoint method however struggles to converge for the ROBER model unless the step size is decreased to 0.1 (as shown in the Fig 6). While the smaller step size does allow the adjoint method to converge even in the stiff case, this smaller step size is much more computationally intensive and, in many cases, may be infeasible.

The speed versus accuracy trade-off
The trade-off between computational speed and accuracy is relevant to solving ODE systems whether they are stiff or not. Fig 7 displays the time versus error trade-off for both the SIR (non-stiff) and ROBER (stiff) models. In this case, error is calculated as the Euclidean distance between the derivatives calculated at various error tolerances and the derivatives calculated at a strict tolerance of 1×10 −8 (for the SIR model) and 1×10 −5 (for the ROBER model). We chose these tolerances as the strictest possible that are numerically realistic for each model. Fig 6 demonstrates that our choices are strict enough for the methods to reach convergence. We display errors versus time in a log-log plot averaged over compartments and parameters and normalized by length of time. We do not include the adjoint method in this comparison due to its difficulties in convergence and large computational cost.   between speed and accuracy in both the stiff (ROBER) and non-stiff (SIR) cases. In both cases, the forward method can be computed more quickly for equal errors than the complex perturbation method. As expected, the ROBER model has a less steep slope compared to the SIR model, indicating that the returns in accuracy grow more slowly per time invested for a stiff ODE system.

Computational speed
Speed is an important attribute of any computational method, especially when it is performed without the benefit of computational clusters or distributed computing resources. Our speed comparisons offer a first look at the efficiency gains possible with multithreading. In implementing multithreading for both the complex perturbation and forward mode methods, we call the Polyester.jl package to compute each partial derivative in a separate thread. All computations were done in Julia version 1.7.1 on a Windows operating system with an Intel Core i7-8565U CPU.
In addition to multithreading, the forward method as implemented in ForwardDiff.jl package provides the capability of multichunking. This involves splitting the equations in each system into different chunks to be solved separately. While forward methods do benefit from chunking, this tactic is unavailable in many packages outside of ForwardDiff.jl or outside of the Julia language. For biologists who depend on other packages and computer languages, it may be more pertinent to focus on the non-chunked results for the forward method. Tables 2, 3, 4 and 5 record the computational speed of the complex perturbation, forward, and adjoint methods (and their multithreaded and multi-chunked versions, as applicable) for four ODE systems models (SIR, CARRGO, ROBER, and MCC). Our comparisons of the firstorder methods show that the forward and complex perturbation methods perform comparably, while the adjoint method performs orders of magnitude slower than the other two. The fastest method is the multichunked forward method, with the complex perturbation method a close second for the simpler ODE systems such as SIR and CARRGO. For the stiff (ROBER) and large (MCC) models however, the complex perturbation method falls further behind the multichunk forward mode method. This could be expected from the larger gap between the time versus accuracy curves in the ROBER model as compared with the SIR model and illustrated in Fig 7. However, it is noteworthy that naive implementations of forward mode differentiation lack the advantage of chunking and are consequently slower than the complex perturbation method.
The adjoint method also has the worst time performance of the second-order methods by orders of magnitude. Both the forward and complex perturbation methods performed well in all four ODE systems models, with the complex perturbation method performing particularly well in models where the number of parameters is not large compared to the number of equations.
While multi-threading usually decreases computational time for both first-order and second-order methods, it does not decrease computational time by as wide of a margin as expected. Many of the solver methods for stiff ODEs rely on BLAS operations that are already internally optimized by running on multiple threads. Explicitly multi-threading sensitivity methods therefore restricts the number of threads available for BLAS operations, adversely affecting their performance. In addition to the reduced efficiency of BLAS operations, multithreading incurs a start-up cost for each thread. These start-up costs may overshadow the benefits of multi-threading if the amount of computation per thread is not high enough. Multithreaded methods require more allocations than other methods, and thus require more garbage collection. While time spent on garbage collection varies, we find that garbage collection can take over twice as much computational time in multi-threaded methods than in their single-threaded counterparts. Thus, multi-threading can only really start to improve computational efficiency when these additional costs are small compared to the cost of each computation. Multi-threading may even be less efficient in some cases. (Rodas5(autodiff = false)) with a tolerance of 1×10 −5 , reflecting the convergence tolerance. For the second-order adjoint method, the ForwardDiffOverAdjoint(QuadratureAdjoint(autodiff = false) solver option was used.

First-order Methods t end = 10 t end = 100 t end = 1000
Complex  Tables 6 and 7 compare the computational speeds of the different methods for the stochastic SIR and branching process models. As expected for the stochastic SIR model, computational speed varies roughly quadratically with the number N of individuals in the system. In the stochastic SIR model, the complex perturbation method proves to be twice as fast as the manual differentiation of (Eq 10) and (Eq 8) because the latter requires a larger number of individual computations. For the branching process model however, this trend reverses since manual Table 4. Computational time (μs) for the ROBER ODE model. Parameters match those previously introduced in this manuscript. Multithread refers to parallelism across parameters. Multichunk refers to parallelism across compartments. We invoke the Julia solver Rodas4(autodiff = false) with a tolerance of 1×10 −7 , reflecting the convergence tolerance. Second-order adjoint method not included at t = 1000 due to time constraints. For the second-order adjoint method, the ForwardDiffO-verAdjoint(QuadratureAdjoint(autodiff = false) solver option was used. differentiation relies on fast linear algebra rather than iteration and avoids the overhead of complex arithmetic. The derivatives of A are matrix equations, and in this case forward mode differentiation even without chunking performs as well as the complex perturbation method, although it does not scale as well to larger systems (N = 1000). However, in the case of the derivatives of e, which are calculated using recursion, neither implementation of forward mode differentiation can be computed as quickly as the complex perturbation method, and this difference increases with the size of the system. Other evidence not shown suggests that the complex perturbation method can reliably evaluate sensitivities where solutions depend on linear algebra and/or recurrence relations. In summary, unless derivatives are quite complicated, manual differentiation is generally more computationally efficient than either the complex perturbation method or the forward method. In computing second derivatives, we expect the tables will be turned. To their credit, the forward and complex perturbation methods do not require formulating derivatives analytically in advance and are consequentially easier to implement.

Prediction error
In general, prediction error measures how well the first and second-order sensitivities capture the change in behavior of a model. Since we have previously shown that the various methods for computing differential sensitivity yield nearly the same results, prediction error is a good metric for determining the value of differential sensitivity in a particular model. We measure prediction error by the Euclidean norms Other norms, such as the ℓ 1 and ℓ 1 norms, yield similar results. In the ODE models, f(x) denotes a matrix trajectory so the Frobenius norm applies. To capture proportional prediction errors, we normalize all vector outputs by their length and all matrix outputs by the square of their length.
Prediction accuracy varies widely between models and even between parameters. As we expect however, second-order approximations are more accurate in prediction. Tables 8, 9 and 10 record prediction errors for each model. For the ODE systems, we see that stiffness highlights the added value of the second-order approximations. In the ROBER and CARRGO models, the second-order approximations have an order of magnitude less prediction error than the first-order approximations. However, stiffness does not appear to impact how the prediction errors grow over time. The ROBER and MCC models do not suffer from increased errors per time point after longer prediction intervals. In the stochastic SIR model, prediction error does not seem to be compounded at all; in fact, the error per value calculated decreases in the case of @M @Z . In the case of branching processes with many types N and large parameter sets, it is inadvisable to compare prediction accuracy across system sizes. However, we can conclude from these results that at least the prediction error does not compound as N increases. Furthermore, prediction accuracy for the branching process models appears to vary dramatically depending on the parameter in question.

Discussion
Our purpose throughout has been to demonstrate the ease and utility of incorporating differential sensitivity analysis in dynamical modeling. Because models are always approximate, and parameters are measured imprecisely, uncertainty plagues virtually all dynamical models. While improving models is incremental and domain specific, sensitivity analysis provides a handle on local parameter uncertainty across models.
Of the methods mentioned in this text, the adjoint method, forward method, and complex perturbation methods all require that the functions defining a model be differentiable in the underlying parameters. While the complex perturbation method has the additional requirement that these functions be complex analytic, it is the only method explored in this manuscript that can be extended to discrete stochastic models in addition to ODE systems. For the modeler who prefers a one size fits all approach, or who prefers to prioritize ease of implementation, we argue that the complex perturbation method should be the method of choice. In addition to its wide range of applicability, the complex perturbation method can be easily multi-threaded and requires only implementation of the component functions of the model. In contrast to the second-order complex perturbation method, forward differentiation slows dramatically in calculating a Hessian directly. It becomes competitive if one calculates the gradient of the gradient. The gradient of the gradient method is not always available natively and usually must be implemented separately as we have done in the current manuscript. Crucially, implementing a specialized forward mode method was possible due to the underlying automatic differentiation software's flexibility and support for composition.
In situations demanding computational speed, our results suggest that choosing a method tailored to a model may be pertinent. In the case of stochastic models, manually differentiating and applying the chain rule must be balanced against the complex perturbation method, which requires less effort up front but longer processing after the derivatives have been determined. For ODE systems models, forward mode is the most computationally efficient when multichunking is available. If multichunking is not available, then the complex perturbation method has comparable speed to the forward method when run with the same tolerance. In maximizing computational efficiency, it is important to note that the use of automatic differentiation tools may require more user input for algorithm selection or multi-threading implementation. Choice of software is critical as well; not all software packages with automatic forward differentiation support chunking as implemented in the ForwardDiff.jl package and that so dramatically improves the computational efficiency of this method.
There are additional challenges to computing model sensitivity that we do not address. For example, not all models use functions that are differentiable in their parameters. Additionally, models may be differentiable yet extremely stiff, in which case the computational time for each sensitivity method discussed here will suffer disproportionally as the number of parameters grows. Furthermore, assessing global parameter sensitivity is more challenging. It can be attacked by techniques such as Latin square hypercube sampling or Sobel quasi-random sampling, but these become infeasible in high dimensions [34]. Given the availability of appropriate software, differential sensitivity is computationally feasible, even for high-dimensional systems.
In the case of stochastic models, traditional methods require costly and inaccurate simulation over a bundle of parameter values. Differential sensitivity is often out of the question. Current automatic differentiation systems, such as PyTorch, Zygote and ForwardDiff, treat generated random numbers as constants, and thus are not reliable methods for use in calculating differential sensitivity of model outcomes that depend on these random variables. This limits the ability of researchers to understand a biological system and how it responds to parameter changes. If a system index such as a mean, variance, extinction probability, or extinction time can be computed by a reasonable algorithm, then differential parameter sensitivity analysis can be undertaken. We have indicated in a handful of examples how this can be accomplished.
In summary, across many models representative of computational biology, we have reached the following conclusions: a. Forward mode, adjoint, and complex perturbation sensitivity methods all converge to the same differential sensitivity values in non-stiff models, thus offering the same level of accuracy for all methods. For stiff models, forward mode and complex perturbation methods converge but adjoint sensitivity struggles and does not achieve convergence for realistic tolerance parameters.
b. Chunked forward mode automatic differentiation and forward mode sensitivity analysis tend to be the most computationally efficient on the tested models.
c. The complex perturbation methods described in this manuscript are competitive and often outperform the unchunked version of forward mode automatic differentiation, while being less sensitive to stiffness than the adjoint methods.
d. Shared memory multi-threading of the complex perturbation and forward mode automatic differentiation methods provides a performance gain but only in high-dimensional systems.
e. Forward mode automatic differentiation method requires that each step of a calculation is differentiable. This renders it unusable for calculating the derivative of ensemble means of discrete state models, such as birth-death processes. For these cases, the complex perturbation method outperforms manual differentiation.
f. The complex perturbation method is competitive with automatic differentiation methods in accuracy, is more straightforward to implement, and can be applied to a wider variety of methods.
These conclusions are tentative but supported by our limited number of biological case studies.
We note that the performance differences may change depending on the efficiency of the implementations. The Julia DifferentialEquations.jl library and its DiffEqSensitivity.jl package have been shown to be highly efficient, outperforming other libraries in both equation solving and derivative calculations in Python, MATLAB, C, and Fortran [19,33]. Details on the current state of performance can be found at https://github.com/SciML/SciMLBenchmarks.jl.
The automatic differentiation implementations in machine learning libraries optimize array operations much more than scalar operations. This can work to the detriment of forward mode AD. MATLAB or Python style vectorization improves the performance of forward mode AD sensitivity analysis by reducing interpreter overhead. Therefore, our conclusions serve as guidelines for the case where all implementations are well-optimized. For programming languages with high overheads or without compile-time optimization of the automatic differentiation passes, the balance in efficiency shifts more favorably towards the complex perturbation method.
One last point worth making is on the coding effort required by the various methods. Both automatic differentiation and the complex perturbation method have comparable accuracy when applied to systems of ODEs, with automatic differentiation having the advantage in speed when it is implemented with the additional level of parallelization provided by chunking. However, the complex perturbation method can easily be generalized to other kinds of objective functions and may be more straightforward to implement for those less sophisticated in computer science. While automatic differentiation is the basis of many large scientific packages, the code required for the complex perturbation methods is fully contained within this manuscript and is easily transferable to other programming languages with similar dispatching on complex numbers. This hard to measure benefit should not be ignored by practicing biologists who simply wish to quickly arrive at reasonably fast code.