Distilling identifiable and interpretable dynamic models from biological data

Mechanistic dynamical models allow us to study the behavior of complex biological systems. They can provide an objective and quantitative understanding that would be difficult to achieve through other means. However, the systematic development of these models is a non-trivial exercise and an open problem in computational biology. Currently, many research efforts are focused on model discovery, i.e. automating the development of interpretable models from data. One of the main frameworks is sparse regression, where the sparse identification of nonlinear dynamics (SINDy) algorithm and its variants have enjoyed great success. SINDy-PI is an extension which allows the discovery of rational nonlinear terms, thus enabling the identification of kinetic functions common in biochemical networks, such as Michaelis-Menten. SINDy-PI also pays special attention to the recovery of parsimonious models (Occam’s razor). Here we focus on biological models composed of sets of deterministic nonlinear ordinary differential equations. We present a methodology that, combined with SINDy-PI, allows the automatic discovery of structurally identifiable and observable models which are also mechanistically interpretable. The lack of structural identifiability and observability makes it impossible to uniquely infer parameter and state variables, which can compromise the usefulness of a model by distorting its mechanistic significance and hampering its ability to produce biological insights. We illustrate the performance of our method with six case studies. We find that, despite enforcing sparsity, SINDy-PI sometimes yields models that are unidentifiable. In these cases we show how our method transforms their equations in order to obtain a structurally identifiable and observable model which is also interpretable.


Introduction
Mathematical models are increasingly used to describe, monitor, analyze and predict the behavior of complex biological systems.One of the major benefits of using mathematical models to study biology is that they can provide an objective and quantitative understanding that would be difficult to achieve through any other means.In systems biology, dynamical models (typically sets of ordinary differential equations, ODEs) are widely used to provide mechanistic insights into the functioning of biological systems [1,2].
The use of dynamical systems theory originated in Newtonian mechanics is now pervasive in all the natural and engineering sciences [3].Dynamic models are highly versatile, enabling researchers to study complex biosystems from a range of different perspectives, such as (i) analyzing the effect of changes in conditions and scenarios different from those studied experimentally, (ii) guiding research by identifying key aspects that need to be further investigated, (iii) helping to generate new testable hypotheses, or (iv) guiding the design of interventions.However, the systematic development of mechanistic dynamic models is a non-trivial exercise.In the case of biological systems, the situation is particularly difficult due to the fact that we cannot rely on first principles in the same way as in e.g.physics.As a consequence, model development is one of the key open problems in mathematical biology [4].
Can we automate the development of mechanistic models?This question of model discovery (in the sense of symbolic reconstruction of equations) from data was already addressed by pioneering attempts in the field of artificial intelligence several decades ago [5][6][7].However, the data-driven automatic identification of nonlinear dynamic models has only been addressed more recently.In this area, several different statistical and machine learning frameworks have been considered, including symbolic regression [8,9], grammar-based methods [10,11], sparse regression [12], neural networks [13][14][15], Gaussian process regression [16,17] and Bayesian approaches [18][19][20].More detailed reviews can be found in [21][22][23][24][25].The sparse identification of nonlinear dynamics (SINDy) algorithm [12] has been particularly successful, and a number of extensions have been developed (see review in [26]).
In the case of biological systems, a large amount of research has been devoted to different classes of subproblems with different simplifying assumptions (such as e.g.static networks, non-mechanistic dynamic networks, linear dynamics, etc.), as reviewed by [27][28][29].In this work, we consider the more general problem of fully reconstructing interpretable (mechanistic and parameterized) nonlinear dynamic models from time-series data.Recently, several approaches using methods based on sparse regression, Bayesian identification or symbolic regression have appeared [18,[30][31][32][33][34][35][36].In this context, SINDy-PI [37] is an especially interesting parallel implicit version of SINDy because it allows the incorporation of implicit dynamics and rational nonlinear terms, thus enabling the discovery of kinetic functions (such as Michaelis-Menten) common in biochemical networks.
Many of these SINDy-based methods pay special attention to the recovery of parsimonious models, usually penalizing model complexity [38] or evaluating performance on a validation data-set [37].The objective is to find the simplest model which can explain the data, in agreement with the well known principle of Occam's razor.These strategies help to discard more complex models which would be indistinguishable (i.e. would explain the data equally well but adding spurious terms).Besides enforcing simplicity, a related key aspect in model discovery is ensuring structural identifiability and observability (SIO).The property of structural identifiability refers to the theoretical possibility of inferring the unknown parameters of a given model (assuming that its equations are known, except for the numerical values of the parameters) from observations of the model output, which typically consist of time-resolved measurements of its state variables, or of a subset of them [39].Likewise, observability is the possibility of inferring all the state variables of a model at a given time from future observations of a subset of them.Since lack of SIO makes it impossible to uniquely infer parameters and state variables, it can compromise the usefulness of the model [40][41][42][43][44][45][46].The analysis of these properties can be performed with symbolic computation tools [47], and numerical approaches have also been proposed for their study [48,49].However, to the best of our knowledge, ensuring SIO has not been considered in dynamic model discovery yet.
Here we present a methodology that ensures SIO in automatic model discovery in two possible scenarios: with and without prior knowledge.In both cases the end product is a dynamic model of a biological system consisting of (typically nonlinear) ODEs.The equations may contain rational terms, such as Michaelis-Menten kinetics, thus being suitable for the description of many biochemical processes.If there is no prior knowledge about the model structure, the methodology performs equation discovery with the SINDy-PI approach, and incorporates a SIO analysis as a post-processing stage.If there is prior knowledge (i.e.we have a candidate model), another SIO analysis is added as a pre-processing step.If the analyses reveal structural unidentifiabilities, a reparameterization step is carried out to ensure that the resulting model is fully identifiable and observable.Furthermore, equivalent model reformulations are generated to facilitate its interpretation in a mechanistic sense.
Using representative case studies, we illustrate how ignoring these structural properties can lead to wrong conclusions or poorly identified models.Although we demonstrate the use of the methodology with SINDy-PI, it is straightforward to apply it in combination with other automatic discovery methods.In particular, it could easily be adapted to future methods capable of considering partially-observed systems.
Overall, our study presents a novel and non-trivial integrated methodology to ensure that the discovered models are structurally identifiable and interpretable.To the best of our knowledge, this is the first study to address these questions in model discovery.Further, our method involves an original and non-obvious combination of algorithmic steps regarding structural identifiability analysis (SIO), reparameterization, reformulation and interpretability analysis.While the concepts of SIO and reparameterization draw on recent ideas developed in our group, the remaining steps and their integration represent fresh and innovative contributions to the field.

Methods
In this section we describe the methodology, which can be used in two different scenarios.Both of them entail performing model discovery (using SINDy-PI or a similar approach) and performing SIO analysis.If a model is structurally identifiable and observable, we say that it is FISPO (full input, state, and parameter observability).If the SIO analysis reveals that the model is not FISPO, our method suggests a reparameterization step.The two scenarios and their procedures are as follows: • Scenario (I): full model discovery from time-series data with no prior knowledge.Since we assume zero prior knowledge, we use SINDy-PI to discover a candidate model (CM).We then analyse its SIO.If it is not FISPO, we reparameterize it in order to obtain an equivalent model which is FISPO.Finally, we check if the model is interpretable, in the sense that it contains monomials and simple rational terms which belong to a dictionary of mechanistic kinetic terms.If not, we apply a symbolic reformulation step in order to render it interpretable.
• Scenario (II): model discovery from time-series data with prior knowledge.This scenario corresponds to situations where we seek model (in)validation and/or refinement.We assume good prior knowledge and time-series data, that is we are reasonably confident that our prior model (PM), which already represents the data quite well, is close to the 'true' one.
Here the motivation to use SINDy-PI is to compare this PM with an alternative candidate obtained via model discovery (CM).To this end we check the SIO of the PM, obtaining a reparameterized version if needed.In parallel, we apply SINDy-PI to the data, obtaining a CM, and we make sure that it is FISPO (using reparameterisation if not).If needed, we use model reformulation techniques to obtain interpretable versions of the CM and the PM.Finally, we perform a comparative analysis of these latter models.
An schematic diagram of our method considering these two scenarios is depicted in Fig 1 .In the remainder of this section we describe in detail each of the steps.

Automatic model discovery using sparse regression
We assume that the dynamical system is governed by classical reaction-rate nonlinear ordinary differential equations with the following form: where _ xðtÞ 2 R n is the state vector, p 2 R n p is the parameter vector, the function f (x(t), p) represents the dynamics, y(t) is the measurable output, and x 0 is the vector of initial conditions.SINDy [12] assumes a fully observed system, y(t) = x(t).In the reminder of this section, we will consider _ xðtÞ ¼ f ðxðtÞÞ to simplify the notation.SINDy also assumes that f(x(t)) can be expressed as the product of a suitable library function, Θ(x(t)), and a sparse vector ξ (indicating the active library terms), where each entry in the library function is a candidate term: YðxÞ ¼ ½y 1 ðxÞ y 2 ðxÞ y 3 ðxÞ . . .y p ðxÞ� ; ð1Þ By arranging the time-series data as a matrix, X = [x(t 1 ), . .., x(t m )], and its associated derivative matrix _ can be expressed as: where X corresponds to the sparse matrix of active terms.When the system includes rational terms, f (x) can be rewritten as: leading to the implicit problem formulation [31]: Eq 4 has a different kind of term in each side of the equality: the Left Hand Side (LHS), in which there are combinations of term involving the derivative data and the candidate library, and the Right Hand Side (RHS), in which we only have library terms.When f (x) includes rational terms, model complexity can be viewed as the number of terms in the LHS, as they will involve the denominator degree too.The generalized function library Θ(X) allows the inclusion of X and _ X.Under this consideration, the implicit problem formulation can be rewritten as: For example, if a model has two states and the chosen degree is 2, the function library for the first state, i.e. x 1 , will be: It should be noted that the function library for state x 2 will differ from Eq 6 as it will include _ x 2 instead of _ x 1 .The design of the library of candidate functions is a critical aspect of SINDy-PI.However, in the context of dynamic modelling of biochemical and biological phenomena, by including monomials up to an order of 5 or 6, we can accommodate the vast majority of nonlinear terms, such as e.g.mass-action kinetics, or feedback regulatory loops.Furthermore, common nonlinear rational terms such as Michaelis-Menten in enzyme kinetics, or Monod in microbial growth, can be inferred from such library due to the implicit nature of SINDy-PI.The case studies considered here cover a wide range of nonlinear terms, illustrating the capabilities of SINDy-PI in biological modelling.
The implicit form of Eq 5 admits the sparsest trivial solution X = 0.The implicit-SINDy algorithm [31] surmounted this issue by using nonconvex optimization to find the sparsest vector ξ in the null space.However this particular formulation is very sensitive towards noise levels, thereby affecting the robustness of the method.
In an effort to address this issue, Kaheman et al. [37] introduced SINDy-PI, a novel method that solves the problem using a sequence of convex relaxations of the non-convex optimization problem.By doing so, the algorithm can utilize the same noise levels as those employed in the original SINDy algorithm.The authors achieved this by assuming the knowledge of at least one term on the LHS, specifically of the form _ xDðxÞ.Consequently, Eq 5 can be rewritten as: where YðX; _ Xjy j ðX; _ XÞÞ denotes the library YðX; _ XÞ without the θ j element.SINDy-PI proceeds iteratively by examining each term within the library.In order to balance accuracy and complexity, the method performs Pareto optimal model selection.
To find the sparsest vector ξ, SINDy-PI considers the problem: SINDy-PI solves this non-convex optimization problem using a sequentially thresholded least-squares (STLSQ) approach.This method proceeds by iteratively solving the least squares term in the cost function, zeroing out elements of ξ that are below a certain threshold λ.This threshold must be fine-tuned to select the model that provides the best trade-off between accuracy and efficiency.To discover the model, SINDy-PI considers a finite set of λ values and proceeds by sweeping the library terms for each value of the threshold λ, obtaining a family of possible candidate models.Next, SINDy-PI performs model selection choosing the best tradeoff, i.e. the Pareto-optimal model.The Pareto front is obtained by considering a model complexity metric (such as the Akaike information criterion, AIC), and the score for each candidate model.Details of this process, illustrated with an example, are given in the Supporting Information.

Structural identifiability and observability analysis and reparameterization
Once a candidate model structure (CM) has been discovered, the next step is to analyse its structural identifiability and observability (SIO) [50].This test assesses the possibility of determining the values of the model parameters and state variables, respectively, from output measurements.These properties are structural (i.e. they depend only on the model equations) and hence they can be analysed a priori (i.e.before taking experimental measurements) using symbolic computation.They should not be confused with the so-called practical versions of these properties, which depend on the features of the experimental data and are analysed a posteriori, i.e. after performing measurements [39].
We can provide a mathematical definition of structural local identifiability (SLI) as follows.Let us denote by y(t, p) the output vector obtained with a parameter vector p at time t.(For fully observed systems y(t, p) = x(t, p), while for partially observed systems y typically consists of a subset of x.)We say that a parameter p i (which is the i th element of the parameter vector p 2 R n p ) is structurally locally identifiable (SLI) if, for almost any parameter vector p * 2 R n p , there is a neighbourhood N ðp * Þ such that: The definition of structural global identifiability is similar, but with the neighbourhood N ðp * Þ extending to the whole parameter space.In this paper we focus on SLI.There are several approaches for determining structural local identifiability and observability.We apply a differential geometry approach, which we explain briefly in these paragraphs.In this framework, parameters are treated as state variables that happen to be constant, i.e. the state vector is augmented with the parameters, x ¼ ðx T ; p T Þ T , and has dimension The augmented dynamic equations are _ x ¼ f ðxÞ; and the output function is y ¼ gðxÞ; omitting the dependence on time for ease of notation.
Thus, SLI is considered as a particular case of a more general property, observability, which describes the possibility of inferring the internal state of a model by observing its output vector -hence the use of the term FISPO for "full input, state, and parameter observability" (note that this concept also allows for the treatment of unknown inputs as additional state variables, a possibility that we will not consider in this paper).
We analyse SIO by building an observability-identifiability matrix and computing its rank.The matrix is built with Lie derivatives of the output function.The zero-order Lie derivative is L~f 0 gðxÞ ¼ gðxÞ; and for i � 1 the i-order Lie derivatives are obtained as: The observability-identifiability matrix O I is: A model is FISPO around a point x0 if the rank of its observability-identifiability matrix equals the number of its states and parameters, rank ðO I ðx 0 ÞÞ ¼ nx ¼ n x þ n p .If the rank is smaller, the model contains structurally unidentifiable parameters.By performing additional tests it is possible to determine which specific parameters are structurally identifiable, and which state variables are observable.
If a model is not FISPO, its calibration will almost surely produce wrong parameter estimates.Furthermore, structural unidentifiability is often linked with non-observability, in which case the simulations of some state variables will also be wrong.Thus, structural nonidentifiability and non-observability are undesirable features of a model's structure, which compromise its reliability as a source of biological insight.These features are caused by symmetries in the differential equations of the model that make its output invariant with respect to certain changes in their parameters and/or state variables [51][52][53].Said symmetries can be studied in the framework of Lie group theory.We say that a mapping of the form x * ¼ Xðx; εÞ; ð11Þ is a one-parameter Lie group of transformations (with ε being the parameter) if it has the following properties: it is smooth in x and analytic in ε, it satisfies the four group axioms (closure, associativity, and existence of an identity and an inverse), and the identity element can be chosen as ε = 0.The transformation above is also called a symmetry transformation, or a Lie symmetry.Examples of the simplest and possibly most common symmetries in biological modelling include the following: Translation: Scaling: Moebius: It is sometimes possible to remove or 'break' these symmetries by transforming the model equations via a suitable reparameterization.To this end, we first search for the symmetry transformations admitted by the model.If a model has such symmetries, it is overparameterized and therefore structurally unidentifiable.Then, we express the ε of those transformations in terms of other parameters, thus setting the value of one of the transformed parameters to one and removing it from the equations.The end result of the reparameterization is a FISPO model that has exactly the same dynamic behaviour as the original one.In previous work [54] we presented a methodology to perform such reparameterizations automatically, which has been integrated in the workflow described here.
In summary, if the SIO analysis of the CM reveals structural unidentifiability and/or nonobservability, our methodology applies a symmetry-breaking reparameterization that makes it FISPO.

Model reformulation for interpretability
The dynamic model obtained in the previous step supports the experimental data and is structurally identifiable and observable.However, the rational expressions in 3 may lack a clear biological interpretation.In the case of biological networks, we will need to reformulate expressions of the form N(x)/D(x) into terms that belong to a dictionary of interpretable terms.
Our model reformulation procedure seeks to transform it into simple monomials and rational terms that have a mapping with the dictionary of kinetic and regulatory terms compatible with the specific type of biological reaction network under study.Typically, this dictionary will include mass-action kinetics and simple rational functions (e.g.Michaelis-Menten for enzyme kinetics, or Hill for cooperative binding).However, care should be taken in order to ensure that the reformulation does not destroy identifiability and observability.Further, as shown in the case studies below, sometimes these rational terms can have high degrees, complicating model discovery.
Our reformulation procedure makes use of symbolic manipulation and involves the following steps: 1. Obtain the list of p non-trivial divisors of the denominator: 2. Obtain a family A of expressions composed of monomials (interpretable as e.g.mass-action kinetics), minimizing the number of rational terms and their degree, by obtaining the quotients and the residuals: 3. If any residuals r i (x) lack interpretability, factorize and simplify N(x) by means of the nested Horner form: obtaining a family of coupled and factorized equations with the same degree, but with monomials involving different state combinations: Thus, the Horner nested form gives different possible decompositions of the numerator.
Next, obtain a family B of reformulations by simplifying the rational terms using the divisors of Eq 15 and the Horner form of the numerator.
As an example, consider Eq 20 below, where we can decompose the fraction in the left into a monomial plus a simpler rational term as follows: 4. Match the monomials and simplified rational terms in families A and B with elements in the dictionary of canonical kinetic and regulatory expressions (or by inspection by a human domain expert), finding members that are fully interpretable.
5. Ensure that the resulting interpretable model is FISPO.If not, reparameterize and repeat until an interpretable and identifiable model is obtained.

Implementation
We implemented our methodology, as depicted in Fig 1, using Matlab and the Symbolic Math Toolbox, integrating the following components: • Sparse regression using SINDy-PI [37] with some modifications, as detailed in the Supporting Information.
• Reformulation for interpretability, implementing the algorithm described above using symbolic manipulation.
The resulting code is available at https://doi.org/10.5281/zenodo.7713047.In order to facilitate reproducibility and illustrate the results at each step of the workflow, we have included interactive notebooks (Matlab live scripts) and reports (in HTML format) for each of the case studies described below.More details are given in S1 File.

Results
Below, we apply our methodology to a set of challenging case studies.(Table 1 summarises their main features).These examples are presented in order to illustrate the performance of our method for a variety of situations of increasing complexity, from models without rational terms and fully identifiable and observable (FISPO) structure, to larger (in terms of number of parameters and states), non-FISPO models with more difficult non-linearities, as indicated by the different maximum degrees in their rational terms.
In order to illustrate all the steps and capabilities of our workflow, we consider Scenario (II) in all these examples.For each problem, a ground truth (GT) model is defined and subsequently considered as prior model (PM) for the sake of simplicity but without loss of generality.This GT model is used to generate training data sets in all the case studies.After confirming the identifiability and interpretability of the final discovered model M*, we also assess its structural, parametric and predictive accuracy.The predictive power is evaluated taking into account conditions different from those used for generating the training data.Details regarding the training data generation and the conditions to evaluate predictive accuracy are given in S1 File.

Lorenz system (Lorenz)
This case study involves the well-known Lorenz system [57], which is a classical example of dynamic model with chaotic behaviour.This model was previously used in [12] to demonstrate the original SINDy algorithm.The governing equations describe the dynamics of a fluid layer warmed from below and cooled from above: where x 1 is proportional to the rate of convection, x 2 to the horizontal temperature variation and x 3 to the vertical one.For certain values of parameters a, b, and c, the system exhibits chaotic dynamics.
We consider the ideal Scenario (II) case where the prior model (PM) is the same as the nominal (or ground truth, GT) model.We generate a synthetic training data set using the GT model and settings similar to [12] (details in Supporting Information).Following the workflow in Fig 1 , we perform structural and identifiability analysis and confirm that the PM is fully identifiable and observable (FISPO).We then apply SINDy-PI to the training data, obtaining the following candidate model (CM): Our algorithm then confirms that this inferred CM model is FISPO and interpretable (thus, it corresponds to M*).Further, it is fully equivalent to the expanded ground truth model in terms of structural, parametric and predictive accuracy, as shown in Fig 2 .In summary, in this case study we have the ideal situation where both the nominal and the inferred models are fully observable and identifiable.As we will see below, this situation might change as soon as we consider rational terms in the dynamics.

Competition between bacteria and the immune system (Immunity)
This model describes the influence of quorum sensing signaling molecules (QSSM) on the competition between bacteria and the immune system, as studied in [58].The following differential equations depict the dynamics for the concentrations of bacteria ( _ x 1 ) and immune cells ( _ x 2 ): where it is assumed that bacteria grow logistically at rate a with an effective carrying capacity of the environment given by parameter k, and that they are cleared by the immune system following a mass action term ex 1 x 2 .The rational term at the end of Eq 23a represents the modulation of QSSM in the competition between bacteria and the immune system.We consider Eqs 23a and 23b as the GT model.We consider Scenario (II) again, assuming that the prior model (PM) is the same as the ground truth (GT) or nominal model.Considering that the unknown parameters are a, k, e, β, γ, α, S, d and δ, the structural identifiability analysis of the PM indicates that two of the three parameters involved in the rational term are unidentifiable.Specifically, there is a scaling symmetry between γ and α, which is probably the most common type of symmetry in biological models [63].Our reparameterization step indicates that this issue can be solved by dividing the numerator and denominator by one of the unidentifiable parameters; for example, if we choose α, Eq 23a will be: where γ/α = γ*.Thus our new reference model will be the following PM*: Next, our workflow proceeds by applying SINDy-PI to a data set generated with the GT model, obtaining the following candidate model (CM): Interestingly, our method then finds that this CM is not FISPO due to three structurally unidentifiable parameters: p 4 , p 5 , p 6 .The reformulation step is then able to find structurally identifiable reformulations of the form: We chose j = 6, but the same result can be obtained with j = 4, 5. Denoting as p * j ¼ p j p 6 ; j ¼ 4; 5, the resulting dynamic system becomes identifiable.Eq 27 is re-arranged as: The resulting model is fully identifiable, but the rational term does not match the one in the GT describing the modulation of QSSM.However, the reformulation step in our workflow produces an interpretable form M* which is FISPO: Fig 3 illustrates the importance of ensuring structural identifiability and observability.The CM found by SINDy-PI is not FISPO, so there are other different parameter realizations producing exactly the same output, as shown by CM2.This means that if this CM structure is used for parameter identification, the estimated parameters will not be unique, i.e. there exist different parameterizations of CM in full agreement with the same output measurements.However, the reformulated model M* is FISPO, i.e. there is a unique set of parameter values compatible with the output.Finally, in Fig 4 we confirm the structural, parametric and predictive accuracy of M*.

Stress response in bacteria (Bacterial)
This model describes the stress response in Bacillus subtilis [59].It was used by Mangan et al [31] to illustrate how an implicit SINDy approach was able to infer biological nonlinear dynamics.Under nutrient limitation, the majority of B. subtilis cells switch to sporulation, but a small fraction switch to an alternative behaviour, the so called state of competence, in which they are capable of taking up extracellular DNA.This latter fraction might subsequently return to vegetative growth.Su ¨el et al [59] described the regulatory system of this mechanism using a dynamic model with two states.In dimensionless form, the ground truth (GT) model for this example is: where  kinetics where the exponent indicates the level of cooperativity (2 and 5, respectively).The last term in both equations represents the degradation of both ComK and ComS.We again consider PM = GT and check the identifiability of PM.Considering unknown parameters a 1 , a 2 , a 3 , b 1 and b 2 , the model is fully identifiable, thus PM* = GT.
Next, SINDy-PI is applied to the training data generated using GT, obtaining the following candidate model (CM): This CM has 16 parameters, p j , j = 1, . .., 16, and the SIO analysis reveals that all of them are non-identifiable, with the exception of p 1 .The reformulation step indicated that we can obtain an identifiable model with four scaling transformations, one per rational term, i.e. the second term in Eq 31a is scaled by p j , j 2 [2,3,4], and the third term by p k , k 2 [5, . .., 9].In Eq 31b the same strategy is applied for p l , l 2 [10,11,12], and p m , m 2 [13, . . ., 16].That is: Choosing j = 4, k = 7, l = 11, m = 14, the resulting structurally identifiable model is: where * denotes a reparameterized parameter.This reformulated model is now fully identifiable, but no longer directly interpretable: Eq 33a does not explicitly have the term involving the autoregulation of ComK.By means of the reformulation procedure, we are able to recover the autoregulation and degradation terms as in Eq 30a: This reformulated model M* is structurally identifiable and interpretable, and equivalent to the PM.

Microbial growth (Microbial)
This case study considers microbial growth in a batch reactor, as presented by [64] and later used by [60] to study identifiable reparameterizations of unidentifiable systems.The following model describes the dynamics of microbial and substrate concentrations assuming Monod kinetics (similar in functional form to Michaelis-Menten enzyme kinetics): where x 1 and x 2 represent the concentrations of microorganisms and growth-limiting substrate, respectively.The rational term in Eq 35a is the Monod kinetic term, where μ is the maximum growth velocity and K s the substrate concentration corresponding to 1 2 m.In Eq 35a, the same rational term appears scaled by γ (the yield coefficient) to represent the depletion of substrate.The last term in Eq 35a describes the death of microorganisms assuming first order kinetics where K d is the decay rate.We consider a prior model (PM) that matches the ground truth (GT) model, Eqs 35a and 35b.When the initial conditions are known and different from zero, our algorithm confirms that this PM is structurally identifiable.Next, our workflow discovers the following dynamics using SINDy-PI: Next, the FISPO step finds that only p 1 is identifiable, i.e. parameters p i , i = 2, . .., 7 are unidentifiable.The reformulation step finds that it is possible to find an identifiable form by scaling each rational term by the same unidentifiable parameter.For simplicity, we have chosen that Then, the resulting structurally identifiable model is: However, Eqs 37a and 37b are not directly interpretable because they do not contain the expected Monod kinetics terms explicitly.Next, the reformulation step finds an equivalent structure which is both interpretable and identifiable (M*): This inferred model (M*) is compared to the ground truth in Fig 6, confirming its structural, parametric and predictive accuracy.

Cell cycle in the colonic crypt (Crypt)
This example considers a cell population model describing the cell renewal cycle in the colonic crypt [61].This cycle is heavily regulated and the model was used to explain the rupture of homeostasis and the initiation of tumorigenesis.The equations describing the dynamics are: where the state variables represent the populations of stem cells (x 1 ), semi-differentiated cells (x 2 ), and fully-differentiated cells (x 3 ).Stem cells have first order kinetics for renewal (rate given parameter a 3 ), differentiation (parameter a 2 ), and death (parameter a 1 ).Semi-differentiated cells have similar renewal, differentiation and death kinetics (with parameters b i ), plus a source term due to the differentiation of stem cells.Fully differentiated cells are generated from semi-differentiated cells with first order rate b 2 and removed with a rate modulated by parameter g.The rational terms correspond to saturating feedback mechanism in the differentiation rates.We take the above model as GT, and PM = GT.Our algorithm finds that this PM is not structurally identifiable: it is not possible to uniquely infer a 1 , a 3 , b 1 and b 3 due to the presence of a translation symmetry.Next, the reformulation step finds a reparameterized prior model (PM*): where a * 3 ¼ a 3 À a 1 and b * 3 ¼ b 3 À a 1 .Next, SINDy-PI is applied to the training data, obtaining the following candidate model (CM): Considering p i , i = 1, ‥, 21 as unknown parameters, the FISPO algorithm indicates that only p 1 , p 2 , p 16 , p 17 and p 21 are structurally identifiable.The reformulation step finds the following structurally identifiable alternative: The above model is not directly interpretable, but the reformulation process is able to find the following interpretable and identifiable reformulation M*: This discovered model M* is fully equivalent to the identifiable version of the ground truth model in terms of structural, parametric and predictive accuracy, as shown in Fig 7 .This example reinforces the importance of checking the identifiability of both the ground truth and the inferred model.

Oscillations in yeast glycolysis (Glycolysis)
Glycolysis is the transformation (in a series of reactions catalyzed by enzymes) of glucose into smaller molecules to produce energy for the cell.In many cell types, glycolysis exhibits oscillations in the concentrations of many intermediate metabolites.This phenomena has been particularly well studied in yeast cells.Wolf and Heinrich [62] studied the oscillatory dynamics of a simplified reaction scheme for yeast glycolysis under anaerobic conditions, where alcoholic fermentation takes place, proposing the following mathematical description: where the state variables represent the concentrations in the cell of glucose (x 1 ), the pool of triose phosphates (x 2 ), 1,3-bisphosphoglycerate (x 3 ), pool of pyruvate and acetaldehyde (x 4 ), NADH (x 5 ), ATP (x 6 ), and x 7 represents the pool of pyruvate and acetaldehyde in the external solution.We consider here the same formulation and parameter values as in [31,37].We take the above as GT, and assume PM = GT.Considering all parameters as unknown (c i , i = 1, 2, 3; d i , i = 1, ‥, 4; e i , i = 1, . .., 4; f i , i = 1, . .., 5; g i , i = 1, 2; h i , i = 1, . .., 5 and j i , i = 1, 2, 3), our algorithm confirms that the model is structurally identifiable and observable, i.e.PM* = PM.This problem is quite challenging for SINDy-PI due to its large number of states and parameters, and the large degree in several terms, leading to a very large library of candidate functions (over 3000 terms).However, it is able to correctly recover the following candidate model (CM): This model has 29 inferred coefficients, p i , i = 1, . .., 29.Our algorithm analyzes their identifiability and finds that the parameters with indices i = 2, 3, 4, 7, 8, 9, 24, 25, 26 are unidentifiable.Next, the reformulation step finds possible reparameterizations by scaling the rational terms.for the last rational term.The end result is an interpretable and identifiable model M*:

Discussion
In recent years, innovations in numerical methods and machine learning have been combined to improve our ability to understand complex systems.Currently, the three main classes of methods to learn equations from data are symbolic regression [65], neural-network approaches [13], and library-based sparse regression [12].Recent reviews of these categories and their overlaps can be found in [66,67].
In particular, data-driven model discovery methods for nonlinear dynamic systems have seen very significant advancements [22,24].The field has seen growth in terms of both sophistication and the range of applications.The fundamental aim remains the same: to discern the underlying mathematical models that govern the behaviour of complex, possibly high-dimensional and nonlinear systems, using measurement data.In this study we have investigated certain aspects of automatic model discovery techniques to derive mechanistic models of biological systems from time-series data.Specifically, we have focused on possible structural deficiencies of their end result, the inferred model equations.As a reference method we have chosen SINDy-PI [37], a recent sparse regression-based methodology that is particularly suited for computational biology due to its ability to capture complex nonlinearities and rational terms.
However, it should be noted that our approach can be combined with other model discovery methods.In any case, SINDy-PI has several advantages over other model discovery methods.First, it is several orders of magnitude more robust to noise than previous approaches based on sparse regression.This means that it can learn implicit ordinary and partial differential equations and conservation laws from limited and noisy data.Second, it can discover models with very complex structure, including implicit dynamics and rational nonlinearities (such as e.g.Michaelis-Menten kinetics), which are common in biological applications.Third, it is still quite computationally efficient thanks to its parallel nature and the exploitation of a library of canonical nonlinear terms.Such a library is particularly attractive when modelling the dynamics of biological networks based on mechanistic assumptions, such as mass-action kinetics.
Since by design SINDy-PI enforces parsimonious models (with the lowest complexity to support the data), it usually produces interpretable equations with excellent predictive power.However, we have shown that sometimes these models lack structural identifiability, which means that using the discovered model structure for parameter estimation might give wrong estimates, compromising its usefulness and reliability.
To address this issue we have presented a methodology that, combined with SINDy-PI, facilitates the inference of identifiable and interpretable dynamic models.Our method integrates symbolic algorithms that analyse a model's structural identifiability and observability (SIO), reparameterize it to achieve SIO if needed, and reformulate it to make it biologically interpretable.We have illustrated its use in two scenarios, with and without prior knowledge, using six challenging case studies corresponding to different kinds of biological systems, including complex regulatory mechanisms.
Our results highlight additional challenges due to non-obvious issues in the relationship between model reformulation, identifiability and interpretability, and show how our approach is able to successfully surmount them.Importantly, our method is modular and can be easily integrated with other model discovery strategies.While we have demonstrated its application in combination with SINDy-PI, other methods could have been used as well.Furthermore, its calculations are entirely symbolic, i.e. they are not affected by numerical issues caused by insufficient or noisy data (which do however limit the application of the accompanying model discovery method).
Future work will be devoted to model discovery in partially observed systems, where the structural identifiability problem will surely be exacerbated, and observability issues-i.e. the impossibility of inferring some of the unmeasured state variables-are to be expected.It should be noted that, as a matter of fact, our methodology is applicable to partially observed systems in its present form.However, model discovery for such systems is still in its infancy (see the recent work by [68]), hence in this study we have considered fully observed systems.Another possible area of improvement is computational efficiency.While our pipeline can be applied to systems with several states and a few dozen parameters, as demonstrated with the Glycolysis example, scaling up to larger models is challenging.The main bottleneck is currently the model reparameterization step performed with AutoRepar, which involves symbolic computations that can be very memory-consuming.We are working on improving the efficiency of the algorithms in order to alleviate its computational cost.
Other important avenues of research which are currently being explored include (i) improved approaches for the design of the library of candidate functions [69], (ii) better incorporation of partial prior knowledge [70], and (iii) taking into account noisy and missing data, uncertainty quantification and applications to real-world data-sets [71][72][73].Since identifiability and observability play a major role in these scenarios, we believe that our methodology will be a useful tool in these explorations.

Fig 2 .
Fig 2. Lorenz case study.Structural accuracy: on the left, active terms in ξ (non-zero terms of the prior model PM in black, and of the inferred model M* in green).Parameter accuracy: center, matching parametric ODEs for PM and M*.Predictive accuracy: on the right, time evolution of the different states (x 1 , x 2 and x 3 ) of the PM and M* models.https://doi.org/10.1371/journal.pcbi.1011014.g002

Fig 3 .
Fig 3. Immunity model.Structural unidentifiability in CM (unidentifiable parameters in red, identifiable parameters in blue) leads to the same output dynamics when different parameterizations are considered, as can be seen in CM2.In contrast, the reformulation M* is FISPO and therefore there is a unique set of parameters compatible with the output measurements.https://doi.org/10.1371/journal.pcbi.1011014.g003

x 1 and x 2
represent the concentration levels of the ComK and ComS proteins.The rational terms arise from time-scale separation assumptions about the regulation: an autoregulatory positive feedback loop of ComK plus and indirect negative feedback loop mediated by ComS.In Eq 30a, a 1 corresponds to the minimal rate of ComK production.The second term describes the autoregulation (via a positive feedback loop) of ComK activating its own production, where a 2 is the fully activated rate of ComK generation.The first term in Eq 30b describes the negative feedback loop regulating the repression of ComS, where b 1 is the maximum rate of ComS expression.Both the auto-activation of ComK and the repression of ComS follow Hill

Fig 4 .
Fig 4. Immunity case study.Structural accuracy: on the left, active terms in ξ (non-zero terms of the prior model PM in black, and of the inferred model M* in green).Parameter accuracy: center, matching parametric ODEs for PM and M*.Predictive accuracy: on the right, time evolution of the different states (x 1 and x 2 ) of the PM and M* models.https://doi.org/10.1371/journal.pcbi.1011014.g004 Fig 5 shows the assessment of the structural, parametric and predictive accuracy of the inferred model M*.

Fig 5 .
Fig 5. Bacterial case study.Structural accuracy: on the left, active terms in ξ (non-zero terms of the prior model PM in black, and of the inferred model M* in green).Parameter accuracy: center, matching parametric ODEs for PM and M*.Predictive accuracy: on the right, time evolution of the different states (x 1 and x 2 ) of the PM and M* models.https://doi.org/10.1371/journal.pcbi.1011014.g005

Fig 6 .
Fig 6.Microbial case study.Structural accuracy: on the left, active terms in ξ (non-zero terms of the prior model PM in black, and of the inferred model M* in green).Parameter accuracy: center, matching parametric ODEs for PM and M*.On the right, predictive accuracy: time evolution of the different states (x 1 and x 2 ) of the PM and M* models.https://doi.org/10.1371/journal.pcbi.1011014.g006

Fig 7 .
Fig 7. Crypt case study.Structural accuracy: on the left, active terms in ξ (non-zero terms of the prior model PM in black, and of the inferred model M* in green).Parameter accuracy: center, matching parametric ODEs for PM and M*.Predictive accuracy: on the right, time evolution of the different states (x 1 , x 2 and x 3 ) of the PM and M* models.https://doi.org/10.1371/journal.pcbi.1011014.g007

Fig 8
Fig 8 illustrates the excellent structural, parametric and predictive accuracy of the inferred model.

Fig 8 .
Fig 8. Yeast-Glycolysis case study.Structural accuracy: on the left, active terms in ξ (non-zero terms of the prior model PM in black, and of the inferred model M* in green).Due to the large number of terms in ξ, the candidate functions are not shown.Parameter accuracy: center, matching parametric ODEs for PM and M*.Predictive accuracy: on the right, time evolution of the different states (x 1 , x 2 , x 3 , x 4 , x 5 , x 6 and x 7 ) of the PM and M* models.https://doi.org/10.1371/journal.pcbi.1011014.g008

Fig 1. Workflow of the methodology. Scenario (I) (solid black lines only): data
-driven full model discovery from (time-series) data with no prior knowledge.We apply SINDy-PI and test the SIO of the discovered candidate model (CM).If the CM is not FISPO, we reparameterize it.Next, we check if the model is interpretable; if not, we reformulate it via symbolic manipulation.The result is a FISPO interpretable model, M*.Scenario (II) (solid black lines + dashed, dark orange lines in the lower part): model discovery from (time-series) data with good prior knowledge.In this scenario we seek model (in)validation and/or refinement.We have a prior model (PM) which we want to compare with an alternative candidate discovered from data (CM).To this end, we check the SIO of the PM and reparameterize if needed.In parallel, we apply SINDy-PI to the data to obtain a CM, and we make sure it is FISPO (using reparameterisation if not).Then, we use model reformulation techniques to ensure interpretable versions (M* and PM*) if needed.Lastly, we perform a comparative analysis. https://doi.org/10.1371/journal.pcbi.1011014.g001

Table 1 . Main features of the case studies: Relevant references and main characteristics of the models considered in the case studies.
The fourth and fifth rows show the maximum degree of N(x) and D(x) in Eq 4. The last row indicates if the original (ground truth, GT) model is fully identifiable and observable (FISPO).
That is, considering the first rational term scaled by p 4 , then p * 2 ¼ p 2 p 4 and p * 3 ¼ p 3 p 4 ; scaling the second term with p 9 produces p * 8 ¼ p 8 p 9 and p * 7 ¼ p 7 p 9 ; and using p 25 yields p * 24 ¼ p 24 p 25 and p * 26 ¼ p 26