Systems biology informed deep learning for inferring parameters and hidden dynamics

Mathematical models of biological reactions at the system-level lead to a set of ordinary differential equations with many unknown parameters that need to be inferred using relatively few experimental measurements. Having a reliable and robust algorithm for parameter inference and prediction of the hidden dynamics has been one of the core subjects in systems biology, and is the focus of this study. We have developed a new systems-biology-informed deep learning algorithm that incorporates the system of ordinary differential equations into the neural networks. Enforcing these equations effectively adds constraints to the optimization procedure that manifests itself as an imposed structure on the observational data. Using few scattered and noisy measurements, we are able to infer the dynamics of unobserved species, external forcing, and the unknown model parameters. We have successfully tested the algorithm for three different benchmark problems.


Introduction
Systems biology aims at a system-level understanding of biological systems, which is a holistic approach to deciphering the complexity of biological systems [1]. To understand the biological systems, we must understand the structures of the systems (both their components and the structural relationships), and their dynamics [2]. The dynamics of a biological system is usually modeled by a system of ordinary differential equations (ODEs), which describes the time evolution of the concentrations of chemical and molecular species in the system. After we know the pathway structure of the chemical reactions, then the ODEs can be derived using some kinetic laws, e.g., the law of mass action or the Michaelis-Menten kinetics [3]. System-level biological models usually introduce some unknown parameters that are required to be estimated accurately and efficiently. Thus, one central challenge in computational modeling of these systems is the estimation of model parameters (e.g., rate constants or initial concentrations) and the prediction of model dynamics (e.g., time evolution of experimentally unobserved concentrations). Hence, a lot of attention has been given to the problem of parameter estimation in the systems biology community. In particular, extensive research has been conducted on the applications of different optimization techniques, such as linear and nonlinear least-squares fitting [4], genetic algorithms [5], evolutionary computation [6], and more [7]. Considerable interest has also been raised by Bayesian methods [8,9], which could extract information from noisy data. The main advantage of Bayesian methods is the ability to infer the whole probability distributions of the unknown parameters, rather than just a point estimate. More recently, parameter estimation for computational biology models has been tackled by the algorithms in the framework of control theory. These algorithms were originally developed for the problem of estimating the time evolution of the unobserved components of the state of a dynamical system. In this context, extended Kalman filtering [10], unscented Kalman filtering [11], and ensemble Kalman methods [12] have been applied as well. In addition, different methods have also been developed to address the issue of hidden variables and dynamics [13,14], but in their examples the number of observable variables is almost one half of the number of total variables, while as we will show in the results below our proposed method requires less observable variables (e.g., one out of eight in the cell apoptosis model) to correctly infer unknown parameters and predict all unobserved variables.
Due to technical limitations, however, biological reaction networks are often only partially observable. Usually, experimental data are insufficient considering the size of the model, which results in parameters that are non-identifiable [15] or only identifiable within confidence intervals (see more details in [16]). Furthermore, a large class of models in systems biology are sensitivity to the parameter values that are distributed over many orders of magnitude. Such sloppiness is also a factor that makes parameter estimation more difficult [17]. In the process of parameter inference, two issues accounting for system's (non-)identifiability have to be considered: structural identifiability that is related to the model structure independent of the experimental data [18,19]; and practical identifiability that is related to the amount and quality of measured data. The a priori structural identifiability can be used to address the question of unique estimation of the unknown parameters based on the postulated model. However, a parameter that is structurally identifiable may still be practically non-identifiable assuming that the model is exact, but the measurements are noisy or sparse [20].
In this work, we introduce a new deep learning [21] method-systems-informed neural networks, based on the method of physics-informed neural networks [22,23], to infer the hidden dynamics of experimentally unobserved species as well as the unknown parameters in the system of equations. By incorporating the system of ODEs into the neural networks (through adding the residuals of the equations to the loss function), we effectively add constraints to the optimization algorithm, which makes the method robust to measurement noise and few scattered observations. In addition, since large system-level biological models are typically encountered, our algorithm is computationally scalable and feasible, and its output is interpretable even though it depends on a high-dimensional parameter space.

Materials and methods
Throughout this paper, we assume that the systems biological process can be modeled by a system of ordinary differential equations (ODEs) of the following form where the state vector x = (x 1 , x 2 , . . ., x S ) represents the concentration of S species, and p = (p 1 , p 2 , . . ., p K ) are K parameters of the model, which remain to be determined. Hence, the system of ODEs will be identified once p is known. y = (y 1 , y 2 , . . ., y M ) are the M measurable signals (consistent with the ODE system), which we can measure experimentally and could possibly be contaminated with a white noise � of Gaussian type with zero mean and standard deviation σ. The output function h is determined by the design of the experiments that are used for parameter inference. While h could, in general, be any function, it is assumed to be a linear function with M � S in most models as follows: . .
i.e., y 1 , y 2 , . . ., y M are the noisy measurements of the species x s 1 ; x s 2 ; . . . ; x s M among all S species

Systems-informed neural networks and parameter inference
Based on the method of physics-informed neural networks proposed in [22], we introduce a deep learning framework that is informed by the systems biology equations that describe the kinetic pathways (Eq (1a)). A neural network with parameters θ takes time t as the input and outputs a vector of the state variablesxðt; θÞ ¼ ðx 1 ðt; θÞ;x 2 ðt; θÞ; . . . ;x S ðt; θÞÞ as a surrogate of the ODE solution x(t) (Fig 1). In our network, in addition to the standard layers, e.g., the fully-connected layer, we add three extra layers to make the network training easier, described as follows.
• Input-scaling layer. Because the ODE system could have a large time domain, i.e., the input time t could vary by orders of magnitude, we first apply a linear scaling function to t, i.e., t ¼ t=T, such thatt � Oð1Þ. We can simply choose T as the maximum value of the time domain.
• Feature layer. In many models, the ODE solution may have a certain pattern, e.g., periodicity in the yeast glycolysis model and fast decay in the cell apoptosis model, and thus it is beneficial to the network training by constructing a feature layer according to these patterns. Specifically, we employ the L functions e 1 (�), e 2 (�), . . ., e L (�) to construct the L features e 1 ðtÞ; e 2 ðtÞ; . . . ; e L ðtÞ. The choice of the features is problem dependent, and we can select the features based on the available observations. For example, if the observed dynamics has certain periodicity (e.g., the yeast glycolysis model), then we can use sin(kt) as the features, where k is selected based on the period of the observed dynamics; if the observed dynamics decays fast (e.g., the cell apoptosis model), then we can use e −kt as a feature. However, if there is no clear pattern observed, the feature layer can also be removed.
• Output-scaling layer. Because the outputsx 1 ;x 2 ; . . . ;x S may have different magnitudes, similar to the input-scaling layer, we add another scaling layer to transform the output of the last fully-connected layerx 1 ;x 2 ; . . . ;x S (of order one) tox 1 ;x 2 ; . . . ;x S , i.e., Here, k 1 , k 2 , . . ., k S are chosen as the magnitudes of the mean values of the ODE solution x 1 , x 2 , . . ., x S , respectively.
The next key step is to constrain the neural network to satisfy the scattered observations of y as well as the ODE system (Eq (1a)). This is realized by constructing the loss function by considering terms corresponding to the observations and the ODE system. Specifically, let us assume that we have the measurements of y 1 , y 2 , . . ., y M at the time t 1 ; t 2 ; . . . ; t N data , and we enforce the network to satisfy the ODE system at the time point t 1 ; t 2 ; . . . ; t N ode . We note that the times t 1 ; t 2 ; . . . ; t N data and t 1 ; t 2 ; . . . ; t N ode are not necessarily on a uniform grid, and they could be chosen at random. Then, the total loss is defined as a function of both θ and p: where ðy m ðt n Þ Àx s m ðt n ; θÞÞ w aux s ðx s ðT 0 Þ Àx s ðT 0 ; θÞÞ 2 þ ðx s ðT 1 Þ Àx s ðT 1 ; θÞÞ L data is associated with the M sets of observations y given by Eq (1c), while L ode enforces the structure imposed by the system of ODEs given in Eq (1a). We employ automatic differentiation (AD) to analytically compute the derivative dx s dt j t n in L ode (see more details of AD in [23]). The third auxiliary loss term L aux is introduced as an additional source of information for the system identification, and involves two time instants T 0 and T 1 . It is essentially a component of the data loss; however, we prefer to separate this loss from the data loss, as in the auxiliary loss data are given for all state variables at these two time instants. We note that L data and L aux are the discrepancy between the network and measurements, and thus they are supervised losses, while L ode is based on the ODE system, and thus is unsupervised. In the last step, we infer the neural network parameters θ as well as the unknown parameters of the ODEs p simultaneously by minimizing the loss function via gradient-based optimizers, such as the Adam optimizer [24]: We note that our proposed method is different from meta-modeling [25,26], as we optimize θ and p simultaneously. (5), and ðw aux 1 ; w aux 2 ; . . . ; w aux S Þ in Eq (6) are used to balance the M + 2S loss terms. In this study, we manually select these weight coefficients such that the weighted losses are of the same order of magnitude during the network training. We note that this guideline makes the weight selection much easier, although there are many weights to be determined. These weights may also be automatically chosen, e.g., by the method proposed in [27]. In this study, the time instants t 1 ; t 2 ; . . . ; t N data for the observations are chosen randomly in the time domain, while the time instants t 1 ; t 2 ; . . . ; t N ode used to enforce the ODEs are chosen in an equispaced grid. Additionally, in the auxiliary loss function, the first set of data is the initial conditions at time T 0 for the state variables. The second set includes the values of the state variables at any arbitrary time instant T 1 within the training time window (not too close to T 0 ); in this work, we consider the midpoint time for the cell apoptosis model, and the final time instant for the yeast glycolysis model and ultradian endocrine model. If data is available at another time point, alternatively this point can be considered.

Analysis of system's identifiability
A parameter p i is identifiable if the confidence interval of its estimatep i is finite. In systems identification problems, two different forms of identifiability namely, structural and practical are typically encountered. Structural non-identifiability arises from a redundant parameterization in the formal solution of y due to insufficient mapping h of internal states x to observables y in Eq (1) [15]. A priori structural identifiability has been studied extensively, e.g., using power series expansion [28] and differential algebraic methods [29], yet mostly limited to linear models as the problem is particularly difficult for nonlinear dynamical systems. Furthermore, practical non-identifiability cannot be detected with these methods, since experimental data are disregarded.
A parameter that is structurally identifiable may still be practically non-identifiable. Practical non-identifiability is intimately related to the amount and quality of measured data and manifests in a confidence interval that is infinite. Different methods have been proposed to estimate the confidence intervals of the parameters such as local approximation of the Fisher-Information-Matrix (FIM) [20] and bootstrapping approach [30]. Another approach is to quantify the sensitivity of the systems dynamics to variations in its parameters using a probabilistic framework [31]. For identifiability analysis, we primarily use the FIM method, which is detailed in S1 Text.

Implementation
The algorithm is implemented in Python using the open-source library DeepXDE [23]. The width and depth of the neural networks (listed in Table 1) depend on the size of the system of equations and the complexity of the dynamics. We use the SwishðxÞ ¼ x 1þe À x function [32] as the activation function σ shown in Fig 1, and the feature layer is listed in Table 2.
For the training, we use an Adam optimizer [24] with default hyperparameters and a learning rate of 10 −3 , where the training is performed using the full batch of data. As the total loss consists of two supervised losses and one unsupervised loss, we perform the training using the following two-stage strategy: Step 1. Considering that supervised training is usually easier than unsupervised training, we first train the network using the two supervised losses L data and L aux for some iterations, such that the network can quickly match the observed data points.
Step 2. We further train the network using all the three losses.
We found empirically that this two-stage training strategy speeds up the network convergence. The number of iterations for each stage is listed in Table 1.

Yeast glycolysis model
The model of oscillations in yeast glycolysis [33] has become a standard benchmark problem for systems biology inference [34,35] as it represents complex nonlinear dynamics typical of biological systems. We use it here to study the performance of our deep learning algorithm used for parsimonious parameter inference with only two observables. The system of ODEs for this model as well as the target parameter values and the initial conditions are included in S2 Text. To represent experimental noise, we corrupt the observation data by a Gaussian noise with zero mean and the standard deviation of σ � = cμ, where μ is the standard deviation of each observable over the observation time window and c = 0 − 0.1. We start by inferring the dynamics using noiseless observations on two species S 5 (the concentration of NADH) and S 6 (the concentration of ATP) only. These two species are the minimum number of observables we can use to effectively infer all the parameters in the model. S1 Fig shows the noiseless synthetically generated data by solving the system of ODEs in S2 Text with the parameters listed in S1 Table. We sample data points within the time frame of 0 − 10 minutes at random and use them for training of the neural networks, where the neural network is informed by the governing ODEs of the yeast model as explained above. S2 Fig shows the inferred dynamics for all the species predicted by the systems-informed neural networks, and plotted against the exact dynamics that are generated by solving the system of ODEs. We observe excellent agreement between the inferred and exact dynamics within the training time window. The neural networks learn the input data given by scattered observations (shown by symbols in S2 Fig) and is able to infer the dynamics of other species due to the constraints imposed by the system of ODEs.
Next, we verify the robustness of the algorithm to noise. For that purpose, we introduce Gaussian additive noise with the noise level c = 10% to the observational data. The input training data are shown in Fig 2 for the same species (S 5 and S 6 ) as the observables, where similar to the previous test, we sample random scattered data points in time. Results for the inferred dynamics are shown in Fig 3. The agreement between the inferred and exact dynamics is excellent considering the relatively high level of noise in the training data. Our results show that the enforced equations in the loss function L ode act as a constraint of the neural networks that can effectively prevent the overfitting of the network to the noisy data. One advantage of encoding the equations is their regularization effect without using any additional L 1 or L 2 regularization.
Our main objective in this work, however, is to infer the unknown model parameters p. This can be achieved simply by training the neural networks for its parameters θ as well as the model parameters using backpropagation. The results for the inferred model parameters along with their target values are given in Table 3 for both test cases (i.e., with and without noise in the observations). First thing to note is that the parameters can be identified within a confidence interval. Estimation of the confidence intervals a priori is the subject of structural identifiability analysis, which is not in the scope of this work. Second, practical identifiability analysis can be performed to identify the practically non-identifiable parameters based on the quality of the measurements and the level of the noise. We have performed local sensitivity analysis by constructing the Fisher Information Matrix (FIM) (S1 Text) and the correlation matrix R derived from the FIM.
The inferred parameters from both noiseless and noisy observations are in good agreement with their target values. The most significant difference (close to 30% difference) can be seen for the parameter N (the total concentration of NAD + and NADH). However, given that the glycolysis system (S2 Text) is identifiable (c.f. [33,35] and S3 Fig), and the inferred dynamics shown in S2 Fig and Fig 3 show that the learned dynamics match very well with the exact dynamics, the inferred parameters are valid. We used Eq. (S3) in S1 Text to estimate the standard deviations of the model parameters. The σ i estimates for the parameters are the lower bounds, and thus, may not be informative here. Further, these estimates are derived based on a local sensitivity analysis. A structural/practical identifiability analysis [15] or a bootstrapping approach to obtain the parameter confidence intervals is probably more relevant here. Using the FIM, we are able to construct the correlation matrix R for the parameters. Nearly perfect correlations (|R ij | � 1) suggest that the FIM is singular and the correlated parameters may not be practically identifiable. For the glycolysis model, as shown in S3 Fig, no perfect correlations can be found in R (except for the anti-diagonal elements), which suggests that the model described by S2 Text is practically identifiable. In the example above, we considered 500 data measurements, but in systems biology we often lack the ability to observe system dynamics at a fine-time scale. To investigate the performance of our method to a set of sparse data points, we used only 200 data points, and still have a good inferred dynamics of the species S 1 , S 2 , S 5 and S 6 (S4 Fig).

Cell apoptosis model
Although the glycolysis model is highly nonlinear and difficult to learn, we have shown that its parameters can be identified. To investigate the performance of our algorithm for non-identifiable systems, we study a cell apoptosis model, which is a core sub-network of the signal transduction cascade regulating the programmed cell death-against-survival phenotypic decision [36]. The equations defining the cell apoptosis model and the values of the rate constants for the model are taken from [36] and listed in S2 Table. Although the model is derived using simple mass-action kinetics and its dynamics is easy to learn with our algorithm, most of the parameters are not identifiable due to both structural and practical non-identifiability. To infer the dynamics of this model, we only use 120 random samples of measurements collected for one observable (x 4 ), where we assume that the measurements are corrupted by a zero-mean Gaussian noise and 5% standard deviation as shown Table 3. Parameter values for yeast glycolysis model and each corresponding inferred values. The standard deviations are estimated using Eq. (S3) in S1 Text as practical non-identifiability analysis based on the FIM. in Fig 4. Furthermore, it is possible to use different initial conditions in order to produce different cell survival outcomes. The initial conditions for all the species are given in S3 Text, while we use x 7 (0) = 2.9 × 10 4 (molecules/cell) to model cell survival (Fig 4(top)) and x 7 (0) = 2.9 × 10 3 (molecules/cell) to model cell death (Fig 4(bottom)).

Parameter Target value Inferred value (Noiseless observations) Inferred value (Noisy observations) Standard deviation
Using the systems-informed neural networks and the noisy input data, we are able to infer most of the dynamics (including x 3 , x 4 , x 6 , x 7 and x 8 ) of the system as shown in S5 Fig and Fig  5. These results show a good agreement between the inferred and exact dynamics of the cell survival/apoptosis models using one observable only.
We report the inferred parameters for the cell apoptosis model in Table 4, where we have used noisy observations on x 4 under two scenarios of cell death and survival for comparison. The results show that four parameters (k 1 , k d2 , k d4 and k d6 ) can be identified by our proposed method with relatively high accuracy, as indicated by the check mark (✓) in the last column of Table 4. We observe that the standard deviations for most of the parameter estimates are orders of magnitude larger than their target values, and thus the standard deviations estimated using the FIM are not informative in the practical identifiability analysis. The only informative standard deviation is for k d6 (indicated by the symbol †), and k d6 is inferred with relatively high accuracy by our method.
To have a better picture of the practical identifiability analysis, we have plotted the correlation matrix R in S6 Fig. We observe perfect correlations |R ij | � 1 between some parameters. Specifically, parameters k 1 − k d1 , and k 3 − k d3 have correlations above 0.99 for cell survival model, which suggests that these parameters may not be identified. This is generally in agreement with the parameter inference results in Table 4 with some exceptions. Our parameter inference algorithm suggests that k 1 is identifiable, whereas k d1 is not for the cell survival model. Thus, in order to increase the power of the practical identifiability analysis and to complement the correlation matrix, we have computed the FIM null eigenvectors and for each eigenvector we identified the most dominant coefficients, which are plotted in Fig 6. We observe that there are six null eigenvectors associated with the zero eigenvalues of the FIM for both the cell survival and cell death models. The most dominant coefficient in each null eigenvector is associated with a parameter that can be considered as practically non-identifiable. The identifiable parameters include k 1 , k d4 and k d6 (indicated by the symbol ? in Table 4), which agree well with the results of our algorithm. On the contrary, our algorithm successfully infers one more parameter k d2 than the above analysis. This could be due to the fact that checking practical identifiability using the FIM can be problematic, especially for partially observed nonlinear systems [37]. We have similar results for the cell death model.

Ultradian endocrine model
The final test case for assessing the performance of the proposed algorithm is to infer parameters of the ultradian model for the glucose-insulin interaction. We use a relatively simple ultradian model [38] with 6 state variables and 30 parameters. This model is developed in a nonpathophysiologic context and represents relatively simple physiologic mechanics. In this model, the main variables are the plasma insulin concentration I p , the interstitial insulin concentration I i , the glucose concentration G, and a three stage filter (h 1 , h 2 , h 3 ) that reflects the response of the plasma insulin to glucose levels [38]. The resulting system of ODEs, the nominal values for the parameters of the ultradian model along with the initial conditions for the 6 state variables are given in S4 Text.
The nutritional driver I G (t) is the systematic forcing of the model that represents the external sources of glucose from nutritional intake. Although the nutritional intake (modeled by Table 4. Parameter values for cell apoptosis model and their corresponding inferred values. The standard deviations are estimated using Eq. (S3) in S1 Text as practical identifiability analysis using the Fisher Information Matrix. The symbols ✓, † and ? in the last column denote that the corresponding variable is identifiable using our proposed method, FIM standard deviation, and null-eigenvector analysis.

Parameter
Target  the N discrete nutrition events) is required to be defined and properly recorded by the patients, it is not always accurately recorded or may contain missing values. Therefore, it would be useful to employ systems-informed neural networks to not only infer the model parameters given the nutrition events, but also to assume that the intake is unknown (hidden forcing) and infer the nutritional driver in Eq. S7f (S4 Text) as the same time.
Model parameter inference given the nutrition events. We consider an exponential decay functional form for the nutritional intake I G ðtÞ ¼ P N j¼1 m j k expðkðt j À tÞÞ, where the decay constant k is the only unknown parameter and three nutrition events are given by (t j , m j ) = [(300, 60) (650, 40) (1100, 50)] (min, g) pairs. The only observable is the glucose level measurements G shown in Fig 7 (generated here synthetically by solving the system of ODEs), which are sampled randomly to train the neural networks for the time window of 0 − 1800 minutes. Because we only use the observation of G and have more than 20 parameters to infer, we limit the search range for these parameters. The range of seven parameters is adopted from [38], and the range for other parameters is set as (0.2x, 1.8x), where x is the nominal value of that parameter (S3 Table).
For the first test case, we set the parameters V p , V i and V g to their nominal values and infer the rest of the parameters. The inferred values are given in Table 5 (column Test 1), where we observe good agreement between the target and inferred values. For the second test, we also infer the values of V p , V i and V g ( Table 5). Although the inferred parameters are slightly worse than the Test 1, when using the inferred parameters, we are able to solve the equations for unseen time instants with high accuracy. We perform forecasting for the second test case after training the algorithm using the glucose data in the time interval of t = 0 − 1800 min and inferring the model parameters. Next, we consider that there is a nutrition event at time t j = 2000 min with carbohydrate intake of m j = 100 g. As shown in Fig 8, we are able to forecast with high accuracy the glucose-insulin dynamics, more specifically, the glucose levels following the nutrition intake.
Model parameter inference with hidden nutrition events. As detailed in the following, one of the significant advantages of the systems-informed neural network is its ability to infer the hidden systematic forcing in the model. For example, in the glucose-insulin model, the nutritional driver I G is the forcing that we aim to infer as well. Here, we use the glucose measurements to train the model for the time interval t = 0 − 1800 min shown in Fig 7, while we assume that the time instants and quantities of three nutritional events are additionally unknown.
We found that it is difficult to infer all the parameters as well as the the timing and carbohydrate content of each nutrition event. However, given V p , V i , V g and the timing of each nutrition event, the algorithm is capable of inferring the other model parameters as well as the carbohydrate content (S4 Table column Test 1). Having the nutrition events as well as all other unknown parameters estimated, we are able to forecast the glucose levels for t = 1800 − 3000 min assuming there has been a nutritional intake of (t j , m j ) = (2000, 100). The predictions for the glucose G and the nutritional driver I G are shown in S7 Fig, which show excellent agreement in the forecasting of glucose levels. For the second test, we also infer the values of V p , V i and V g , and the result is slightly worse (S4 Table column Test 2 and S8 Fig).
If both the timing and carbohydrate content of each nutrition event are unknown, the algorithm is also capable to infer them by assuming that certain model parameters are known. We found that the selection of the known parameters is important. As shown in S4 Table,

Discussion
We presented a new and simple to implement "systems-biology-informed" deep learning algorithm that can reliably and accurately infer the hidden dynamics described by a mathematical model in the form of a system of ODEs. The system of ODEs is encoded into a plain "uninformed" deep neural networks and is enforced through minimizing the loss function that includes the residuals of the ODEs. Enforcing the equations in the loss function adds additional constraints in the learning process, which leads to several advantages of the proposed algorithm: first, we are able to infer the unknown parameters of the system of ODEs once the neural network is trained; second, we can use a minimalistic amount of data on a few observables to infer the dynamics and the unknown parameters; third, the enforcement of the equations adds a regularization effect that makes the algorithm robust to noise (we have not used any other regularization technique); and lastly, the measurements can be scattered, noisy and just a few. The problem of structural and practical non-identifiability (such as the one encountered in the cell apoptosis model) is a long-standing problem in the field of systems identification, and has been under extensive research, e.g., [39]. Structural non-identifiabilities originate from incomplete observation of the internal model states. Because the structural non-identifiability is independent of the accuracy of experimental data measurements, we cannot resolve it by a refinement of existing measurements, and one possible way to resolve this issue is increasing the number of observed species. Our focus in this study is mostly on practical identifiability, which can guide us to redesign the experiment, improve the model, or collect more experimental measurements. In this study, we used FIM and local sensitivity analysis for the identifiability analysis, but we note that FIM has many limitations and can be problematic, especially for partially observed nonlinear systems [37], and hence other advanced alternatives [15,40] should be used in future works. However, our goal in this work was not to do systematic identifiability analysis, but rather to use identifiability analysis to explain some of our findings.

Conclusion
We have used three benchmark problems to assess the performance of the algorithm including a highly nonlinear glycolysis model, a non-identifiable cell apoptosis model, and an ultradian glucose-insulin model for glucose forecasting based on the nutritional intake. Given the system of ODEs and initial conditions of the state variables, the algorithm is capable of accurately inferring the whole dynamics with one or two observables, where the unknown parameters are also inferred during the training process. An important and very useful outcome of the algorithm is its ability to infer the systematic forcing or driver in the model such as the nutritional intake in the glucose-insulin model. In this work, we considered the synthetic data of three small problems to test the performance and limitation of the proposed method. We will apply our method to larger problems and real data (e.g., the dataset introduced in [41]) in future work.  Table). Scattered observations of glucose level are randomly sampled from 0 − 1800 min and used for training. The parameter k in the intake function I G as well as carbohydrate content (m j ) of each nutrition event are treated as unknown, while V p , V i , V g and the timing (t j ) of each nutrition event are given. (PDF) Table). Scattered observations of glucose level are randomly sampled from 0 − 1800 min and used for training. The parameter k in the intake function I G as well as carbohydrate content (m j ) of each nutrition event are treated as unknown, while the timing (t j ) of each nutrition event are given. (PDF) Table). Scattered observations of glucose level are randomly sampled from 0 − 1800 min and used for training. The timing (t j ) and carbohydrate content (m j ) of each nutrition event are treated as unknown, while the parameter k in the intake function I G is given. (PDF) S1 Table. Full list of parameters for glycolytic oscillator model [33]. (PDF) S2 Table. Full list of parameters for cell apoptosis model [36]. (PDF) S3 Table. Full list of parameters for the ultradian glucose-insulin model [42]. The search range for the first 7 parameters is adopted from [38], and the range for the other parameters is (0.2x, 1.8x), where x is the nominal value of that parameter. (PDF)