Data-driven model discovery and model selection for noisy biological systems

doi:10.1371/journal.pcbi.1012762

Fig 1.

Pipeline for data-driven model discovery for dynamical systems with model selection.

A. Overview of the pipeline. Raw data is converted to batch training data for learning hybrid dynamical models using different hyperparameters. Dynamics approximated by the best hybrid dynamical model are used for inferring ODE models with sparse regression. Inferred ODE models are evaluated on model fit and extrapolation. B. Each sample in the raw training data is split using a sliding window to obtain samples on shorter time spans. These short samples are assembled into batches for training. C. A hybrid dynamical model is trained using NN (x) by simulating over the training time span and backpropagating, using the loss between simulated data and training data. D. The trained hybrid dynamical model is input to sparse regression for equation-learning via the STLSQ sparse regression algorithm, and the final inferred model is output.

More »

Expand

Fig 2.

Evaluation of hybrid dynamical model fits on the Lotka-Volterra model.

A. The Lotka-Volterra system describes predator-prey relationships in an ecosystem, here parameterized by (α, β, γ, δ). B. Example simulation with parameters (1 . 3, 0 . 9, 0 . 8, 1 . 8). The population dynamics oscillate at a stable limit cycle. C and D. To evaluate model discovery methods, additive or multiplicative noise is added to the underlying deterministic dynamics at different noise levels. The mean trajectories of 200 samples for each noise model are shown; ribbons represent ± 3 s.d. E and F. Comparison of fits using hybrid dynamical models. In setting 1, good fits to the data were not obtained (high validation loss). Training parameters: learning rate 0.001; window size 10; batch size 10. In setting 2, a good fit to the data was obtained. Training parameters: learning rate 0.01; window size 5; batch size 5.

More »

Expand

Table 1.

Model selection configuration for Lotka-Volterra.

More »

Expand

Table 2.

Best ODE model inferred from Lotka-Volterra data at each noise level.

More »

Expand

Fig 3.

Inferred models from Lotka-Volterra datasets.

For each model, the first term in and is known and the remaining terms are inferred. Underline denotes terms that are incorrect relative to the true model (Eqs 4). A–C. Inferred models with lowest AICc for datasets with additive noise, at 1, 5 and 10%. D–F. Inferred models with lowest AICc for datasets with multiplicative noise at at 1, 5 and 10%. G. The model with correct terms could be inferred at 10% additive noise, but was not ranked highest by AICc. H. The model with correct terms could be inferred at 10% multiplicative noise, but was not ranked highest by AICc.

More »

Expand

Table 3.

Lowest AICc values of models inferred from Lotka-Volterra datasets using various methods.

More »

Expand

Fig 4.

Comparison of derivatives approximated from Lotka-Volterra datasets using various methods.

A. Derivatives approximated by base SINDy, which uses the method of finite differences. B. Derivatives approximated by finite differences, with known terms (i.e. αx₁ and - δx₂ in Eqs 5) subtracted, which are used by base-known. C. Derivatives approximated by a pure neural network formulation, in which the full right-hand side of an ODE system is approximated by a neural network. D. Derivatives approximated by a neural network in the hybrid model; where the neural network is fitted to partial dynamics (Eqs 5).

More »

Expand

Fig 5.

Evaluation of base SINDy fits and data generation for the repressilator model.

A. The repressilator models describes a coupled negative feedback loop between three interacting proteins. B. Example simulation of the repressilator model without noise. The concentrations of each of the three proteins oscillate and the system eventually reaches a stable limit cycle. C. Evaluation of base SINDy to infer repressilator models from noise-free data on t = [ 0, 10 ] . Inferred equations (left) and simulation of the inferred model on t = [ 0, 30 ] (right). Underlined terms are incorrect in comparison to the true repressilator model (Eqs 6). D–E. To evaluate model discovery methods, additive or multiplicative noise is added to the underlying deterministic dynamics at different noise levels. The mean trajectories of 200 samples for each noise model are shown; ribbons represent ± 3 s.d.

More »

Expand

Table 4.

Model selection configuration for the repressilator model.

More »

Expand

Table 5.

Best ODE model inferred from repressilator data at each noise level.

More »

Expand

Fig 6.

Inferred models inferred from repressilator datasets.

For each model, the linear decay terms are known and the remaining terms are inferred. Underline denotes terms that are incorrect relative to the true model (Eqs 6). A–C. Inferred models with lowest AICc for datasets with additive noise, at 0 . 1, 1 and 10%. D–F. Inferred models with lowest AICc for datasets with multiplicative noise at at 0 . 1, 1 and 10%. G. The model with correct terms could be inferred at 10% additive noise, but was not ranked highest by AICc. H. The model with correct terms could be inferred at 10% multiplicative noise, but was not ranked highest by AICc.

More »

Expand

Fig 7.

Model discovery of cell state transition dynamics from scRNA-seq data.

A. Diagram of possible cell state transitions during epithelial-mesenchymal transition (EMT). B. EMT data generated with uncertainty estimates for model training from scRNA-seq of cell state transitions over pseudotime. C. Simulations from ODE model inferred by base SINDy. D. Simulations from ODE model inferred from the hybrid model (Eq 9); model equations of the inferred model (right). x₁: epithelial cell state; x₂ intermediate cell state; x₃: mesenchymal cell state. E. Simulations from ODE model inferred from the pure NN model. For C–E, the vertical dashed lines indicate the end of time span on which the inferred models were trained.

More »

Expand