Identifying nonlinear dynamical systems via generative recurrent neural networks with applications to fMRI

A major tenet in theoretical neuroscience is that cognitive and behavioral processes are ultimately implemented in terms of the neural system dynamics. Accordingly, a major aim for the analysis of neurophysiological measurements should lie in the identification of the computational dynamics underlying task processing. Here we advance a state space model (SSM) based on generative piecewise-linear recurrent neural networks (PLRNN) to assess dynamics from neuroimaging data. In contrast to many other nonlinear time series models which have been proposed for reconstructing latent dynamics, our model is easily interpretable in neural terms, amenable to systematic dynamical systems analysis of the resulting set of equations, and can straightforwardly be transformed into an equivalent continuous-time dynamical system. The major contributions of this paper are the introduction of a new observation model suitable for functional magnetic resonance imaging (fMRI) coupled to the latent PLRNN, an efficient stepwise training procedure that forces the latent model to capture the ‘true’ underlying dynamics rather than just fitting (or predicting) the observations, and of an empirical measure based on the Kullback-Leibler divergence to evaluate from empirical time series how well this goal of approximating the underlying dynamics has been achieved. We validate and illustrate the power of our approach on simulated ‘ground-truth’ dynamical systems as well as on experimental fMRI time series, and demonstrate that the learnt dynamics harbors task-related nonlinear structure that a linear dynamical model fails to capture. Given that fMRI is one of the most common techniques for measuring brain activity non-invasively in human subjects, this approach may provide a novel step toward analyzing aberrant (nonlinear) dynamics for clinical assessment or neuroscientific research.


Introduction
A central tenet in computational neuroscience is that computational processes in the brain are ultimately implemented through (stochastic) nonlinear neural system dynamics (Breakspear, 2017;Izhikevich, 2007;Wilson, 1999).This idea reaches from Hopfield's (Hopfield, 1982) early proposal on memory patterns as fixed point attractors in recurrent neural networks, working memory as rate attractors (Durstewitz et al., 2000;Wang, 2001), decision making as stochastic transitions among competing attractor states (Albantakis and Deco, 2009), motor or thought sequences as limit cycles or heteroclinic chains of saddle nodes (Rabinovich et al., 2008a;Rabinovich et al., 2008b), to the role of line attractors in parametric working memory (Machens et al., 2005;Rabinovich and Varona, 2011;Romo et al., 1999), neural integration (Seung et al., 2000), interval timing (Durstewitz, 2003), and more recent thoughts on the role of transient dynamics in cognitive processing (Balaguer-Ballester et al., 2017).To test and further develop such theories, methods for directly assessing system dynamics from neural measurements would be of great value.
Traditionally, mostly linear approaches like linear (Gaussian or Gaussian-Poisson) state space models (Macke et al., 2015;Paninski et al., 2010;Smith and Brown, 2003), Gaussian Process Factor Analysis (GPFA; Yu et al., 2009), Dynamic Causal Modeling (DCM; Friston et al., 2003), or (nonlinear, but model-free) delay embedding techniques (Balaguer-Ballester et al., 2011;Lapish et al., 2015), have been used for reconstructing state space trajectories from experimental recordings.While these are powerful visualization tools that may also give some insight into system parameters, like connectivity (Friston et al., 2003), linear dynamical systems (DS) are inherently very limited with regards to the range of dynamical phenomena they can produce (e.g.Strogatz, 2018).The representation of smoothed trajectories in the latent space may still inform the researcher about interesting aspects of the dynamics, but the inferred latent model on its own is not powerful enough to reproduce many interesting and computationally important phenomena like multi-stability, complex limit cycles, or chaos (Durstewitz, 2017a;Strogatz, 2018).More formally, given experimental observations  = {  } supposedly generated by some underlying latent dynamical process  = {  } (Figure 1), linear state space models may yield a useful approximation to the posterior (|), butdue to their linear limitationsthey will not produce an adequate mathematical model of the prior dynamics () that could generate the actual observations via (|).
In contrast, recurrent neural networks (RNNs) represent a class of nonlinear DS models which are universal in the sense that they can approximate arbitrarily closely the flow of any other dynamical system (Funahashi and Nakamura, 1993;Kimura and Nakano, 1998;Trischler and D'Eleuterio, 2016).
Hence, RNNs are, in theory, sufficiently powerful to emulate any type of brain dynamics.Based on previous work embedding RNNs into a statistical inference framework (Roweis and Ghahramani, 2000;Yu et al., 2005), we have recently developed a nonlinear state space model utilizing piecewiselinear RNNs (PLRNNs) for the latent dynamical process (Durstewitz, 2017b).In state space models, similar to sequential variational auto-encoders (VAEs; Bayer and Osendorfer, 2015;Chung et al., 2015;Kingma and Welling, 2013), one attempts to infer the system parameters  by maximizing a lower bound on the log-likelihood log  (𝐗|𝛉).In contrast to many other RNN-based approaches (Roweis and Ghahramani, 2000;Zhao and Park, 2018), including sequential VAEs (Pandarinath et al., 2018), our method returns a set of neuronally interpretable and partly analytically tractable dynamical equations that could be used to gain further insight into the generating system.
The present work further advances this powerful methodology along three major directions: First, given that fMRI is likely the most important non-invasive technique for gaining insight into human brain function in healthy subjects and psychiatric illness, we provide an observation ('decoder') model for the PLRNN that takes the hemodynamic response filtering into account.Second, we develop a stepwise initialization and training scheme that forces the latent PLRNN model toward the correct underlying dynamics: Good prediction of the time series observations and informative smooth latent trajectories may be achieved even without recreating a sufficiently good approximation to the underlying DS (as evidenced by the success of linear state space models).Through a kind of annealing protocol that places increasingly more burden of predicting the observations onto the latent process model, we enforce the correct dynamics.Third, we show that a Kullback-Leibler divergence defined across state space between the prior generative model dynamics () (independent of the observations) and the inferred latent states given the observations, (|), provides a good measure for how well the reconstructed DS (emulated by the PLRNN) can be expected to have captured the correct underlying system.Hence, our approach, rather than just inferring the latent space underlying the observations, attempts to force the system to capture the correct dynamics in its governing equations, and provides a quantitative sense of how well this worked for any empirically observed system for which the ground truth is not known.

PLRNN-based state space model (PLRNN-SSM)
We start by introducing our nonlinear state space model (SSM) and statistical inference framework (originally developed in Durstewitz, 2017b).Within a SSM, one aims to predict observed experimental time series   ∈ ℝ  from a potentially much lower dimensional set of latent variables   ∈ ℝ  and their temporal dynamics.Here we use a piecewise-linear recurrent neural network (PLRNN) (i.e., a RNN composed of rectified-linear units [ReLUs]) for modeling the unknown latent dynamics: (1)   =  −1 + ( −1 ) +  +   +   ,   ~(, ) 1 ~( 0 +  1 , ) where   is the latent state vector at time t=1…T,  ∈ ℝ x is a diagonal matrix of (linear) autoregression weights (or 'time constants'),  ∈ ℝ x an off-diagonal matrix of connection weights, and (  ) = max (  , 0) is an (element-wise) ReLU transfer function.  ∈ ℝ  denotes time-dependent external inputs that influence latent states through coefficient matrix  ∈ ℝ x , and   is a Gaussian white noise process with diagonal covariance matrix .(The basic model was modified from Durstewitz (2017b) to enable efficient estimation of bias parameters  and speeding up the inference algorithm by orders of magnitude.)The diagonal and off-diagonal structure of  and , respectively, besides having a physiological interpretation (see Durstewitz, 2017b), help to ensure that system parameters remain identifiable.
The observed time series are generated from the ReLU-transformed latent states (eq.( 1)) through a linear-Gaussian model: (2)   = (  ) +   ,   ~(, ) where   are the observed N-dimensional measurements at time t generated from   ,  ∈ ℝ x is a matrix of regression weights (factor loadings), and   denotes a Gaussian white observation noise process with diagonal covariance matrix .
Thus, the model is specified by the set of parameters  = { 0 , , , , , , , }, and we are interested in recovering  as well as the posterior distribution (|) over the latent state path  = { 1: } from the experimentally observed time series  = { 1: } and experimental inputs  = { 1: }.In the following, we will sometimes use the notation   = { 0 , , , , , } and   = {, } to exclusively refer to parameters in the evolution or observation equation, respectively.

Observation model for BOLD time series
An appealing feature of the SSM framework is that different measurement modalities and properties can be accommodated by connecting different observation models to the same latent model.In order to apply our model to fMRI time series, we need only to adapt observation eq. ( 2) to meet the distributional assumptions and temporal filtering of the blood-oxygen-level dependent (BOLD) signal, while retaining process eq. ( 1) with its universal approximation capabilities.In contrast to electrophysiological measurements, BOLD time-series are a strongly filtered, highly smoothed version of some underlying neural process, only accessible through the hemodynamic response function (HRF) (e.g.Worsley and Friston, 1995).Hence, we modified the observation model (eq.( 2)) such that the observed BOLD signal is generated from the latent states (eq.( 1)) through a linear-Gaussian model with HRF convolution: (3)   = (ℎ *  : ) +   +   ,   ~(, ) where   are the observed BOLD responses in N voxels at time t generated from  : , (concatenated into one vector and convolved with the hemodynamic response function).We also added nuisance predictors   ∈ ℝ  , which account for artifacts caused, e.g., by movements. ∈ ℝ x is the coefficient matrix of these nuisance variables, and ,  and   are the same as in eq. ( 2).Hence, the observation model takes the typical form of a General Linear Model for BOLD signal analysis as, e.g., implemented in the statistical parametric mapping (SPM) framework (Worsley and Friston, 1995).Note that while nuisance variables are assumed to directly blur into the observed signals (they do not affect the neural dynamics but rather the recording process), external stimuli presented to the subjects are, in contrast, assumed to exert their effects through the underlying neuronal dynamics (eq.( 1)).Thus, the fMRI PLRNN-SSM (termed 'PLRNN-BOLD-SSM') is now specified by the set of parameters  = { 0 , , , , , , , , }.Model inference is performed through a type of Expectation-Maximization (EM) algorithm (see Methods and full derivations in the supplement).
One complication here is that the observations in eq. ( 3) do not just depend on the current state   as in a conventional SSM, but on a set of states  : across several previous time steps.This severely complicates standard solution techniques for the E-step like extended or unscented Kalman filtering (Durbin and Koopman, 2012).Our E-step procedure (cf.Durstewitz, 2017b), however, combines a global Laplace approximation with an efficient iterative (fixed point-type) mode search algorithm that exploits the sparse, block-banded structure of the involved covariance (inverse Hessian) matrices, which is more easily adapted for the current situation with longer-term temporal dependencies (see Methods sect.'Model specification and inference' & Supplement for further details).

Stepwise initialization and training protocol
The EM-algorithm aims to compute (in the linear case) or approximate the posterior distribution (|) of the latent states given the observations in the E-step, in order to maximize the expected joint log-  Goodfellow et al., 2016;Hinton et al., 2006;LeCun et al., 2015;Talathi and Vartak, 2015).Since  1)) in successive optimization steps (see also Abarbanel et al., 2018).In brief, while early steps of the training scheme prioritize the fit to the observed measurements through the observation (or 'decoder') model (|) (eqns.( 2),(3)), subsequent annealing steps shift the burden of reproducing the observations onto the latent model () (eq.( 1)) by, at some point, fixing the observation parameters   , and then enforcing the temporal consistency within the latent model equations (as demanded by eq. ( 1)) by gradually boosting the contribution of this term to the loglikelihood (see Methods).

Evaluation of training protocol
We examined the performance of this annealing protocol in terms of how well the inferred model was capable of recovering the true underlying dynamics of the Lorenz system.This 3-dimensional benchmark system (equations and parameter values used given in Figure 3 legend), conceived by Edward Lorenz in 1963 to describe atmospheric convection (Lorenz, 1963), exhibits chaotic behavior in certain regimes (Figure 3A).We measured the quality of DS reconstruction by the Kullback-Leibler divergence   (  (),   (|)) between the spatial probability distributions   () over observed system states in -space from trajectories produced by the (true) Lorenz system and   (|) from trajectories generated by the trained PLRNN-SSM (  , in the following refers to this divergence evaluated in observation space, see eq. ( 9) in Methods, while  ̃ denotes a normalized version of this measure; see Figure 1 and Methods sect.'Reconstruction of benchmark dynamical systems' for details).Note that, importantly, our measure compares attractor geometries in state space (the target of the delay embedding theorems; Sauer et al., 1991;Takens, 1981), instead of comparing the fit directly on the time series themselves which can be highly misleading for chaotic systems (e.g.Wood, 2010).
Trajectories of length T=1000 were drawn with process noise (σ 2 =.3) from the Lorenz system and handed to the inference algorithm with M={8, 10, 12, 14} latent states (for statistics, a total of 100 such trajectories were simulated and model fits carried out on each).Models were trained through the 'PLRNN-SSM-anneal' protocol (Algorithm-1 in Methods) and compared to models trained from random initial conditions (termed 'PLRNN-SSM-random') in which parameters were randomly initialized (see Figure 2).9)).Bottom: analysis pipeline for experimental data.We used preprocessed fMRI data from human subjects undergoing a classic working memory n-back paradigm.First, nuisance variables, in this case related to movement, were collected.Then, time series obtained from regions of interest (ROI) were extracted, standardized, and filtered (in agreement with the study design).From these preprocessed time series, we derived the first principle components and handed them to the inference algorithm (once including and once excluding variables indicating external stimulus presentations during the experiment).With the inferred parameters, the system was then run freely to produce new trajectories which were compared to the state space distribution from the inferred trajectories via the Kullback-Leibler divergence   (see eq. ( 11)).
In general, the PLRNN-SSM-anneal protocol significantly decreased the normalized KL divergence  ̃ (eq.( 9)) and increased the joint log-likelihood when compared to the PLRNN-SSM-random initialization scheme (see Figure 2A, B, independent t-test on  ̃: t(686)=-16.3,p<.001, and on the expected joint log-likelihood: t(640)=11.32,p<.001).More importantly though, the PLRNN-SSManneal protocol produced more estimates for which  ̃ was in a regime in which the chaotic attractor could be well reconstructed (see Figure 3, grey shaded area indicates  ̃ values for which the chaotic attractor was reproduced).Furthermore, the expected joint log-likelihood increased (Figure 2D) while

Reconstruction of benchmark dynamical systems
After establishing an efficient training procedure designed to enforce recovery of the underlying DS by the prior model (eq.( 1)), we more formally evaluated dynamical reconstructions on the chaotic Lorenz system and on the van der Pol nonlinear oscillator.The van der Pol oscillator with nonlinear dampening is a simple 2-dimensional model for electrical circuits consisting of vacuum tubes (vdP; van der Pol, 1926 equations given in Figure 3).Figure 3 illustrates its flow field in the plane, together with several trajectories converging to the system's limit cycle (note that training was always performed on samples of the time series, not on the generally unknown flow field).
As for the Lorenz system, we drew 100 time series samples of length T=1000 with process noise (σ 2 =.1) using Runge-Kutta numerical integration, and handed each of those over to a separate PLRNN-SSM inference run with M={8, 10, 12, 14} latent states.As above, reconstruction performance was assessed in terms of the (normalized) KL divergence  ̃ (eq.( 9)) between the distributions over true and generated states in state space.In addition, for the chaotic attractor, the absolute difference between Lyapunov exponents (e.g.Kantz and Schreiber, 2004) from the true vs. the PLRNN-SSMgenerated trajectories was assessed, as another measure of how well hallmark dynamical characteristics of the chaotic Lorenz system had been captured.For the vdP (non-chaotic) oscillator, we instead assessed the correlation between the power spectrum of the true and the generated trajectories (see Methods sect.'Reconstruction of benchmark dynamical systems').
Overall, our PLRNN-SSM-anneal algorithm managed to recover the nonlinear dynamics of these two benchmark systems (see Figure 3).The inferred PLRNN-SSM equations reproduced the 'butterfly' structure of the somewhat challenging chaotic attractor very well (Figure 3D).The  ̃ measure effectively captured this reconstruction quality, with PLRNN reconstructions achieving values below  ̃ ≈ .4agreeing well with the Lorenz attractor's 'butterfly' structure as assessed by visual inspection (see Figure 3B).At the same time, for this range of KL ̃x values the deviation between Lyapunov exponents of the true and generated Lorenz system was generally very low (see Figure 3C, grey shaded area).If we accept this value as an indicator for successful reconstruction, our algorithm was successful in 15%, 24%, 35%, and 28% of all samples for M=8, 10, 12, and 14 states, respectively (note that our algorithm had access only to rather short time series of T=1000, to create a situation comparable to the fMRI data studied later).Importantly and in contrast to most previous studies, note we requested full independent generation of the original attractor object from the once trained PLRNN.
That is, we neither 'just' evaluated the posterior (|) conditioned on the actual observations (as, e.g. in Archer et al., 2015;or Pandarinath et al., 2018), nor did we 'just' assess predictions a couple of time steps ahead (as, e.g., in Durstewitz, 2017b), but rather defined a much more ambitious goal for our algorithm.
For the vdP system, our inference procedure yielded agreeable results in 20%, 31%, 25%, and 35% of all samples for M=8, 10, 12, and 14 states, respectively (grey shaded area in Figure 3F).Furthermore, around 50% of all estimates generated stable limit cycles and hence a topologically equivalent attractor object in state space, although these limit cycles varied a lot in frequency and amplitude compared to the true oscillator.Like for the Lorenz system, the  ̃ measure generally served as a good indicator of reconstruction quality (see Figure 3H), particularly when combined with the power spectrum correlation (Figure 3G), although low  ̃ values did not always guarantee and high values did not exclude the retrieval of a stable limit cycle.for van der Pol system.G. Correlation of the spectral density between true and reconstructed van der Pol systems.A significant negative correlation between the agreement in the power spectrum (high values on y-axis) and  ̃ again supports that the normalized KL divergence defined across state space (eq.( 9)) captures the dynamics (we note that measuring the correlation between power spectra comes with its own problems, however).H. Same as in D for van der Pol system.Note that even reconstructed systems with high  ̃ values may capture the limit cycle behavior and thus the basic topological structure of the underlying true system (in general, the 2-dimensional vdP system is likely easier to reconstruct than the chaotic Lorenz system; vice versa, low  ̃ values do not generally ascertain that the reconstructed system exhibits the same frequencies).

Reconstruction of experimental data
We next tested our PLRNN inference scheme, with a modified observation model that takes the hemodynamic response filtering into account (PLRNN-BOLD-SSM; see sect.'Observation model for BOLD time series'), on a previously published experimental fMRI data set (Koppe et al., 2014).In brief, the experimental paradigm assessed three cognitive tasks presented within repeated blocks, two variants of the well-established working memory (WM) n-back task: a 1-back continuous delayed response task (CDRT), a 1-back continuous matching task (CMT), and a (0-back control) choice reaction task (CRT).Exact details on the experimental paradigm, fMRI data acquisition, preprocessing, and sample information can be found in (Koppe et al., 2014).From these data obtained from 26 subjects, we preselected as time series the first principle component from each of 10 bilateral regions identified as relevant to the n-back task in a previous meta-analysis (Owen et al., 2005).These time series along with the individual movement vectors obtained from the SPM realignment procedure (see also Methods sect.'Data acquisition and preprocessing') were given to the inference algorithm for each subject: Models with M={1,…,10} latent states were inferred twice: once explicitly including, and once excluding external (experimental) inputs (i.e., in the latter analysis, the model had to account for fluctuations in the BOLD signal all by itself, without information about changes in the environment).
For experimentally observed time series, unlike for the benchmark systems, we do not know the ground truth (i.e., the true data generating process), and generally do not have access to the complete true state space either (but only to some possibly incomplete, nonlinear projection of it).Thus, we cannot determine the agreement between generated and true distributions directly in the space of observables, as we could for the benchmark systems.Therefore we use a proxy: If the prior dynamics is close to the true system which generated the experimental observations, and those represent the true dynamics well (at the very least, they are the best information we have), then the distribution of latent states constrained by the data, i.e. (|), should be a good representative of the distribution over latent states generated by the prior model on its own, i.e. ().Hence, our proxy for the reconstruction quality is the KL divergence   (  (|),   ()) (  for short, or, when normalized,  ̃; see eq. ( 11) in Methods) between the posterior (inferred) distribution   (|) over latent states  conditioned on the experimental data , and the spatial distribution   () over latent states as generated by the model's prior (governing the free-running model dynamics; we use capital letters, , and lowercase letters, , to distinguish between full trajectories and single vector points in state space, respectively).Note that the latent space defines a complete state space as we have that complete model available (also note that our measure, as before, assesses the agreement in state space, not the agreement between time series).
For the benchmark systems, our proposed proxy   was well correlated with the KL divergence   assessed directly in the complete observation space, i.e., between true and generated distributions (Figure 4A, r=.72 on a logarithmic scale, p<.001; likewise,   (  (|),   ()) and   (  (),   (|)) were generally correlated highly; r>.9, p<.001).Moreover, although especially for chaotic systems we would not necessarily expect a good fit between observed or inferred and generated time series (c.f.Wood, 2010),  ̃ on the latent space turned out to be significantly related to the correlation between inferred and generated latent state series in our case (on a logarithmic scale, see Figure 4B).That is, lower  ̃ values were associated with a better match of inferred and generated state trajectories.  (), is significantly lower than for the permutation bootstraps (all p<.01), suggesting that the prior dynamics contains taskrelated information.Unstable system estimates were removed from analysis.F. Example of inferred (top) and generated (bottom) state space trajectories of a sample subject projected down to the first 3 principle components for visualization purposes, color-coded for distinct task phases (see legend).
This tight relation was particularly pronounced in models including external inputs (Figure 4B blue,top).This is expected, as in this case the internal dynamics are reset or partly driven by the external inputs, which will therefore induce correlations between directly inferred and freely generated trajectories.Thus, overall,   was slightly lower for models including external inputs as compared to autonomous models (see also Figure 4C).One simple but important conclusion from this is that knowledge about additional external inputs and the experimental task structure may (strongly) help to recover the true underlying DS.This was also evident in the mean squared error on n-step ahead projections of generated as compared to true data (Figure 4D), i.e. when comparing predicted observations from the PLRNN-BOLD-SSM run freely for n time steps to the true observations (once again we stress, however, that a measure evaluated directly on the time series may not necessarily give a good intuition about whether the underlying DS has been captured well).Accuracy of n-stepahead predictions also generally improved with increasing latent state dimensionality, that is, adding latent states to the model appeared to enhance the dynamical reconstruction within the range studied here.
To ensure that the retrieved dynamics did not simply capture data variation related to background fluctuations in blood flow (or other systematic effects of no interest), we examined whether the generated trajectories carried task-specific information.For this purpose, we assessed how well we could classify the three experimental tasks (which demanded distinct cognitive processes) via linear discriminant analysis (LDA) based on the generated (through the prior model) latent state trajectories.
(We exclusively focused on classifying task phases, as these were pseudo-randomized across subjects, while 'resting' and 'instruction' phases occurred at fixed times, and we wanted to prevent significant classification differences which may occur either due to a fixed temporal order, or due to differences in presentation of experimental inputs during resting/instruction vs. proper task phases.) Figure 4E shows the relative classification error obtained when classifying the three tasks by the generated trajectories (solid colored lines) as compared to that from the directly inferred trajectories (dashed colored lines), and to bootstrap permutations of these trajectories (black solid and dashed lines).
Overall, for M>2 latent states, generated trajectories significantly reduced the relative classification error, even in the absence of any external stimulus information, suggesting that distinct cognitive processes were associated with distinct regions in the latent space, and that this cognitive aspect was captured by the PLRNN-BOLD-SSM prior model (see also Figure 4F for an example of a generated state space for a sample subject, and Figure 5).As observed for the ahead-prediction error above, performance improved with increasing latent state dimensionality.While adding dimensions will boost LDA classifications in general, as it becomes easier to find well separating linear discriminant surfaces in higher dimensions, we did not observe as strong a reduction in classification error for the permutation bootstraps, suggesting that at least part of the observed improvement was related to better reconstruction of the underlying dynamics.Of note, models which included external inputs enabled almost perfect classifications with as few as M=8 states.These results are not solely attributable to the model receiving external inputs, as these did not differentiate between cognitive tasks (i.e., number and type of inputs were the same for all tasks, see Methods sect.'Experimental paradigm').
Lastly, we observed that trained models in many cases produced interesting nonlinear dynamics, including stable limit cycles, chaotic attractors, and multi-stability between various attractor objects (Figure 6).This indicates that the fMRI data may indeed harbor interesting dynamical structure that one would not have been able to reveal with linear state space models like classical DCMs, at least not within the retrieved system of equations (as argued above, the inferred posterior (|) may still reflect this structure, but the model itself would not reproduce it).

Model specification and inference
The formulation of the state space model for BOLD time series (PLRNN-BOLD-SSM) is given in the Results section.To infer the parameters and latent variables of the model, we used Expectation-Maximization (EM) (Dempster et al., 1977;Durbin and Koopman, 2012).The EM algorithm maximizes a lower bound ℒ(, ) (also called the evidence lower bound, ELBO) of the log-likelihood log (|) given by (see Supplement sect.'PLRNN-BOLD-SSM model inference' for full details): (4) log (|) ≥ E  [log (, |)] + ((|)) = log (|) − ((|),   (|)) =: ℒ(, ), with (|) some proposal density over latent states, and ((|), (|)) the Kullback-Leibler divergence between proposal density (|) and true posterior (|).This expression can be derived by, e.g., using Jensen's inequality (e.g.Roweis and Ghahramani, 2000).From this we see that the bound becomes exact when proposal density (|) exactly matches the true posterior density (|) (defined through the latent state model here) which we aim to determine in the E-step (in contrast to variational inference where we assume (|) to come from some parameterized family of density functions, in EM we usually try to compute [linear case] or approximate (|) directly).

State estimation (E-Step).
In the E-step we seek  * ≔ arg max  ℒ( * , ) given a current parameter estimate  * .Since  * is assumed to be given, this amounts to minimizing the Kullback-Leibler divergence ((|), (|)).The common procedure for linear-Gaussian models (e.g., Kalman filter-smoother; Kalman, 1960;Rauch et al., 1965) is equating (|) = (|), and then determining the first two moments of the latter for performing the M-step.For the present model (|) is a highdimensional mixture of piecewise Gaussians for which 'explicit' integration (i.e., using tabulated Gaussian integrals) becomes unfeasible for large T and M. Typically, however, the piecewise Gaussians will have centers close to the origin (Figure S1; cf.Durstewitz, 2017b), and hence we resort to solving for the maximum a-posteriori (MAP) estimate of (|), expected to be close to E [𝐙|𝐗] (which is exactly so for a single Gaussian), and instantiate the state covariance matrix with the negative inverse Hessian around this maximizer (e.g.Smith & Brown 2003).Essentially, this is a global Gaussian approximation, or a Laplace approximation of the log-likelihood where we approximate log (|) ≈ log   (| max ) + log   ( max ) − 1 2 log|− max | + .using the maximizer  max of log   (, ) (note that the Hessian  max is constant around the maximizer) (Koyama et al., 2010;Paninski et al., 2010).
Taking this approach, letting Ω() ⊆ {1 … } refer to the set of all indices of units for which  , ≤ 0 and  Ω() to the matrix  that has all columns corresponding to indices in Ω() set to 0, the optimization objective in the E-Step may be formulated as: (5 Let us concatenate all state variables across m and t into one long column vector  = ( 11 , … ,  1 , … ,  1 , … ,   ) T ∈ ℝ  , and likewise arrange all matrices ,  Ω() , and so on, into large x block tri-diagonal matrices, and let us further collect all terms quadratic in z, linear in z, or constant (see Supplement for exact composition of these matrices).Defining  as the HRF convolution matrix,  Ω ≔ (Ι( 11 > 0), Ι( 21 > 0), … , Ι(  > 0)) T as an indicator vector with a 1 for all states  , > 0 and zeros otherwise, and  Ω ≔ ( Ω ) as the diagonal matrix formed from this vector, one can rewrite the optimization criterion eq.( 5) compactly as which is a piecewise quadratic function in z with solution vectors provided this solution is consistent with the current set Ω, i.e. is a true solution of eq. ( 6).For solving this set of piecewise linear equations, we use a simple Newton-type iteration scheme, similar to the one suggested in (Brugnano and Casulli, 2008), where we iterate between (1) solving eq. ( 6) for fixed  Ω and (2) flipping the bits in  Ω inconsistent with the obtained solution to eq. ( 6), until convergence.
Care is taken to avoid getting trapped in cyclic behavior, and a quadratic programming step may be added at the end to obtain the maximum given a fixed index set Ω (which seemed rarely necessary from our experience; see Durstewitz, 2017b for details).
These state covariance estimates are then used to compute, mostly analytically, the expectations

Parameter estimation (M-Step).
In the M-step, parameters are updated by seeking  * ≔ arg max  ℒ(,  * ) given  * from the E-step (since  * is assumed fixed and known in the E-step, note that the entropy over  becomes a constant in eq. ( 4) and drops out from the maximization).This boils down to a simple linear regression problem given that the ReLU nonlinearities have been resolved we can equivalently express the solution as ,  =  1:,1: ,  =  1:,+1:+ .
With these definitions, differentiating eq. ( 7) w.r.t Γ yields where  denotes an NxN identity matrix.Solutions for the latent state parameters   are given in the supplement.E-and M-steps are then iterated until convergence of the expected joint log-likelihood.

Stepwise model training procedure
We introduce here an efficient approach for pushing the latent model to capture the underlying DS that generated the observations.Our approach rests on a step-wise procedure in which we gradually increase the importance of fitting the latent state dynamics as compared to fitting the observations.
Since the latent state process and the observation process account for additive terms in the joint loglikelihood (eq.( 5)), the tradeoff between fitting the dynamics and fitting the observations is regulated by the ratio of the two covariance matrices  and  (eqn.( 1)-(3)( 5)).Hence, the idea of our training scheme is to begin with fitting the observation model and putting milder constraints on the latent process, using a linear latent model for initialization in a first step (or even factor analysis which places no constraints on the temporal relationship among latent states; cf.Roweis and Ghahramani, 2000), and then gradually decreasing ": " during training to enforce the temporal consistency of the latent model.Furthermore, one may force all burden of fitting the observations completely onto the latent model by fixing   from some step onwards.The complete training protocol is outlined in Algorithm-1.

Reconstruction of benchmark dynamical systems
We evaluated the performance of our PLRNN-SSM approach on two popular benchmark DS, the Lorenz equations and the van der Pol nonlinear oscillator (vdP).Within some parameter range, the 3dimensional Lorenz system exhibits a chaotic attractor and the 2-dimensional vdP-system exhibits a limit cycle (see Figure 3 for parameter settings used, system equations, and sample trajectories of the systems).We were interested in solutions where the true system dynamics is not just reflected in the directly inferred posterior distribution (|) over the PLRNN states { 1: } given the actual observations { 1: }, but also in the model's generative or prior distribution (), i.e. whether the once estimated PLRNN when run on its own would produce similar trajectories with the same dynamical properties as the ground truth system.
For evaluation, n=100 samples of (standardized) trajectories of length T=1000 were drawn from the ground truth systems using Runge-Kutta numerical integration and random initial conditions.PLRNN-SSMs were trained on these sample sets as described above for M={8, 10, 12, 14} latent states, using eq.( 2) for the observations (see also Figure 1).To probe our stepwise training protocol (Algorithm-1), PLRNN-SSM training under this protocol (termed 'PLRNN-SSM-anneal') was compared to simple EM training of the PLRNN-SSM started from random initializations of parameters (termed 'PLRNN-SSMrandom'; essentially just step 1 of Algorithm-1 with  directly fixed to 10 −3 ).
To quantify how well the true system dynamics was captured by the 'free-running' PLRNN (after training, but unconstrained by the observations), we used the Kullback-Leibler divergence defined across state space, i.e. integrating across space, not across time.Similar in spirit to the criteria defined for the classical delay embedding theorems (Kantz and Schreiber, 2004;Sauer et al., 1991;Takens, 1981), our measure therefore assessed the agreement between the original and reconstructed attractor geometries.Integrating across time (i.e., computing divergence between time series) is problematic for nonlinear DS, since two time series from the very same chaotic DS usually cannot be 0) Draw initial parameter estimates  (0) ~() from some suitable prior, constraint to max |eig(  + )| < 1 for biasing toward stable models (see also Macke et al., 2015).
expected to overlap very well with even miniscule differences in initial conditions (cf.Wood, 2010).For the ground truth benchmark systems, for which we have access to the true distribution   () and the complete state space, this KL divergence can be computed directly in observation space and was defined as where the integration is performed across -space, and   (|) is the distribution across observations generated from PLRNN simulations (i.e., after PLRNN-SSM training, but discarding the original set of time series observations   = { 1: } used for training).Hence, this measure assesses whether PLRNN-SSM-simulated trajectories in the limit fill the same volume of state space as the true DS trajectories, and in this sense whether the systems' attractor objects are topologically and geometrically 'equivalent'.(As a terminological remark, in the machine learning literature   (|) is often called the 'generative' or 'decoding' model, while (|) or (|) is sometimes referred to as the 'encoder' or 'recognition' model [e.g. Kingma and Welling, 2013;Rezende et al., 2014].Here we will, more generally, refer with   () to the (prior) distribution of latent states generated by the PLRNN independent of the training observations   = { 1: }, and with   (|) to the distribution of simulated observations produced from samples   ~ () according to the observation model [eq.
Lastly, to obtain an interpretable measure between 0 and 1, we normalized the KL divergence (termed  ̃) by dividing it by the expected maximum deviation. ̃ and the expected joint log-likelihood were compared between PLRNN-SSM-anneal and PLRNN-SSM-random via independent t-tests.For these analyses, all unstable system estimates were removed (10.3%).Furthermore, strong outliers with joint log-likelihood values < -1000 (which occurred only for PLRNN-SSM-random in 3.8% of cases) were removed.
A standard measure of chaoticity in nonlinear DS is the maximal Lyapunov exponent (Strogatz et al., 1994).We thus also assessed how well our KL measure correlated with the deviation in Lyapunov exponents between true and estimated systems.The Lyapunov exponent was assessed numerically by a linear regression fit to the initial slope of the log-Euclidean distance log  Δ ( (1) ,  (2) ) between initially close ( 0 < 10 −10 ) trajectories  (1) and  (2) as a function of time lag Δ, up to the point in the curve where a plateau indicating the full extent of the attractor object has been reached.For the van der Pol nonlinear (non-chaotic) oscillator, the agreement in the power spectra between the true and generated systems is more informative as a measure of how well the system dynamics has been captured (the maximum Lyapunov exponent for a stable limit cycle is 0), which was simply assessed by the average Pearson correlation.

Reconstruction of dynamical systems from experimental data
Experimental paradigm.The experimental paradigm assessed three cognitive tasks, two working memory (WM) n-back tasks -the continuous delayed response task (CDRT), and the continuous matching task (CMT) -and a choice reaction task (CRT), which served as 0-back control task.In all tasks, subjects were presented with a sequence of stimuli, and they had to respond to each stimulus (a triangle or a square) according to the task instruction.While in the CDRT participants were asked to indicate which stimulus was presented last, the CMT required participants to compare the current to the last stimulus and indicate whether they were the same or different (Gevins et al., 1990).In the CRT, participants had to simply indicate the current stimulus, and WM was not required.The paradigm is known to robustly activate the WM network.Each task was preceded by a resting period and an instruction phase.Tasks only differed w.r.t. the instruction phase, otherwise participants were faced with the same stimulus sequence, presented on a central screen at variable inter-stimulus intervals.
Data acquisition and preprocessing.Exact details on fMRI acquisition preprocessing, as well as information on the sample and consent of study participation can be found in (Koppe et al., 2014).In brief, 26 healthy subjects participated in the study, undergoing the experimental paradigm in a 1.5 GE Scanner.From these data, we chose to preselect voxel time series known to be relevant to the n-back task, as identified by a previous meta-analysis (Owen et al., 2005).This included the following Brodmann areas (BA): BA6 (supplementary motor), BA32 (anterior cingulate), BA46, BA9 (dorsolateral prefrontal cortices), BA45, BA47 (ventromedial prefrontal cortices), BA10 (orbitofrontal cortex), BA7, BA40 (parietal cortices), as well as the medial cerebellum.From each of these areas we extracted the first principle component.Given 10 bilateral regions, this amounted to extracting 20 voxel time series from each participant.Time series were mean centered, and mildly temporally smoothed by convolution with a Gaussian filter (σ 2 =1).
For each individual, the 20 extracted time series were entered as experimental observations  along with 6 nuisance predictors  (related to movement vectors obtained from the SPM realignment preprocessing procedure; Koppe et al., 2014) to the PLRNN-SSM inference procedure.Models were estimated both including and excluding experimental inputs.For the inclusion condition, experimental inputs  were defined as binary 'design' vectors of length K=5.The first two entries contained 1's for the presentation of the two stimulus types ('triangle' or 'square'), and the last 3 entries indicated by 1's the instruction phases of the three tasks; all other entries were set to 0. Note that during the actual task phases (following the instruction phases) the inference algorithm therefore (like the real subjects) received only information about the presented stimuli but not about the task phase itself.Models were estimated with  2 regularization and regularization factor λ=50.
Reconstruction measures.In the case of experimental data, in which the ground truth DS is not known, we do not have access to the data generating distribution   (), nor to the complete state space in general.We therefore used as a proxy for eq.( 9) the Kullback-Leibler divergence between the distribution over latent states obtained by sampling from the data-unconstrained prior   () and the data-constrained (i.e., inferred) posterior distribution   (|), arguing that the former should match closely with the latter if the actually observed  represent the underlying DS well (see Results section; also note that the -space is always complete by model definition, at least in the autonomous case).We again take the KL divergence across the system's state space (not time): (10)   (  (|),   ()) = ∫   (|) log   (|) () ∈ℝ  d .
To evaluate this integral, sampling from  (|), however, is difficult because of the known degeneracy problems with particle filters or other numerical samplers in high dimensions (Bengtsson et al., 2008;Li et al., 2014a).We therefore approximated both   (|) and   () as Gaussian mixtures across trajectory times, i.e. with   (|) ≈ , which is reasonable given that the PLRNN distribution is a mixture of piecewise Gaussians (see above).Just as in eqns.( 8) and ( 9) above, probabilities are therefore evaluated in space across all time points.The mean and covariance of (  | 1: ) and (  | −1 ) were obtained by marginalizing over the multivariate distributions (|) and   (), respectively, yielding E[  | 1: ], E[  | −1 ], and covariance matrices Var(  | 1: ) and Var(  | −1 ).Note that the covariance matrix of (|) was reestimated at the end of the full training procedure with the process noise matrix  set to the identity (i.e., to the last value it had before  was fixed qua Algorithm-1).Diagonal elements of the covariance matrix of (|) were further restricted to a minimum value of 1 (some lower bound on the variance turned out to be necessary to make   well defined almost everywhere).
For high-dimensional latent spaces, (asymptotically unbiased) approximation through MC sampling becomes computationally inefficient or unfeasible.For these cases, Hershey and Olson (2007) suggest a variational approximation to the integral in eq. ( 10) which we found to be in almost exact agreement with the results obtained through MC sampling: , where the terms in the exponentials refer to KL divergences between pairs of Gaussians, for which an analytical expression exists.
We normalized this measure by dividing by the KL divergence between   (|) and a reference distribution   () which was simply given by the temporal average across state expectations and variances along trajectories of the prior   () (i.e., by one big Gaussian in an, on average, similar location as the Gaussian mixture components, but eliminating information about spatial trajectory flows).(Note that we may rewrite the evidence lower bound as ℒ(, ) = E  [log (|)] − ((|), ()) with ((|), ()) ≈ ((|), ()), which has a similar form as eq.( 10) above, but computes the divergence across trajectories (time), not across space.)

Discussion
Theories about neural computation and information processing are often formulated in terms of nonlinear DS models, i.e. in terms of attractor states, transitions among these, or transient dynamics still under the influence of attractors or other salient geometrical properties of the state space (Hopfield, 1982;Rabinovich et al., 2008a;Wang, 2002).Given the success of DS theory in neuroscience, and the recent surge in interest in reconstructing trajectory flows and state spaces from experimental recordings (Churchland et al., 2007;Lapish et al., 2015;Laurent et al., 2001;Mante et al., 2013;Nichols et al., 2017), methodological tools which would return not only state space representations, but actually a model of the governing equations, would be of great benefit.Here we suggested a novel algorithm within an SSM framework that specifically forces the latent model, represented by a PLRNN, to capture the underlying dynamics in its intrinsic behavior, such that it can produce on its own time series of 'fake observations' that closely match the real ones.We also evaluated a measure, the KL divergence defined across state space (not time) between the inferred (posterior) and intrinsically generated (prior) distribution of latent states, which would give us a quantitative sense of how well the underlying DS has been captured in empirical situations where no ground truth is available.Finally, given that fMRI is the most common non-invasive technique to study human cognition in health and psychiatric illness, we derived a new observation model specifically for fMRI data that takes the HRF into account, and demonstrate that our approach recovers nonlinear dynamics and task-related trajectory flows from human fMRI recordings in a working memory paradigm.This, to our knowledge, has not been shown before.

Comparison to other approaches for identifying dynamical systems
The 'classical' technique for reconstructing attractor dynamics from experimental time series is delay embedding, based on the delay embedding theorems by Takens (1981) and Sauer et al. (1991).It has been used to disentangle task-related trajectory flows and attractor-like properties in experimentally assessed neuronal time series (Balaguer-Ballester et al., 2011;Lapish et al., 2015).However, as a completely non-parametric technique, delay embedding will not give a complete picture of the system's flow field, nor access to the governing equations.Linear dynamical systems, coupled to Gaussian or Poisson observation equations (Macke et al., 2015;Smith and Brown, 2003), and related approaches like GPFA (Yu et al., 2009), are quite popular in neurophysiology for obtaining smoothed trajectories and state spaces, but -due to their linear latent dynamics -are inherently unsuitable for reconstructing the underlying DS itself in most cases (as explained above, they may still yield a good approximation to the posterior (|), thus still useful, but they would fail to capture the generative dynamics itself).
To our knowledge, Roweis and Ghahramani (2000), and somewhat later Yu et al. (2005), were among the first to suggest an RNN for the latent model in order to reconstruct dynamics.These earlier contributions still focused more on in the inferred space (|), rather than on the fully generative capabilities of their models (at least were these not systematically analyzed), perhaps partly due to the fact that numerically less stable and efficient inference methods like the extended Kalman filter were employed at the time.Very recent work by Zhao and Park (2018) built on the radial basis function networks suggested by Roweis and Ghahramani (2000) for the latent model, and combined it with variational inference.They showed ahead predictions of their model for up to 1000 time steps.
Similarly, Pandarinath et al. (2018) recently proposed a sequential variational auto-encoder framework for inferring dynamics from neural recordings (although here as well the focus was more on the posterior encoding in the latent states, and on inference of initial conditions and perturbations).Both these models, however, are fairly complex and not directly interpretable in neural terms, and, moreover, hard to analyze with respect to their intrinsic dynamics.
The PLRNN framework offers several distinct advantages compared to other approaches: The equations have a fairly direct neural interpretation (Durstewitz, 2017b), in fact have the general form of neural rate equations that have been used to model various neural and cognitive phenomena (Song et al., 2016), and -due to their piecewise-linear structure -can also be easily translated into an equivalent continuous-time neural rate model (see also Ozaki, 2012).Dynamical phenomena can be analyzed more easily in PLRNNs than in other frameworks, e.g.fixed points and their stability can be determined analytically (Durstewitz, 2017b).Furthermore, ReLU-type activation functions appear to be a quite good approximation to the I/O-functions of many neocortical cell types (Hertäg et al., 2014;Hertäg et al., 2012), and, besides, are almost the default now in deep networks due to their favorable properties in optimization (Goodfellow et al., 2016), a feature our iterative state inference algorithm exploits as well.Finally, in contrast to most previous approaches, here we demonstrated that the prior PLRNN model on its own, after training, can produce the same attractor dynamics in state space as the true DS.
In the physics literature, several other methods based on reservoir computing (Pathak et al., 2017), RNNs formed from feedforward networks trained directly on the flow field (Trischler and D'Eleuterio, 2016; see also Funahashi and Nakamura, 1993), or LASSO regression combined with polynomial basis expansions (Brunton et al., 2016), have recently been discussed for identifying DS.Process noise is usually not included in these models, i.e. the latent dynamics is deterministic, which entails the risk that noise in the process is wrongly attributed to deterministic aspects of the dynamics.While some of these methods required hundreds of hidden states and millions of samples to reconstruct the van der Pol or Lorenz attractors (Trischler and D'Eleuterio, 2016), we found that as few as just eight latent states and a single time series of length 1000, within the range of typical fMRI data, can be sufficient for the PLRNN-SSM to rebuild the chaotic Lorenz attractor, another tremendous advantage in empirical settings.

Applications in fMRI research and beyond
In this contribution, we have derived a new observation model for fMRI that accounts for the HRF filtering of the BOLD signal.The HRF implies that current observations do not depend only on the system's current state (the common assumption in SSMs), but on a sequence of previous states, a situation handled relatively seamlessly by our PLRNN-SSM inference algorithm.fMRI is still the most common recording technique for monitoring brain function during cognitive and emotional processing in healthy and psychiatric subjects.Huge data bases have been compiled in large cohort studies over the past decade or so (e.g., the German National Cohort Study initiated by the Helmholtz association: https://www.helmholtz.de/en/research_infrastructures/national_cohort_study/;see also Collins and Varmus (2015)) as a reference for monitoring and assessing neurological and psychiatric dysfunction.
Although other noninvasive recording techniques with finer temporal resolution, like MEG/ EEG, may be more suitable for addressing questions about the DS basis of cognition, clinical research cannot afford to discard this large body of medically relevant data.
On the other hand, important hypotheses about the neural underpinnings of psychiatric conditions like schizophrenia, attention deficit hyperactivity disorder, or depression, have been formulated in terms of altered system dynamics (see Durstewitz et al., 2018 for a recent review).For instance, based on physiological single unit and synapse data combined with biophysical network models on dopamine modulation in prefrontal cortex, it has been suggested that a dysregulated dopamine system by overly 'deepening' cortical attractor landscapes may inhibit transitions among states, and thereby cause some of the (cognitive) symptoms in schizophrenia (Durstewitz and Seamans, 2008).This proposal has been supported by a number of neurophysiological and neuropsychological observations (e.g.Armbruster et al., 2012;Lapish et al., 2015), but a direct experimental evaluation of the specific changes in attractor basins in schizophrenia is still lacking.Tools like the one proposed here could be applied to directly test these types of hypotheses in human subjects recorded with fMRI.More generally, an extensive literature suggests that dynamical properties assessed from fMRI predict psychopathological conditions (e.g.Damaraju et al., 2014;Li et al., 2014b;Rashid et al., 2014), where the methodological framework proposed here could help to better understand the underlying dynamics and define targets for intervention (e.g. in the context of neurofeedback).
Beyond fMRI, most neuroimaging techniques, including, e.g., calcium imaging (Smetters et al., 1999) or imaging by voltage-sensitive dyes (Shoham et al., 1999) in neural tissue, involve some form of filtering that has to be taken into account when the goal is to capture underlying dynamical processes (like neural interactions) that evolve at a faster time scale.The present paper establishes, more generally, a framework for inferring nonlinear dynamics in situations where the measurement technique involves low-or band-pass-filtering of the process of interest.

Open issues and outlook
There is room for improvement in both our training algorithm and the measures used to evaluate its success in empirical situations.Our stepwise training algorithm was devised based on an intuitive heuristic, namely that by shifting the workload for fitting the observations onto the latent model and gradually increasing the requirements for its temporal consistency, a better representation of the unobserved system dynamics could be achieved.We could show that this was indeed the case when compared to a bootstrap (random) sample of models trained in the 'standard' way, and that our procedure seemed to work in general, but a more systematic theoretical derivation and testing of alternative schemes and explicitly designed optimization criteria (directly utilizing eq. ( 10)) would certainly be desirable in future work.
We also find it important that in testing the performance of different reconstruction algorithms not only 'good examples' that prove the basic concept ('my algorithm works') are shown, but a more thorough quantitative statistical evaluation of precisely how well it performed in what percentage of cases is provided, like the one attempted here (Figure 3,4).For applications to empirical data, for which we do not know the ground truth, an open issue is how we could best quantify how much confidence we could have in the reconstructed stochastic equations of motion.Cross-validation and out-of-sample prediction errors provide a guidance, but for DS it is less clear in terms of what these should be measured: It is known that for nonlinear systems with complex or chaotic dynamics standard squarederror or likelihood-based measures evaluated along time series are not too useful (e.g.Wood, 2010), since miniscule differences in initial conditions or noise perturbations may cause quick decorrelation of trajectories even if they come from the very same DS.We therefore decided to compare true and simulated data in terms of probability distributions across state space, arguing that if the observations come from the same attractor or system dynamics they should fill roughly the same volume of state space -this is more along the lines of a DS view which compares dynamical objects in terms of their geometrical or topological equivalence in state space (Kantz and Schreiber, 2004;Sauer et al., 1991;Sugihara et al., 2012;Takens, 1981), rather than the literal overlap among time series.Another corollary of this view is that to establish the equivalence between two DS, it is neither sufficient nor potentially even useful to predict observations just a couple of time steps ahead: In a chaotic noisy system, the prediction horizon is inherently limited to begin with (because of exponential divergence of trajectories), and one has to demonstrate that the 'general type' of long-term behavior in the limit is the same (e.g. a limit cycle of a certain periodicity and order) to claim that two DS are equivalent.We therefore suggested to evaluate performance in terms of completely newly generated ('faked') trajectories that the trained system produces when no longer guided by the actual observations (i.e., the prior   () rather than the posterior   (|)).
Especially in fMRI, however, the data space is often very high-dimensional (>10³) while at the same time often only a single time series sample of limited length (T≤1000) is available, i.e. the -space is very sparse.In these cases we cannot obtain a good approximation of the distribution (), as we could for the benchmarks, and hence our original measure is not directly applicable.Hence we reverted to performing the comparison in latent space, between two distributions we do have in principle available, the one constrained by the observations,   (|), and the other,   (), obtained from the completely freely running (simulated) system.We argued that if our actual observations  reflect the true dynamics well, then states obtained under   (|) should be highly likely a priori, i.e.
under   (), and hence these distributions should highly overlap.As direct sampling from   (|) is difficult and time-consuming, due to degeneracy problems, and the latent space dimensionality may also be prohibitively high, we approximated it by a mixture of Gaussians, which is a reasonable assumption for our ReLU-based RNN model and allows for an efficient analytical approximation to   (Hershey and Olsen, 2007).More generally, if we are only interested in dynamical equivalence, we may also want to accept translations, rotations, rescaling, and potentially other deformations of the true state space that do not change topological equivalence (Sauer et al., 1991;Takens, 1981).
Procrustes analysis (Krzanowski, 2000) could be performed to (partly) allow for such transformations (on the other hand, since   () and   (|) come from the same underlying model, in our specific case such transformations may neither be necessary, nor necessarily desired).3), main manuscript), the expected joint log-likelihood is given by depends on the temporal resolution of the time series, i.e. on the length n of the HRF vector.

State estimation (E-Step).
We assume that (|), like a Gaussian, could be specified by its first two moments, i.e. the mean and the covariance, and that, as for a Gaussian, the MAP estimator is a reasonably good approximation to the mean.Thus we aim to maximize the log-joint distribution over  and , log (, |) (as defined by Q Ω * () in the main manuscript) w.r.t. (see also Koyama et al., 2010;Paninski et al., 2010).
We have reformulated the optimization criterion in eq.( 6) (main manuscript) in 'big-matrix form', defining the set of 'active' states (i.e.,   > 0) through the binary vector  Ω ≔ Ι( > 0).The matrix  ∈ ℝ x in eq. ( 6) is a convolution matrix which for M=1 contains the elements of the HRF in the following form For M>1, we would add additional columns inserting M-1 zeros between each element of , and add additional rows by duplicating each row M-1 times, and shifting it by 1...M-1 positions, respectively.Below we will also give the full structure of the block-banded matrices ∈ ℝ x that occur in eq.( 6), restated here for convenience:   where k(j) is a set of row indices which pick out those rows in   that correspond to those parameters actually defined, i.e. which squeeze out all zeros from the j'th row of , and accordingly from the respective rows/columns on the r.h.s. of eq. ( 2).
Let us define the following expectation sums:

𝑇 𝑡=2
With this, the vector and matrix in eq. ( 2) can be written as , from which we select the columns k(j) from eq. ( 3), and the rows and columns k(j) out of eq. ( 4) to produce the solution for the j-th row in eq. ( 2).
Finally, the estimate for the initial condition is given by Reconstruction of benchmark dynamical systems.bin sizes are nearly monotonically related such that rank information on the quality of DS retrieval is conserved.However, the   spread is largest for Δx=1 such that qualitative differences in DS retrieval are differentiated more easily for this bin size, and hence this bin size was chosen for the evaluation in the main manuscript.

Figure 1 :
Figure 1: Analysis pipeline.Top: Analysis pipeline for simulated data.From the two benchmark systems (vdP and Lorenz systems), noisy trajectories were drawn and handed over to the PLRNN-SSM inference algorithm.With the inferred model parameters, completely new trajectories were generated and compared to the state space distribution over true trajectories via the Kullback-Leibler divergence   (see eq. (9)).Bottom: analysis pipeline for experimental data.We used preprocessed fMRI decreased (Figure2C) over the distinct training steps of the PLRNN-SSM-anneal protocol, indicating that each step further enhances the solution quality. ̃ and normalized log-likelihood were, however, only moderately correlated (r=-27, p<.001), as expected based on the formal considerations above (sect.'Stepwise initialization and training protocol').

Figure 2 :
Figure 2: Evaluation of stepwise training protocol on chaotic Lorenz attractor.A. Relative frequency of normalized KL divergences evaluated on the observation space ( ̃) after running the EM algorithm with the PLRNN-SSM-anneal (blue) and PLRNN-SSM-random (red) protocols on 100 distinct trajectories drawn from the Lorenz system (with T=1000, and M=8, 10, 12, 14).B. Same as A for normalized expected joint log-likelihood E (|) [log (, |)] (see Supplement eq.(1)).C. Decrease in   over the distinct training steps of 'PLRNN-SSM-anneal' (see Algorithm-1; the first step refers to a LDS initialization and was removed).D. Increase in (rescaled) expected joint log-likelihood across training steps 2-31-3 in 'PLRNN-SSM-anneal'.Since the protocol partly works by systematically scaling down Σ, for comparability the log-likelihood after each step was recomputed (rescaled) by setting Σ to the identity matrix.E. Representative example of joint log-likelihood increase during the EM iterations of the individual training steps 2-31-3 for a single Lorenz trajectory.Unstable system estimates (28% of all estimates) were removed from all figures.

Figure 3 :
Figure 3: Evaluation of training protocol and KL measure on dynamical systems benchmarks.A. True trajectory from chaotic Lorenz attractor (with parameters s=10, r=28, b=8/3).B. Distribution of  ̃ (eq.(9)) across all samples, binned at .05.For T=1000, around 26% of these samples (grey shaded area, pooled across different numbers of latent states M) captured the butterfly structure of the Lorenz attractor well (see also D). C. Estimated Lyapunov exponents for reconstructed Lorenz systems (estimated exponent for true Lorenz system .9, red line).A significant positive correlation between the absolute deviation in Lyapunov exponents for true and reconstructed systems with the  ̃ (r=.27, p<.001) further supports that the latter measures salient aspects of the nonlinear dynamics.D. Samples of PLRNN-generated trajectories for different  ̃ values.The grey shaded area indicates successful estimates.E. True van der Pol system trajectories (with μ=2 and ω=1).F. Same as in B but

Figure 4 :
Figure 4: Model evaluation on experimental data.A. Association between KL divergence measures on observation (  ) vs. latent space (  ) for the Lorenz system; y-axis displayed in log-scale.B. Association between  ̃ (eq.(11); in log scale) and correlation between generated and inferred state series for models with inputs (top, displayed in shades of blue for M=2…10), and models without inputs (bottom, displayed in shades of red for M=2…10).C. Distributions of  ̃ (y-axis) in an experimental sample of n=26 subjects for different latent state dimensions (x-axis), for models including (top) or excluding (bottom) external inputs.D. Mean squared error (MSE) between generated (not inferred) and true observations (y-axis) as a function of aheadprediction step (x-axis).E. Relative LDA classification error for different task phases based on the freely generated states (solid lines), the directly inferred states (dashed lines), and random permutations of states (black lines) for models including (blue) or excluding (red) stimulus inputs.Except for M=2, the classification error based on generated states, drawn from the prior model

Figure 5 :
Figure 5: Exemplary DS reconstruction in a sample subject.A. Top: Latent trajectories generated by the prior model projected down to the first 3 principle components for visualization purposes in a model including external inputs and M=6 latent states.Task separation is clearly visible in the generated state space (color-coded as in the legend), i.e. different cognitive demands are associated with different regions of state space (hard step-like changes in state are caused by the external inputs).Bottom: Observed time series (black) and their predictions based on the generated trajectories (red, with 90% CI in grey) for the same subject.B. Top: Same as A. for the same subject in a PLRNN without external inputs.Bottom: Same as B. for the same subject in a PLRNN without external inputs.*BA= Brodmann area, Le/Re=left/right, CRT= choice reaction task, CDRT=continuous delayed response task, CMT=continuous matching task.

Figure 6 :
Figure 6: Examples of highly nonlinear phenomena extracted from fMRI data (in systems with M=10 states, no external inputs).A. PLRNN-BOLD-SSM with 3 stable limit cycles (LC) estimated from one subject (top: subspace of state space for 3 selected states; bottom: time graphs).B. PLRNN with 2 stable limit cycles and one chaotic attractor, estimated from another subject.C. PLRNN with one stable limit cycle and one stable fixed point.D. Increase in average (log Euclidean) distance between initially infinitesimally close trajectories with time for chaotic attractor in B. (In A and B states diverging towards -∞ were removed, as by virtue of the ReLU transformation they would not affect the other states and hence overall dynamics).
within the expectations E[()], E[() T ], and E[()() T ], and hence criterion eq.(5) becomes simply quadratic.We can (analytically) solve for the parameters   of the observation model and   of the latent model separately.Because of the off-diagonal structure of , it is most efficient to obtain parameter solutions row-wise for the latent model parameters (i.e., separately for each state m=1…M), as spelled out in the supplement.For the observation model parameters, concatenating matrices  and  as  = [ ] ∈ ℝ x(+) , and concatenating convolved states and nuisance variables in   ∈ ℝ + , one can rewrite the observation equation term in (, ) ≔ E  [log (, |) ] as (7) Q  (  , ) = ∑ E[(ℎ *  : )(ℎ *  : ) T ]  =1

Figure S1 :
Figure S1: Likelihood landscape.Illustration of the model's likelihood landscape as a function of just two latent states z1 and z2.The likelihood consists of piecewise Gaussians which cut off at the zeros of the states; often they will cluster near the origin and give rise to a strongly elevated plateau of high-likelihood solutions, close to one full Gaussian.Red cross indicates MAP estimate.