Data-assisted reduced-order modeling of extreme events in complex dynamical systems

The prediction of extreme events, from avalanches and droughts to tsunamis and epidemics, depends on the formulation and analysis of relevant, complex dynamical systems. Such dynamical systems are characterized by high intrinsic dimensionality with extreme events having the form of rare transitions that are several standard deviations away from the mean. Such systems are not amenable to classical order-reduction methods through projection of the governing equations due to the large intrinsic dimensionality of the underlying attractor as well as the complexity of the transient events. Alternatively, data-driven techniques aim to quantify the dynamics of specific, critical modes by utilizing data-streams and by expanding the dimensionality of the reduced-order model using delayed coordinates. In turn, these methods have major limitations in regions of the phase space with sparse data, which is the case for extreme events. In this work, we develop a novel hybrid framework that complements an imperfect reduced order model, with data-streams that are integrated though a recurrent neural network (RNN) architecture. The reduced order model has the form of projected equations into a low-dimensional subspace that still contains important dynamical information about the system and it is expanded by a long short-term memory (LSTM) regularization. The LSTM-RNN is trained by analyzing the mismatch between the imperfect model and the data-streams, projected to the reduced-order space. The data-driven model assists the imperfect model in regions where data is available, while for locations where data is sparse the imperfect model still provides a baseline for the prediction of the system state. We assess the developed framework on two challenging prototype systems exhibiting extreme events. We show that the blended approach has improved performance compared with methods that use either data streams or the imperfect model alone. Notably the improvement is more significant in regions associated with extreme events, where data is sparse.

May 2, 2018 1/22 The complete description of these system through the governing equations is often challenging either because it is very hard/expensive to solve these equations with an appropriate resolution or due to the magnitude of the model errors.The very large dimensionality of their attractor in combination with the occurrence of important transient, but rare events, makes the application of classical order-reduction methods a challenging task.Indeed, classical Galerkin projection methods encounter problems as the truncated degrees-of-freedom are often essential for the effective description of the system due to high underlying intrinsic dimensionality.On the other hand, purely data-driven, non-parametric methods such as delay embeddings [16][17][18][19][20][21], equation-free methods [22,23], Gaussian process regression based methods [24], or recurrent neural networks based approaches [25] may not perform well when it comes to rare events, since the training data-sets typically contain only a small number of the rare transient responses.The same limitations hold for data-driven, parametric methods [26][27][28][29], where the assumed analytical representations have parameters that are optimized so that the resulted model best fits the data.Although these methods perform well when the system operates within the main 'core' of the attractor, this may not be the case when rare and/or extreme events occur.
We propose a hybrid method for the formulation of a reduced-order model that combines an imperfect physical model with available data streams.The proposed framework is important for the non-parametric description, prediction and control of complex systems whose response is characterized by both i) high-dimensional attractors with broad energy spectrum distributed across multiple scales, and ii) strongly transient non-linear dynamics such as extreme events.
We focus on data-driven recurrent neural networks (RNN) with a long-short term memory (LSTM) [30] that represents some of the truncated degrees-of-freedom.The key concept of our work is the observation that while the imperfect model alone has limited descriptive and prediction skills (either because it has been obtained by a radical reduction or it is a coarse-grid solution of the original equations), it still contains important information especially for the instabilities of the system, assuming that the relevant modes are included in the truncation.However, these instabilities need to be combined with an accurate description of the nonlinear dynamics within the attractor and this part is captured in the present framework by the recurrent neural network.Note, that embedding theorems [31,32] make the additional memory of the RNN to represent dimensions of the system that have been truncated, a property that provides an additional advantage in the context of reduced-order modeling [20,25].
We note that such blended model-data approaches have been proposed previously in other contexts.In [33,34], a hybrid forecasting scheme based on reservoir computing in conjunction with knowledge-based models are successfully applied to example high-dimensional spatiotemporal chaotic systems.In [35][36][37] the linearized dynamics were projected to low-dimensional subspaces and were combined with additive noise and damping that were rigorously selected to represent the effects nonlinear energy fluxes from the truncated modes.The developed scheme resulted in reduced-order stochastic models that efficiently represented the second order statistics in the presence of arbitrary external excitation.In [38] a deep neural network architecture was developed to reconstruct the near-wall flow field in a turbulent channel flow using suitable wall only information.These nonlinear near-wall models can be integrated with flow solvers for the parsimonious modeling and control of turbulent flows [39][40][41][42].In [43] a framework was introduced wherein solutions from intermediate models, which capture some physical aspects of the problem, were incorporated as solution representations into machine learning tools to improve the predictions of the latter, minimizing the reliance on costly experimental measurements or high-resolution, high-fidelity numerical solutions.[44] design a stable adaptive control strategy using neural networks for May 2, 2018 2/22 physical systems for which the state dependence of the dynamics is reasonably well understood, but the exact functional form of this dependence, or part thereof, is not, such as underwater robotic vehicles and high performance aircraft.In [29,[45][46][47] neural nets are developed to simultaneously learn the solution of the model equations using data.In these works that only a small number of scalar parameters is utilized to represent unknown dynamics, while the emphasis is given primarily on the learning of the solution, which is represented through a deep neural network.In other words, it is assumed that a family of models that 'lives' in a low-dimensional parameter space can capture the correct response.Such a representation is not always available though.Here our goal is to apply such a philosophy on the prediction of complex systems characterized by high dimensionality and strongly transient dynamics.We demonstrate the developed strategy in prototype systems exhibiting extreme events and show that the hybrid strategy has important advantages compared with either purely data-driven methods or those relying on reduced-order models alone.

Materials and methods
We consider a nonlinear dynamical system with state variable u ∈ R d and dynamics given by where F : R d → R d is a deterministic, time-independent operator with linear and nonlinear parts L and h respectively.We are specifically interested in systems whose dynamics results in a non-trivial, globally attracting manifold S ⊂ R d to which trajectories quickly decay.The intrinsic dimension of S is presumably much less than d.
In traditional Galerkin-based reduced-order model [48] one typically uses an ansatz of the form where the columns of matrix Y = [y 1 , ..., y m ] form an orthonormal basis of Y , an m-dimensional subspace of R d , and the columns of Z = [z 1 , ..., z d−m ] make up an orthonormal basis for the orthogonal complement Z = R d \ Y ; ξ and η are the projection coordinates associated with Y and Z; b is an offset vector typically made equal to the attractor mean state.This linear expansion allows reduction to take place through special choices of subspace Y and Z, as well as their corresponding basis.For example, the well-known proper orthogonal decomposition (POD) derives the subspace empirically to be such that the manifold S preserves its variance as much as possible when projected to Y (or equivalently, minimizing the variance when projected to Z), given a fixed dimension constraint m.
We show that such a condition enables reduction, by substituting Eq (2) into Eq (1) and projecting onto Y and Z respectively to obtain two coupled systems of differential equations: If on average |η| |ξ|, we may make the approximation that η = 0, leading to a m-dimensional system (ideally m d) May 2, 2018 3/22 which can be integrated in time.This is known as the flat Galerkin method.The solution to Eq (1) is approximated by u ≈ Yξ + b.Using (4) as an approximation to Eq (1) is known to suffer from a number of problems.First, the dimension m of the reduction subspace Y may be too large for |η| ≈ 0 to hold true.Second, the subspace Z is derived merely based on statistical properties of the manifold without addressing the dynamics.This implies that even if η has small magnitude on average it may play a big role in the dynamics of the high-energy space (e.g.acting as buffers for energy transfer between modes [49]).Neglecting such dimensions in the description of the system may alter its dynamical behaviors and compromise the ability of the model to generate reliable forecasts.
An existing method that attempts to address the truncation effect of the η terms is the nonlinear Galerkin projection [48,50], which expresses η as a function of ξ: yielding a reduced system The problems boils down to finding Φ, often empirically.Unfortunately, Φ is well-defined only when the inertial manifold S is fully parametrized by dimensions of Y (see Fig 1), which is a difficult condition to achieve for most systems under a reasonable m.Even if the condition is met, how to systematically find Φ remains a big challenge.) is projected to 2D plane parametrized by (ξ 1 , ξ 2 ).Parametrization is assumed to be imperfect, i.e. out-of-plane coordinate η 1 cannot be uniquely determined from (ξ 1 , ξ 2 ).Flat Galerkin method always uses the dynamics corresponding to η 1 = 0. Nonlinear Galerkin method uses the dynamics corresponding to η 1 = Φ(ξ 1 , ξ 2 ) where Φ is determined by some prescribed criterion (e.g.minimization of L2 error).

Data-assisted reduced-order modeling
In this section we introduce a new framework for improving the reduced-space model that assists, with data streams, the nonlinear Galerkin method.Our main idea relies on building an additional data-driven model from data series observed in the reduction space to assist the equation-based model equations (4).We note that the exact dynamics of ξ can be written as where F ξ is defined in Eq (4) and G : R m × R d−m → R m encompasses the coupling between ξ and η.We will refer to ψ = G(ξ, η) as the complementary dynamics since it can be thought of as a correction that complements the flat Galerkin dynamics F ξ .
The key step of our framework is to establish a data-driven model Ĝ to approximate G: where ξ(t), ξ(t − τ ), ... are uniformly time-lagged states in ξ up to a reference initial condition.The use of delayed ξ states makes up for the fact that Y may not be a perfect parametrization subspace for S. The missing state information not directly accessible from within Y is instead inferred from these delayed ξ states and then used to compute ψ.This model form is motivated by the embedding theorems developed by Whitney [31] and Takens [32], who showed that the attractor of a deterministic, chaotic dynamical system can be fully embedded using delayed coordinates.
We use the long short-term memory (LSTM) [30], a regularization of recurrent neural network (RNN), as the fundamental building block for constructing Ĝ.The LSTM has been recently deployed successfully for the formulation of fully data-driven models for the prediction of complex dynamical systems [25].Here we employ the same strategy to model the complementary dynamics while we preserve the structure of the projected equations.LSTM takes advantage of the sequential nature of the time-delayed reduced space coordinates by processing the input in chronological order and keeping memory of the useful state information that complements ξ at each time step.An overview of the RNN model and the LSTM is given in S1 Appendix.
Building from LSTM units, we use two different architectures to learn the complementary dynamics from data.The first architecture reads a sequence of ξ states, i.e. states projected to the d−dimensional subspace and outputs the corresponding sequence of complementary dynamics.The second architecture reads an input sequence and integrate the output dynamics to predict future.The details of both architectures are described below.

Data series
Both architectures are trained and tested on the same data set consisting of N data series, where N is assumed to be large enough such that the low-order statistics of S are accurately represented.Each data series is a sequence of observed values in reduced space Y , with strictly increasing and evenly spaced observation times.Without loss of generality, we assume that all data series have the same length.Moreover, the observation time spacing τ is assumed to be small so that the true dynamics at each step of the series can be accurately estimated with finite difference.We remark that for single-step prediction (architecture I below) increasing τ (while keeping the number of steps constant) is beneficial for training as it reduces the correlation between successive inputs.However, for multi-step prediction (architecture II), large τ incurs integration errors which quickly outweigh the benefit of having decorrelated inputs.Hence, we require small τ in data.

Architecture I
We denote an input sequence of length-p as {ξ 1 , ..., ξ p } and the corresponding finite-difference interpolated dynamics as { ξ1 , ..., ξp }.A forward pass in the first architecture works as follows (illustrated in Fig 2I ).At time step i, input ξ i is fed into a LSTM cell with n LSTM hidden states, which computes its output h i based on the received input and its previous memory states (initialized to zero).The LSTM output is then passed through an intermediary fully-connected (FC) layer with n FC hidden states and rectified linear unit (ReLU) activations to the output layer at desired dimension m.
Here h i is expected to contain state information of the unobserved η at time step i, reconstructed effectively as a function of all previous observed states {ξ 1 , . . ., ξ i−1 }.The model output is a predicted sequence of complementary dynamics { ψ1 , . . ., ψp }.
Optionally, h i can be concatenated with LSTM input ξ i to make up the input to the FC layer.The concatenation is necessary when n LSTM is small relative to m.Under such conditions, the LSTM hidden states h i are more likely trained to represent η i alone, as opposed to (ξ i , η i ) combined.ξ i thus needs to be seen by the FC layer in order to have all elements necessary in order to estimate ψ.If the LSTM cell has sufficient room to integrate incoming input with memory (i.e.number of hidden units larger than the intrinsic dimensionality of the attractor), the concatenation may be safely ignored.
In the case of models with lower complexity for faster learning a small n LSTM is preferred.However, finding the minimum working n LSTM , which is expected to approach the intrinsic attractor dimension, is a non-trivial problem.Therefore, it is sometimes desirable to conservatively choose n LSTM .In this case the LSTM unit is likely to have sufficient cell capacity to integrate incoming input with memory, rendering the input concatenation step unnecessary.
The model is trained by minimizing a loss function with respect to the weights of the LSTM cell and FC layer.The loss function is defined as a weighted sum of the mean squared error (MSE) of the complementary dynamics: where ψ i = ξi − F ξ (ξ) is the true complementary dynamics at step i.Note that for this architecture it is equivalent to defining the loss based on MSE of the total dynamics.For weights w i we use a step profile: where w 0 1 is used to weight the first p t steps when the LSTM unit is still under the transient effects of the cell states being initialized to zero.Predictions made during this period is therefore valued much less.In practice, p t is usually negatively correlated with the parametrization power of the reduction subspace Y and can be determined empirically.For optimization we use the gradient-based Adam optimizer [51] (also described in S1 Appendix) with early stopping.The gradient is calculated for small batches of data series (batch size n batch ) and across the entire training data for n ep epochs.
A notable property of this model architecture is that input representing the reduced state is always accurate regardless of any errors made in predicting the dynamics previously.This is undesirable especially for chaotic systems where errors tend to grow Dashed arrows represent optional connections depending on the capacity of LSTM relative to the dimension of ξ.Both architectures share the same set of trainable weights.For architecture I, predictions are made as input is read; input is always accurate regardless of any prediction errors made in previous steps.This architecture is used only for training.Architecture II makes prediction in a sequence-to-sequence (setup sequence to prediction sequence) fashion.Errors made early do impact all predictions that follow.This architecture is used for fine-tuning weights and multi-step-ahead prediction.
exponentially.Ideally, the model should be optimized with respect to the cumulative effects of the prediction errors.To this end, this architecture is primarily used for pre-training and a second architecture is utilized for fine-tuning and multi-step-ahead prediction.

Architecture II
The second architecture bears resemblance to the sequence-to-sequence (seq2seq) models which have been widely employed for natural language processing tasks [52,53] For this architecture we define the loss function as This definition is based on MSE of the total dynamics so that the model learns to 'cooperate' with the projected dynamics F ξ .For weights we use an exponential profile: where 0 < γ < 1 is a pre-defined ratio of decay.This profile is designed to counteract the exponentially growing nature of the errors in a chaotic system and prevent exploding gradients.Similar to architecture I, training is performed in batches using the Adam algorithm.
Architecture II, in contrast with the architecture I, finishes reading the entire input sequence before producing the prediction sequence.For this reason it is suitable for running multi-step-ahead predictions.Both architectures, however, share the same set of trainable weights used to estimate the complementary dynamics.Hence, we can utilize architecture I as a pre-training facility for architecture II because it tends to have smaller gradients (as errors do not accumulate over time steps) and thus faster convergence.This idea is very similar to teacher forcing method used to accelerate training (see [54]).On the other hand, architecture II is much more sensitive to the weights.Gradients tend to be large and only small learning rates can be afforded.For more efficient training, it is therefore beneficial to use architecture I to find a set of weights that already work with reasonable precision and perform fine-tuning with architecture II.In addition, the p t parameter for architecture I also provides a baseline for the set-up stage length s to be used for architecture II.
Another feature of architecture II is that the length of its prediction stage can be arbitrary.Shorter length limits the extent to which errors can grow and renders the model easier to train.In practice we make sequential improvements to the model weights by progressively increasing the length p of the prediction stage.
For convenience, the hyperparameters involved in each architecture are summarized in Table 1.

Fully data-driven modeling
Both of the proposed architectures can be easily adapted for a fully data-driven modeling approach (see [25]): for architecture I the sequence of total dynamics Doing so changes the distribution of model targets and implicitly forces the model to learn more.For comparison, we examine the performance of this fully data-driven approach through the example applications in the following section.

A chaotic intermittent low-order atmospheric model
We consider a chaotic intermittent low-order atmospheric model, the truncated Charney-DeVore (CDV) equations, developed to model barotropic flow in a β-plane channel with orography.The model formulation used herein is attributed to [55,56], and employs a slightly different scaling and a more general zonal forcing profile than the original CDV.Systems dynamics are governed by the following ordinary differential equations: where the model coefficients are given by for m = 1, 2.Here we examine the system at a fixed set of parameters (x * 1 , x * 4 , C, β, γ, b) = (0.95, −0.76095, 0.1, 1.25, 0.2, 0.5), which is found to demonstrate chaotic intermittent transitions between zonal and blocked flow regime, caused by the combination of topographic and barotropic instabilities [55,56]  For reduction of the system we attempt the classic proper orthogonal decomposition (POD) whose details are described in S1 Appendix.The basis vectors of the projection subspace are calculated using the method of snapshots on a uniformly sampled time series of length 10,000 obtained by integrating Eq (13).The first five POD modes collectively account for 99.6% of the total energy.However, despite providing respectable short-term prediction accuracy, projecting the CDV system to its most energetic five modes completely changes the dynamical behavior and results in a single globally attracting fixed point instead of a strange attractor.The difference between exact and projected dynamics can be seen in terms of the two most energetic POD coefficients, ξ 1 , ξ 2 , in Fig 3C (left and middle subplots).
In the context of our framework, we construct a data-assisted reduced-order model that includes the dynamics given by the 5-mode POD projection.We set n LSTM = 1 (because one dimension is truncated) and n FC = 16.Input to the FC layer is a concatenation of LSTM output and reduced state because n LSTM = 1 is sufficient to represent the truncated mode.Data is obtained as 10,000 trajectories, each with p = 200 and τ = 0.01.We use 80%, 10%, 10% for training, validation and testing respectively.For this setup it proves sufficient, based on empirical evidence, to train the assisting data-driven model with Architecture I for 1000 epochs, using a batch size of 250.The trained weights are plugged in architecture II to generate sequential predictions.As we quantify next, it is observed that (a) the trajectories behave much like the 6-dimensional CDV system in the long term by forming a similar attractor, as May 2, 2018 10/22 shown in Fig 3C , and (b) the short-term prediction skill is boosted significantly.We quantify the improvement in prediction performance by using two error metricsroot mean squared error (RMSE) and correlation coefficient.For comparison we also include prediction errors when using a purely data-driven model based on LSTM.RMSE in ith reduced dimension is computed as where ξ (n) i (t l ) and ξ(n) i (t l ) represent the truth and prediction for the nth test trajectory at prediction lead time t l respectively.The results are plotted in Fig 4 .We end remark that the predictions obtained by the proposed data-assisted model are significantly better than the projected model, as well as than the purely data-driven approach.Low error levels are maintained by the present approach even when the other methods under consideration exhibit significant errors.
The anomaly correlation coefficient (ACC) [57] measures the correlation between anomalies of forecasts and those of the truth with respect to a reference level and is where ξ i is the reference level set to the observation average by default.ACC takes a maximum value of 1 if the variation pattern of the anomalies of forecast is perfectly coincident with that of truth and a minimum value of -1 if the pattern is completely reversed.Again, the proposed method is able to predict anomaly variation patterns which are almost perfectly correlated with the truth at very large lead times when the predictions made by the compared methods are mostly uncorrelated ( We emphasize that the presence of the equation-driven part contributes largely to the long-term stability (vs.purely data driven models) while the data-driven part serves to improve the short-term prediction accuracy.These two ingredients of the dynamics complement each other favorably in achieving great prediction performance.In addition, the data-assisted approach successfully produces a chaotic structure that is similar to the one observed in the full-dimensional system, a feat that cannot be replicated by either methods using equation or data alone.

Intermittent bursts of dissipation in Kolmogorov flow
We consider the two-dimensional incompressible Navier-Stokes equations where u = (u x , u y ) is the fluid velocity defined over the domain (x, y) ∈ Ω = [0, 2π] × [0, 2π] with periodic boundary conditions, ν = 1/Re is the non-dimensional viscosity equal to reciprocal of the Reynolds number and p denotes the pressure field over Ω.We consider the flow driven by the monochromatic Kolmogorov forcing f (x) = (f x , f y ) with f x = sin(k f y) and f y = 0. k f = (0, k f ) is the forcing wavenumber.
Following [58], the kinetic energy E, dissipation D and energy input I are defined as Due to spatial periodicity, it is natural to examine the velocity field in Fourier space.The divergence-free velocity field u admits the following Fourier series expansion: where k = (k 1 , k 2 ) is the wavenumber and a(k, t) = −a(−k, t) for u to be real-valued.
For notation clarity, we will not explicitly write out the dependence on t from here on.Substituting Eq (19) into the governing equations Eq (17) we obtain the evolution equations for a as (more details are presented in S1 Appendix) The first term suggests that any mode with wavenumber k is directly affected, in a nonlinear fashion, by pairs of modes with wavenumbers p and q such that k = p + q.A triplet of modes {p, q, k} satisfying this condition is referred to as a triad.It is worth noting that a mode which does not form a triad with mode k can still have an indirect effect on dynamics ȧ(k) through interacting with modes that do form a triad with k.
In [58], it is found that the most revealing triad interaction to observe, in the interest of predicting intermittent bursts in the energy input/dissipation, is amongst May 2, 2018 13/22 modes (0, k f ), (1, 0) and (1, k f ).Shortly prior to an intermittent event, mode (1, 0) transfers a large amount of energy to mode (0, k f ), leading to rapid growth in the energy input rate I and subsequently the dissipation rate D (see Fig 5A).However, projecting the velocity field and dynamics to this triad of modes and their complex conjugates fails to faithfully replicate the dynamical behaviors of the full system (the triad only accounts for 59% of the total energy; Fourier energy spectrum is shown in Fig 5B).We use the present framework to complement the projected triad dynamics.
Quantities included in the model are a(1, 0), a(0, k f ) and a(1, k f ) and their conjugate pairs, which amount to a total of six independent dimensions.Data is generated as a single time series of length 10 5 at ∆t = 1 intervals, by integrating the full model equations Eq (17) using a spectral grid of size 32 (wavenumbers truncated to −16 ≤ k 1 , k 2 ≤ 16) [59].Each data point is then taken as an initial condition from which a trajectory of length 1 (200 steps of 0.005) is obtained.The sequence of states along the trajectory is projected to make up the 6-dimensional input to the LSTM model.The ground truth total dynamics is again approximated with first-order finite differences.The first 80% of the data is used for training, 5% for validation and the remaining 15% for testing.
For this problem it is difficult to compute the true minimum parametrizing dimension so we conservatively choose n LSTM = 70 and n FC = 38.It is found that the models do not tend to overfit, nor is their performance sensitive to these hyper-parameters around the chosen value.Since the number of hidden units used in the LSTM is large relative to the input dimension, they are not concatenated with the input before entering the output layer.We first perform pre-training with architecture I for 1000 epochs and fine-tune the weights with architecture II.Due to the low-energy nature of the reduction space, transient effects are prominent (see S1 Appendix) and thus a sizable set-up stage is needed for training and prediction with architecture II.Using a sequential training strategy, we keep s = 100 fixed and progressively increase prediction length at p = {10, 30, 50, 100} (see S1 Appendix).At each step, weights are optimized for 1000 epochs using a batch size of 250.The hyperparameters defining the loss functions are p t = 60, w 0 = 0.01 and γ = 0.98, which are found empirically to result in favorable weight convergence.
Similar to the CDV system, we measure the prediction performance using RMSE and correlation coefficient.Since the modeled Fourier coefficients are complex-valued, the sum in Eq (15) is performed on the squared complex magnitude of the absolute error.
The resulting normalized test error curves are shown in Figs 6 and 7 respectively, comparing the proposed data-assisted framework with the original projected model and the fully data-driven approach as the prediction lead time increases.At 0.5 lead time (approximately 1 eddy turn-over time t e ), the data-assisted approach achieves 0.13, 0.005, 0.058 RMSE in mode [0, 4], [1,0] and [1,4] respectively.Predictions along a sample trajectory is shown in Fig 8.
Overall, the data-assisted approach produces the lowest error, albeit narrowly beating the fully data-driven model but significantly outperforming the projected model (88%, 86% and 95% reduction in error for the three modes).This is because data is used to assist a projected model that ignores a considerable amount of state information which contribute heavily to the dynamics.It is therefore all up to the data-driven model to learn this missing information.For this reason we observe similar performance between data-assisted and fully data-driven models.However, when we classify the test cases into regular and rare events based on the value of |a(1, 0)| and examine the error performance separately, the advantage of the data-assisted approach is evident in the latter category, especially for mode (1,4).The is mainly due to (a) rare events appear less frequently in data such that the corresponding dynamics is not learned well compared to regular events and (b) the triad of Fourier modes selected play more prominent role in rare events and therefore the projected dynamics contain relevant dynamical information.Nevertheless, errors for rare events are visibly higher (about 5 times), attesting to their unpredictable nature in general.
To better understand the favorable properties of the hybrid scheme when it comes to the prediction of extreme events we plot the probability density function (pdf) of complementary and total dynamics (see eq. Eq (7)), calculated from the 10 5 -point training data set with a kernel density estimator (Fig 9).The dynamics values are standardized so that 1 unit in horizontal axis represent 1 standard deviation.We immediately notice that total dynamics in every dimension have a fat-tailed distribution.This signifies that the data set contains several extreme observations, more than 10 standard deviations away from the mean.For data-driven models these dynamics are difficult to learn due to their sporadic occurrence in sample data and low density in phase space.
In contrast, the marginal pdf of the complementary dynamics have noticeably different characteristics, especially in both the real and imaginary parts of mode (1, 4) (and to a smaller degree for the real part of mode (0, 4)).The distribution is bimodal-like; more importantly, density falls below 10 −4 level within 3 standard deviations.Because of its concentrated character, this is a much better conditioned target distribution, as the data-driven scheme would never have to learn extreme event dynamics; these are captured by the projected equations.As expected, the error plot in

Conclusion
We introduce a data assisted framework for reduced-order modeling and prediction of extreme transient events in complex dynamical systems with high-dimensional attractors.The framework utilizes a data-driven approach to complement the dynamics given by imperfect models obtained through projection, i.e. in cases when the  projection subspace does not perfectly parametrize the inertial manifold of the system.Information which is invisible to the subspace but important to the dynamics is extracted by analyzing the time history of trajectories (data-streams) projected in the subspace, using a RNN strategy.The LSTM based architecture of the employed RNN allows for the modeling of the dynamics using delayed coordinates, a feature that significantly improves the performance of the scheme, complementing observation in fully data-driven schemes.
We showcase the capabilities of the present approach through two illustrative examples exhibiting intermittent bursts: a low dimensional atmospheric system, the Charney-DeVore model and a high-dimensional system, the Kolmogorov flow described by Navier-Stokes equations.For the former the data-driven model helps to improve significantly the short-term prediction skill in a high-energy reduction subspace, while faithfully replicating the chaotic attractor of the original system.In the infinite-dimensional example it is clearly demonstrated that in regions characterized by extreme events the data-assisted strategy is more effective than the fully data-driven prediction or the projected equations.On the other hand, when we consider the performance close to the main attractor of the dynamical system the purely data-driven approach and the data-assisted scheme exhibit comparable accuracy.
The present approach provides a non-parametric framework for the improvement of imperfect models through data-streams.For regions where data is available we obtain corrections for the model, while for regions where no data is available the underlying model still provides a baseline for prediction.The results in this work emphasize the value of this hybrid strategy for the prediction of extreme transient responses for which data-streams may not contain enough information.In the examples considered the imperfect models where obtained through projection to low-dimensional subspaces.It is important to emphasize that such imperfect models should contain relevant dynamical information for the modes associated to extreme events.These modes are not always the most energetic modes (as illustrated in the fluids example) and numerous efforts have been devoted for their characterization [58,60,61].
Apart of the modeling of extreme events, the developed blended strategy should be of interest for data-driven modeling of systems exhibiting singularities or singular perturbation problems.In this case the governing equations have one component that is particularly challenging to model with data, due to its singular nature.For such systems it is beneficial to combine the singular part of the equation with a data-driven scheme that will incorporate information from data-streams.Future work will focus on the application of the formulated method in the context of predictive control [62][63][64] for turbulent fluid flows and in particular for the suppression of extreme events.

Fig 2 .
Fig 2. Computational graph for model architecture I and II.Yellow nodes are input provided to the network corresponding to sequence of states and blue nodes are prediction targets corresponding to the complementary dynamics (plus states for architecture II).Blocks labeled 'FC' are fully-connected layers with ReLU activations.Dashed arrows represent optional connections depending on the capacity of LSTM relative to the dimension of ξ.Both architectures share the same set of trainable weights.For architecture I, predictions are made as input is read; input is always accurate regardless of any prediction errors made in previous steps.This architecture is used only for training.Architecture II makes prediction in a sequence-to-sequence (setup sequence to prediction sequence) fashion.Errors made early do impact all predictions that follow.This architecture is used for fine-tuning weights and multi-step-ahead prediction.

Fig 3 .
Fig 3. CDV system.(A) 10 4 points sampled from the CDV attractor, projected to (x 1 , x 4 ) plane.(B) Example time series for x 1 ; blocked flow regime is shaded in red.(C) Length-2000 trajectory projected to the first two POD modes (normalized) integrated using the CDV model (left), 5-mode POD projected model (middle) and data-assisted model (right).Despite preserving 99.6% of the total variance, the 5-mode projected model has a single fixed point as opposed to a chaotic attractor.Data-assisted model, however, is able to preserve the geometric features of the original attractor.
Fig 4 -second row).In the third and fourth rows of Fig 4 we illustrate the improvement that we obtain with the data-assisted approach throughout the systems attractor, i.e. in both zonal and blocked regimes.In the third row of Fig 4 the flow in the zonal regime is shown and in the fourth row we demonstrate the flow transitions into the blocked regime around t = 20.In both cases, the data-assisted version clearly improves the prediction accuracy.

Fig 6 .
Fig 6.Kolmogorov flow -RMSE vs. time.Errors are computed for 10 4 test trajectories (legend: fully data-driven ; data-assisted ; triad ).The RMSE in each mode is normalized by the corresponding amplitude E(k) = E[|a(k)| 2 ].A test trajectory is classified as regular if |a(1, 0)| > 0.4 at t = 0 and rare otherwise.Performance for regular, rare and all trajectories are shown in three columns.Data-assisted model has very similar errors to those of purely data-driven models for regular trajectories, but the performance is visibly improved for rare events.

Fig 7 .
Fig 7. Kolmogorov flow: ACC vs. time.Values are computed for (A) regular and (B) rare trajectories classified from 10 4 test cases.Legend: fully data-driven ; data-assisted ; triad dynamics .Real and imaginary parts are treated independently.Similarly to RMSE in Fig 6, improvements in predictions made by the data-assisted model are more prominent for rare events.

Fig 7
Fig 7  suggests that the biggest improvement from a purely data-driven approach to a data-assisted approach is indeed for mode(1,4).

Fig 9 .
Fig 9. Marginal probability density function of total dynamics (top row) and complementary dynamics (bottom row).Horizontal axes are scaled by standard deviations of each quantity.For the real and imaginary parts of mode (1, 4) (and real part of mode (0, 4) to a smaller degree) triad dynamics help remove large-deviation observations in the complementary dynamics.
.It consists of two stages (illustrated in Fig2 II): a set-up stage and a prediction stage.The set-up stage has the same structure as architecture I, taking as input a uniformly spaced sequence of s reduced-space states which we call {ξ −s+1 , ξ −s+2 ..., ξ 0 }.No output, however, is produced until the very last step.This stage acts as a spin-up such that zero initializations to the LSTM memory no longer affects prediction of dynamics at the beginning of the next stage.The output of the set-up stage is a single prediction of the complementary dynamics ψ0 corresponding to the last state of the input sequence and the ending LSTM memory states.This dynamics is combined with F ξ (ξ 0 ) to give the total dynamics at ξ 0 .The final state and dynamics are passed to an integrator to obtain the first input state of the prediction stage ξ1 .During the prediction stage, complementary dynamics is predicted iteratively based on the newest state prediction and the LSTM memory content before combined with F ξ dynamics to generate the total dynamics and subsequently the next state.After p prediction steps, the output of the model is obtained as a sequence of predicted states { ξ1 , . . ., ξp } and a sequence of complementary dynamics { ψ1 , . . ., ψp }.

Table 1 .
Summary of hyperparameters for data-driven model architectures.{ ξ1 , . . ., ξp } is used as the training target in place of the complementary dynamics and for architecture II the FC layer output is directly integrated to generate the next state.
. These highly transient instabilities render this model an appropriate test case for evaluating the developed methodology.The two distinct regimes are manifested through x 1 and x 4 (Fig 3Aand 3B).