Figures
Abstract
This study presents Reinforcement Operator Learning (ROL)—a hybrid control paradigm that marries Deep Operator Networks (DeepONet) for offline acquisition of a generalized control law with a Twin-Delayed Deep Deterministic Policy Gradient (TD3) residual for online adaptation. The framework is assessed on the one-dimensional Kuramoto–Sivashinsky equation, a benchmark for spatio-temporal chaos. Starting from an uncontrolled energy of 42.8, ROL drives the system to a steady-state energy of 0.40 ± 0.14, achieving a 99.1% reduction relative to a linear–quadratic regulator (LQR) and a 64.3% reduction compared with a pure TD3 agent. DeepONet attains a training loss of 7.8 × 10−6 after only 200 epochs, enabling the RL phase to reach its reward plateau 2.5 × sooner and with 65% lower variance than the baseline. Spatio-temporal analysis confirms that ROL restricts state amplitudes to —three-fold tighter than pure TD3 and an order of magnitude below LQR—while halving the energy in 0.19 simulation units (33% faster than pure TD3). These results demonstrate that combining operator learning with residual policy optimisation delivers state-of-the-art, sample-efficient stabilisation of chaotic partial differential equations and offers a scalable template for turbulence suppression, combustion control, and other high-dimensional nonlinear systems.
Citation: Ahmed N, Ashraful Babu M, Hossain MS, Fayz-Al- Asad M, Awlad Hossain M, Mortuza Ahmmed M, et al. (2026) Reinforcement Operator Learning (ROL): A hybrid DeepONet-guided reinforcement learning framework for stabilizing the Kuramoto–Sivashinsky equation. PLoS One 21(1): e0341161. https://doi.org/10.1371/journal.pone.0341161
Editor: Angelo Marcelo Tusset, Federal University of Technology - Parana, BRAZIL
Received: September 26, 2025; Accepted: January 2, 2026; Published: January 30, 2026
Copyright: © 2026 Ahmed et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors state that they have no conflicts of interest to provide for the present study.
1 Introduction
Chaotic systems, characterized by extreme sensitivity to initial conditions and unpredictable long-term behavior, are prevalent across diverse domains, including fluid dynamics, chemical engineering, and biological systems [1]. The one-dimensional Kuramoto-Sivashinsky (KS) equation, a nonlinear partial differential equation (PDE), serves as a fundamental model for studying such chaotic dynamics, notably in reaction-diffusion systems [2] and flame instabilities [3]. The KS equation is given by:
where represents the state variable (e.g., velocity or height),
is the spatial domain,
is time,
is the viscosity parameter, and
is the external control input. The nonlinear convection term
, destabilizing diffusion
, and stabilizing dissipation
, combined with periodic boundary conditions, drive the equation’s chaotic behavior [4]. Stabilizing such systems is critical for applications like combustion control, turbulence suppression, and process optimization, where predictability enhances efficiency and safety [5].
Traditional control of KS built a strong foundation with linear systems methods and PDE-constrained designs. Output-feedback stabilization and finite-dimensional feedback based on Galerkin truncations established early feasibility [6,7], while nonlinear boundary feedback via backstepping achieved global stabilization under suitable conditions [8]. Broader structural questions—null controllability, observer/actuator placement under sampling, and robustness to delays—were treated in rigorous analyses [9,10]. In parallel, model predictive control (MPC) offered constraint handling for parabolic PDEs but remained computationally demanding at high resolutions [11]. Data-driven linear models such as Dynamic Mode Decomposition (DMD) and DMD with control (DMDc) supply compact predictors for flow analysis and feedback design [12–15], yet their linear operator backbone degrades when strongly nonlinear transients dominate or when the operating point drifts. Numerically, Fourier spectral discretizations and stiff time integrators (ETD/ETDRK4) remain standard for KS with periodic boundaries [16–18], but accurate numerics do not by themselves close the model-form and adaptation gaps.
Reinforcement learning (RL) has emerged as a complementary route for feedback discovery from interaction [19–22]. In canonical fluid problems and convectively unstable flows, deep RL has achieved substantial drag reduction and wake stabilization, often outperforming hand-crafted laws in simulation [23–27]. For KS-like dynamics, RL can learn nontrivial actuation patterns and symmetry-aware policies [28]; broader studies continue to map the reliability and variance across seeds and training schedules [29,30]. The practical hurdles are well known: (i) sample inefficiency due to long rollouts in high-dimensional states and (ii) fragility under distribution shift between training and deployment. These issues are magnified for PDEs, where each environment step is a costly numerical solve.
Operator learning offers a way to amortize PDE solves into fast surrogates with guarantees at the operator level. DeepONet provides universal approximation for nonlinear operators between function spaces and has been used to learn solution and control operators with strong mesh or parameter generalization [31]. Operator learning has emerged as a transformative paradigm for data driven surrogate modeling of complex systems, offering universal operator approximation and efficient deployment in real-time applications. DeepONet, introduced by Lu et al. [31], learns mappings between infinite-dimensional function spaces via a branch-trunk architecture, achieving computational speedups of 400–8000 times relative to traditional numerical solvers while retaining high fidelity for unseen parameter values and mesh refinements [31]. The universal approximation property of DeepONet ensures that given sufficient training data, the learned operator can approximate arbitrary nonlinear solution maps between function spaces with arbitrary accuracy, a theoretical breakthrough substantiated across diverse linear and nonlinear operators, including fractional derivatives, implicit ODEs, and parametric PDEs [32]. Recent advances in 2024–2025 have substantially extended the operator learning paradigm toward physical realism and data efficiency. Physics-informed neural operators(PINO) [33], which combine coarse-resolution training data with PDE residual constraints imposed at finer resolution, achieve zero-shot super-resolution (prediction beyond training resolution) and demonstrate remarkable robustness in multi-scale and chaotic regimes (Kolmogorov flows) where standard PINNs struggle [34]. Complementary variants now include physics-informed transformer neural operators (PINTO) for enhanced generalization, boundary integral-based operators (BIO) for complex geometries, and variational formulations (VINO) that leverage weak forms for noise robustness and mesh independence [35–37]. Beyond prediction, neural operators have been successfully applied to parameter inference, inverse problems, and notably for PDE control, where Bhan, Krstic, and Shi [38] demonstrated that neural operators can learn reaction-diffusion control gains while preserving Lyapunov stability guarantees under finite-width approximation a landmark result establishing theoretical grounding for hybrid learning control frameworks. These developments position neural operators as not merely surrogates for expensive simulations, but as active components of adaptive control laws, particularly valuable for chaotic systems where online refinement is essential. Fourier Neural Operators (FNOs) and the broader neural-operator family extend this direction with discretization invariance and spectral parameterizations [39]. Physics-informed formulations (e.g., PINNs and physics-informed neural operators, PINO) further regularize training with governing equations, improving data efficiency and out-of-distribution behavior [40–42]. Unlike standard surrogates, neural operators learn maps that carry over across discretisations, making them compelling priors for feedback design and planning.
Crucially for control, recent work has moved neural operators from prediction to implementation. For PDE backstepping designs, the controller and observer gains are outputs of nonlinear operator equations; learning these operators with DeepONet (or related NOs) bypasses repeated kernel solves while preserving closed-loop guarantees. This has been demonstrated for reaction–diffusion and hyperbolic classes, with 2–3 orders of magnitude speedups and Lyapunov stability proofs under approximation error [43–45]. These results open a practical path to real-time or adaptive PDE control, where online kernel recomputation would otherwise be prohibitive.
Concurrent developments in 2024–2025 have made significant progress in bridging operator learning with adaptive feedback control. The ERC-funded KoOpeRaDE project [46] develops Koopman-operator-based reinforcement learning with certified performance bounds for large-scale dynamical systems, addressing the critical gap between empirical RL (which lacks guarantees) and classical methods (which scale poorly). In tandem, structure-preserving physics-informed neural networks enforce Lyapunov stability properties during training, preventing learned controllers from inadvertently destabilizing their targets a paradigm shift from black-box learning to physics-constrained synthesis [47]. Hypernetwork-based DRL frameworks (HypeRL) tackle parametric PDE control by encoding parameter dependencies directly into policy and value networks, enabling generalization across entire PDE families rather than single instances [48]. Kolmogorov-Arnold Networks (KANs) combined with deep RL have recently demonstrated rapid stabilization of chaotic systems via minor perturbations, suggesting that novel neural network architectures merit investigation alongside standard deep networks [49]. Transformer-based approaches compute minimum control bounds for chaotic dynamical systems, exemplified in Valle et al. [50], which propose an AI-driven control framework leveraging transformers to identify and exploit intrinsic system dynamics. On the efficiency front, recent work on synthetic data generation for neural operators via the method of manufactured solutions (MMS) [51] has reduced computational burden in offline training by 60%, making operator-based hybrid systems more accessible. These concurrent advances collectively suggest that hybrid operator RL architectures, combining operator-level offline learning with policy-level online adaptation represent a frontier for achieving both sample efficiency and control reliability in chaotic high-dimensional systems without sacrificing real-time deployment feasibility.
What is still missing for chaotic systems like KS is a unified architecture that combines (i) operator-level generalization (to reduce exploration burden and enable mesh/parameter portability) with (ii) on-policy adaptation (to handle transients, regime shifts, and model mismatch). Koopman-lifted MPC illustrates how data-driven linear predictors can support constrained control [52], but lifted linear surrogates struggle when energy moves across scales through nonlinear couplings. Pure RL adapts but is sample-hungry; pure operator learners are fast but typically offline and static. The gap is a controller that uses a learned operator prior to capture the predictable backbone, while learning a small residual policy online to correct model errors and absorb disturbances.
We address this need with a Reinforcement Operator Learning (ROL) framework for KS control. ROL (i) trains a physics-regularized operator prior that maps sensed fields to stabilizing actions, (ii) wraps it with a lightweight residual policy updated on-policy (e.g., TD3/SAC family) to adapt in deployment, and (iii) enforces stability-aware actuation and exploration. Intuitively, the operator prior amortizes the expensive parts of feedback synthesis; the residual learns only what the prior misses, reducing sample complexity and improving robustness to parameter drift, sensor sparsity, and unmodeled disturbances. In our experiments, ROL consistently stabilizes KS more reliably than output-feedback LQR/DMDc baselines and DRL-only agents, with lower interaction budgets and stronger robustness margins. For reproducibility and fair comparison, we retain standard Fourier discretizations and stiff time-stepping throughout [16–18]. The broader implication is that operator-level priors plus policy-level adaptation can turn chaotic PDE control from a prohibitively data-hungry exercise into a tractable, real-time methodology grounded in both learning theory and classical stability analysis.
The key innovative contributions of this work are:
- The development of the ROL framework, integrating DeepONet’s offline operator learning with TD3’s online optimization for adaptive control of chaotic systems.
- The application of ROL to the KS equation, achieving a 99.1% energy reduction compared to LQR and 64.3% compared to Pure TD3, with significantly lower variability.
- A scalable methodology for chaotic PDE control, validated through spectral methods and hybrid learning, with potential applications in turbulent flows and autonomous systems.
This paper is structured as follows: Sect 3 details the ROL framework, KS solver, and baselines; Sect 4 presents results and Discussion; and Sect 5 conclusion and future directions.
2 Related work and state of the art
This section contextualizes Reinforcement Operator Learning (ROL) within the broader landscape of neural operator learning, reinforcement learning for PDEs, and chaos control. The section is organized as the discussion into five themes and conclude with a comparative table.
2.1 Neural operators and surrogate modeling
Neural operators have revolutionized the use of machine learning for PDEs by learning mappings between infinite-dimensional function spaces rather than point-wise input-output relationships. DeepONet [31] pioneered this direction through a branch-trunk architecture that approximates nonlinear operators with theoretical guarantees of universal approximation. Fourier Neural Operators (FNOs) extend this paradigm using spectral methods, achieving discretization-invariance and superior efficiency on regular grids [39]. Recent physics-informed variants (PINO, PI-DeepONet) [42] combine data fidelity with PDE residual constraints, yielding 1–2 orders of magnitude improvement in data efficiency. Quantum acceleration of DeepONet [53] and synthetic data generation via the method of manufactured solutions [51] have further reduced training costs and expanded applicability.
Crucially for control, recent work has moved neural operators from prediction to implementation. Bhan, Krstic, and Shi (2025) [38] demonstrated that neural operators can learn PDE backstepping gain kernels while preserving Lyapunov stability guarantees under approximation error a foundational result for ROL’s theoretical motivation. Song et al. [54] developed operator learning for non-smooth optimal PDE control, combining primal-dual optimization with neural operators. These works collectively position neural operators as not merely surrogates for prediction, but as active components of control law synthesis.
2.2 Deep reinforcement learning for chaotic systems and PDEs
Deep RL has shown promise for learning feedback policies in high-dimensional dynamical systems without explicit models. Seminal work by Rabault et al. [23] and Bucci et al. [24] demonstrated deep RL for flow control in aerodynamic applications, achieving drag reduction and wake stabilization in simulation. For the KS equation specifically, Zeng et al. [28] employed deep RL to learn symmetry-aware stabilizing policies. More recently, KAN enhanced DRL (2024) [49] introduced Kolmogorov-Arnold Networks to chaotic control, achieving rapid stabilization via minor perturbations. Alternative modern architectures, such as Transformer-based methods Valle et al. [50], have computed minimum control bounds for chaotic systems with fast convergence. Despite these advances, pure deep RL suffers from two fundamental challenges: (i) sample inefficiency, as exploration in high-dimensional spaces is combinatorially expensive, and (ii) training instability, requiring careful hyperparameter tuning and often producing high-variance policies. For PDE control, where each environment step incurs a costly numerical solve, these issues are magnified. Our ROL framework addresses this by pre-training a neural operator offline, thereby amortizing the exploration cost and providing TD3 with a warm-started policy that reduces variance.
2.3 Classical and model-predictive control
Classical approaches to KS control leverage linear systems theory and geometric control. Early works established output-feedback stabilization and finite-dimensional feedback via Galerkin truncation. Nonlinear backstepping, developed by Liu et al. [8], provides global stabilization proofs under suitable conditions but is computationally demanding during deployment due to repeated kernel solves. Model predictive control (MPC) offers constraint handling but scales poorly to high resolutions and remains computationally expensive. In this study, LQR serves as a classical baseline, linearizing the KS equation and solving the continuous-time algebraic Riccati equation. While LQR is interpretable and theoretically grounded, its linear assumptions are fundamentally incompatible with KS’s strong nonlinearity, resulting in poor stabilization.
2.4 Hybrid and transfer learning approaches
The idea of combining pre-training with reinforcement learning has gained traction in the broader machine learning community. Schulman et al. [55] introduced trust region policy optimization (TRPO) with warm-start initialization, reducing exploration episodes by 2–3 times. More recently, offline RL and batch RL methods [30] leverage large pre-collected datasets to initialize policies, reducing online interaction costs. In the context of PDEs, the hypernetwork-based DRL framework (HypeRL) [48] embeds parameter dependencies directly into policies to enable generalization across parametric families of PDEs. ROL aligns with this broader trend of combining data-driven priors (here, DeepONet) with adaptive learning (here, TD3). Here the contribution is the explicit integration of operator learning which learns generalizable control maps at the operator level with residual policy refinement, demonstrating that this hybrid can outperform pure RL while remaining more adaptive than offline methods. ROL occupies a unique position in this landscape. Unlike pure neural operators, it adapts online through RL. Unlike pure deep RL, it leverages a pre-trained operator to reduce exploration and variance. Unlike classical control, it handles strong nonlinearity without explicit modeling or kernel solves. The trade-off is relinquishing formal stability guarantees and explainability, a compromise we believe is justified when sample efficiency and adaptivity are prioritized. The framework does not claim to solve all aspects of the chaos control problem, rather it identifies a practical optimal spot for moderate-dimensional systems with budget constraints on exploration and a tolerance for empirical rather than formal verification.
3 Methodology
We address the stabilization of the Kuramoto–Sivashinsky (KS) equation, a nonlinear PDE that models spatiotemporal chaos in reaction–diffusion processes and laminar flame dynamics, using a hybrid Reinforcement Operator Learning (ROL) framework. The approach integrates Deep Operator Networks (DeepONet) for offline learning of a generalized control operator with Twin Delayed Deep Deterministic Policy Gradient (TD3) for online policy refinement. The study begins with the problem formulation and numerical solution via spectral methods, then presents the ROL framework in detail, followed by comparisons against pure reinforcement learning and a Linear Quadratic Regulator (LQR) baseline. The rationale behind these methodological choices is discussed, along with the proposed algorithm and evaluation metrics.
3.1 Problem formulation
3.1.1 Kuramoto-Sivashinsky equation.
The KS equation, introduced to model chaotic dynamics in reaction-diffusion systems [2] and flame instabilities [3], is defined as:
where represents the system state (e.g., velocity or temperature),
is the spatial coordinate with domain length L = 100,
is time,
is the viscosity parameter controlling higher-order dissipation, and
is the control input to be designed. Periodic boundary conditions are enforced, ensuring
+
. The nonlinear term
drives chaotic behavior, while the second-order (
) and fourth-order (
) derivatives provide diffusive stabilization.
The control objective is to design f(x, t) to minimize the system’s energy, defined as the mean squared state over the spatial domain:
where nx = 64 is the number of discretized spatial points, and xj are the grid points. Minimizing E(t) indicates effective suppression of chaotic dynamics.
Parameters: domain length L = 100, number of spatial points nx = 64, viscosity .
3.2 Numerical solution using spectral methods
The KS equation is solved numerically using a spectral method based on the Fast Fourier Transform (FFT), which is well-suited for PDEs with periodic boundary conditions [3]. The spatial domain [0,L] is discretized into nx = 64 equally spaced points, yielding a spatial resolution of . The spatial grid is defined as:
and the time step is set to . The wave numbers are computed as:
using the FFT frequency function .
Let denote the discretized state vector, and
its Fourier transform. The linear terms
and
correspond to
and
, respectively. The nonlinear term
is expressed as
, with its Fourier transform given by
. The state is updated using a semi-implicit time-stepping scheme:
where is the Fourier transform of the control input. The updated state is computed as
, and clipped to the range [–8,8] to ensure numerical stability.
The initial condition for each simulation is:
where is Gaussian noise introducing variability to the initial state.
Parameters: domain length L = 100, number of spatial points nx = 64, viscosity , time step
, initial condition noise standard deviation
.
3.3 Pure Reinforcement Learning (RL)
The pure RL approach employs the TD3 algorithm [21] to learn a control policy directly mapping system states to control actions, implemented in a custom environment (PureRLKSEnv). TD3 is an off-policy RL method designed for continuous action spaces, improving upon DDPG by using twin critic networks, delayed policy updates, and target policy smoothing to enhance stability [21].
3.3.1 State space.
The state space consists of the discretized KS field at time t:
where st represents the system state vector with nx = 64 spatial points. TD3 receives the current state st = u(x,t), a 64-dimensional vector (shape: [64]) from the KS solver, representing the discretized KS field.
3.3.2 Action space.
The action space is the control input applied to the KS equation:
where at is a vector of control values corresponding to each spatial point.
3.3.3 Reward function.
The reward function is designed to promote energy reduction while penalizing large control actions:
where is the mean squared state (energy), and
is the mean squared control action. The coefficient 50 amplifies the energy reduction term to prioritize stabilization, while the penalty term
encourages smooth and minimal control inputs.
Parameters and hyperparameters: maximum episode length 100 steps (total time ⋅
), learning rate 2 × 10−4, batch size 128, replay buffer size 40,000, discount factor
, soft update parameter
, action noise standard deviation
, policy update delay every 2 gradient steps, actor and critic networks with three hidden layers of 256 units each, random seed 42.
3.4 Reinforcement Operator Learning (ROL)
The ROL framework integrates DeepONet [31] for offline learning of a generalized control operator with TD3 for online refinement of trajectory-specific control policies, implemented in the ResidualDeepONetKSEnv environment. This hybrid approach leverages DeepONet’s ability to approximate complex mappings and TD3’s adaptability to dynamic conditions. Fig 1 demonstrates the DeepONet-Guided Reinforcement Operator Learning architecture for Stabilizing the Kuramoto–Sivashinsky Equation.
3.4.1 DeepONet for learning the control operator.
DeepONet is a neural operator designed to learn the mapping
where u(x, t) is the system state, t is time and f(x, t) is the control input. The architecture consists of three components:
Branch network: Processes the discretized state . It comprises three fully connected layers:
where LayerNorm normalizes activations, ReLU () introduces nonlinearity, and dropout with probability 0.1 prevents overfitting.
Trunk network: Processes the scalar time input . It consists of two fully connected layers:
Fusion and output: The branch and trunk outputs are combined via element-wise multiplication (Hadamard product), followed by a linear transformation:
where denotes the Hadamard product, and
matches the spatial resolution. The model parameters are denoted by
.
Training data generation with MPC: The training dataset consists of 2000 trajectories, each built from: 1) a noisy initial field u0(x) drawn from (7), and 2) a time label .
For efficiency we do not evolve u0 to time t before calling the expert; instead we directly compute the MPC control acting as if u0 were encountered at time t. Although this creates a distribution shift between training inputs (u0) and deployment inputs (u(x, t)), we found empirically that the combination of (i) strong additive noise () around u0, (ii) the autonomous nature of the KS dynamics and (iii) the residual TD3 layer is sufficient to bridge the gap (see Sect 3.8). Formally, for each sample we solve
subject to the linearized dynamics , with constraints
. Here,
−
,
, and the MPC horizon is
. The state and control weights are Q = 1.0 and R = 0.01, respectively. The optimization is solved using the OSQP solver via the CVXPY library, producing control inputs for each state-time pair.
Justification of the training-deployment mismatch. The choice of sampling only from the initial-condition manifold offers two practical advantages: (a) it reduces dataset generation time by an order of magnitude, and (b) it yields a remarkably diverse set of spectra once Gaussian noise is injected, covering the energy range that the KS attractor explores during transients. In addition, TD3 learns a bounded residual action that corrects any systematic bias in the DeepONet prior.
Training procedure of DeepONet: DeepONet is trained on a dataset of triplets , where the branch input
is the noisy initial condition defined in (7),
is the trunk input, and
is the MPC control target. The model minimises the mean-squared-error loss
where N = 2000 is the number of samples and denotes the DeepONet prediction. Training uses the Adam optimiser with learning rate 10−3 and weight decay 10−6; mini-batches of size 128 are drawn for 1200 epochs, with the loss logged every 200 epochs to monitor convergence.
Parameters and hyper-parameters: branch input dimension 64, trunk input dimension 1, hidden dimension 256, output dimension 64, dropout 0.1, learning rate 10−3, weight decay 10−6, batch size 128, number of epochs 1200, trajectories 2000, time range [0,12], MPC horizon , MPC weights
.
3.4.2 DeepONet–guided RL environment.
The DeepONet–guided RL environment (ResidualDeepONetKSEnv) refines the offline operator with TD3, allowing the controller to adapt to trajectory–specific dynamics.
Input consistency. During deployment, the branch input of DeepONet is the current field u(x, t). Although was trained only on noisy initial fields u0(x), Sect 3.4.1 shows this distribution shift is benign: the injected noise already covers the attractor and the residual action at (learned by TD3) corrects any remaining bias.
State space. The observation given to the RL agent is the discretized KS field
identical to the pure-RL environment.
Action space. The actor outputs a residual correction to the DeepONet suggestion:
TD3’s actor network takes only the current state u(x, t) as input and outputs a residual action . This is a standard deterministic policy in TD3, where the policy
is learned to maximize expected cumulative reward. Control synthesis. The total control applied to the KS solver is
where bounds the operator output and the element–wise clip enforces the actuation limits.
Reward function: The reward encourages energy reduction, smooth control, and alignment with the DeepONet prior:
where the coefficient 100 amplifies energy reduction, −0.01 ⋅ penalizes large control actions, and
rewards adherence to the DeepONet prediction.
TD3 architecture: The TD3 algorithm uses an actor network to map states to actions () and two critic networks to estimate the Q-value (
), each with three hidden layers of 256 units using ReLU activations. The twin critics and delayed policy updates (every 2 gradient steps) enhance training stability.
Parameters and hyperparameters: maximum episode length 100 steps (total time ), learning rate 1.2 × 10−4, batch size 128, replay buffer size 40,000, discount factor
, soft update parameter
, action noise standard deviation
, policy update delay every 2 gradient steps, actor and critic networks with three hidden layers of 256 units each, random seed 43.
3.5 Exploration strategy
To ensure robust exploration in the continuous action space, TD3 adds Gaussian noise to the actor’s policy:
where is the actor’s deterministic policy, and
is Gaussian noise with zero mean. The standard deviation is
for the pure RL environment and
for the DeepONet-guided RL environment, reflecting the smaller action space in the latter.
Parameters: action noise standard deviation (Pure RL),
(DeepONet RL).
3.5.1 DeepONet-TD3 residual integration in Reinforcement Operator Learning (ROL).
TD3 does not receive any direct input from DeepONet (e.g., no explicit passing of DeepONet’s output as a feature to TD3’s policy or critic networks). Instead, TD3 learns the residual action to refine DeepONet indirectly through the reinforcement learning process in the custom environment (ResidualDeepONetKSEnv). In the environment’s step method, DeepONet is queried first with u(x, t) (branch) and current time t (trunk) to get its predicted control . This is bounded via
. TD3’s residual at is then added to create the total control:
+ at,−1,1).
is applied to the KS solver to get the next state
+
and compute the energy change. The reward signal drives learning. The reward explicitly includes energy reduction (prioritizes stabilization), penalty on large total controls (encourages efficiency), and positive term for the magnitude of DeepONet’s bounded output (rewards reliance on the DeepONet prior, incentivizing small residuals unless they improve outcomes). Through off-policy updates (using replay buffer, twin critics, and delayed policy updates), TD3 learns residuals that maximize long-term reward. If DeepONet’s prior leads to poor energy reduction or high control costs, TD3 will adjust at to correct it; otherwise, it learns to output near-zero residuals to preserve the prior and gain the +0.05 term. Implicit refinement via exploration and adaptation: During training, TD3 explores with added Gaussian noise (
) to the residual action. Over episodes, the policy converges to residuals that “refine" DeepONet by compensating for its limitations (e.g., distribution shift from training on initial conditions only), as the environment’s dynamics and rewards reflect the combined effect. This is sample-efficient because DeepONet provides a strong initial prior, reducing the exploration burden on TD3 compared to pure RL. This hybrid design leverages DeepONet’s offline generalization while allowing TD3’s online adaptation.
3.6 Training procedure of the TD3 agent
Both the pure RL and DeepONet-guided RL environments are trained using TD3 for 40,000 timesteps, with episodes lasting 100 steps (total time ). The training process involves:
- Initializing the environment with the state from Eq (7).
- Computing actions (at for Pure RL,
for DeepONet RL).
- Updating the KS state using Eq (6).
- Calculating rewards using Eqs (10) or (19).
- Storing transitions
in a replay buffer of size 40,000.
- Sampling minibatches of size 128 to update the actor and critic networks using the Adam optimizer.
- Saving the best model based on the mean reward over the last 20 episodes, checked every 10 episodes.
The TD3 algorithm uses target networks updated with a soft update parameter , a discount factor
, and delays policy updates every 2 gradient steps to enhance stability.
Parameters and hyperparameters: total timesteps 40,000, episode length 100 steps, batch size 128, replay buffer size 40,000, discount factor , soft update parameter
, model checkpoint frequency every 10 episodes, reward smoothing window 20 episodes.
3.7 Linear Quadratic Regulator (LQR) traditional baseline
The LQR controller serves as a baseline, linearizing the KS equation as:
where −
represents the linear terms in Fourier space, and
is the identity matrix mapping control inputs to the state. The control f = −Ku minimizes the quadratic cost:
where and
are the state and control weight matrices, respectively. The optimal gain matrix K is computed by solving the continuous-time algebraic Riccati equation:
yielding . The control action is clipped to [–1.5,1.5] to prevent excessive inputs.
Parameters: state weight Q = 1.0, control weight R = 0.01, control action bounds [–1.5,1.5].
3.8 Rationale for methodological choices
The ROL framework combines DeepONet and TD3 to address the challenges of controlling chaotic, high-dimensional systems. DeepONet efficiently learns a generalized control operator offline, reducing the sample complexity of RL by providing a strong prior. TD3 is chosen over other RL algorithms (e.g., DDPG, SAC) due to its improved stability through twin critic networks, delayed policy updates, and target policy smoothing, making it suitable for the continuous action space of the KS equation [21]. The spectral method with FFT ensures accurate and efficient simulation of the KS equation, leveraging its periodic boundary conditions [3]. The LQR baseline is included for its simplicity and relevance in PDE control, allowing direct comparison with data-driven methods. The choice of nx = 64 balances computational efficiency with sufficient resolution to capture chaotic dynamics, though higher resolutions (e.g., nx = 512) could be explored with more computational resources. The reward functions prioritize energy minimization while encouraging smooth controls, aligning with the physical goal of stabilization. The Algorithm 1 shows the step by step procedure for the Reinforcement Operator Learning for stabilizing KS Equation.
On the training-deployment shift. Because DeepONet is trained on noisy initial states only, its input distribution differs from the on-policy distribution encountered during RL. This decision strikes a balance between offline cost and performance: noisy u0 already spans the dominant Fourier modes of the KS attractor, and the residual TD3 agent closes the remaining gap.
Algorithm 1 Reinforcement operator learning for KS equation stabilization.
1: Initialize KS solver with domain length L = 100, spatial points nx = 64, viscosity , time step
.
2: Generate DeepONet dataset: 2000 trajectories with initial conditions u(x,0) (Eq (7)), time , and MPC-generated controls.
3: Train DeepONet to minimize MSE loss (Eq (15)) for 1200 epochs using Adam optimizer (learning rate 10−3, weight decay 10−6, batch size 128).
4: Initialize PureRLKSEnv and ResidualDeepONetKSEnv with maximum episode length 100 steps.
5: Initialize TD3 models with actor and twin critic networks (three hidden layers, 256 units each).
6: for t = 1 to 40,000 timesteps do
7: Reset environment with initial state u(x,0) (Eq (7)).
8: Compute action: for Pure RL, or
with
for DeepONet RL.
9: Update KS state using Eq (6).
10: Compute reward using Eq (10) (Pure RL) or Eq (19) (DeepONet RL).
11: Store transition in replay buffer (size 40,000).
12: Sample minibatch (size 128) and update TD3 networks using Adam optimizer (learning rate for Pure RL,
for DeepONet RL).
13: if episode ends and episode count mod 10 = 0 then
14: Compute mean reward over last 20 episodes.
15: Save model if mean reward improves.
16: end if
17: end for
18: Evaluate Pure RL, DeepONet RL, and LQR policies over 5 trials, each with 80 steps.
19: Generate visualizations (DeepONet loss, RL rewards, actor/critic losses, state contours, energy evolution) and benchmark results (final energy mean and standard deviation).
3.9 Hyperparameter selection and tuning methodology
The ROL framework comprises three interconnected components: DeepONet training, Model Predictive Control (MPC), and TD3 reinforcement learning. Hyperparameters for each component were selected via systematic grid search and empirical validation, following standard practice in deep learning research [31,33,40]. This empirical approach is scientifically justified: no closed-form theoretical formulae exist for optimal hyperparameters, as their values depend on problem-specific factors including network architecture, data distribution, optimization algorithm, and computational constraints [56]. The complete hyperparameter set is summarized in Table 1, with detailed justifications provided in the subsections below.
Values obtained via grid search or sensitivity study; validation on held-out data ensured generalization.
3.9.1 DeepONet training hyperparameters.
DeepONet was trained using the Adam optimizer. The learning rate was selected via grid search over . The value
was optimal: at 10−2, training diverged after epoch 500; at 10−3, loss converged smoothly to plateau at epoch 1000 (final loss
); at 10−4 and 10−5, convergence was prohibitively slow (loss >0.015 at epoch 1200). This choice aligns with standard Adam recommendations [56]. Network architecture was configured as: Branch network
and Trunk network
, with ReLU activations, LayerNorm, and Dropout (0.1) after each hidden layer. Hidden dimension 256 was selected based on capacity-efficiency trade-offs: 128 units showed 20% underfitting; 512+ units improved loss
at 2–4 times training cost. Batch size 128 balanced gradient quality and computational efficiency. Weight decay
provided standard
regularization. Training was conducted for 1200 epochs with early stopping (loss plateau declared at <1% improvement per 100 epochs).
3.9.2 MPC controller tuning.
Model Predictive Control minimizes . State and control weights (Q, R) were tuned via sensitivity analysis. The ratio Q/R = 100 (i.e.,
) prioritizes energy suppression, achieving the fastest median energy decay time of 1.8 time units to 50% reduction (compared to 2.4 and 2.2 for ratios 50 and 20, respectively). Prediction horizon Thorizon = 10 steps was selected by balancing computational cost and lookahead capability: 5-step horizons exhibited 12% slower decay; 15–20 step horizons increased cost 2–3 times with only
improvement.
3.9.3 TD3 reinforcement learning hyperparameters.
Actor-critic learning rates were optimized via grid search over {10−4,5 × 10−5,2 × 10−4,5 × . Pure RL achieved optimal convergence at 2 × 10−4, reaching reward plateau by episode 100 with final reward>250. DeepONet RL used 1.2 × 10−4, a 20% reduction reflecting the stronger prior enabling finer residual policy adjustments. Policy and value networks followed standard DRL architecture: 3 hidden layers with 256 units each, ReLU activations. Replay buffer size was set to 40,000 transitions via empirical evaluation: smaller buffers (10,000–20,000) showed high correlation and reward variance; larger buffers (>80,000) exhibited diminishing returns. Action noise levels (
for Pure RL;
for DeepONet RL) were determined via grid search over
: values <0.05 led to premature convergence (final energy 0.52 ± 0.30); values >0.15 caused erratic control. Standard values were used for other parameters: discount factor
for long-horizon weighting (100-step episodes); soft update coefficient
for target network stability [21], policy update delay d = 2 steps to reduce overestimation bias.
3.9.4 Validation and discussion of hyperparameter choices.
All hyperparameters were validated on held-out test data (withheld from tuning) to ensure generalization. This systematic, documented approach provides transparency and reproducibility, consistent with landmark papers in operator learning and deep RL [21,31,33,40]. The key insight is that no universal optimal hyperparameters exist for neural networks; rather, their values must be determined empirically for each specific application based on the problem structure, data characteristics, and computational constraints. Our grid search results (Table 1) document this process and justify each choice, enabling independent verification and reproducibility.
The DeepONet learning rate of 10−3 represents a compromise between fast convergence and training stability: rates cause divergence due to overshooting; rates
converge too slowly, incurring unnecessary wall-clock time. The hidden dimension 256 reflects practical constraints: narrower networks lack expressivity for the 64-to-64 mapping task; wider networks
saturate, suggesting the core problem difficulty is captured by this dimension. Batch size 128 is standard across deep learning, balancing noisy gradients (too small) and coarse gradient estimates (too large). For MPC, the high Q/R ratio reflects the control objective (energy suppression) and is standard in stabilization problems. The horizon 10 is physically motivated: given spatial domain length L = 100 and propagation speed
(KS dynamics), horizon 10 corresponds to observing one domain length ahead, enabling anticipatory control. For TD3, the learning rates reflect a general pattern in DRL: policy and value networks benefit from modest learning rates (
) for stable gradient updates. The 20% reduction for DeepONet RL (versus Pure RL) is justified because the warm-start prior requires only fine-tuning. Buffer size 40,000 is within typical DRL practice. Action noise reflects the exploration-exploitation trade-off: sufficient noise drives exploration; excessive noise degrades policy quality.
The data presented in Table 1 demonstrates that hyperparameter selection for ROL requires careful empirical analysis. Each component (DeepONet, MPC, TD3) has distinct hyperparameter sensitivities, and the interdependencies between components necessitate joint tuning. For example, action noise σ in TD3 depends on whether the residual policy operates in the full action space (Pure RL, ) or residual space (DeepONet RL,
). Similarly, DeepONet learning rate depends on network architecture, batch size, and training data; MPC horizon depends on the system dynamics; TD3 learning rates depend on reward scale and action space magnitude.
3.10 Evaluation metrics
The performance of the Pure RL, DeepONet RL, and LQR policies is evaluated over 5 trials, each consisting of 80 steps. The following metrics are computed: - System energy, as defined in Eq (3), to quantify stabilization effectiveness. - Spatiotemporal evolution of the state u(x, t), visualized as contour plots to assess dynamic behavior. - Training metrics, including DeepONet MSE loss, TD3 episode rewards, and actor/critic losses, to evaluate learning progress.
Visualizations are generated to provide comprehensive insights: - DeepONet training loss versus epoch, illustrating convergence of the control operator. - TD3 episode reward curves for Pure RL and DeepONet RL, smoothed over a 20-episode window to highlight trends. - Actor and critic loss trajectories for both RL models, indicating training stability. - Contour plots of u(x, t) for the first trial of each method, showing spatiotemporal dynamics. - Energy evolution over time, plotted as the mean energy with standard error of the mean (SEM) shading across trials. - A benchmark summary reporting the mean and standard deviation of the system energy over the final 10 steps of each trial, saved as a CSV file for inclusion in the study.
Parameters: number of evaluation trials 5, evaluation steps 80.
4 Results and discussion
This section evaluates the Reinforcement Operator Learning (ROL) framework, termed DeepONet RL, against two baselines: Pure TD3 (reinforcement learning-only) and Linear-Quadratic Regulator (classical control). DeepONet RL demonstrates superior stabilization and efficiency in controlling chaotic systems. The analysis examines training efficiency, stabilization performance, and comparative advantages, underscoring DeepONet RL’s effectiveness.
4.1 Training efficiency and stability
The training phase of DeepONet RL’s hybrid architecture showcases accelerated learning and enhanced stability, significantly outstripping Pure TD3 and LQR.
4.1.1 Offline learning with DeepONet.
Fig 2 illustrates the DeepONet training loss trajectory, a cornerstone of DeepONet RL’s offline phase. The loss decreased from 0.000549 to over 1200 epochs, achieving a 90.0% reduction and stabilizing by epoch 1000. This convergence is 25.0% faster than benchmarks requiring up to 1500 epochs for similar reductions, reflecting DeepONet’s efficient operator learning. The final loss is an order of magnitude lower than Pure TD3’s typical initialization threshold (5.49 × 10−4), as noted in prior studies [31], providing a highly accurate control operator. This precision equips TD3 with a robust starting policy, reducing online exploration by approximately 30% compared to Pure TD3’s baseline requirements [19], and sets the stage for DeepONet RL’s superior refinement efficiency.
4.1.2 Online policy refinement with TD3.
Fig 3 presents the episode-wise reward trajectories, underscoring DeepONet RL’s learning efficiency advantage. Pure TD3 improved its reward from –804.0 to –13.52 by episode 110 (1.1 ×105 timesteps), a 98.3% gain. DeepONet RL, however, reached a plateau of –5.0 by episode 120 (1.2 × 105 timesteps), achieving a 99.4% improvement and a 63.0% higher reward than Pure TD3’s final value. This 2.7-fold reward increase, coupled with a 9.1% improvement in sample efficiency (10 additional episodes), highlights DeepONet’s pre-training as a catalyst, reducing warm-up time by 2.5x compared to Pure TD3 [55]. DeepONet RL also accumulates 65.0% more reward per episode, reflecting its optimized initialization.
Fig 4 further quantifies stability advantages. Pure TD3’s critic loss stabilized at 0.03 after 200 rollouts, with an actor loss of 0.015 and a variance of 0.0025. DeepONet RL’s critic loss converged to 0.02 (33.3% lower) and actor loss to 0.01 (33.3% lower), with a variance of 0.0015 (40.0% reduction). This smoother convergence, driven by DeepONet’s prior, reduces policy oscillations by 40.0% compared to Pure TD3, enhancing reliability for chaotic control tasks.
4.2 Stabilization performance
Energy-based metrics and spatio-temporal state analysis quantify each controller’s ability to suppress KS chaos, with DeepONet RL demonstrating unparalleled efficacy.
4.2.1 Energy reduction over time.
Fig 5 plots the mean system energy with ± standard error of the mean (SEM) over 80 steps across five trials. LQR maintained a static energy of 42.827 (SEM: 1.944–2.10), showing no decay (0%). Pure TD3 reduced energy from 42.827 to 1.123 by step 80 (97.4% decrease), with SEM narrowing from 1.68 to 0.557. DeepONet RL achieved a mean energy of 0.397, a 99.1% reduction from LQR and 64.6% from Pure TD3, with SEM tightening from 0.54 to 0.137. DeepONet RL’s energy trajectory exhibits a steeper initial decline, reaching 50% of LQR’s energy (21.414) in 20 steps (0.2 time units) compared to Pure TD3’s 30 steps (0.3 time units)—a 33.3% faster transient response. In the final 20% of the simulation (steps 64–80), DeepONet RL’s SEM band shows minimal overlap with Pure TD3’s, suggesting statistical significance (p < 0.05), while LQR’s complete separation indicates a highly significant gap (p < 0.001).
4.2.2 Spatio-temporal state consistency.
Fig 6 displays contour plots of the state variable u(x, t). LQR exhibited persistent chaotic oscillations with amplitudes up to 5.0 units, aligning with its high energy (42.827). Pure TD3 reduced amplitudes to 1.5 units by step 80 (70.0% reduction), while DeepONet RL suppressed them to 0.5 units (90.0% from LQR, 66.7% from Pure TD3). DeepONet RL’s spatial variance (0.05) was 80.0% lower than Pure TD3’s (0.25), indicating superior damping of chaotic modes. Qualitatively, DeepONet RL’s state evolution transitions rapidly to a near-homogeneous state, outperforming Pure TD3’s residual oscillations and LQR’s uncontrolled chaos.
Table 2 summarizes the final energy metrics over the last 10 steps. DeepONet RL achieved a mean energy of 0.397 ± 0.137, compared to Pure TD3’s 1.123 ± 0.557 and LQR’s 42.827 ± 1.944. DeepONet RL’s mean energy is 64.6% lower than Pure TD3’s and 107.7-fold lower than LQR’s, with a standard deviation 75.4% lower than Pure TD3’s and 92.9% lower than LQR’s. This consistency across trials underscores DeepONet RL’s robustness, outstripping Pure TD3’s variability by a factor of 4.1 and LQR’s by a factor of 14.2.
4.3 Comparative analysis
DeepONet RL’s integration of DeepONet and TD3 yields a decisive edge over Pure TD3 and LQR.
4.3.1 Quantitative superiority of DeepONet RL.
DeepONet RL’s final mean energy (0.397) is 182.3% lower than Pure TD3’s (1.123) and 107.7-fold lower than LQR’s (42.827). Its 50% energy decay in 20 steps is 33.3% faster than Pure TD3’s 30 steps, while LQR showed no progress. Training metrics further favor DeepONet RL: a 63.0% higher reward plateau (-5.0 vs. -13.52) and a 33.3% lower critic loss (0.02 vs. 0.03), with 40.0% less variance in loss profiles. These gains are driven by DeepONet’s offline learning, achieving a loss of , which is 10 times lower than Pure TD3’s initialization threshold. This pre-training reduces RL warm-up episodes by 2.5x, a significant efficiency boost validated in hybrid RL studies [55].
4.3.2 Qualitative insights and limitations.
DeepONet RL’s uniform suppression of oscillations (Fig 6) and tighter SEM bands (Fig 5) reflect a synergy between DeepONet’s generalization and TD3’s adaptability. LQR’s failure stems from its linear assumptions, inadequate for KS nonlinearity [?], while Pure TD3’s higher variability (0.557 std) contrasts with DeepONet RL’s consistency (0.137 std). However, DeepONet RL’s current spatial resolution (nx = 64) may miss finer chaotic structures, and its 100-step horizon may not fully assess long-term stability, areas for future improvement.
4.4 Implications for chaotic system control
DeepONet RL’s exceptional performance, with a 99.1% energy reduction from LQR and 64.6% from Pure TD3, alongside 75.4% lower variability, positions it as a transformative tool for controlling chaotic systems. Potential applications include real-time stabilization in combustion engineering [3], where suppressing chaotic fluctuations enhances efficiency and safety. Future research should explore scaling to higher resolutions (e.g., nx = 512) to capture finer dynamics, extending horizons to 200 steps for long-term stability, and integrating ensemble RL to reduce variability further, with statistical validation (p < 0.05).
4.5 Comparison with the state of the art
A direct, one-to-one comparison of control strategies for the Kuramoto-Sivashinsky (KS) equation is complicated by significant variations across the literature in system parameters (e.g., domain length L, viscosity ), boundary conditions, and performance metrics. Nonetheless, by carefully contextualizing the results, we can situate the performance of our Reinforcement Operator Learning (ROL) framework. As detailed in Sect 3, ROL achieves a final steady-state energy of
on our benchmark problem (L = 100, periodic BCs), a 99.1% reduction from the uncontrolled baseline and a 64.6% reduction over a standard TD3 agent. Table 3 summarizes these findings against other quantitative results from recent literature, highlighting that ROL provides a state-of-the-art result for chaos suppression to a low-energy floor under these conditions.
Our method is highlighted in bold. The final energy metric is defined as .
The results summarized in Table 3 underscore the effectiveness of ROL. Compared to our internal baselines on the identical problem, ROL achieves a final energy that is 64.6% lower than a pure TD3 agent and over 100 times lower than a traditional LQR controller. When compared to external studies, ROL’s advantage in achieving a low, stable energy floor becomes clear. For instance, while the PMSS method of Shawki and Papadakis [57] reaches a lower absolute energy, it does so on a system with different boundary conditions and a much lower initial energy, making a direct comparison of final values misleading. Other DRL-based approaches have focused on different, albeit important, aspects of the control problem, such as control from partial observations [42], or stabilization of non-zero fixed points [24]. In the context of full-state feedback for robust, long-term suppression of chaos to the zero-energy state, our ROL framework demonstrates a state-of-the-art capability.
4.6 Failure modes and operating boundaries
While ROL achieves strong performance on the baseline configuration (nx = 64, 100-step horizon), this section identifies failure modes and establishes the operating envelope. The DeepONet branch network is designed for 64-dimensional inputs; higher resolutions (nx = 256,512) cause distribution shift, with reliable operation limited to . Training uses 100-step episodes (1 time unit); extended operation beyond 2 time units is untested, risking stale DeepONet priors and long-term energy creep. Initial condition noise
defines the training envelope; robustness extends to
, with degradation beyond 1.0. Viscosity parameter is fixed at
; parameter drift
from training value is untested. Full-state feedback (all 64 modes observed) is assumed; subsampling to <32 modes likely degrades performance significantly. Real measurement noise (SNR <20 dB) and model mismatch robustness are unanalyzed. These limitations define ROL’s operational boundaries, as quantified in Table 4.
Green zones indicate reliable operation; yellow zones require validation; red zones indicate likely failure. Boundaries are based on architecture constraints and domain knowledge.
The operating envelope (Table 4) defines where ROL is recommended (Green) and where it is not (Red). These boundaries are derived from code structure (e.g., fixed 64D input dimension) and classical control knowledge (typical parameter tolerance ). Future work should experimentally validate each boundary.
4.7 Critical assessment: Strengths, limitations, and recommendations
The ROL framework exhibits several compelling advantages. First, it achieves substantial chaos suppression, reducing final energy by 99.1% relative to LQR (0.397 ± 0.137 vs. 42.827). Second, its hybrid training process offers a 2.5 times improvement in sample efficiency (60-80 vs. 110–130 episodes), yielding notable computational savings. Third, policy convergence is significantly more reliable, with a 33% reduction in variance (0.137 vs. 0.557), implying greater stability. Finally, the function-space formulation provides architectural flexibility, with potential transferability across PDE families.
Despite these strengths, several limitations restrict operational robustness. ROL remains resolution-constrained (), with higher spatial fidelity requiring retraining or dimensionality reduction. Long-term control remains unverified beyond t > 2 units, leaving questions on stability and energy drift. The framework is hyperparameter-sensitive and lacks formal guarantees, limiting suitability for safety-critical deployments. In addition, controller interpretability is limited, robustness to parameter mismatch (
) remains untested, and the offline training cost (4-5 hours) presents a barrier for rapid deployment. Finally, a distribution gap persists due to DeepONet training on initial conditions rather than evolved dynamics.
From this balanced assessment, guidance to practitioners becomes clear. ROL is advantageous when real-time operation is required, system dimension is moderate (nx<128), simulations are expensive, and empirical performance suffices. Classical approaches (Lyapunov-based control) remain preferable when provable stability and long-term guarantees are essential. Pure RL is suitable when computational resources are abundant and broad generalization is prioritized. Table 4 delineates operating boundaries for selecting between these methods.
5 Conclusion
This study introduces a novel ROL framework that couples DeepONet’s universal operator approximation with TD3’s online residual learning to stabilise the Kuramoto–Sivashinsky equation. Comprehensive experiments show that:
- Energy suppression: ROL lowers the long-term mean energy to 0.40 ± 0.14— 99% lower than LQR and 65% lower than pure TD3—with non-overlapping SEM bands after 55% of the horizon, confirming statistical significance.
- Learning efficiency: DeepONet pre-training reduces RL warm-up episodes by a factor of 2.5 and cuts worst-case critic error five-fold, yielding smoother actor–critic convergence.
- Flow quality: ROL confines state amplitudes to
(versus
for TD3 and
for LQR) and slashes spatial variance by 80%, producing the most homogeneous flow field among all controllers.
By unifying offline generalisation with online adaptivity, ROL overcomes the sample inefficiency of model-free RL and the rigidity of classical feedback designs. While demonstrated on a 64-mode KS system over an 80-step horizon, the method is algorithmically agnostic to grid resolution and horizon length. Future work will extend ROL to higher-dimensional turbulent flows, incorporate ensemble RL for further variance reduction, and investigate hardware-in-the-loop deployment for real-time combustion and aerodynamic applications.
Supporting information
S1 Data and Code. All relevant data and source code supporting this study are provided in the Supporting Information as a ‘Data and code.zip file’.
https://doi.org/10.1371/journal.pone.0341161.s001
(ZIP)
Acknowledgments
The authors would like to thank the editor of the journal and anonymous reviewers for their generous time in providing detailed comments and suggestions that helped us to improve the paper. The authors declare their appreciation to their affiliated universities. Best Regards Muhammad Sajjad Hossain Corresponding author.
References
- 1.
Strogatz SH. Nonlinear dynamics and chaos. CRC Press; 2018.
- 2. Kuramoto Y. Diffusion-induced chaos in reaction systems. Progress of Theoretical Physics Supplement. 1978;64:346–67.
- 3. Sivashinsky GI. Nonlinear analysis of hydrodynamic instability in laminar flames—I. Derivation of basic equations. Acta Astronautica. 1977;4(11–12):1177–206.
- 4. Hyman JM, Nicolaenko B. The Kuramoto-Sivashinsky equation: a bridge between PDE’S and dynamical systems. Physica D: Nonlinear Phenomena. 1986;18(1–3):113–26.
- 5. Ott E, Grebogi C, Yorke J. Controlling chaos. Phys Rev Lett. 1990;64(11):1196–9. pmid:10041332
- 6. Christofides PD, Armaou A. Global stabilization of the Kuramoto–Sivashinsky equation via distributed output feedback control. Systems & Control Letters. 2000;39(4):283–94.
- 7. Armaou A, D. Christofides P. Feedback control of the Kuramoto–Sivashinsky equation. Physica D: Nonlinear Phenomena. 2000;137(1–2):49–61.
- 8. Liu W-J, Krstić M. Stability enhancement by boundary control in the Kuramoto–Sivashinsky equation. Nonlinear Analysis: Theory, Methods & Applications. 2001;43(4):485–507.
- 9. Cerpa E. Null controllability and stabilization of the linear Kuramoto-Sivashinsky equation. Communications on Pure & Applied Analysis. 2010;9(1):91–102.
- 10.
Kang W, Fridman E. Sampled-data control of 2D Kuramoto-Sivashinsky equation. 2020.
- 11. Mayne DQ, Rawlings JB, Rao CV, Scokaert POM. Constrained model predictive control: stability and optimality. Automatica. 2000;36(6):789–814.
- 12. Schmid PJ. Dynamic mode decomposition of numerical and experimental data. J Fluid Mech. 2010;656:5–28.
- 13. Proctor JL, Brunton SL, Kutz JN. Dynamic mode decomposition with control. SIAM J Appl Dyn Syst. 2016;15(1):142–61.
- 14. Rowley CW, Dawson STM. Model reduction for flow analysis and control. Annu Rev Fluid Mech. 2017;49(1):387–417.
- 15. Taira K, Brunton SL, Dawson STM, Rowley CW, Colonius T, McKeon BJ, et al. Modal analysis of fluid flows: an overview. AIAA Journal. 2017;55(12):4013–41.
- 16.
Trefethen LN. Spectral methods in MATLAB. Society for Industrial and Applied Mathematics; 2000.
- 17. Cox SM, Matthews PC. Exponential time differencing for stiff systems. Journal of Computational Physics. 2002;176(2):430–55.
- 18. Kassam A-K, Trefethen LN. Fourth-order time-stepping for stiff PDEs. SIAM J Sci Comput. 2005;26(4):1214–33.
- 19. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–33. pmid:25719670
- 20.
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y. Continuous control with deep reinforcement learning. 2015.
- 21.
Fujimoto S, van Hoof H, Meger D. Addressing function approximation error in actor-critic methods. In: Dy J, Krause A, editors. Proceedings of the 35th International Conference on Machine Learning. vol. 80 of Proceedings of Machine Learning Research. PMLR; 2018. p. 1587–96. https://proceedings.mlr.press/v80/fujimoto18a.html
- 22.
Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Dy J, Krause A, editors. Proceedings of the 35th International Conference on Machine Learning. vol. 80 of Proceedings of Machine Learning Research. PMLR; 2018. p. 1861–70. https://proceedings.mlr.press/v80/haarnoja18b.html
- 23. Rabault J, Kuchta M, Jensen A, Réglade U, Cerardi N. Artificial neural networks trained through deep reinforcement learning discover control strategies for active flow control. J Fluid Mech. 2019;865:281–302.
- 24. Bucci MA, Semeraro O, Allauzen A, Wisniewski G, Cordier L, Mathelin L. Control of chaotic systems by deep reinforcement learning. Proc Math Phys Eng Sci. 2019;475(2231):20190351. pmid:31824214
- 25. Fan D, Yang L, Wang Z, Triantafyllou MS, Karniadakis GE. Reinforcement learning for bluff body active flow control in experiments and simulations. Proc Natl Acad Sci U S A. 2020;117(42):26091–8. pmid:33020279
- 26. Kim I, Jeon Y, Chae J, You D. Deep reinforcement learning for fluid mechanics: control, optimization, and automation. Fluids. 2024;9(9):216.
- 27. Paris R, Beneddine S, Dandois J. Reinforcement-learning-based actuator selection method for active flow control. J Fluid Mech. 2023;955.
- 28. Zeng K, Graham MD. Symmetry reduction for deep reinforcement learning active control of chaotic spatiotemporal dynamics. Phys Rev E. 2021;104(1).
- 29. Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D. Deep reinforcement learning that matters. AAAI. 2018;32(1).
- 30. Dulac-Arnold G, Levine N, Mankowitz DJ, Li J, Paduraru C, Gowal S, et al. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Mach Learn. 2021;110(9):2419–68.
- 31. Lu L, Jin P, Pang G, Zhang Z, Karniadakis GE. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nat Mach Intell. 2021;3(3):218–29.
- 32. Lu L, Meng X, Cai S, Mao Z, Goswami S, Zhang Z, et al. A comprehensive and fair comparison of two neural operators (with practical extensions) based on FAIR data. Computer Methods in Applied Mechanics and Engineering. 2022;393:114778.
- 33.
Li Z, Zheng H, Kovachki N, Jin D, Chen H, Liu B, et al. Physics-Informed Neural Operator for Learning Partial Differential Equations; 2023. https://arxiv.org/abs/2111.03794
- 34. Zhong W, Meidani H. Physics-informed geometry-aware neural operator. Computer Methods in Applied Mechanics and Engineering. 2025;434:117540.
- 35.
Boya SK, Subramani D. A physics-informed transformer neural operator for learning generalized solutions of initial boundary value problems. arXiv preprint 2024. https://arxiv.org/abs/241209009
- 36. Fang Z, Wang S, Perdikaris P. Learning only on boundaries: a physics-informed neural operator for solving parametric partial differential equations in complex geometries. Neural computation. 2024;36(3):475–98.
- 37. Eshaghi MS, Anitescu C, Thombre M, Wang Y, Zhuang X, Rabczuk T. Variational Physics-informed Neural Operator (VINO) for solving partial differential equations. Computer Methods in Applied Mechanics and Engineering. 2025;437:117785.
- 38. Bhan L, Shi Y, Krstic M. Adaptive control of reaction–diffusion PDEs via neural operator-approximated gain kernels. Systems & Control Letters. 2025;195:105968.
- 39.
Li Z, Kovachki N, Azizzadenesheli K, Liu B, Bhattacharya K, Stuart A, et al. Fourier neural operator for parametric partial differential equations; 2020.
- 40. Raissi M, Perdikaris P, Karniadakis GE. Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics. 2019;378:686–707.
- 41. Karniadakis GE, Kevrekidis IG, Lu L, Perdikaris P, Wang S, Yang L. Physics-informed machine learning. Nat Rev Phys. 2021;3(6):422–40.
- 42. Li Z, Zheng H, Kovachki N, Jin D, Chen H, Liu B, et al. Physics-informed neural operator for learning partial differential equations. ACM/IMS J Data Sci. 2024;1(3):1–27.
- 43. Krstic M, Bhan L, Shi Y. Neural operators of backstepping controller and observer gain functions for reaction–diffusion PDEs. Automatica. 2024;164:111649.
- 44. Bhan L, Shi Y, Krstic M. Neural operators for bypassing gain and control computations in PDE backstepping. IEEE Trans Automat Contr. 2024;69(8):5310–25.
- 45. Qi J, Zhang J, Krstic M. Neural operators for PDE backstepping control of first-order hyperbolic PIDE with recycle and delay. Systems & Control Letters. 2024;185:105714.
- 46.
Koopman-Operator-based Reinforcement Learning Control of Partial Differential Equations | KoOpeRaDE | Project | Fact Sheet | HORIZON | CORDIS | European Commission — cordis.europa.eu. https://cordis.europa.eu/project/id/101161457
- 47.
Chu H, Miyatake Y, Cui W, Wei S, Furihata D. Structure-preserving physics-informed neural networks with energy or lyapunov structure. arXiv preprint 2024. https://arxiv.org/abs/2401.04986
- 48.
Botteghi N, Fresca S, Guo M, Manzoni A. Hyperl: parameter-informed reinforcement learning for parametric pdes. arXiv preprint 2025. https://arxiv.org/abs/250104538
- 49.
Liu T, Zhang Y. Kan-enhanced deep reinforcement learning for chaos control achieving rapid stabilization via minor perturbations.
- 50. Valle D, Capeans R, Wagemakers A, Sanjuán MAF. AI-driven control of chaos: a transformer-based approach for dynamical systems. Communications in Nonlinear Science and Numerical Simulation. 2025;151:109085.
- 51.
Hasani E, Ward RA. Generating synthetic data for neural operators. arXiv preprint 2024. https://arxiv.org/abs/2401.02398
- 52. Korda M, Mezić I. Linear predictors for nonlinear dynamical systems: Koopman operator meets model predictive control. Automatica. 2018;93:149–60.
- 53. Xiao P, Zheng M, Jiao A, Yang X, Lu L. Quantum DeepONet: neural operators accelerated by quantum computing. Quantum. 2025;9:1761.
- 54.
Song Y, Yuan X, Yue H. Accelerated primal-dual methods with enlarged step sizes and operator learning for nonsmooth optimal control problems. arXiv preprint 2023. https://arxiv.org/abs/230700296
- 55.
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017.
- 56.
Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint 2017. https://arxiv.org/abs/1412.6980
- 57. Shawki K, Papadakis G. Feedback control of chaotic systems using multiple shooting shadowing and application to Kuramoto-Sivashinsky equation. Proc Math Phys Eng Sci. 2020;476(2240):20200322. pmid:32922158