Figures
Abstract
Bimanual piano rhythm training must maintain precise interlimb timing under limited practice time and under fatigue constraints, while feedback on performance is typically available only after an exercise. A piano practice gym environment (PianoGym) is used as a reproducible simulator for fatigue-constrained piano rhythm training under post-action feedback. The training task is formulated as a fatigue-constrained, post-action, partially observable Markov decision process (POMDP). In this POMDP, a controller observes beat asynchrony, dominance gap, synchronization fidelity, and two fatigue signals, and selects the next exercise from a finite library of structured practice actions. To handle delayed measurements and fatigue feasibility under the simulator budget, we introduce a dual-timescale safety layer. The slow Lagrangian part tracks a long-horizon average constraint using revealed true fatigue, while the fast predictive guard screens candidate actions using the online fatigue estimate. On top of this layer, a piano model predictive controller (PianoMPC) uses certainty-equivalent planning and performs finite-horizon rollouts over a calibrated surrogate environment model and searches only within guard-filtered action sets. In the main three-profile experiment, PianoMPC achieves mean time-to-mastery values of 24.4 to 28.2 steps and FeasibleRate values of 0.90 to 0.95 under the shared environment-side guard. Under the same environment-side guard, it also outperforms bandit and value-based agents. These results indicate that model-predictive planning can convert a fixed operational fatigue budget into faster progress in fatigue-aware piano practice within the PianoGym simulator and its stated surrogate fatigue and skill-dynamics assumptions.
Citation: Meng X, Shi H, Liu N, Pan Z, Xia Y (2026) PianoGym: Safe post-action piano rhythm training with fatigue constraints. PLoS One 21(6): e0351141. https://doi.org/10.1371/journal.pone.0351141
Editor: Bruno Alejandro Mesz, Universidad Nacional de Tres de Febrero, ARGENTINA
Received: December 2, 2025; Accepted: May 22, 2026; Published: June 16, 2026
Copyright: © 2026 Meng et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All source code and simulation data necessary to reproduce the reported results are publicly available in the PianoGym repository at https://github.com/xiay9/PianoGym. The repository includes the simulator, controller and baseline implementations, experiment scripts, plotting scripts, dependency file, generated result files, and instructions for reproducing the manuscript tables and figures. This study uses only simulated data and does not rely on external or human-subject datasets.
Funding: This work was supported by the High-level Talent Start-up Fund of Xi’an International University under grant XAIU202547 to Xiaoyu Meng. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Bimanual piano rhythm training requires precise interlimb timing while operating under fatigue limits and limited practice time [1,2]. In typical instructional settings, the coordination state is inferred only after an exercise, from three music-specific observables, namely beat asynchrony (ms), dominance gap (ms), and synchronization fidelity in [0,1]. The next exercise must be selected before new measurements become available [3,4]. This post-action observation pattern is further complicated by recent findings that synchronization stability depends on effector, modality, and presentation order, and that training trajectories are non-monotonic and context-dependent [5–8]. These results imply that practice should be adapted to measurements rather than follow fixed routines. In this paper, safety refers to fatigue feasibility under an explicitly defined simulator budget, not to clinical or physical safety certification for human learners. Under these conditions, we must design a controller that converts a limited fatigue budget into faster progress while keeping both average and peak fatigue within allowed limits.
Related work can be grouped in three directions. First, music motor research has characterized rhythmic coordination, effector dependence, rate specificity, and variability profiles. However, these results are typically analyzed offline and are seldom converted into online decision rules that operate directly on music-specific observables and current fatigue levels [3,4,7,9–11]. Second, safe and constrained decision making has developed runtime enforcement and Lagrangian-style methods. Yet many controllers still encode safety only through soft penalties or keep evaluation separate from selection, which makes it difficult to reuse the same safety interface during data collection and testing [12–15]. Third, planning and sequencing for music sessions and real-time accompaniment have shown the benefit of goal-aware ordering and model-based synchronization. However, learner-side skill transfer, interference across limbs, and explicit fatigue constraints are often omitted [16–19]. A unified formulation is therefore required to connect structured skill dynamics, post-action observations, and explicit safety constraints in a form that can be deployed in instructional environments.
The present study considers a constrained training problem in which a controller receives post-action observations of asynchrony, dominance gap, and synchronization fidelity. From these observations, it must select the next structured practice action from a finite library. Each action is tagged with difficulty, skill targets, learning and transfer parameters, time cost, and a positive fatigue increment. The setting exhibits delayed measurements and partial observability of multi-skill proficiency. It also features heterogeneous transfer across left hand, right hand, polyrhythm, and switching skills, together with limited exploration due to safety requirements and lesson length. The central challenges are to act on post-action observations, to model transfer, interference, fatigue accumulation, and retention under noise, and to enforce safety on two time scales while preserving progress toward mastery.
To address these challenges, a piano practice gym environment (PianoGym) is defined. PianoGym implements a post-action, fatigue-constrained partially observable Markov decision process (POMDP) with a structured practice library and music-specific observables. A dual-timescale safety layer combines a Lagrangian update that adapts to long-horizon average fatigue with a short-horizon predictive guard that filters actions whose predicted fatigue would violate the safety margin. On top of this layer, piano model predictive control (PianoMPC) rolls out the model over a finite horizon and searches only within the safe action sets returned by the guard. Experiments in the PianoGym environment compare this controller with bandit and value-based agents under the same environment-side guard in order to isolate the effect of planning depth and agent-level safety components. The main contributions of this paper include:
- We formulate fatigue-aware bimanual piano rhythm training as a post-action POMDP that connects structured practice actions, music-specific observables, latent skill dynamics, and fatigue constraints in a single reproducible benchmark.
- We introduce PianoMPC, a certainty-equivalent model-predictive controller that performs fatigue-aware lookahead planning over a finite practice library. This design enables the controller to allocate the same simulator fatigue budget more effectively toward faster mastery.
- We evaluate PianoMPC across three learner profiles and a 3 × 3 task suite under a shared environment-side guard. PianoMPC reduces mean TTM compared with reactive and value-based baselines while maintaining simulator-level fatigue feasibility.
The remainder of the paper is organized as follows. Section Related work reviews related work. Section Methods presents the modeling assumptions, the safety mechanisms, and the PianoMPC controller. Section Experiments describes the task suite, learner profiles, and evaluation metrics, and reports results on efficiency, feasibility, and robustness. Section Conclusion concludes and outlines possible extensions. Section Limitations discusses the scope of the certainty-equivalence, fatigue, and skill-coupling assumptions.
Related work
Observables and coordination in piano practice
Music-facing observables obtained from Musical Instrument Digital Interface (MIDI) or sensor systems make it possible to condition practice on measured coordination changes. Studies on effectors, modality, and tempo show that synchronization stability depends on hand, presentation, and rate rather than being uniform across conditions, which motivates interfaces that operate directly on these music-specific measurements [3,4]. Recent piano-specific modeling work shows that skilled piano control cannot be reduced to simple timing accuracy alone. Upper-limb coordination during octave playing depends on biomechanical constraints [20]. Expressive piano output also reflects nonlinear mappings between dynamic control parameters and sound [21]. Brink et al. and Maarup et al. further reported that variability structure and a bodily hierarchy across voice, hands, and feet shape coordination transitions. These findings in turn motivate explicit noise modeling and the use of standardized observables for practice controllers [9,22]. Longitudinal neuroimaging on piano training further indicates context-dependent, non-monotonic plasticity, so progress cannot be treated as stationary during practice [7,23,24].
Fatigue studies on pianists show that repetitive or demanding segments induce local muscular fatigue and degrade timing and key velocity. They also show that subjective rest does not always match objective indicators [1,2]. These findings support returning a noisy online fatigue estimate for immediate action filtering together with a delayed true fatigue value for long timescale statistics. Interpersonal and duet studies report that joint action and perturbations modulate interbrain synchronization and the balance of self other integration [11,25]. Such effects are outside the scope of the single-learner, post-action interface considered here and are therefore not modeled.
Safety in sequential decision making
Safe and constrained reinforcement learning encodes performance under cost limits through Lagrangian or proximal updates, which yields policies that track average constraints during learning [14,26]. Complementary lines of work enforce safety through certified sets and predictive shielding that override unsafe actions near constraint boundaries and provide forward-invariance style guarantees [15,27]. More recent studies combine runtime enforcement with learned policies in safety-critical systems, where an external enforcer corrects an agent whenever formal safety rules would be violated [12,28–30].
A large fraction of these approaches assumes fully observed states or synchronous cost feedback and operates on a single time scale, as in representative value-based reinforcement learning under full observability [31,32]. Shielding under partial observability is closely related because it uses a model-based safety filter when the agent state is incomplete [33]. In fatigue-aware piano rhythm training the controller receives post-action music metrics and two distinct fatigue signals. Safety must therefore be split into a slow Lagrangian adaptation that uses revealed true fatigue and a fast predictive guard that uses the online estimate. Experiments in PianoGym are designed with a shared environment-side guard so that different decision rules can be compared under the same fatigue budget.
Planning and model predictive controller for practice
Optimization based sequencing on music platforms shows that exploiting position aware and locally sequential preferences increases within session engagement. This supports ordered practice rather than independent ranking of exercises [18]. Model based real time accompaniment further demonstrates that temporal alignment with human performers improves when the controller adapts online [19]. These studies validate goal aware, model based lookahead in musical interaction. However, they generally do not model learner side skill transfer, interference across tasks, or fatigue limited practice budgets, and the control inputs are typically not drawn from a finite library.
Model predictive controllers have been adapted for rhythmic or periodic motor tasks. Examples include basis function parameterization for fast gait generation and combinations of central pattern generators with MPC [34,35]. Gain scheduled and bio rhythm informed MPC maintains performance under time varying operating points, which indicates that predictive controllers can respect physiological cycles without losing responsiveness [36,37]. The piano practice setting considered here differs because actions come from a finite, structured library with tagged difficulty, skill targets, and positive fatigue increments, and planning is restricted to guard-filtered safe action sets. The combination of post-action observability, dual-timescale safety, and library-based MPC is not addressed in the above MPC and sequencing literature.
Methods
Problem statement and interaction protocol
Fatigue-aware piano practice is modeled as a constrained, post-action, partially observable Markov decision process (CPOMDP) with music-specific observables and two fatigue signals. This common interface makes it possible to analyze both sample efficiency and safety.
Post-action partially observable process
At decision step t, the learner occupies an unobserved latent state
where xt contains the proficiencies of K rhythm-related skills, ft is the accumulated cognitive and motor fatigue, and mt is a memory or retention trace of the same dimension.
The controller maintains an internal estimate and chooses a structured practice action according to
The environment then evolves according to the true state
that is, the transition kernel depends on (st, at) and not on the estimate. After the transition, the environment reveals three post-action signals
where
collects the music-facing measurements, is a noisy online fatigue estimate intended for immediate safety screening, and
is the revealed true fatigue intended for long-run constraint adaptation. The scalar rt denotes the immediate reward revealed after executing at, and
is the reward map from the post-action observation and executed action to this scalar reward. The two fatigue signals are consumed by different safety components in Section Dual-timescale fatigue safety. Policies in Section PianoMPC controller are defined on the filtered state estimate
, as shown in Fig 1.
Objective and constraints
Rhythm mastery is declared once
hold for W = 3 consecutive decision steps, where W denotes the consecutive-mastery window. The main performance indicator is the time to mastery (TTM), defined as the smallest index t at which Eq. (6) is satisfied, with if the condition is not met within the episode.
Because every practice segment induces fatigue, the process is constrained by the long-run average of the true fatigue:
with task-dependent threshold , where
denotes the episode length in the PianoGym environment. The control objective is therefore
Here denotes the expected long-run fatigue cost under policy
,
is expectation over trajectories induced by
and the simulator dynamics. In the PianoGym instantiation used for experimentation, the cumulative reward in Eq. (8) is monotone in TTM, so reporting TTM is sufficient to compare controllers. The reward-based formulation is nevertheless retained because the model-predictive controller in Section PianoMPC controller optimizes a finite-horizon surrogate of Eq. (8).
Environment and predictive model
The controller performs model-predictive rollouts and therefore requires a deterministic approximation of the environment dynamics. This section makes the abstract CPOMDP in Section Problem statement and interaction protocol concrete by specifying the latent variables, the structured action library, the skill and fatigue updates, and the stochastic observation layer from which rewards are computed. All parameters below are treated as known constants of the PianoGym environment and are not learned by the controller. This makes PianoMPC a certainty-equivalent planner: it plans from the current filtered estimate as if that estimate were the rollout state, rather than maintaining a full posterior belief over latent skills.
Parameterization and calibration
Parameterization and calibration PianoGym uses a fixed parameterization so that all controllers are compared in the same simulator rather than learning a different environment model. The action-library structure, skill targets, transfer signs, and fatigue-load ordering were manually specified from the musical roles of the exercises. The numerical learning rates, forgetting rate, fatigue gain, rest recovery rate, observation scales, and noise levels were then set once on an internal calibration suite of simulated pilot profiles spanning balanced, mild-left-weak, and severe-left-weak learners. The calibration target was not to fit a human cohort, but to place asynchrony, dominance gap, fidelity, fatigue, and TTM in interpretable ranges aligned with the evaluation thresholds. After this calibration step, the parameters were frozen for every baseline, ablation, and robustness experiment. Sensitivity to these choices is assessed in the dynamics-mismatch, scoped-mismatch, threshold-window, and guard-dependence analyses in Section Experiments. Table 1 summarizes these parameter groups and their sources.
State and action space instantiation
The latent skill vector contains K = 5 rhythm-relevant components
describing left-hand control, right-hand control, 2:3 polyrhythm coordination, 3:4 polyrhythm coordination, and fast switching. The memory trace has the same dimension and plays the role of a smoothed retention level.
The action set is a finite library of predefined structured practice actions and is exposed to the controller as its discrete action space. Each action is encoded as
where is a difficulty tag,
with
specifies which skills the segment targets,
specifies learning rates per skill,
is a sparse transfer or interference matrix,
is the positive fatigue load, and
is the execution time cost. This representation separates the musical target, the learning intensity, and the physiological load. The library also contains an explicit REST action,
which is kept available at every decision step.
Skill dynamics
Skill adaptation results from diminishing returns, difficulty matching, and fatigue attenuation. For an executed action at,
where reduces gains near mastery. The difficulty matching gate, as shown in Fig 2, is used to filter candidate actions so that their tagged difficulty stays close to the current skill estimate.
which becomes active when the projected skill level approaches the declared difficulty d.
Fatigue attenuates learning through
with rate and lower bound
.
Transfer, interference, and process noise are then added:
The next skill vector is clipped element-wise to the admissible range
where, for a vector v, applies
to every coordinate vi. This clipping is part of the environment step and is mirrored by the controller.
Fatigue and memory dynamics
For practice actions , fatigue increases with the action load:
For the REST action, fatigue recovers at rate :
Retention is maintained through an exponential smoother:
and forgetting is represented as a small decay proportional to missing retention:
with decay rate . This decay is also part of the dynamics mirrored by the controller.
Observation and reward generation
Given and the executed action at, the environment samples
from explicit action-dependent observation equations. Let wA be the fixed skill weights used for asynchrony, let and
be action-specific scale parameters, and let
be an action-specific dominance-gap offset.
where . The dominance gap is generated as
where . The fidelity signal is a clipped logistic-normal proxy
These equations make the quantitative link explicit: higher relevant skill lowers asynchrony and improves fidelity, left-right imbalance changes dominance gap, and higher fatigue increases timing error and lowers fidelity. They are simulator observation equations, not fitted physiological measurement laws. The online fatigue estimate is obtained by
To aggregate heterogeneous observables into a single learning signal, each measurement v is standardized using fixed statistics :
Here z(v)+ denotes the positive part of the standardized residual. Asynchrony and dominance gap are standardized in this way. Fidelity already lies in [0,1] and is used in raw form. The immediate reward is
where are fixed reward weights,
is the duration cost of the executed action, and
is its coefficient. The reward penalizes poor musical coordination and long practice segments and keeps duration-related effects comparable across actions.
Dual-timescale fatigue safety
Fatigue must remain within specified limits both on average and at the next decision instant. Safety is therefore enforced on two coupled time scales, as shown in Fig 3. A slow Lagrangian adaptation tracks the long-run constraint in Eq. (7) using the revealed true fatigue. A fast predictive guard filters unsafe actions using the online estimate. In this paper, the safety layer refers to the simulator-level fatigue-feasibility interface used to screen practice actions. During experimentation, the same one-step guard is implemented as an environment-side guard for all agents in order to provide a common safety interface. The environment-side guard takes precedence and can overwrite any action proposed by a controller according to a fixed fallback rule.
Slow timescale: lagrangian adaptation
At the beginning of step t, the controller has access to the true fatigue ft that was revealed at the end of the previous transition. An exponential moving average of revealed fatigue is updated as
and the Lagrange multiplier for fatigue is advanced by projected gradient ascent
The maximum operator keeps the Lagrange multiplier nonnegative, as required for the fatigue inequality constraint. The scalar converts the average constraint into an instantaneous penalty and is kept fixed for the entire decision step t. During planning in Section PianoMPC controller, this multiplier penalizes predicted trajectories whose fatigue rises above
.
Fast timescale: predictive guard
The fast component restricts actions that would produce an immediate overload. At step t, the internal fatigue estimate (computed in Section PianoMPC controller) is combined with the deterministic dynamics Eq. (17)–Eq. (18) and the action-specific loads
in order to predict the next-step fatigue
for every
. Two margins are used for two different purposes. The conservative one-step margin
defines a primary step cap, whereas the relaxed guard slack
defines a peak diagnostic threshold used by the environment-side guard and by guard-consistency experiments. The primary one-step set is
For a guard horizon Hg, the relaxed peak condition is
The executed guard uses Eq. (29) as the ordinary action filter, and REST is always available. The relaxed band in Eq. (30) is logged for calibration and can affect execution only in the fallback case where no non-rest primary action survives; otherwise it does not enlarge the ordinary safe set. Accordingly, is used as the conservative execution cap, whereas
is used only as a relaxed peak-diagnostic threshold. When the environment-side guard is present, the environment applies the same interface to the finally proposed action.
PianoMPC controller
This section introduces PianoMPC, a finite-horizon, certainty-equivalent controller that plans only over actions certified as safe by the dual-timescale mechanism. The controller uses the deterministic model in Section Environment and predictive model, the multipliers in Section Dual-timescale fatigue safety, and the same structured action library as the environment.
Certainty-equivalent state tracking
At the start of step t, immediately after receiving generated by action
, the controller forms an updated estimate
The update is carried out in two stages.
First, a prediction step propagates the previous estimate through the deterministic model:
where applies the noise-free dynamics Eq. (12)–Eq. (20) and Eq. (17)–Eq. (18). This produces a prior for step t.
Second, a correction step aligns the fatigue component with the revealed true fatigue while keeping the skill and memory components unchanged:
This certainty-equivalent design is motivated by two characteristics of the setting. Musical observables (Async, DomGap, Fid) are noisy and task dependent, so back-propagating them into all skill coordinates at every step would increase variance and computational load. In contrast, the environment reveals the true fatigue without noise. Using it to anchor yields an accurate fatigue signal for both the slow Lagrangian update and the fast guard. The observables are therefore used through the deterministic observation and reward model in Section Environment and predictive model and are not used for state correction.
Finite-horizon planning
Given and the current multiplier
, the controller solves a receding-horizon problem of length H, as shown in Fig 4. Let a candidate action sequence be
. The deterministic model is rolled forward as
starting from . At each lookahead step t + h, the predictive guard in Section Dual-timescale fatigue safety is applied to produce the future safe set
using the predicted fatigue contained in
.
The immediate reward at lookahead index h is obtained by applying the deterministic observation-and-reward model to the predicted state and to action
. For compactness, this predicted quantity is denoted by
. The planning problem is
The multiplier computed in Eq. (28) is held constant over the horizon, so that the optimization in Eq. (36) depends only on the predicted trajectories.
The discrete optimization in Eq. (36) is solved by a beam search of width b. At each depth, only the best b partial sequences consistent with the safe sets are kept, which yields a computational cost proportional to and to the size of the filtered action sets. After evaluating all depth-H sequences retained by the beam, the first action of the best sequence is executed. At the next decision step, the state is re-estimated and the planning procedure is repeated. This produces a receding horizon controller that respects both long-run and short-horizon fatigue constraints and that is used in Section Experiments.
Experiments
This section evaluates whether model-predictive planning improves learning under fatigue-constrained practice. All experiments use the same PianoGym action library and the same environment-side short-horizon safety guard. Two safety regimes are reported. In the main comparison reported in Section Main comparison under the safety guard, the environment fatigue hard constraint is enabled for the three learner profiles, so that agents differ only in their decision rules under a common safety interface. In the task suite study reported in Section Robustness across task families, each profile follows its default configuration. The balanced and mild-left-weak profiles keep only the guard, whereas the severe-left-weak profile enables both the guard and the hard constraint.
Dataset
Task composition
A 3 × 3 task suite is constructed in PianoGymEnv (see Table 2). The rows are the three learner profiles used throughout the experiments, and for each profile three transfer levels (weak, medium, strong) are used. This gives 9 tasks in total.
Scale and execution protocol
Unless otherwise stated, S = 10 independent random seeds are run for each task, and each seed generates one trajectory. A trajectory runs for at most steps and terminates early once the mastery condition is satisfied. This produces
records of states, actions, observations, and fatigue. Under this configuration the total sample size is on the order of 105, which is sufficient for reporting means and 95% confidence intervals.
Symmetry between left-weak and right-weak profiles
Only the left-weak variants (mild-left-weak, severe-left-weak) are reported in the main text. The data generator can swap left and right skill dimensions and action targets to obtain the right-weak counterparts. Because the action-library fatigue parameters, the short-horizon safety guard, and the PianoMPC state estimator are symmetric in the two dimensions, the mirrored tasks yield the same rankings. Their safety statistics are also almost identical. The corresponding results can be reproduced from the released code or from the supplementary material.
Observation and action interface
The environment uses a post-action interface. After executing action at at time t, it returns
together with the online fatigue estimate and the true fatigue
. The action set
contains several structured practice actions and an explicit rest action REST. For comparability, a common environment-side short-horizon safety guard is applied to every agent. Each agent first proposes a candidate action. The guard then filters actions using the current online fatigue estimate
(denoted
in the experiments) and produces
with in the default experiments. The environment-side guard also logs whether the proposed action would pass the relaxed diagnostic band
, with
. Only actions in this primary set are executed, except for the relaxed fallback case defined in Section Dual-timescale fatigue safety. Some agents, such as PianoMPC and the fatigue-aware LinUCB variant, forecast future fatigue and therefore tend to propose actions that already satisfy the guard. Other agents do not anticipate the guard, but their final executed actions are still filtered by the same environment-side rule. This design keeps the safety mechanism identical across methods and lets the comparison focus on the quality of the decision rules.
Experimental setup
One simulation step corresponds to a practice segment of about 30–60 seconds. This correspondence fixes the scale of the fatigue load , the recovery rate
, and the typical time to mastery (TTM). All observation scalings, such as
,
,
, and associated noise levels are estimated on an internal calibration set and then frozen for all experiments. The calibration set contains simulated pilot runs from the three learner profiles and was used only to set observation ranges and noise levels before the reported experiments were run.
Unless specified otherwise, the experiments use K = 5 skill dimensions and an action library of size , which includes the REST action. The mastery condition requires a window of W = 3 consecutive successful steps. Key hyperparameters are set as follows: the profile-specific fatigue threshold is
for balanced, mild-left-weak, and severe-left-weak profiles; the one-step guard margin is
; the relaxed diagnostic slack is
; and the guard horizon is
. For the main comparison, the MPC planning horizon is H = 3; the horizon sweep separately evaluates
. The beam search width is b = 20. Dynamics parameters include a memory smoothing factor
and a forgetting rate
. The reward weights are set to
, and the time-cost coefficient is
.
The per-step computational complexity of PianoMPC is given by , where TF is the cost of one forward prediction
. In practice, the finite action library combined with the predictive guard ensures that the safe action set
remains small, keeping the controller computationally efficient.
Evaluation metrics
All results are computed by the evaluation module in PianoGym and are averaged across tasks and seeds. Unless otherwise stated, means and 95% confidence intervals are computed across independent seeds within the stated profile or task aggregation unit. For rate metrics, figure displays are clipped to the valid interval [0,1] or [0,100]% without changing the underlying means. Consider a trajectory
where denotes fatigue,
denotes the environment fatigue threshold, W denotes the mastery window, and
denotes the maximum episode length.
Time to mastery (TTM) quantifies sample efficiency by measuring how many steps are required to achieve stable mastery for a window of length W. Formally,
with if the mastery condition is not satisfied within
. TTM is the primary performance metric used to rank agents in Section Main comparison under the safety guard to Section Effect of fatigue-threshold and mastery-window choices (lower is better). The counter
records consecutive successful steps up to and including step t and resets to zero whenever any criterion in Eq. (6) is not satisfied.
Task return is computed by recording the instantaneous raw reward and aggregating
Safety metrics follow the dual-timescale safety design, consisting of a short-horizon guard and a long-horizon fatigue limit. Each fatigue sample is thresholded as
Following standard practice in constrained and safe reinforcement learning [14], the following quantities are reported:
FeasibleRate and OverloadRate are complementary by definition, while AvgViolation measures the magnitude of fatigue violations rather than their frequency. Guard replacement rate is reported as an intervention diagnostic for proposed-versus-executed actions and is not treated as a safety success metric. For this reason, the results are interpreted as fatigue-feasibility outcomes under the simulator budget rather than as absolute safe or unsafe behavior.
A progress metric called rhythm independence gain is also reported:
This metric uses the same thresholds as the mastery condition defined by the environment and therefore supports cross task comparison. Additional diagnostics, including memory retention and offline evaluation coverage, appear in the public implementation.
Main comparison under the safety guard
Baselines
All baselines observe the same post-action signals, use the same action set, and are filtered by the same environment-side guard. Our PianoMPC agent and the fatigue-aware LinUCB variant also run the same short-horizon guard inside the policy, whereas the other baselines rely only on the environment-side guard. Thus differences in Section Main comparison under the safety guard come mainly from the decision rule. Thompson [38] is a lightweight posterior-sampling bandit under delayed feedback. LinUCB [39] is a contextual bandit on PianoGym features but does not plan future fatigue. Bayesian MAB [40] adds discounting and change-point handling for non-stationary profiles. CCB-DF [41] learns from delayed counterfactual rewards. DQN [31] is a value-based agent that relies on the external guard for safety. Safe-AC [32] treats fatigue above as cost in a Lagrangian actor-critic. Auto-Curriculum [42] orders exercises and inserts REST, mimicking human teaching without MPC-style planning.
Results on three profiles
This section reports the main comparison under the common safety guard. Table 3 summarizes three learner profiles and eight agents. FeasibleRate values range from 0.89 to 0.97 in the main comparison, with nonzero overload rates still possible. For this reason, the primary comparison is mean TTM interpreted together with FeasibleRate and violation diagnostics, rather than a binary safe/unsafe label.
PianoMPC achieves mean TTM values of 24.4, 26.0, and 28.2 steps across the balanced, mild-left-weak, and severe-left-weak profiles, respectively. This range is clearly below the second group, formed by BayesianMAB with 35–40 steps. All other methods require between the low 40s and about 60 steps, with LinUCB, DQN, CCB-DF and AutoCurriculum often around 50–60 steps.
Fig 5 panel (a) shows the same three-level structure. LinUCB keeps the highest FeasibleRate, about two points above PianoMPC, yet this improvement is marginal compared with the extra training time. The large gap appears because LinUCB reacts myopically when fatigue is near the limit and spends many steps on low-value rest choices. The controller plans the fatigue budget over the horizon, so it converts the same fatigue allowance into progress. Fig 5 panel (b) follows the time pattern, since shorter runs accumulate less negative reward. Panel (c) reports that some slower methods reach slightly higher coordination, which reflects longer exposure rather than stronger learning rules. Fig 6 shows that PianoMPC improves reward earliest and keeps the advantage through the early practice steps. This behaviour is consistent with planning under post-action observations and with the dual safety design of the environment. A guard-dependence ablation compares the default environment-side guard, a weaker guard, and no guard. PianoMPC remains the fastest method in all three settings, with macro mean TTM values of 25.67, 24.50, and 25.77 steps and FeasibleRate values of 0.921, 0.920, and 0.927, respectively. Under the default environment-side guard, the guard replacement rate for PianoMPC is 0.000, whereas replacement rates for several reactive baselines are much higher. This indicates that the guard is a common execution interface and diagnostic, rather than the sole driver of PianoMPC’s efficiency.
Panel labels show mean TTM, reward, and coordination gain; error bars are 95% seed CIs.
Curves show sliding-averaged raw reward; shaded bands are 95% seed CIs.
Effect of planning horizon
To assess how much lookahead is needed, the planning horizon H was varied. This test used the balanced profile while all other settings were fixed. The sweep uses the same balanced-profile seed schedule and agent-update pipeline as the main comparison, so the H = 3 point matches the PianoMPC entry in Table 3; H = 10 is an additional sweep setting rather than the table configuration. Fig 7 shows the resulting trade-off between training speed and fatigue feasibility. A myopic policy (H = 1) achieves the lowest balanced-profile mean TTM at 24.1 steps, but its FeasibleRate is 0.902 and its macro FeasibleRate across profiles is 0.850. The H = 1 setting is therefore treated as a low-feasibility setting rather than categorically unsafe.
Left axis shows TTM; right axis shows FeasibleRate on a zoomed scale with visual headroom above 100%; symmetric 95% CIs are clipped to the valid 0–100% range for display.
At the main-comparison setting H = 3, the balanced-profile mean TTM is 24.4 steps and FeasibleRate is 0.945. Longer horizons raise FeasibleRate to 1.000 for H = 5 and H = 10, with balanced-profile mean TTM values of 27.3 and 25.8 steps. This pattern indicates that short lookahead already captures useful near-term fatigue structure, while longer lookahead mainly changes the feasibility-speed trade-off. Thus planning depth, not only the shared guard, explains the efficiency gap.
Effect of agent-level safety components
This experiment evaluates the safety modules inside the agent policy. The short-horizon guard and the soft fatigue penalty are enabled or removed. Fig 8 reports a profile-aware bar plot of mean TTM and realized fatigue-overload rate, computed as , across all three learner profiles. Note that the environment guard remains enabled for every variant and the ablation removes only agent-side components such as the internal guard or the soft penalty.
Bars show macro means; open markers show profile-specific values. Overload rate denotes the realized trajectory overload rate, , after the shared environment-side guard.
With both PianoMPC safety components active, the macro mean realized overload rate is 8.2% and mean TTM is 26.2 steps. Removing the soft penalty gives a similar realized overload rate of 8.3% and a mean TTM of 26.7 steps. Thus, under the shared environment-side guard, PianoMPC’s internal guard and replanning dominate this ablation, while the soft penalty has only a small effect on the realized feasibility-speed trade-off.
The LinUCB variants remain slower, with mean TTM values between 49.2 and 53.1 steps. Their realized overload rates stay between 3.5% and 8.0%, indicating a clearer guard-dependent feasibility-speed trade-off for the reactive bandit variants. Safe-AC appears between these groups, with mean TTM of 42.5 steps and an 8.2% realized overload rate, so it improves over the bandit variants but does not match PianoMPC’s efficiency.
Consistency between windowed and peak-guard safety metrics
This experiment checks whether the online guard diagnostics behave consistently as the guard slack changes. Fig 9 reports online peak-violation rate and guard false-negative rate for PianoMPC on the balanced profile. The diagnostic window is fixed at in this plot so that the figure focuses on the guard horizon Hg and the relaxed slack
.
Panels show peak-violation and false-negative rates across guard horizons at fixed diagnostic window .
The online peak-violation rate decreases as increases, which is expected because the diagnostic threshold becomes less conservative. The false-negative rate remains low across the displayed settings and approaches zero at
. These rates are diagnostics of the guard calibration; they do not mean that violations are allowed as a goal of the method. The online guard is used for action filtering, whereas windowed violation metrics are retained as offline checks of the realized trajectory.
Robustness across task families
This experiment tests whether the advantage of planning holds under different task structures. The suite combines three categories (Balanced, Mild-left-weak, Severe-left-weakness) with three transfer levels (weak, medium, strong). Table 4 reports time to mastery and FeasibleRate for the four strongest agents. The suite includes nonzero overload rates in several severe-left-weakness settings, so these values should be read as fatigue-feasibility diagnostics rather than as binary safety claims. We keep these defaults to test robustness under realistic per-task settings rather than under a single globally tuned constraint.
PianoMPC achieves the lowest mean TTM in every cell of the table. On weak transfer tasks it needs about 43–47 steps, while CCB-DF needs around 78 steps and Thompson stays above 70 steps. BayesianMAB lies between them, at 49–54 steps, and is therefore the second tier. This shows that, when transfer is scarce and fatigue is enforced, planning still converts practice into progress more efficiently than reactive bandits.
When transfer becomes medium or strong, every agent improves, but the gains are not equal. PianoMPC drops to roughly 25 steps for medium transfer and to 14–17 steps for strong transfer across all categories. BayesianMAB and CCB-DF also speed up, yet they remain 10–20 steps slower than the planner. In the severe-left-weakness tasks, PianoMPC is the only agent that stays below 30 mean steps at medium transfer while keeping FeasibleRate above 0.9. This indicates that the planning controller remains effective across this task suite under the evaluated fatigue limits.
Robustness to dynamics misspecification
This experiment studies how sensitive the agents are to errors in the task dynamics. Three environment parameters are scaled around the nominal value one. The parameters control memory decay, fatigue accumulation, and gating sharpness. Fig 10 reports the change in four metrics with respect to the baseline run. Values close to zero indicate stable behaviour under misspecification.
Panels show metric changes from nominal; AutoCurriculum reaches the run limit in scanned settings.
Across all settings, PianoMPC stays close to the zero line. For memory errors, its time to mastery changes by at most about four steps, and FeasibleRate varies within roughly three tenths of a percentage point. Average fatigue and learning slope also show only small shifts. This pattern suggests that online replanning absorbs moderate model errors without extra tuning.
Non-planning agents react much more strongly. For misspecified forgetting rates, LinUCB is affected the most and can require more than 50 additional steps, while AutoCurriculum always hits the run limit in terms of TTM, so its sensitivity in time to mastery cannot be observed. Several bandit baselines also lose around one percentage point of feasibility when the fatigue gain is inaccurate. Misspecification of the gating parameter pushes CCB-DF and LinUCB even further from the origin. These results indicate that methods with fixed exploration schedules are more sensitive to the tested dynamics errors, whereas the predictive controller changes less under the same simulator changes. The zero sensitivity of AutoCurriculum in some panels is therefore a ceiling artifact: it reaches the run limit in the scanned settings, rather than showing greater robustness. Tests with scoped simulator changes also keep the calibrated fatigue model fixed while increasing observation noise and varying transfer/interference strength. In additional scoped simulator-change tests, PianoMPC retained the lowest overall mean TTM and a competitive FeasibleRate when observation noise was increased by two to three times and transfer strength was scaled by 1.2. These results support robustness within the tested simulator changes, but they do not imply robustness to arbitrary model error.
Effect of fatigue-threshold and mastery-window choices
The experiment studies how evaluation hyperparameters affect reported performance. Only the fatigue threshold and the mastery window change, agent policies remain identical. Fig 11 reports both TTM and FeasibleRate, directly addressing whether the ranking depends on a single operational mastery or fatigue definition. It also compares PianoMPC, BayesianMAB, and CCB-DF on the balanced profile.
Panels show TTM and FeasibleRate under threshold scaling and mastery windows.
Panel (a) scales the fatigue limit from 0.8 to 1.2 of the nominal value. The PianoMPC curve stays between 22.5 and 25.5 steps. CCB-DF varies from about 35–45 steps and peaks exactly at the unscaled threshold. BayesianMAB moves inside the 31–36 step band and never overtakes PianoMPC. The corresponding FeasibleRate panel shows that the threshold sweep changes feasibility levels but does not remove PianoMPC’s TTM advantage. Thus the relative ordering in Section Main comparison under the safety guard is not caused by a single threshold choice.
The mastery-window sweep is limited to because longer windows act mainly as stress tests and can obscure the operational definition of TTM. The default W = 3 is marked in both mastery-window panels. PianoMPC remains the fastest method over the displayed window range, while FeasibleRate varies because changing W changes episode termination times and the fatigue samples included in the trajectory average. This behaviour shows that TTM is an operational PianoGym metric, not a universal definition of musical mastery.
Discussion
In PianoGym, model-predictive planning improves mean TTM while maintaining fatigue-feasibility diagnostics under the stated surrogate fatigue model. Under a shared environment-side guard, PianoMPC achieves clearly lower time-to-mastery than reactive bandits. This advantage is supported by the horizon and policy-side safety ablations. The two safety diagnostics are consistent once their purposes are separated. The effect persists across task families, tested dynamics changes, and evaluation hyperparameters. The no-guard and guard-calibration analyses further indicate that PianoMPC’s efficiency is not driven solely by the default environment-side guard. Safety-related results are therefore interpreted as simulator-level fatigue-feasibility findings rather than as claims about real human physiology.
Conclusion
This paper studied sample-efficient learning for fatigue-constrained piano practice under a unified safety interface. PianoGym provides a reproducible simulation benchmark for post-action, fatigue-constrained piano rhythm training, and PianoMPC demonstrates how model-predictive planning can improve mean TTM under this benchmark. Across three learner profiles and the 3 × 3 task suite, PianoMPC achieved the lowest mean TTM while maintaining the reported FeasibleRate range under the evaluated fatigue budget. Ablations on planning depth and agent-level penalties suggest that short lookahead can be sufficient in this simulator and that soft penalties deliver controllable feasibility and speed trade-offs. Alternative safety diagnostics show that tuning the online guard changes absolute rates without reversing the comparative pattern in the evaluated settings.
Robustness tests show stable behavior under the evaluated dynamics misspecification, scoped simulator changes, guard settings, and mastery definitions. The observed advantage was therefore not limited to the default setting in the main table. The resulting claim is simulator-level: PianoMPC is an effective planner for the stated PianoGym surrogate model and fatigue interface. Open directions include learning guard thresholds, richer state estimation, multi-task scheduling, physical-instrument evaluation, and personalized nonlinear fatigue models.
Limitations
This study has several limitations related to its modeling assumptions. PianoMPC uses a certainty-equivalent design and rolls out a deterministic model from the current filtered estimate. It does not propagate a full posterior over latent skills, fatigue, or observation noise. This choice makes planning efficient for a finite exercise library, but it can understate uncertainty when observations are noisy or learner dynamics change abruptly. PianoGym also represents bimanual rhythm ability with five task-level skill coordinates. These coordinates are coupled by transfer and interference terms. This representation supports controlled benchmarking, but it does not capture the full neuromuscular, biomechanical, or expressive coupling involved in real piano performance.
The scope of the results is also limited by the simulator-level fatigue and evaluation definitions. Fatigue is modeled as a scalar operational index with additive exercise load, linear rest recovery, and a multiplicative effect on learning gain. It is not a validated physiological model of muscular, cognitive, or biomechanical fatigue. Therefore, TTM, FeasibleRate, and mastery thresholds should be interpreted as operational benchmark metrics rather than universal definitions of musical mastery or clinical safety.
References
- 1. Goubault E, Verdugo F, Pelletier J, Traube C, Begon M, Dal Maso F. Exhausting repetitive piano tasks lead to local forearm manifestation of muscle fatigue and negatively affect musical parameters. Sci Rep. 2021;11(1):8117. pmid:33854088
- 2. Takemi M, Akahoshi M, Ushiba J, Furuya S. Behavioral and physiological fatigue-related factors influencing timing and force control learning in pianists. Sci Rep. 2023;13(1):21646. pmid:38062126
- 3. Ito K, Watanabe T, Horinouchi T, Matsumoto T, Yunoki K, Ishida H, et al. Higher synchronization stability with piano experience: relationship with finger and presentation modality. J Physiol Anthropol. 2023;42(1):10. pmid:37337272
- 4. Barchet AV, Henry MJ, Pelofi C, Rimmele JM. Auditory-motor synchronization and perception suggest partially distinct time scales in speech and music. Commun Psychol. 2024;2(1):2. pmid:39242963
- 5. Whitton SA, Jiang F. Sensorimotor synchronization with visual, auditory, and tactile modalities. Psychol Res. 2023;87(7):2204–17. pmid:36773102
- 6. Roman IR, Roman AS, Kim JC, Large EW. Hebbian learning with elasticity explains how the spontaneous motor tempo affects music performance synchronization. PLoS Comput Biol. 2023;19(6):e1011154. pmid:37285380
- 7. Olszewska AM, Gaca M, Droździel D, Jednoróg K, Marchewka A, Herman AM. Piano Training Induces Dynamic Neuroplasticity of Bimanual Coordination but Not Auditory Processing in Young Adults. J Neurosci Res. 2025;103(7):e70067. pmid:40650444
- 8.
Labrou K, Zaman CH, Turkyasar A, Davis R. Following the Master’s Hands: Capturing Piano Performances for Mixed Reality Piano Learning Applications. In: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, 2023. 1–8. https://doi.org/10.1145/3544549.3585838
- 9. Brink KJ, Kim SK, Sommerfeld JH, Amazeen PG, Stergiou N, Likens AD. Pink noise promotes sooner state transitions during bimanual coordination. Proc Natl Acad Sci U S A. 2024;121(31):e2400687121. pmid:39042677
- 10. Deosdad-Díez M, Marco-Pallarés J. Note-by-note predictability modulates rhythm learning and its neural components. NPJ Sci Learn. 2025;10(1):59. pmid:40841516
- 11. Lender A, Perdikis D, Gruber W, Lindenberger U, Müller V. Dynamics in interbrain synchronization while playing a piano duet. Ann N Y Acad Sci. 2023;1530(1):124–37. pmid:37824090
- 12.
Vuppala SRH, Allen N, Pinisetty S, Roop P. A Formal Approach for Safe Reinforcement Learning: A Rate-Adaptive Pacemaker Case Study. In: International Conference on Runtime Verification. Springer; 2024. p. 3–21. https://doi.org/10.1007/978-3-031-74234-7_1
- 13. Wang J, Gao R, Zha H. Reliable Off-Policy Evaluation for Reinforcement Learning. Operations Research. 2024;72(2):699–716.
- 14.
Achiam J, Held D, Tamar A, Abbeel P. In: International conference on machine learning, 2017. 22–31.
- 15.
Bastani O, Li S. Safe Reinforcement Learning via Statistical Model Predictive Shielding. In: Robotics: Science and Systems XVII, 2021. https://doi.org/10.15607/rss.2021.xvii.026
- 16.
Chen J, Xue W, Tan X, Ye Z, Liu Q, Guo Y. FastSAG: towards fast non-autoregressive singing accompaniment generation. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024. 7618–26. https://doi.org/10.24963/ijcai.2024/843
- 17.
Wu Y, Cooijmans T, Kastner K, Roberts A, Simon I, Scarlatos A. In: 2024. 53328–45.
- 18.
Moor D, Yuan Y, Mehrotra R, Dai Z, Lalmas M. Exploiting Sequential Music Preferences via Optimisation-Based Sequencing. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023. 4759–65. https://doi.org/10.1145/3583780.3615476
- 19. Wang H, Zhang X, Iida F. Human–Robot Cooperative Piano Playing With Learning-Based Real-Time Music Accompaniment. IEEE Trans Robot. 2024;40:4650–69.
- 20. Wang H, Nonaka T, Abdulali A, Iida F. Coordinating upper limbs for octave playing on the piano via neuro-musculoskeletal modeling. Bioinspir Biomim. 2023;18(6):10.1088/1748-3190/acfa51. pmid:37714178
- 21. Scimeca L, Ng C, Iida F. Gaussian process inference modelling of dynamic robot control for expressive piano playing. PLoS One. 2020;15(8):e0237826. pmid:32797107
- 22. Mårup SH, Møller C, Vuust P. Coordination of voice, hands and feet in rhythm and beat performance. Sci Rep. 2022;12(1):8046. pmid:35577815
- 23. Garzón B, Helms G, Olsson H, Brozzoli C, Ullén F, Diedrichsen J, et al. Cortical changes during the learning of sequences of simultaneous finger presses. Imaging Neurosci (Camb). 2023;1:imag–1–00016. pmid:40799721
- 24. Jünemann K, Engels A, Marie D, Worschech F, Scholz DS, Grouiller F, et al. Increased functional connectivity in the right dorsal auditory stream after a full year of piano training in healthy older adults. Sci Rep. 2023;13(1):19993. pmid:37968500
- 25. Kohler N, Novembre G, Gugnowska K, Keller PE, Villringer A, Sammler D. Cortico-cerebellar audio-motor regions coordinate self and other in musical joint action. Cereb Cortex. 2023;33(6):2804–22. pmid:35771593
- 26.
Bhatnagar S, Jayant AK. Model-Based Safe Deep Reinforcement Learning Via a Constrained Proximal Policy Optimization Algorithm. In: Advances in Neural Information Processing Systems 35, 2022. 24432–45. https://doi.org/10.52202/068431-1774
- 27.
Ames AD, Coogan S, Egerstedt M, Notomista G, Sreenath K, Tabuada P. Control Barrier Functions: Theory and Applications. In: 2019 18th European Control Conference (ECC), 2019. 3420–31. https://doi.org/10.23919/ecc.2019.8796030
- 28.
Xiao W, Lyu Y, Dolan J. Model-based Dynamic Shielding for Safe and Efficient Multi-agent Reinforcement Learning. In: International Joint Conference on Autonomous Agents and Multiagent Systems, 2023. 1587–96. https://doi.org/10.65109/wdom6367
- 29. Bejarano FP, Brunke L, Schoellig AP. Safety Filtering While Training: Improving the Performance and Sample Efficiency of Reinforcement Learning Agents. IEEE Robot Autom Lett. 2025;10(1):788–95.
- 30. Vaaler A, Husa SJ, Menges D, Larsen TN, Rasheed A. Modular control architecture for safe marine navigation: Reinforcement learning with predictive safety filters. Artificial Intelligence. 2024;336:104201.
- 31. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–33. pmid:25719670
- 32. Adjei P, Tasfi N, Gomez-Rosero S, Capretz MAM. Safe Reinforcement Learning for Arm Manipulation with Constrained Markov Decision Process. Robotics. 2024;13(4):63.
- 33. Carr S, Jansen N, Junges S, Topcu U. Safe Reinforcement Learning via Shielding under Partial Observability. AAAI. 2023;37(12):14748–56.
- 34. Zhang A, Wan W, Harada K. Fast pivoting gait generation by model predictive control designed with basis functions. Advanced Robotics. 2022;36(15):735–49.
- 35. Qiao Y, Wei W, Li Y, Xu S, Wei L, Hao X, et al. CPG-MPC controller for wheel-fin-flipper integrated amphibious robot. IR. 2023;50(6):900–16.
- 36. Chai F, Ge Q, Yin Y, Li D, Wang Y. Modeling and control of dissolved oxygen in recirculating aquaculture systems: A circadian rhythm analysis approach and GSMPC controller. Computers and Electronics in Agriculture. 2024;227:109515.
- 37. Zhang T, Wang X, Chen G, Liu F, Zha F, Guo W. Walk2Run: A Bio-Rhythm-Inspired Unified Control Framework for Humanoid Robot Walking and Running. J Bionic Eng. 2025;22(6):2849–63.
- 38. Chapelle O, Li L. An empirical evaluation of thompson sampling. Advances in Neural Information Processing Systems. 2011;24.
- 39. Silva N, Werneck H, Silva T, Pereira ACM, Rocha L. Multi-Armed Bandits in Recommendation Systems: A survey of the state-of-the-art and future directions. Expert Systems with Applications. 2022;197:116669.
- 40. Santana P, Moura J. A Bayesian Multi-Armed Bandit Algorithm for Dynamic End-to-End Routing in SDN-Based Networks with Piecewise-Stationary Rewards. Algorithms. 2023;16(5):233.
- 41. Cai R, Lu R, Chen W, Hao Z. Counterfactual contextual bandit for recommendation under delayed feedback. Neural Comput & Applic. 2024;36(23):14599–613.
- 42. Wei Y, Zhang H, Wang Y, Huang C. Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions. Applied Sciences. 2023;13(16):9421.