PianoGym: Safe post-action piano rhythm training with fatigue constraints

Xiaoyu Meng; Hui Shi; Ningning Liu; Zhuangzhuang Pan; Yan Xia

doi:10.1371/journal.pone.0351141

Abstract

Bimanual piano rhythm training must maintain precise interlimb timing under limited practice time and under fatigue constraints, while feedback on performance is typically available only after an exercise. A piano practice gym environment (PianoGym) is used as a reproducible simulator for fatigue-constrained piano rhythm training under post-action feedback. The training task is formulated as a fatigue-constrained, post-action, partially observable Markov decision process (POMDP). In this POMDP, a controller observes beat asynchrony, dominance gap, synchronization fidelity, and two fatigue signals, and selects the next exercise from a finite library of structured practice actions. To handle delayed measurements and fatigue feasibility under the simulator budget, we introduce a dual-timescale safety layer. The slow Lagrangian part tracks a long-horizon average constraint using revealed true fatigue, while the fast predictive guard screens candidate actions using the online fatigue estimate. On top of this layer, a piano model predictive controller (PianoMPC) uses certainty-equivalent planning and performs finite-horizon rollouts over a calibrated surrogate environment model and searches only within guard-filtered action sets. In the main three-profile experiment, PianoMPC achieves mean time-to-mastery values of 24.4 to 28.2 steps and FeasibleRate values of 0.90 to 0.95 under the shared environment-side guard. Under the same environment-side guard, it also outperforms bandit and value-based agents. These results indicate that model-predictive planning can convert a fixed operational fatigue budget into faster progress in fatigue-aware piano practice within the PianoGym simulator and its stated surrogate fatigue and skill-dynamics assumptions.

Citation: Meng X, Shi H, Liu N, Pan Z, Xia Y (2026) PianoGym: Safe post-action piano rhythm training with fatigue constraints. PLoS One 21(6): e0351141. https://doi.org/10.1371/journal.pone.0351141

Editor: Bruno Alejandro Mesz, Universidad Nacional de Tres de Febrero, ARGENTINA

Received: December 2, 2025; Accepted: May 22, 2026; Published: June 16, 2026

Copyright: © 2026 Meng et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All source code and simulation data necessary to reproduce the reported results are publicly available in the PianoGym repository at https://github.com/xiay9/PianoGym. The repository includes the simulator, controller and baseline implementations, experiment scripts, plotting scripts, dependency file, generated result files, and instructions for reproducing the manuscript tables and figures. This study uses only simulated data and does not rely on external or human-subject datasets.

Funding: This work was supported by the High-level Talent Start-up Fund of Xi’an International University under grant XAIU202547 to Xiaoyu Meng. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Bimanual piano rhythm training requires precise interlimb timing while operating under fatigue limits and limited practice time [1,2]. In typical instructional settings, the coordination state is inferred only after an exercise, from three music-specific observables, namely beat asynchrony (ms), dominance gap (ms), and synchronization fidelity in [0,1]. The next exercise must be selected before new measurements become available [3,4]. This post-action observation pattern is further complicated by recent findings that synchronization stability depends on effector, modality, and presentation order, and that training trajectories are non-monotonic and context-dependent [5–8]. These results imply that practice should be adapted to measurements rather than follow fixed routines. In this paper, safety refers to fatigue feasibility under an explicitly defined simulator budget, not to clinical or physical safety certification for human learners. Under these conditions, we must design a controller that converts a limited fatigue budget into faster progress while keeping both average and peak fatigue within allowed limits.

Related work can be grouped in three directions. First, music motor research has characterized rhythmic coordination, effector dependence, rate specificity, and variability profiles. However, these results are typically analyzed offline and are seldom converted into online decision rules that operate directly on music-specific observables and current fatigue levels [3,4,7,9–11]. Second, safe and constrained decision making has developed runtime enforcement and Lagrangian-style methods. Yet many controllers still encode safety only through soft penalties or keep evaluation separate from selection, which makes it difficult to reuse the same safety interface during data collection and testing [12–15]. Third, planning and sequencing for music sessions and real-time accompaniment have shown the benefit of goal-aware ordering and model-based synchronization. However, learner-side skill transfer, interference across limbs, and explicit fatigue constraints are often omitted [16–19]. A unified formulation is therefore required to connect structured skill dynamics, post-action observations, and explicit safety constraints in a form that can be deployed in instructional environments.

The present study considers a constrained training problem in which a controller receives post-action observations of asynchrony, dominance gap, and synchronization fidelity. From these observations, it must select the next structured practice action from a finite library. Each action is tagged with difficulty, skill targets, learning and transfer parameters, time cost, and a positive fatigue increment. The setting exhibits delayed measurements and partial observability of multi-skill proficiency. It also features heterogeneous transfer across left hand, right hand, polyrhythm, and switching skills, together with limited exploration due to safety requirements and lesson length. The central challenges are to act on post-action observations, to model transfer, interference, fatigue accumulation, and retention under noise, and to enforce safety on two time scales while preserving progress toward mastery.

To address these challenges, a piano practice gym environment (PianoGym) is defined. PianoGym implements a post-action, fatigue-constrained partially observable Markov decision process (POMDP) with a structured practice library and music-specific observables. A dual-timescale safety layer combines a Lagrangian update that adapts to long-horizon average fatigue with a short-horizon predictive guard that filters actions whose predicted fatigue would violate the safety margin. On top of this layer, piano model predictive control (PianoMPC) rolls out the model over a finite horizon and searches only within the safe action sets returned by the guard. Experiments in the PianoGym environment compare this controller with bandit and value-based agents under the same environment-side guard in order to isolate the effect of planning depth and agent-level safety components. The main contributions of this paper include:

We formulate fatigue-aware bimanual piano rhythm training as a post-action POMDP that connects structured practice actions, music-specific observables, latent skill dynamics, and fatigue constraints in a single reproducible benchmark.
We introduce PianoMPC, a certainty-equivalent model-predictive controller that performs fatigue-aware lookahead planning over a finite practice library. This design enables the controller to allocate the same simulator fatigue budget more effectively toward faster mastery.
We evaluate PianoMPC across three learner profiles and a 3 × 3 task suite under a shared environment-side guard. PianoMPC reduces mean TTM compared with reactive and value-based baselines while maintaining simulator-level fatigue feasibility.

The remainder of the paper is organized as follows. Section Related work reviews related work. Section Methods presents the modeling assumptions, the safety mechanisms, and the PianoMPC controller. Section Experiments describes the task suite, learner profiles, and evaluation metrics, and reports results on efficiency, feasibility, and robustness. Section Conclusion concludes and outlines possible extensions. Section Limitations discusses the scope of the certainty-equivalence, fatigue, and skill-coupling assumptions.

Related work

Observables and coordination in piano practice

Music-facing observables obtained from Musical Instrument Digital Interface (MIDI) or sensor systems make it possible to condition practice on measured coordination changes. Studies on effectors, modality, and tempo show that synchronization stability depends on hand, presentation, and rate rather than being uniform across conditions, which motivates interfaces that operate directly on these music-specific measurements [3,4]. Recent piano-specific modeling work shows that skilled piano control cannot be reduced to simple timing accuracy alone. Upper-limb coordination during octave playing depends on biomechanical constraints [20]. Expressive piano output also reflects nonlinear mappings between dynamic control parameters and sound [21]. Brink et al. and Maarup et al. further reported that variability structure and a bodily hierarchy across voice, hands, and feet shape coordination transitions. These findings in turn motivate explicit noise modeling and the use of standardized observables for practice controllers [9,22]. Longitudinal neuroimaging on piano training further indicates context-dependent, non-monotonic plasticity, so progress cannot be treated as stationary during practice [7,23,24].

Fatigue studies on pianists show that repetitive or demanding segments induce local muscular fatigue and degrade timing and key velocity. They also show that subjective rest does not always match objective indicators [1,2]. These findings support returning a noisy online fatigue estimate for immediate action filtering together with a delayed true fatigue value for long timescale statistics. Interpersonal and duet studies report that joint action and perturbations modulate interbrain synchronization and the balance of self other integration [11,25]. Such effects are outside the scope of the single-learner, post-action interface considered here and are therefore not modeled.

Safety in sequential decision making

Safe and constrained reinforcement learning encodes performance under cost limits through Lagrangian or proximal updates, which yields policies that track average constraints during learning [14,26]. Complementary lines of work enforce safety through certified sets and predictive shielding that override unsafe actions near constraint boundaries and provide forward-invariance style guarantees [15,27]. More recent studies combine runtime enforcement with learned policies in safety-critical systems, where an external enforcer corrects an agent whenever formal safety rules would be violated [12,28–30].

A large fraction of these approaches assumes fully observed states or synchronous cost feedback and operates on a single time scale, as in representative value-based reinforcement learning under full observability [31,32]. Shielding under partial observability is closely related because it uses a model-based safety filter when the agent state is incomplete [33]. In fatigue-aware piano rhythm training the controller receives post-action music metrics and two distinct fatigue signals. Safety must therefore be split into a slow Lagrangian adaptation that uses revealed true fatigue and a fast predictive guard that uses the online estimate. Experiments in PianoGym are designed with a shared environment-side guard so that different decision rules can be compared under the same fatigue budget.

Planning and model predictive controller for practice

Optimization based sequencing on music platforms shows that exploiting position aware and locally sequential preferences increases within session engagement. This supports ordered practice rather than independent ranking of exercises [18]. Model based real time accompaniment further demonstrates that temporal alignment with human performers improves when the controller adapts online [19]. These studies validate goal aware, model based lookahead in musical interaction. However, they generally do not model learner side skill transfer, interference across tasks, or fatigue limited practice budgets, and the control inputs are typically not drawn from a finite library.

Model predictive controllers have been adapted for rhythmic or periodic motor tasks. Examples include basis function parameterization for fast gait generation and combinations of central pattern generators with MPC [34,35]. Gain scheduled and bio rhythm informed MPC maintains performance under time varying operating points, which indicates that predictive controllers can respect physiological cycles without losing responsiveness [36,37]. The piano practice setting considered here differs because actions come from a finite, structured library with tagged difficulty, skill targets, and positive fatigue increments, and planning is restricted to guard-filtered safe action sets. The combination of post-action observability, dual-timescale safety, and library-based MPC is not addressed in the above MPC and sequencing literature.

Methods

Problem statement and interaction protocol

Fatigue-aware piano practice is modeled as a constrained, post-action, partially observable Markov decision process (CPOMDP) with music-specific observables and two fatigue signals. This common interface makes it possible to analyze both sample efficiency and safety.

Post-action partially observable process

At decision step t, the learner occupies an unobserved latent state

(1)

where x_t contains the proficiencies of K rhythm-related skills, f_t is the accumulated cognitive and motor fatigue, and m_t is a memory or retention trace of the same dimension.

The controller maintains an internal estimate and chooses a structured practice action according to

(2)

The environment then evolves according to the true state

(3)

that is, the transition kernel depends on (s_t, a_t) and not on the estimate. After the transition, the environment reveals three post-action signals

(4)

where

(5)

collects the music-facing measurements, is a noisy online fatigue estimate intended for immediate safety screening, and is the revealed true fatigue intended for long-run constraint adaptation. The scalar r_t denotes the immediate reward revealed after executing a_t, and is the reward map from the post-action observation and executed action to this scalar reward. The two fatigue signals are consumed by different safety components in Section Dual-timescale fatigue safety. Policies in Section PianoMPC controller are defined on the filtered state estimate , as shown in Fig 1.

Download:

Fig 1. Post-action interaction protocol.

https://doi.org/10.1371/journal.pone.0351141.g001

Objective and constraints

Rhythm mastery is declared once

(6)

hold for W = 3 consecutive decision steps, where W denotes the consecutive-mastery window. The main performance indicator is the time to mastery (TTM), defined as the smallest index t at which Eq. (6) is satisfied, with if the condition is not met within the episode.

Because every practice segment induces fatigue, the process is constrained by the long-run average of the true fatigue:

(7)

with task-dependent threshold , where denotes the episode length in the PianoGym environment. The control objective is therefore

(8)

Here denotes the expected long-run fatigue cost under policy , is expectation over trajectories induced by and the simulator dynamics. In the PianoGym instantiation used for experimentation, the cumulative reward in Eq. (8) is monotone in TTM, so reporting TTM is sufficient to compare controllers. The reward-based formulation is nevertheless retained because the model-predictive controller in Section PianoMPC controller optimizes a finite-horizon surrogate of Eq. (8).

Environment and predictive model

The controller performs model-predictive rollouts and therefore requires a deterministic approximation of the environment dynamics. This section makes the abstract CPOMDP in Section Problem statement and interaction protocol concrete by specifying the latent variables, the structured action library, the skill and fatigue updates, and the stochastic observation layer from which rewards are computed. All parameters below are treated as known constants of the PianoGym environment and are not learned by the controller. This makes PianoMPC a certainty-equivalent planner: it plans from the current filtered estimate as if that estimate were the rollout state, rather than maintaining a full posterior belief over latent skills.

Parameterization and calibration

Parameterization and calibration PianoGym uses a fixed parameterization so that all controllers are compared in the same simulator rather than learning a different environment model. The action-library structure, skill targets, transfer signs, and fatigue-load ordering were manually specified from the musical roles of the exercises. The numerical learning rates, forgetting rate, fatigue gain, rest recovery rate, observation scales, and noise levels were then set once on an internal calibration suite of simulated pilot profiles spanning balanced, mild-left-weak, and severe-left-weak learners. The calibration target was not to fit a human cohort, but to place asynchrony, dominance gap, fidelity, fatigue, and TTM in interpretable ranges aligned with the evaluation thresholds. After this calibration step, the parameters were frozen for every baseline, ablation, and robustness experiment. Sensitivity to these choices is assessed in the dynamics-mismatch, scoped-mismatch, threshold-window, and guard-dependence analyses in Section Experiments. Table 1 summarizes these parameter groups and their sources.

Download:

Table 1. Parameter sources in PianoGym.

https://doi.org/10.1371/journal.pone.0351141.t001

State and action space instantiation

The latent skill vector contains K = 5 rhythm-relevant components

(9)

describing left-hand control, right-hand control, 2:3 polyrhythm coordination, 3:4 polyrhythm coordination, and fast switching. The memory trace has the same dimension and plays the role of a smoothed retention level.

The action set is a finite library of predefined structured practice actions and is exposed to the controller as its discrete action space. Each action is encoded as

(10)

where is a difficulty tag, with specifies which skills the segment targets, specifies learning rates per skill, is a sparse transfer or interference matrix, is the positive fatigue load, and is the execution time cost. This representation separates the musical target, the learning intensity, and the physiological load. The library also contains an explicit REST action,

(11)

which is kept available at every decision step.

Skill dynamics

Skill adaptation results from diminishing returns, difficulty matching, and fatigue attenuation. For an executed action a_t,

(12)

where reduces gains near mastery. The difficulty matching gate, as shown in Fig 2, is used to filter candidate actions so that their tagged difficulty stays close to the current skill estimate.

Download:

Fig 2. Difficulty-matching gate mechanism.

https://doi.org/10.1371/journal.pone.0351141.g002

(13)

which becomes active when the projected skill level approaches the declared difficulty d.

Fatigue attenuates learning through

(14)

with rate and lower bound .

Transfer, interference, and process noise are then added:

(15)

The next skill vector is clipped element-wise to the admissible range

(16)

where, for a vector v, applies to every coordinate v_i. This clipping is part of the environment step and is mirrored by the controller.

Fatigue and memory dynamics

For practice actions , fatigue increases with the action load:

(17)

For the REST action, fatigue recovers at rate :

(18)

Retention is maintained through an exponential smoother:

(19)

and forgetting is represented as a small decay proportional to missing retention:

(20)

with decay rate . This decay is also part of the dynamics mirrored by the controller.

Observation and reward generation

Given and the executed action a_t, the environment samples

from explicit action-dependent observation equations. Let w_A be the fixed skill weights used for asynchrony, let and be action-specific scale parameters, and let be an action-specific dominance-gap offset.

(21)

where . The dominance gap is generated as

(22)

where . The fidelity signal is a clipped logistic-normal proxy

(23)

These equations make the quantitative link explicit: higher relevant skill lowers asynchrony and improves fidelity, left-right imbalance changes dominance gap, and higher fatigue increases timing error and lowers fidelity. They are simulator observation equations, not fitted physiological measurement laws. The online fatigue estimate is obtained by

(24)

To aggregate heterogeneous observables into a single learning signal, each measurement v is standardized using fixed statistics :

(25)

Here z(v)⁺ denotes the positive part of the standardized residual. Asynchrony and dominance gap are standardized in this way. Fidelity already lies in [0,1] and is used in raw form. The immediate reward is

(26)

where are fixed reward weights, is the duration cost of the executed action, and is its coefficient. The reward penalizes poor musical coordination and long practice segments and keeps duration-related effects comparable across actions.

Dual-timescale fatigue safety

Fatigue must remain within specified limits both on average and at the next decision instant. Safety is therefore enforced on two coupled time scales, as shown in Fig 3. A slow Lagrangian adaptation tracks the long-run constraint in Eq. (7) using the revealed true fatigue. A fast predictive guard filters unsafe actions using the online estimate. In this paper, the safety layer refers to the simulator-level fatigue-feasibility interface used to screen practice actions. During experimentation, the same one-step guard is implemented as an environment-side guard for all agents in order to provide a common safety interface. The environment-side guard takes precedence and can overwrite any action proposed by a controller according to a fixed fallback rule.

Download:

Fig 3. Dual-timescale safety layer.

https://doi.org/10.1371/journal.pone.0351141.g003

Slow timescale: lagrangian adaptation

At the beginning of step t, the controller has access to the true fatigue f_t that was revealed at the end of the previous transition. An exponential moving average of revealed fatigue is updated as

(27)

and the Lagrange multiplier for fatigue is advanced by projected gradient ascent

(28)

The maximum operator keeps the Lagrange multiplier nonnegative, as required for the fatigue inequality constraint. The scalar converts the average constraint into an instantaneous penalty and is kept fixed for the entire decision step t. During planning in Section PianoMPC controller, this multiplier penalizes predicted trajectories whose fatigue rises above .

Fast timescale: predictive guard

The fast component restricts actions that would produce an immediate overload. At step t, the internal fatigue estimate (computed in Section PianoMPC controller) is combined with the deterministic dynamics Eq. (17)–Eq. (18) and the action-specific loads in order to predict the next-step fatigue for every . Two margins are used for two different purposes. The conservative one-step margin defines a primary step cap, whereas the relaxed guard slack defines a peak diagnostic threshold used by the environment-side guard and by guard-consistency experiments. The primary one-step set is

(29)

For a guard horizon H_g, the relaxed peak condition is

(30)

The executed guard uses Eq. (29) as the ordinary action filter, and REST is always available. The relaxed band in Eq. (30) is logged for calibration and can affect execution only in the fallback case where no non-rest primary action survives; otherwise it does not enlarge the ordinary safe set. Accordingly, is used as the conservative execution cap, whereas is used only as a relaxed peak-diagnostic threshold. When the environment-side guard is present, the environment applies the same interface to the finally proposed action.

PianoMPC controller

This section introduces PianoMPC, a finite-horizon, certainty-equivalent controller that plans only over actions certified as safe by the dual-timescale mechanism. The controller uses the deterministic model in Section Environment and predictive model, the multipliers in Section Dual-timescale fatigue safety, and the same structured action library as the environment.

Certainty-equivalent state tracking

At the start of step t, immediately after receiving generated by action , the controller forms an updated estimate

The update is carried out in two stages.

First, a prediction step propagates the previous estimate through the deterministic model:

(31)

where applies the noise-free dynamics Eq. (12)–Eq. (20) and Eq. (17)–Eq. (18). This produces a prior for step t.

Second, a correction step aligns the fatigue component with the revealed true fatigue while keeping the skill and memory components unchanged:

(32)

(33)

(34)

This certainty-equivalent design is motivated by two characteristics of the setting. Musical observables (Async, DomGap, Fid) are noisy and task dependent, so back-propagating them into all skill coordinates at every step would increase variance and computational load. In contrast, the environment reveals the true fatigue without noise. Using it to anchor yields an accurate fatigue signal for both the slow Lagrangian update and the fast guard. The observables are therefore used through the deterministic observation and reward model in Section Environment and predictive model and are not used for state correction.

Finite-horizon planning

Given and the current multiplier , the controller solves a receding-horizon problem of length H, as shown in Fig 4. Let a candidate action sequence be . The deterministic model is rolled forward as

Download:

Fig 4. Model predictive control with receding horizon.

https://doi.org/10.1371/journal.pone.0351141.g004

(35)

starting from . At each lookahead step t + h, the predictive guard in Section Dual-timescale fatigue safety is applied to produce the future safe set using the predicted fatigue contained in .

The immediate reward at lookahead index h is obtained by applying the deterministic observation-and-reward model to the predicted state and to action . For compactness, this predicted quantity is denoted by . The planning problem is

(36)

(37)

(38)

The multiplier computed in Eq. (28) is held constant over the horizon, so that the optimization in Eq. (36) depends only on the predicted trajectories.

The discrete optimization in Eq. (36) is solved by a beam search of width b. At each depth, only the best b partial sequences consistent with the safe sets are kept, which yields a computational cost proportional to and to the size of the filtered action sets. After evaluating all depth-H sequences retained by the beam, the first action of the best sequence is executed. At the next decision step, the state is re-estimated and the planning procedure is repeated. This produces a receding horizon controller that respects both long-run and short-horizon fatigue constraints and that is used in Section Experiments.

Experiments

This section evaluates whether model-predictive planning improves learning under fatigue-constrained practice. All experiments use the same PianoGym action library and the same environment-side short-horizon safety guard. Two safety regimes are reported. In the main comparison reported in Section Main comparison under the safety guard, the environment fatigue hard constraint is enabled for the three learner profiles, so that agents differ only in their decision rules under a common safety interface. In the task suite study reported in Section Robustness across task families, each profile follows its default configuration. The balanced and mild-left-weak profiles keep only the guard, whereas the severe-left-weak profile enables both the guard and the hard constraint.

Dataset

Task composition

A 3 × 3 task suite is constructed in PianoGymEnv (see Table 2). The rows are the three learner profiles used throughout the experiments, and for each profile three transfer levels (weak, medium, strong) are used. This gives 9 tasks in total.

Download:

Table 2. Profiles in the PianoGym task suite.

https://doi.org/10.1371/journal.pone.0351141.t002

Scale and execution protocol

Unless otherwise stated, S = 10 independent random seeds are run for each task, and each seed generates one trajectory. A trajectory runs for at most steps and terminates early once the mastery condition is satisfied. This produces

records of states, actions, observations, and fatigue. Under this configuration the total sample size is on the order of 10⁵, which is sufficient for reporting means and 95% confidence intervals.

Symmetry between left-weak and right-weak profiles

Only the left-weak variants (mild-left-weak, severe-left-weak) are reported in the main text. The data generator can swap left and right skill dimensions and action targets to obtain the right-weak counterparts. Because the action-library fatigue parameters, the short-horizon safety guard, and the PianoMPC state estimator are symmetric in the two dimensions, the mirrored tasks yield the same rankings. Their safety statistics are also almost identical. The corresponding results can be reproduced from the released code or from the supplementary material.

Observation and action interface

The environment uses a post-action interface. After executing action a_t at time t, it returns

together with the online fatigue estimate and the true fatigue . The action set contains several structured practice actions and an explicit rest action REST. For comparability, a common environment-side short-horizon safety guard is applied to every agent. Each agent first proposes a candidate action. The guard then filters actions using the current online fatigue estimate (denoted in the experiments) and produces

with in the default experiments. The environment-side guard also logs whether the proposed action would pass the relaxed diagnostic band , with . Only actions in this primary set are executed, except for the relaxed fallback case defined in Section Dual-timescale fatigue safety. Some agents, such as PianoMPC and the fatigue-aware LinUCB variant, forecast future fatigue and therefore tend to propose actions that already satisfy the guard. Other agents do not anticipate the guard, but their final executed actions are still filtered by the same environment-side rule. This design keeps the safety mechanism identical across methods and lets the comparison focus on the quality of the decision rules.

Experimental setup

One simulation step corresponds to a practice segment of about 30–60 seconds. This correspondence fixes the scale of the fatigue load , the recovery rate , and the typical time to mastery (TTM). All observation scalings, such as , , , and associated noise levels are estimated on an internal calibration set and then frozen for all experiments. The calibration set contains simulated pilot runs from the three learner profiles and was used only to set observation ranges and noise levels before the reported experiments were run.

Unless specified otherwise, the experiments use K = 5 skill dimensions and an action library of size , which includes the REST action. The mastery condition requires a window of W = 3 consecutive successful steps. Key hyperparameters are set as follows: the profile-specific fatigue threshold is for balanced, mild-left-weak, and severe-left-weak profiles; the one-step guard margin is ; the relaxed diagnostic slack is ; and the guard horizon is . For the main comparison, the MPC planning horizon is H = 3; the horizon sweep separately evaluates . The beam search width is b = 20. Dynamics parameters include a memory smoothing factor and a forgetting rate . The reward weights are set to , and the time-cost coefficient is .

The per-step computational complexity of PianoMPC is given by , where T_F is the cost of one forward prediction . In practice, the finite action library combined with the predictive guard ensures that the safe action set remains small, keeping the controller computationally efficient.

Evaluation metrics

All results are computed by the evaluation module in PianoGym and are averaged across tasks and seeds. Unless otherwise stated, means and 95% confidence intervals are computed across independent seeds within the stated profile or task aggregation unit. For rate metrics, figure displays are clipped to the valid interval [0,1] or [0,100]% without changing the underlying means. Consider a trajectory

where denotes fatigue, denotes the environment fatigue threshold, W denotes the mastery window, and denotes the maximum episode length.

Time to mastery (TTM) quantifies sample efficiency by measuring how many steps are required to achieve stable mastery for a window of length W. Formally,

(39)

with if the mastery condition is not satisfied within . TTM is the primary performance metric used to rank agents in Section Main comparison under the safety guard to Section Effect of fatigue-threshold and mastery-window choices (lower is better). The counter records consecutive successful steps up to and including step t and resets to zero whenever any criterion in Eq. (6) is not satisfied.

Task return is computed by recording the instantaneous raw reward and aggregating

(40)

(41)

Safety metrics follow the dual-timescale safety design, consisting of a short-horizon guard and a long-horizon fatigue limit. Each fatigue sample is thresholded as

Following standard practice in constrained and safe reinforcement learning [14], the following quantities are reported:

(42)

(43)

(44)

(45)

FeasibleRate and OverloadRate are complementary by definition, while AvgViolation measures the magnitude of fatigue violations rather than their frequency. Guard replacement rate is reported as an intervention diagnostic for proposed-versus-executed actions and is not treated as a safety success metric. For this reason, the results are interpreted as fatigue-feasibility outcomes under the simulator budget rather than as absolute safe or unsafe behavior.

A progress metric called rhythm independence gain is also reported:

(46)

This metric uses the same thresholds as the mastery condition defined by the environment and therefore supports cross task comparison. Additional diagnostics, including memory retention and offline evaluation coverage, appear in the public implementation.

Main comparison under the safety guard

Baselines

All baselines observe the same post-action signals, use the same action set, and are filtered by the same environment-side guard. Our PianoMPC agent and the fatigue-aware LinUCB variant also run the same short-horizon guard inside the policy, whereas the other baselines rely only on the environment-side guard. Thus differences in Section Main comparison under the safety guard come mainly from the decision rule. Thompson [38] is a lightweight posterior-sampling bandit under delayed feedback. LinUCB [39] is a contextual bandit on PianoGym features but does not plan future fatigue. Bayesian MAB [40] adds discounting and change-point handling for non-stationary profiles. CCB-DF [41] learns from delayed counterfactual rewards. DQN [31] is a value-based agent that relies on the external guard for safety. Safe-AC [32] treats fatigue above as cost in a Lagrangian actor-critic. Auto-Curriculum [42] orders exercises and inserts REST, mimicking human teaching without MPC-style planning.

Results on three profiles

This section reports the main comparison under the common safety guard. Table 3 summarizes three learner profiles and eight agents. FeasibleRate values range from 0.89 to 0.97 in the main comparison, with nonzero overload rates still possible. For this reason, the primary comparison is mean TTM interpreted together with FeasibleRate and violation diagnostics, rather than a binary safe/unsafe label.

Download:

Table 3. Overall comparison of time to mastery (TTM) and FeasibleRate (Feas.). Values are mean ± 95% CI over 10 runs. Best values are highlighted in bold and second best are underlined.

https://doi.org/10.1371/journal.pone.0351141.t003

PianoMPC achieves mean TTM values of 24.4, 26.0, and 28.2 steps across the balanced, mild-left-weak, and severe-left-weak profiles, respectively. This range is clearly below the second group, formed by BayesianMAB with 35–40 steps. All other methods require between the low 40s and about 60 steps, with LinUCB, DQN, CCB-DF and AutoCurriculum often around 50–60 steps.

Fig 5 panel (a) shows the same three-level structure. LinUCB keeps the highest FeasibleRate, about two points above PianoMPC, yet this improvement is marginal compared with the extra training time. The large gap appears because LinUCB reacts myopically when fatigue is near the limit and spends many steps on low-value rest choices. The controller plans the fatigue budget over the horizon, so it converts the same fatigue allowance into progress. Fig 5 panel (b) follows the time pattern, since shorter runs accumulate less negative reward. Panel (c) reports that some slower methods reach slightly higher coordination, which reflects longer exposure rather than stronger learning rules. Fig 6 shows that PianoMPC improves reward earliest and keeps the advantage through the early practice steps. This behaviour is consistent with planning under post-action observations and with the dual safety design of the environment. A guard-dependence ablation compares the default environment-side guard, a weaker guard, and no guard. PianoMPC remains the fastest method in all three settings, with macro mean TTM values of 25.67, 24.50, and 25.77 steps and FeasibleRate values of 0.921, 0.920, and 0.927, respectively. Under the default environment-side guard, the guard replacement rate for PianoMPC is 0.000, whereas replacement rates for several reactive baselines are much higher. This indicates that the guard is a common execution interface and diagnostic, rather than the sole driver of PianoMPC’s efficiency.

Download:

Fig 5. Quantitative results for the main experiment.

Panel labels show mean TTM, reward, and coordination gain; error bars are 95% seed CIs.

https://doi.org/10.1371/journal.pone.0351141.g005

Download:

Fig 6. Learning curves for the main experiment.

Curves show sliding-averaged raw reward; shaded bands are 95% seed CIs.

https://doi.org/10.1371/journal.pone.0351141.g006

Effect of planning horizon

To assess how much lookahead is needed, the planning horizon H was varied. This test used the balanced profile while all other settings were fixed. The sweep uses the same balanced-profile seed schedule and agent-update pipeline as the main comparison, so the H = 3 point matches the PianoMPC entry in Table 3; H = 10 is an additional sweep setting rather than the table configuration. Fig 7 shows the resulting trade-off between training speed and fatigue feasibility. A myopic policy (H = 1) achieves the lowest balanced-profile mean TTM at 24.1 steps, but its FeasibleRate is 0.902 and its macro FeasibleRate across profiles is 0.850. The H = 1 setting is therefore treated as a low-feasibility setting rather than categorically unsafe.

Download:

Fig 7. Effect of PianoMPC planning horizon on the balanced profile.

Left axis shows TTM; right axis shows FeasibleRate on a zoomed scale with visual headroom above 100%; symmetric 95% CIs are clipped to the valid 0–100% range for display.

https://doi.org/10.1371/journal.pone.0351141.g007

At the main-comparison setting H = 3, the balanced-profile mean TTM is 24.4 steps and FeasibleRate is 0.945. Longer horizons raise FeasibleRate to 1.000 for H = 5 and H = 10, with balanced-profile mean TTM values of 27.3 and 25.8 steps. This pattern indicates that short lookahead already captures useful near-term fatigue structure, while longer lookahead mainly changes the feasibility-speed trade-off. Thus planning depth, not only the shared guard, explains the efficiency gap.

Effect of agent-level safety components

This experiment evaluates the safety modules inside the agent policy. The short-horizon guard and the soft fatigue penalty are enabled or removed. Fig 8 reports a profile-aware bar plot of mean TTM and realized fatigue-overload rate, computed as , across all three learner profiles. Note that the environment guard remains enabled for every variant and the ablation removes only agent-side components such as the internal guard or the soft penalty.

Download:

Fig 8. Agent-side safety-module ablation across three learner profiles.

Bars show macro means; open markers show profile-specific values. Overload rate denotes the realized trajectory overload rate, , after the shared environment-side guard.

https://doi.org/10.1371/journal.pone.0351141.g008

With both PianoMPC safety components active, the macro mean realized overload rate is 8.2% and mean TTM is 26.2 steps. Removing the soft penalty gives a similar realized overload rate of 8.3% and a mean TTM of 26.7 steps. Thus, under the shared environment-side guard, PianoMPC’s internal guard and replanning dominate this ablation, while the soft penalty has only a small effect on the realized feasibility-speed trade-off.

The LinUCB variants remain slower, with mean TTM values between 49.2 and 53.1 steps. Their realized overload rates stay between 3.5% and 8.0%, indicating a clearer guard-dependent feasibility-speed trade-off for the reactive bandit variants. Safe-AC appears between these groups, with mean TTM of 42.5 steps and an 8.2% realized overload rate, so it improves over the bandit variants but does not match PianoMPC’s efficiency.

Consistency between windowed and peak-guard safety metrics

This experiment checks whether the online guard diagnostics behave consistently as the guard slack changes. Fig 9 reports online peak-violation rate and guard false-negative rate for PianoMPC on the balanced profile. The diagnostic window is fixed at in this plot so that the figure focuses on the guard horizon H_g and the relaxed slack .

Download:

Fig 9. Guard-calibration diagnostics for PianoMPC on the balanced profile.

Panels show peak-violation and false-negative rates across guard horizons at fixed diagnostic window .

https://doi.org/10.1371/journal.pone.0351141.g009

The online peak-violation rate decreases as increases, which is expected because the diagnostic threshold becomes less conservative. The false-negative rate remains low across the displayed settings and approaches zero at . These rates are diagnostics of the guard calibration; they do not mean that violations are allowed as a goal of the method. The online guard is used for action filtering, whereas windowed violation metrics are retained as offline checks of the realized trajectory.

Robustness across task families

This experiment tests whether the advantage of planning holds under different task structures. The suite combines three categories (Balanced, Mild-left-weak, Severe-left-weakness) with three transfer levels (weak, medium, strong). Table 4 reports time to mastery and FeasibleRate for the four strongest agents. The suite includes nonzero overload rates in several severe-left-weakness settings, so these values should be read as fatigue-feasibility diagnostics rather than as binary safety claims. We keep these defaults to test robustness under realistic per-task settings rather than under a single globally tuned constraint.

Download:

Table 4. Robustness of top agents across the 3 × 3 task suite. Cells report mean TTM and FeasibleRate with 95% seed CIs.

https://doi.org/10.1371/journal.pone.0351141.t004

PianoMPC achieves the lowest mean TTM in every cell of the table. On weak transfer tasks it needs about 43–47 steps, while CCB-DF needs around 78 steps and Thompson stays above 70 steps. BayesianMAB lies between them, at 49–54 steps, and is therefore the second tier. This shows that, when transfer is scarce and fatigue is enforced, planning still converts practice into progress more efficiently than reactive bandits.

When transfer becomes medium or strong, every agent improves, but the gains are not equal. PianoMPC drops to roughly 25 steps for medium transfer and to 14–17 steps for strong transfer across all categories. BayesianMAB and CCB-DF also speed up, yet they remain 10–20 steps slower than the planner. In the severe-left-weakness tasks, PianoMPC is the only agent that stays below 30 mean steps at medium transfer while keeping FeasibleRate above 0.9. This indicates that the planning controller remains effective across this task suite under the evaluated fatigue limits.

Robustness to dynamics misspecification

This experiment studies how sensitive the agents are to errors in the task dynamics. Three environment parameters are scaled around the nominal value one. The parameters control memory decay, fatigue accumulation, and gating sharpness. Fig 10 reports the change in four metrics with respect to the baseline run. Values close to zero indicate stable behaviour under misspecification.

Download:

Fig 10. Sensitivity of agents to dynamics misspecification.

Panels show metric changes from nominal; AutoCurriculum reaches the run limit in scanned settings.

https://doi.org/10.1371/journal.pone.0351141.g010

Across all settings, PianoMPC stays close to the zero line. For memory errors, its time to mastery changes by at most about four steps, and FeasibleRate varies within roughly three tenths of a percentage point. Average fatigue and learning slope also show only small shifts. This pattern suggests that online replanning absorbs moderate model errors without extra tuning.

Non-planning agents react much more strongly. For misspecified forgetting rates, LinUCB is affected the most and can require more than 50 additional steps, while AutoCurriculum always hits the run limit in terms of TTM, so its sensitivity in time to mastery cannot be observed. Several bandit baselines also lose around one percentage point of feasibility when the fatigue gain is inaccurate. Misspecification of the gating parameter pushes CCB-DF and LinUCB even further from the origin. These results indicate that methods with fixed exploration schedules are more sensitive to the tested dynamics errors, whereas the predictive controller changes less under the same simulator changes. The zero sensitivity of AutoCurriculum in some panels is therefore a ceiling artifact: it reaches the run limit in the scanned settings, rather than showing greater robustness. Tests with scoped simulator changes also keep the calibrated fatigue model fixed while increasing observation noise and varying transfer/interference strength. In additional scoped simulator-change tests, PianoMPC retained the lowest overall mean TTM and a competitive FeasibleRate when observation noise was increased by two to three times and transfer strength was scaled by 1.2. These results support robustness within the tested simulator changes, but they do not imply robustness to arbitrary model error.

Effect of fatigue-threshold and mastery-window choices

The experiment studies how evaluation hyperparameters affect reported performance. Only the fatigue threshold and the mastery window change, agent policies remain identical. Fig 11 reports both TTM and FeasibleRate, directly addressing whether the ranking depends on a single operational mastery or fatigue definition. It also compares PianoMPC, BayesianMAB, and CCB-DF on the balanced profile.

Download:

Fig 11. Effect of changing evaluation hyperparameters.

Panels show TTM and FeasibleRate under threshold scaling and mastery windows.

https://doi.org/10.1371/journal.pone.0351141.g011

Panel (a) scales the fatigue limit from 0.8 to 1.2 of the nominal value. The PianoMPC curve stays between 22.5 and 25.5 steps. CCB-DF varies from about 35–45 steps and peaks exactly at the unscaled threshold. BayesianMAB moves inside the 31–36 step band and never overtakes PianoMPC. The corresponding FeasibleRate panel shows that the threshold sweep changes feasibility levels but does not remove PianoMPC’s TTM advantage. Thus the relative ordering in Section Main comparison under the safety guard is not caused by a single threshold choice.

The mastery-window sweep is limited to because longer windows act mainly as stress tests and can obscure the operational definition of TTM. The default W = 3 is marked in both mastery-window panels. PianoMPC remains the fastest method over the displayed window range, while FeasibleRate varies because changing W changes episode termination times and the fatigue samples included in the trajectory average. This behaviour shows that TTM is an operational PianoGym metric, not a universal definition of musical mastery.

Discussion

In PianoGym, model-predictive planning improves mean TTM while maintaining fatigue-feasibility diagnostics under the stated surrogate fatigue model. Under a shared environment-side guard, PianoMPC achieves clearly lower time-to-mastery than reactive bandits. This advantage is supported by the horizon and policy-side safety ablations. The two safety diagnostics are consistent once their purposes are separated. The effect persists across task families, tested dynamics changes, and evaluation hyperparameters. The no-guard and guard-calibration analyses further indicate that PianoMPC’s efficiency is not driven solely by the default environment-side guard. Safety-related results are therefore interpreted as simulator-level fatigue-feasibility findings rather than as claims about real human ‌‌physiology.

Conclusion

This paper studied sample-efficient learning for fatigue-constrained piano practice under a unified safety interface. PianoGym provides a reproducible simulation benchmark for post-action, fatigue-constrained piano rhythm training, and PianoMPC demonstrates how model-predictive planning can improve mean TTM under this benchmark. Across three learner profiles and the 3 × 3 task suite, PianoMPC achieved the lowest mean TTM while maintaining the reported FeasibleRate range under the evaluated fatigue budget. Ablations on planning depth and agent-level penalties suggest that short lookahead can be sufficient in this simulator and that soft penalties deliver controllable feasibility and speed trade-offs. Alternative safety diagnostics show that tuning the online guard changes absolute rates without reversing the comparative pattern in the evaluated settings.

Robustness tests show stable behavior under the evaluated dynamics misspecification, scoped simulator changes, guard settings, and mastery definitions. The observed advantage was therefore not limited to the default setting in the main table. The resulting claim is simulator-level: PianoMPC is an effective planner for the stated PianoGym surrogate model and fatigue interface. Open directions include learning guard thresholds, richer state estimation, multi-task scheduling, physical-instrument evaluation, and personalized nonlinear fatigue models.

Limitations

This study has several limitations related to its modeling assumptions. PianoMPC uses a certainty-equivalent design and rolls out a deterministic model from the current filtered estimate. It does not propagate a full posterior over latent skills, fatigue, or observation noise. This choice makes planning efficient for a finite exercise library, but it can understate uncertainty when observations are noisy or learner dynamics change abruptly. PianoGym also represents bimanual rhythm ability with five task-level skill coordinates. These coordinates are coupled by transfer and interference terms. This representation supports controlled benchmarking, but it does not capture the full neuromuscular, biomechanical, or expressive coupling involved in real piano performance.

The scope of the results is also limited by the simulator-level fatigue and evaluation definitions. Fatigue is modeled as a scalar operational index with additive exercise load, linear rest recovery, and a multiplicative effect on learning gain. It is not a validated physiological model of muscular, cognitive, or biomechanical fatigue. Therefore, TTM, FeasibleRate, and mastery thresholds should be interpreted as operational benchmark metrics rather than universal definitions of musical mastery or clinical safety.

References

1. Goubault E, Verdugo F, Pelletier J, Traube C, Begon M, Dal Maso F. Exhausting repetitive piano tasks lead to local forearm manifestation of muscle fatigue and negatively affect musical parameters. Sci Rep. 2021;11(1):8117. pmid:33854088
- View Article
- PubMed/NCBI
- Google Scholar
2. Takemi M, Akahoshi M, Ushiba J, Furuya S. Behavioral and physiological fatigue-related factors influencing timing and force control learning in pianists. Sci Rep. 2023;13(1):21646. pmid:38062126
- View Article
- PubMed/NCBI
- Google Scholar
3. Ito K, Watanabe T, Horinouchi T, Matsumoto T, Yunoki K, Ishida H, et al. Higher synchronization stability with piano experience: relationship with finger and presentation modality. J Physiol Anthropol. 2023;42(1):10. pmid:37337272
- View Article
- PubMed/NCBI
- Google Scholar
4. Barchet AV, Henry MJ, Pelofi C, Rimmele JM. Auditory-motor synchronization and perception suggest partially distinct time scales in speech and music. Commun Psychol. 2024;2(1):2. pmid:39242963
- View Article
- PubMed/NCBI
- Google Scholar
5. Whitton SA, Jiang F. Sensorimotor synchronization with visual, auditory, and tactile modalities. Psychol Res. 2023;87(7):2204–17. pmid:36773102
- View Article
- PubMed/NCBI
- Google Scholar
6. Roman IR, Roman AS, Kim JC, Large EW. Hebbian learning with elasticity explains how the spontaneous motor tempo affects music performance synchronization. PLoS Comput Biol. 2023;19(6):e1011154. pmid:37285380
- View Article
- PubMed/NCBI
- Google Scholar
7. Olszewska AM, Gaca M, Droździel D, Jednoróg K, Marchewka A, Herman AM. Piano Training Induces Dynamic Neuroplasticity of Bimanual Coordination but Not Auditory Processing in Young Adults. J Neurosci Res. 2025;103(7):e70067. pmid:40650444
- View Article
- PubMed/NCBI
- Google Scholar
8. Labrou K, Zaman CH, Turkyasar A, Davis R. Following the Master’s Hands: Capturing Piano Performances for Mixed Reality Piano Learning Applications. In: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, 2023. 1–8. https://doi.org/10.1145/3544549.3585838
9. Brink KJ, Kim SK, Sommerfeld JH, Amazeen PG, Stergiou N, Likens AD. Pink noise promotes sooner state transitions during bimanual coordination. Proc Natl Acad Sci U S A. 2024;121(31):e2400687121. pmid:39042677
- View Article
- PubMed/NCBI
- Google Scholar
10. Deosdad-Díez M, Marco-Pallarés J. Note-by-note predictability modulates rhythm learning and its neural components. NPJ Sci Learn. 2025;10(1):59. pmid:40841516
- View Article
- PubMed/NCBI
- Google Scholar
11. Lender A, Perdikis D, Gruber W, Lindenberger U, Müller V. Dynamics in interbrain synchronization while playing a piano duet. Ann N Y Acad Sci. 2023;1530(1):124–37. pmid:37824090
- View Article
- PubMed/NCBI
- Google Scholar
12. Vuppala SRH, Allen N, Pinisetty S, Roop P. A Formal Approach for Safe Reinforcement Learning: A Rate-Adaptive Pacemaker Case Study. In: International Conference on Runtime Verification. Springer; 2024. p. 3–21. https://doi.org/10.1007/978-3-031-74234-7_1
13. Wang J, Gao R, Zha H. Reliable Off-Policy Evaluation for Reinforcement Learning. Operations Research. 2024;72(2):699–716.
- View Article
- Google Scholar
14. Achiam J, Held D, Tamar A, Abbeel P. In: International conference on machine learning, 2017. 22–31.
15. Bastani O, Li S. Safe Reinforcement Learning via Statistical Model Predictive Shielding. In: Robotics: Science and Systems XVII, 2021. https://doi.org/10.15607/rss.2021.xvii.026
16. Chen J, Xue W, Tan X, Ye Z, Liu Q, Guo Y. FastSAG: towards fast non-autoregressive singing accompaniment generation. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024. 7618–26. https://doi.org/10.24963/ijcai.2024/843
17. Wu Y, Cooijmans T, Kastner K, Roberts A, Simon I, Scarlatos A. In: 2024. 53328–45.
18. Moor D, Yuan Y, Mehrotra R, Dai Z, Lalmas M. Exploiting Sequential Music Preferences via Optimisation-Based Sequencing. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023. 4759–65. https://doi.org/10.1145/3583780.3615476
19. Wang H, Zhang X, Iida F. Human–Robot Cooperative Piano Playing With Learning-Based Real-Time Music Accompaniment. IEEE Trans Robot. 2024;40:4650–69.
- View Article
- Google Scholar
20. Wang H, Nonaka T, Abdulali A, Iida F. Coordinating upper limbs for octave playing on the piano via neuro-musculoskeletal modeling. Bioinspir Biomim. 2023;18(6):10.1088/1748-3190/acfa51. pmid:37714178
- View Article
- PubMed/NCBI
- Google Scholar
21. Scimeca L, Ng C, Iida F. Gaussian process inference modelling of dynamic robot control for expressive piano playing. PLoS One. 2020;15(8):e0237826. pmid:32797107
- View Article
- PubMed/NCBI
- Google Scholar
22. Mårup SH, Møller C, Vuust P. Coordination of voice, hands and feet in rhythm and beat performance. Sci Rep. 2022;12(1):8046. pmid:35577815
- View Article
- PubMed/NCBI
- Google Scholar
23. Garzón B, Helms G, Olsson H, Brozzoli C, Ullén F, Diedrichsen J, et al. Cortical changes during the learning of sequences of simultaneous finger presses. Imaging Neurosci (Camb). 2023;1:imag–1–00016. pmid:40799721
- View Article
- PubMed/NCBI
- Google Scholar
24. Jünemann K, Engels A, Marie D, Worschech F, Scholz DS, Grouiller F, et al. Increased functional connectivity in the right dorsal auditory stream after a full year of piano training in healthy older adults. Sci Rep. 2023;13(1):19993. pmid:37968500
- View Article
- PubMed/NCBI
- Google Scholar
25. Kohler N, Novembre G, Gugnowska K, Keller PE, Villringer A, Sammler D. Cortico-cerebellar audio-motor regions coordinate self and other in musical joint action. Cereb Cortex. 2023;33(6):2804–22. pmid:35771593
- View Article
- PubMed/NCBI
- Google Scholar
26. Bhatnagar S, Jayant AK. Model-Based Safe Deep Reinforcement Learning Via a Constrained Proximal Policy Optimization Algorithm. In: Advances in Neural Information Processing Systems 35, 2022. 24432–45. https://doi.org/10.52202/068431-1774
27. Ames AD, Coogan S, Egerstedt M, Notomista G, Sreenath K, Tabuada P. Control Barrier Functions: Theory and Applications. In: 2019 18th European Control Conference (ECC), 2019. 3420–31. https://doi.org/10.23919/ecc.2019.8796030
28. Xiao W, Lyu Y, Dolan J. Model-based Dynamic Shielding for Safe and Efficient Multi-agent Reinforcement Learning. In: International Joint Conference on Autonomous Agents and Multiagent Systems, 2023. 1587–96. https://doi.org/10.65109/wdom6367
29. Bejarano FP, Brunke L, Schoellig AP. Safety Filtering While Training: Improving the Performance and Sample Efficiency of Reinforcement Learning Agents. IEEE Robot Autom Lett. 2025;10(1):788–95.
- View Article
- Google Scholar
30. Vaaler A, Husa SJ, Menges D, Larsen TN, Rasheed A. Modular control architecture for safe marine navigation: Reinforcement learning with predictive safety filters. Artificial Intelligence. 2024;336:104201.
- View Article
- Google Scholar
31. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–33. pmid:25719670
- View Article
- PubMed/NCBI
- Google Scholar
32. Adjei P, Tasfi N, Gomez-Rosero S, Capretz MAM. Safe Reinforcement Learning for Arm Manipulation with Constrained Markov Decision Process. Robotics. 2024;13(4):63.
- View Article
- Google Scholar
33. Carr S, Jansen N, Junges S, Topcu U. Safe Reinforcement Learning via Shielding under Partial Observability. AAAI. 2023;37(12):14748–56.
- View Article
- Google Scholar
34. Zhang A, Wan W, Harada K. Fast pivoting gait generation by model predictive control designed with basis functions. Advanced Robotics. 2022;36(15):735–49.
- View Article
- Google Scholar
35. Qiao Y, Wei W, Li Y, Xu S, Wei L, Hao X, et al. CPG-MPC controller for wheel-fin-flipper integrated amphibious robot. IR. 2023;50(6):900–16.
- View Article
- Google Scholar
36. Chai F, Ge Q, Yin Y, Li D, Wang Y. Modeling and control of dissolved oxygen in recirculating aquaculture systems: A circadian rhythm analysis approach and GSMPC controller. Computers and Electronics in Agriculture. 2024;227:109515.
- View Article
- Google Scholar
37. Zhang T, Wang X, Chen G, Liu F, Zha F, Guo W. Walk2Run: A Bio-Rhythm-Inspired Unified Control Framework for Humanoid Robot Walking and Running. J Bionic Eng. 2025;22(6):2849–63.
- View Article
- Google Scholar
38. Chapelle O, Li L. An empirical evaluation of thompson sampling. Advances in Neural Information Processing Systems. 2011;24.
- View Article
- Google Scholar
39. Silva N, Werneck H, Silva T, Pereira ACM, Rocha L. Multi-Armed Bandits in Recommendation Systems: A survey of the state-of-the-art and future directions. Expert Systems with Applications. 2022;197:116669.
- View Article
- Google Scholar
40. Santana P, Moura J. A Bayesian Multi-Armed Bandit Algorithm for Dynamic End-to-End Routing in SDN-Based Networks with Piecewise-Stationary Rewards. Algorithms. 2023;16(5):233.
- View Article
- Google Scholar
41. Cai R, Lu R, Chen W, Hao Z. Counterfactual contextual bandit for recommendation under delayed feedback. Neural Comput & Applic. 2024;36(23):14599–613.
- View Article
- Google Scholar
42. Wei Y, Zhang H, Wang Y, Huang C. Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions. Applied Sciences. 2023;13(16):9421.
- View Article
- Google Scholar

[ref1] 1. Goubault E, Verdugo F, Pelletier J, Traube C, Begon M, Dal Maso F. Exhausting repetitive piano tasks lead to local forearm manifestation of muscle fatigue and negatively affect musical parameters. Sci Rep. 2021;11(1):8117. pmid:33854088
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Takemi M, Akahoshi M, Ushiba J, Furuya S. Behavioral and physiological fatigue-related factors influencing timing and force control learning in pianists. Sci Rep. 2023;13(1):21646. pmid:38062126
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Ito K, Watanabe T, Horinouchi T, Matsumoto T, Yunoki K, Ishida H, et al. Higher synchronization stability with piano experience: relationship with finger and presentation modality. J Physiol Anthropol. 2023;42(1):10. pmid:37337272
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Barchet AV, Henry MJ, Pelofi C, Rimmele JM. Auditory-motor synchronization and perception suggest partially distinct time scales in speech and music. Commun Psychol. 2024;2(1):2. pmid:39242963
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Whitton SA, Jiang F. Sensorimotor synchronization with visual, auditory, and tactile modalities. Psychol Res. 2023;87(7):2204–17. pmid:36773102
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Roman IR, Roman AS, Kim JC, Large EW. Hebbian learning with elasticity explains how the spontaneous motor tempo affects music performance synchronization. PLoS Comput Biol. 2023;19(6):e1011154. pmid:37285380
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Olszewska AM, Gaca M, Droździel D, Jednoróg K, Marchewka A, Herman AM. Piano Training Induces Dynamic Neuroplasticity of Bimanual Coordination but Not Auditory Processing in Young Adults. J Neurosci Res. 2025;103(7):e70067. pmid:40650444
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Labrou K, Zaman CH, Turkyasar A, Davis R. Following the Master’s Hands: Capturing Piano Performances for Mixed Reality Piano Learning Applications. In: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, 2023. 1–8. https://doi.org/10.1145/3544549.3585838

[ref9] 9. Brink KJ, Kim SK, Sommerfeld JH, Amazeen PG, Stergiou N, Likens AD. Pink noise promotes sooner state transitions during bimanual coordination. Proc Natl Acad Sci U S A. 2024;121(31):e2400687121. pmid:39042677
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref10] 10. Deosdad-Díez M, Marco-Pallarés J. Note-by-note predictability modulates rhythm learning and its neural components. NPJ Sci Learn. 2025;10(1):59. pmid:40841516
View Article
PubMed/NCBI
Google Scholar

[35] View Article

[36] PubMed/NCBI

[37] Google Scholar

[ref11] 11. Lender A, Perdikis D, Gruber W, Lindenberger U, Müller V. Dynamics in interbrain synchronization while playing a piano duet. Ann N Y Acad Sci. 2023;1530(1):124–37. pmid:37824090
View Article
PubMed/NCBI
Google Scholar

[39] View Article

[40] PubMed/NCBI

[41] Google Scholar

[ref12] 12. Vuppala SRH, Allen N, Pinisetty S, Roop P. A Formal Approach for Safe Reinforcement Learning: A Rate-Adaptive Pacemaker Case Study. In: International Conference on Runtime Verification. Springer; 2024. p. 3–21. https://doi.org/10.1007/978-3-031-74234-7_1

[ref13] 13. Wang J, Gao R, Zha H. Reliable Off-Policy Evaluation for Reinforcement Learning. Operations Research. 2024;72(2):699–716.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref14] 14. Achiam J, Held D, Tamar A, Abbeel P. In: International conference on machine learning, 2017. 22–31.

[ref15] 15. Bastani O, Li S. Safe Reinforcement Learning via Statistical Model Predictive Shielding. In: Robotics: Science and Systems XVII, 2021. https://doi.org/10.15607/rss.2021.xvii.026

[ref16] 16. Chen J, Xue W, Tan X, Ye Z, Liu Q, Guo Y. FastSAG: towards fast non-autoregressive singing accompaniment generation. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024. 7618–26. https://doi.org/10.24963/ijcai.2024/843

[ref17] 17. Wu Y, Cooijmans T, Kastner K, Roberts A, Simon I, Scarlatos A. In: 2024. 53328–45.

[ref18] 18. Moor D, Yuan Y, Mehrotra R, Dai Z, Lalmas M. Exploiting Sequential Music Preferences via Optimisation-Based Sequencing. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023. 4759–65. https://doi.org/10.1145/3583780.3615476

[ref19] 19. Wang H, Zhang X, Iida F. Human–Robot Cooperative Piano Playing With Learning-Based Real-Time Music Accompaniment. IEEE Trans Robot. 2024;40:4650–69.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref20] 20. Wang H, Nonaka T, Abdulali A, Iida F. Coordinating upper limbs for octave playing on the piano via neuro-musculoskeletal modeling. Bioinspir Biomim. 2023;18(6):10.1088/1748-3190/acfa51. pmid:37714178
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref21] 21. Scimeca L, Ng C, Iida F. Gaussian process inference modelling of dynamic robot control for expressive piano playing. PLoS One. 2020;15(8):e0237826. pmid:32797107
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref22] 22. Mårup SH, Møller C, Vuust P. Coordination of voice, hands and feet in rhythm and beat performance. Sci Rep. 2022;12(1):8046. pmid:35577815
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref23] 23. Garzón B, Helms G, Olsson H, Brozzoli C, Ullén F, Diedrichsen J, et al. Cortical changes during the learning of sequences of simultaneous finger presses. Imaging Neurosci (Camb). 2023;1:imag–1–00016. pmid:40799721
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref24] 24. Jünemann K, Engels A, Marie D, Worschech F, Scholz DS, Grouiller F, et al. Increased functional connectivity in the right dorsal auditory stream after a full year of piano training in healthy older adults. Sci Rep. 2023;13(1):19993. pmid:37968500
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

[ref25] 25. Kohler N, Novembre G, Gugnowska K, Keller PE, Villringer A, Sammler D. Cortico-cerebellar audio-motor regions coordinate self and other in musical joint action. Cereb Cortex. 2023;33(6):2804–22. pmid:35771593
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref26] 26. Bhatnagar S, Jayant AK. Model-Based Safe Deep Reinforcement Learning Via a Constrained Proximal Policy Optimization Algorithm. In: Advances in Neural Information Processing Systems 35, 2022. 24432–45. https://doi.org/10.52202/068431-1774

[ref27] 27. Ames AD, Coogan S, Egerstedt M, Notomista G, Sreenath K, Tabuada P. Control Barrier Functions: Theory and Applications. In: 2019 18th European Control Conference (ECC), 2019. 3420–31. https://doi.org/10.23919/ecc.2019.8796030

[ref28] 28. Xiao W, Lyu Y, Dolan J. Model-based Dynamic Shielding for Safe and Efficient Multi-agent Reinforcement Learning. In: International Joint Conference on Autonomous Agents and Multiagent Systems, 2023. 1587–96. https://doi.org/10.65109/wdom6367

[ref29] 29. Bejarano FP, Brunke L, Schoellig AP. Safety Filtering While Training: Improving the Performance and Sample Efficiency of Reinforcement Learning Agents. IEEE Robot Autom Lett. 2025;10(1):788–95.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref30] 30. Vaaler A, Husa SJ, Menges D, Larsen TN, Rasheed A. Modular control architecture for safe marine navigation: Reinforcement learning with predictive safety filters. Artificial Intelligence. 2024;336:104201.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref31] 31. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–33. pmid:25719670
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref32] 32. Adjei P, Tasfi N, Gomez-Rosero S, Capretz MAM. Safe Reinforcement Learning for Arm Manipulation with Constrained Markov Decision Process. Robotics. 2024;13(4):63.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref33] 33. Carr S, Jansen N, Junges S, Topcu U. Safe Reinforcement Learning via Shielding under Partial Observability. AAAI. 2023;37(12):14748–56.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref34] 34. Zhang A, Wan W, Harada K. Fast pivoting gait generation by model predictive control designed with basis functions. Advanced Robotics. 2022;36(15):735–49.
View Article
Google Scholar

[98] View Article

[99] Google Scholar

[ref35] 35. Qiao Y, Wei W, Li Y, Xu S, Wei L, Hao X, et al. CPG-MPC controller for wheel-fin-flipper integrated amphibious robot. IR. 2023;50(6):900–16.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref36] 36. Chai F, Ge Q, Yin Y, Li D, Wang Y. Modeling and control of dissolved oxygen in recirculating aquaculture systems: A circadian rhythm analysis approach and GSMPC controller. Computers and Electronics in Agriculture. 2024;227:109515.
View Article
Google Scholar

[104] View Article

[105] Google Scholar

[ref37] 37. Zhang T, Wang X, Chen G, Liu F, Zha F, Guo W. Walk2Run: A Bio-Rhythm-Inspired Unified Control Framework for Humanoid Robot Walking and Running. J Bionic Eng. 2025;22(6):2849–63.
View Article
Google Scholar

[107] View Article

[108] Google Scholar

[ref38] 38. Chapelle O, Li L. An empirical evaluation of thompson sampling. Advances in Neural Information Processing Systems. 2011;24.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

[ref39] 39. Silva N, Werneck H, Silva T, Pereira ACM, Rocha L. Multi-Armed Bandits in Recommendation Systems: A survey of the state-of-the-art and future directions. Expert Systems with Applications. 2022;197:116669.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref40] 40. Santana P, Moura J. A Bayesian Multi-Armed Bandit Algorithm for Dynamic End-to-End Routing in SDN-Based Networks with Piecewise-Stationary Rewards. Algorithms. 2023;16(5):233.
View Article
Google Scholar

[116] View Article

[117] Google Scholar

[ref41] 41. Cai R, Lu R, Chen W, Hao Z. Counterfactual contextual bandit for recommendation under delayed feedback. Neural Comput & Applic. 2024;36(23):14599–613.
View Article
Google Scholar

[119] View Article

[120] Google Scholar

[ref42] 42. Wei Y, Zhang H, Wang Y, Huang C. Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions. Applied Sciences. 2023;13(16):9421.
View Article
Google Scholar

[122] View Article

[123] Google Scholar

Figures

Abstract

Introduction

Related work

Observables and coordination in piano practice

Safety in sequential decision making

Planning and model predictive controller for practice

Methods

Problem statement and interaction protocol

Post-action partially observable process

Objective and constraints

Environment and predictive model

Parameterization and calibration

State and action space instantiation

Skill dynamics

Fatigue and memory dynamics

Observation and reward generation

Dual-timescale fatigue safety

Slow timescale: lagrangian adaptation

Fast timescale: predictive guard

PianoMPC controller

Certainty-equivalent state tracking

Finite-horizon planning

Experiments

Dataset

Task composition

Scale and execution protocol

Symmetry between left-weak and right-weak profiles

Observation and action interface

Experimental setup

Evaluation metrics

Main comparison under the safety guard

Baselines

Results on three profiles

Effect of planning horizon

Effect of agent-level safety components

Consistency between windowed and peak-guard safety metrics

Robustness across task families

Robustness to dynamics misspecification

Effect of fatigue-threshold and mastery-window choices

Discussion

Conclusion

Limitations

References