No substantial change in the balance between model-free and model-based control via training on the two-step task

Human decisions can be habitual or goal-directed, also known as model-free (MF) or model-based (MB) control. Previous work suggests that the balance between the two decision systems is impaired in psychiatric disorders such as compulsion and addiction, via overreliance on MF control. However, little is known whether the balance can be altered through task training. Here, 20 healthy participants performed a well-established two-step task that differentiates MB from MF control, across five training sessions. We used computational modelling and functional near-infrared spectroscopy to assess changes in decision-making and brain hemodynamic over time. Mixed-effects modelling revealed overall no substantial changes in MF and MB behavior across training. Although our behavioral and brain findings show task-induced changes in learning rates, these parameters have no direct relation to either MF or MB control or the balance between the two systems, and thus do not support the assumption of training effects on MF or MB strategies. Our findings indicate that training on the two-step paradigm in its current form does not support a shift in the balance between MF and MB control. We discuss these results with respect to implications for restoring the balance between MF and MB control in psychiatric conditions.

Introduction Decision-making is suggested to rely on at least two parallel and distinct systems; a retrospectively-driven system based on acquired habits, and a prospective goal-directed system based on deliberate planning [1][2][3][4][5][6][7]. Since these two systems sometimes promote different choices, it's possible to differentiate their relative contribution to decision-making when action-outcome contingencies change; although in reality additional systems may guide decision-making [8] such that increasing reliance on one system does not always decrease reliance on the other [9]. Habits allow performing routines under consistent circumstances with little effort, which can be acquired through reinforcement learning where decisions rewarded in the past are more likely to be repeated in the future [10]. In contrast, goal-directed behavior requires the consideration of potential future outcomes of alternative actions based on the implementation of planned actions and outcomes. In computational terms, these two strategies are described as model-free (MF) and model-based (MB) decision control [1,2,11], respectively. These two strategies are often thought to be employed in parallel but the arbitration between them as determined by situations, actions and outcomes, has to be learned by exploration of the statetransition prediction error [12].
An imbalance between the two systems in favor of the MF system has been related to maladaptive choices in psychiatric disorders [28,29]. For example, excessive overreliance on habitual control has been shown in obsessive-compulsive disorder (OCD) [24,25] when rigid habits result in inadequate, repetitive and self-deleterious compulsive actions. Behavioral control may then become insensitive towards negative long-term consequences [30], the latter has been shown to correlate with altered prefrontal signals [31]. Beyond OCD, deficits in goaldirected behavior have also been reported in patients with addiction [24,[32][33][34][35], social anxiety [36,37] and schizophrenia [38][39][40]. The finding of similar MB deficits across different psychiatric disorders enforces the idea of a trans-diagnostic symptom approach [41].
Based on the assumption that overreliance on MF control can result in harmful habits and that MB learning is protective against the formation of those habits [42], the question arises of whether MB strategies can be strengthened by training. The well-characterized two-step task [1] (Fig 1) that promises to differentiate MB from MF learning through the implementation of parametric decision variables [43], is a likely candidate for such a training approach. The twostep task requires continuously updating action values for optimal behaviour under randomly fluctuating reward probabilities. It may therefore encourage goal-directed learning [1] and does not induce overtraining, which in animals has been shown to encourage MF strategies [44]. Indeed, a previous study by Economides et al. [9] suggested that short-term training on the two-step paradigm (768 trials across three consecutive days) improves MB control while leaving MF control unaffected, however only when participants were placed under additional cognitive load via a secondary task. The present work hypothesized that more intensive training (1005 trials across five sessions, each separated by a week) on the same task [1] may both reduce MF and strengthen MB control in the long-term. In addition, we aimed to evaluate whether behavioral training effects would be accompanied by changes in prefrontal brain activations. In order to facilitate future clinical studies, we utilized functional near infrared spectroscopy (fNIRS), which is more readily available and easier to integrate into clinical settings than, for example, fMRI.

Ethics statement
All participants gave written informed consent. The study was approved by the governmental ethics committee (KEK Zurich) and conducted in accordance with the Declaration of Helsinki.

Participants
Thirty-three healthy participants (age 25.5 ± 4.4 mean ± STD, 17 females) were recruited at the University of Zurich. Exclusion criteria were psychiatric or neurological disorders or current medication.

Experimental protocol
We used the two-step task by Daw et al. [1] (Fig 1) programmed in MATLAB (The Math-Works, MA) [45] with the Psychophysics Toolbox [46]. The task consisted of 201 trials, each comprising two stages. In the 1 st stage, participants chose between two options ('states') represented by geometrical coloured shapes. In the 2 nd stage, participants were presented with either of two more states which were rewarded with money (0.2 Swiss Francs) or not (zero). Which 2 nd stage state was presented depended probabilistically on the 1 st stage choice according to a fixed common (70% of trials) and uncommon (30% of trials) transition scheme. In order to encourage learning, reward probability for each 2 nd stage stimulus fluctuated slowly and independently by adding independent Gaussian noise (mean 0, SD .025), with reflecting boundaries at .25 and .75 [1].
Trials were separated by an inter-trial-interval of random duration between 5-11 seconds. If participants failed to make choices within 2 seconds, the trial was excluded from analysis. The goal of the task for the participants was to identify the rewarding 2 nd stage state and make the 1 st stage choice accordingly. To achieve this, participants were required to build an internal model of both 1 st stage transitions and of 2 nd stage reward probabilities.
Prior to the first training session, participants underwent extensive self-paced computerbased instructions and performed 50 practice trials (approx. 20 minutes). Instructions gave detailed information about the task structure, the fixed transition probabilities between 1 st and 2 nd stage and the varying reward probabilities at the 2 nd stage. Participants were instructed to win as much reward as they could and that they would be paid depending on their cumulative performance across a randomly drawn one-third of all trials in each session. Each participant performed five training sessions on five days (total of 1005 trials) separated by a week.

fNIRS instrumentation
A NIRSport instrument (LLC NIRx Medical Technologies) was used to record cortical hemodynamic responses during task performance in each session. Regions of interest were selected to correspond to the vmPFC (Fpz, Fp1, Fp2, AFz) and dlPFC (FC5, FC6, FFC5h, FFC6h, FC4, FC3) which have both been suggested to represent pure MB strategies [27], and the ilPFC (F7, F8, FFC7h, FFC8h, F5, F6) that is thought to encode the arbitrator between the MF and the MB system [27] (Fig 2, S1 Table). Regions corresponding to the MF system, such as DLS [27], were not recorded because fNIRS has a limited depth of tissue penetration and can therefore not record subcortical areas.
The fNIRSport system utilizes time-multiplexed dual-wavelength light-emitting diodes (wavelengths 760 nm and 850 nm) with photo-electrical detectors (Siemens, Germany). Sources and detectors were placed in a head cap providing a source-detector distance of approximately 30 (Left) Two-step task. Each 1 st stage led to a 2 nd stage in 70% of trials (common transition) and in 30% of trials to another 2 nd stage (uncommon transition). Reward probabilities (p(reward)) for each 2 nd stage fluctuated across trials between 25% and 75% according to Gaussian random walks [1]. (Right) Model predictions. Predictions on MF versus MB learning for the probability to repeat the choice from the previous trial (p(stay)) as a function of reward (R + = rewarded vs. R -= unrewarded) and transition (C = common vs. U = uncommon) at the previous trial. MF predicts a main effect of 'reward' and no effect of 'transition', whereas MB predicts an interaction effect of 'reward � transition'. Mixed effects of both MB and MF are typically identified in the two-step task [1]. Figure adapted from [67]. mm. Custom made short channels (approx. 10 mm) were used to remove superficial tissue contributions. Functional recordings acquisitioned using LabVIEW (National Instruments, Austin, TX, USA) were pre-processed including baseline correction, detrending and band-pass filtering [47]. Data were visually inspected for motion artifacts ("steps" and "spikes") that were removed in 15 participants using NIRSlab [48]. , was chosen as primary parameter of interest because it is thought to be more specific for mapping cerebral activity [49,50]. Trial-bytrial estimates of Δ[tHb] were derived using the general linear model (GLM) approach [51,52] by convolving a stick function at actual choice with a hemodynamic response function for NIRS data [53]. We only modelled 1 st stage choices because hemodynamic responses to 1 st and 2 nd stage choices could not be unambiguously separated due to the short inter-trial-interval [52,54].

Data analysis
Data analysis was performed to assess overall training outcomes (response times and reward rates) followed by analyses based on logistic and linear mixed-effects (LME) (behavioral choice and hemodynamic responses) and computational modelling (behavioral choice) as well as a simulation to relate LME and modelling.

Response times and reward rates
Training effects on response times and reward rates were assessed using repeated measures ANOVA with Bonferroni correction. In case of significant main effects, polynomial contrasts were assessed.

LME regression
We first analyzed stay-versus-switch behavior on 1 st stage choices of each trial to dissociate the relative influence of MF and MB control. As mentioned above, MF learning predicts that rewarded choices will lead to a repetition of that choice irrespective of a following common or uncommon transition, because the transition structure is not considered (Fig 1); a reward after a uncommon transition would therefore adversely increase the value of the chosen 1 st stage state without updating the value of the unchosen state. By contrast, MB strategy predicts an interaction between transition and reward, because an uncommon transition inverts the effect of a subsequent reward (Fig 1); a reward after an uncommon transition would therefore increase the probability to choose the previously unchosen 1 st stage state. Hence, MF behavior has been suggest to be quantifiable as main effect of 'reward' and no effect of 'transition', whereas MB behavior may be quantified as interaction effect of 'reward � transition' [55]. LME regression was fitted using the glmer function from the lme4 package [56] in R [57] for the effects of 'reward' (coded as rewarded 1, unrewarded -1), 'transition' (coded as common 1, uncommon -1) and their interaction 'reward � transition' (choice~reward � transition + (1 + reward � transition | subject)) in predicting each trial's choice (coded as switch 0 and stay 1, relative to the previous trial) with states being treated independently [58]. Following previous work [43], we also included an additional random 'correct' predictor capturing the tendency of the agent to repeat correct choices, in order to prevent differences in action values at the start of the trial from appearing as a spurious loading on the transition-outcome interaction predictor [43]; the inclusion of this predictor only marginally affected results. The function anova from the lme4 package was used to extract F-stats and p-values. To graphically demonstrate training effects on the balance between MF and MB control, the LME coefficients indexing MF (effect of 'reward') and MB (interaction of 'reward � transition') control were illustrated following previous work [9,55].
Analogous to the behavioral choice data, the scaled hemodynamic responses in vmPFC, dlPFC and ilPFC were fitted using linear mixed-effects (LME) regression based on the lmer function from the lme4 package [56] in R [57]. The relation between the behavioral and brain LME coefficients was assessed using Pearson product moment correlation.

Computational model
Since LME one-step effects reflect not only expression of MF and MB strategies but also parametric changes within the two systems and may therefore mislead interpretations [59], we compared the LME results with computational modelling of the two-step task [1,60].
Based on the original hybrid model by Daw et al. [1], we compared eight different model variants as implemented in the Emfit toolbox (https://www.quentinhuys.com/pub/emfit/) in MATLAB (MathWorks, MA) [61] using priori Bayesian model comparison. The model with the best fit to the data was a variant of the original model which has two separate betas, one for the MB system and one for the MF system, rather than a weight explicitly trading off the two components as the weighting parameter (ω) in Daw et al. [1] (S2 Table). Model selection was based on the lowest integrated Bayesian information criterion (iBIC) score which is the sum of integrals over the individual parameters [60].
For details on the models see Huys et al. [60]. In brief, the MF strategy is computed using the SARSA (λ) temporal difference (TD) model, which learns the task by strengthening or weakening associations between 1 st stage states and 1 st stage actions depending on whether the action is followed by a reward or not [62]. It simply predicts that 1 st stage actions that resulted in a reward are more likely to be repeated in the next trial with the same initial state [1]. This is quantified by calculating the value for each state-action pair at each stage of each trial with the model allowing different learning rates α 1 and α 2 for 1 st and 2 nd stages, respectively. The reinforcement eligibility parameter (λ) determines the update of 1 st stage actions by the 2 nd stage prediction error (Q TD ), with λ = 1 being the case of Fig 1 (MF) in which only the final reward is important, and λ = 0 being the purest case of the TD algorithm in which only the 2 nd stage value plays a role. On the other hand, MB strategy uses an internal model of the task structure to determine 1 st stage choices that will most likely result in a reward [1]. It thus considers which 2 nd stages are most frequently rewarded in recent trials and selects 1 st stage actions that most likely led there. This is quantified by mapping state-action pairs to the transition function, the common or the uncommon transition. The action value (Q MB ) is thus computed at each trial from the estimates of the rewards and transition probabilities (Fig 1, MB). Choice randomness is reflected in the softmax inverse temperature parameter at the 2 nd stage (β 2 ) that controls how deterministic choices are and p captures perseveration (p > 0) or switching (p < 0) in 1 st stage choices. Finally, contrary to the original model [1] that uses a weighted sum (Q NET ) of MF and MB strategies (weighting parameter, ω) at the 1 st stage, the model variant has two separate betas, one for the MB system and one for the MF system. The model variant thus tests whether the assumption of the original model [1] that the two approaches coincide at the 2 nd stage (i.e., that Q MB = Q TD , Q NET = Q MB = Q TD at the 2 nd stage) holds true.
Taken together, the hybrid model variant outputs seven free parameters: bMB and bMF, the betas governing the tradeoff between MB and MF actions; the inverse temperature parameter at the 2 nd stage (β 2 ); the 1 st (α 1 ) and 2 nd (α 2 ) stage learning rates; the reinforcement eligibility parameter (λ); and p, which captures first-order perseveration. All five training sessions across participants (N = 100) were fitted simultaneously with all data treated as derived from the same prior distribution.
The bounded model parameters were transformed to an unconstrained scale via exponential transformation for parameters bMB, bMF, β 2 according to Eq 1 and via sigmoid transformation for parameters α 1 , α 2 , and λ according to Eq 2: x ¼ expðxÞ ð1Þ To assess training effects on the seven model parameters one-way repeated measures ANOVA with Bonferroni correction was performed; in accordance with the assumption of the fitting procedure that sessions were drawn from the same Gaussian prior distribution. In case of significant main effects, polynomial contrasts were assessed. To validate the goodness of fit, the subject-specific BIC [63] was compared between sessions using repeated measures ANOVA.
To assess test-retest reliability of the model parameters, the Intraclass Correlation Coefficient (ICC) was used. ICC were computed as type ICC(2,k) according to the Shrout and Fleiss convention [64], i.e., a two-way random-effects model with absolute agreement. P-values of the hypothesis test ICC = 0 based on alpha level p < 0.05 were reported. ICC < 0.4, 0.4-0.75, > 0.75 are considered poor, moderate and excellent reliability, respectively [65].
To assess test-retest repeatability, the Coefficient of Variation (CV), defined as the ratio of the standard deviation to the absolute mean, was calculated [66]. CV is a measure of precision with higher values indicating greater level of dispersion expressed in percentage (%) and therefore allows for comparison between model parameters independent of their units (in contrast to the ICC that is based on units).

Simulating the relation between LME and modelling
To evaluate the relation between LME and modelling, simulation was conducted to assess how LME regression captures the seven parameters (bMB, bMF, β 2 , α 1 , α 2 , λ, p). For this purpose, data were generated for 1000 subjects with each 201 trials by independently changing each of the seven parameters within the distribution of the untransformed values obtained from the actual data (5 th , 25 th , 50 th , 75 th , 95 th percentile, across sessions S1-S5) while keeping the remaining parameters constant at the median (S4 Table). Based on the simulation, we estimated the relative parameter-specific changes in LME coefficients for MF control ('reward' effect) and MB control ('reward � transition' interaction) for each parameter. This was done by computing the correlation between the independent parameter changes and the induced changes in LME coefficients and describing them as parameter-specific correlation indices (MF CI and MB CI ).

Results
Twenty participants (mean ± STD = 24.9 ± 3.1 age, 9 females) completed five training sessions (mean duration 51.8 minutes, repeated measures ANOVA F 4,76 = 1.82, p = 0.133). 13 additional participants were excluded because of non-adherence to at least one training session (n = 12) or due to technical problems (n = 1, failure of data synchronization).

Computational model
We then fitted the seven model parameters (bMB, bMF, β 2 , α 1 , α 2 , λ, p) of the model variant [60] to the behavioral choice data and found the best fitting parameters to be reasonably  Table).
No training effects were observed on MB control (bmB) and MF control (bMF) indicating no support for our main hypothesis that task training changes the relative strength between the two systems. MB control (bMB) was slightly stronger compared to MF control (bMF) across all sessions (t-test F1, 98 = 2.13, p = 0.036), supporting the assumption that participants slightly more relied on MB strategies (Table 3, Fig 5).
The remaining parameters also did not sufficiently argue for a shift between MB and MF control. While the parameters β 2 , λ, and p revealed no training effects, significant training effects were found on α 1 and α 2 learning rates, which decreased across sessions (repeated measures ANOVA: α 1 F 4,76 = 2.52, p = 0.048; α 2 F 4,76 = 5.27, p = 0.001). Hence, even if changes in learning rates may indicate some changes within the MF or within the MB system, they cannot be assigned to the balance or the relative expression between them. For example, increases in α 1 / α 2 might represent some change in the MF/MB system, and might indicate that participants consider more MF/MB strategies in the LME regression, yet it does not provide  Training model-free and model-based control sufficient evidence to conclude that there is a change in the expression of MF relative to MB, or vice versa. Goodness of model fit was also not affected by training as evidenced by the subject-specific BIC per session (F 4,76 = 1.39, p = 0.247) (Table 3, Fig 5), suggesting that there was no evidence of training-induced systematic changes in decision-making strategies not captured by the model. Across sessions, some of the parameters correlated weakly with the changes in 1 st and 2 nd stage response times as expected from the training patterns ( Table 4). There were no significant correlations between the model parameters and NIRS responses to any of the critical trial conditions (those that were preceded by a rare/common trial, those that were rewarded/unrewarded, all p > 0.05, S5 Table), indicating that that NIRS responses did not inform on the behavioral changes captured by the model. Test-retest reliability was moderate to high for all parameters, bMB (ICC = 0.83), bMB (ICC = 0.85), β 2 (ICC = 0.71), α 1 (ICC = 0.83), α 2 (ICC = 0.73), λ (ICC = 0.89), p (ICC = 0.90), whereas test-retest repeatability was low for all parameters, bMB (CV = 71%), bMF (CV = 47%), β 2 (CV = 33%), α 1 (CV = 49%), α 2 (CV = 47%), λ (CV = 36%), p (CV = 59%) (Table 5, Fig 6). The ICC results suggest that the two-step task has potential as behavioral marker for individual variation in performance, whereas the low degree of precision indicates that inter-subject variation was similar compared to intra-subject variation.

Simulating the relation between LME and modelling
To better understand how the LME results (suggesting training effects on MF and MB control) related to results from the computational model (suggesting no training effects on MF and MB control), we speculated that even if the computational model fully captures the learning system, choices are not only influenced by the balance between MF-MB control, but also by other model parameters. We simulated choice data with different parameter values, to understand each parameter's independent impact on the LME. This suggested that the LME is capturing

LME main effects (Left).
Each bar represents the stay probability (p(stay)) or mean tHb response across all participants and all sessions. Error bars represent standard error of the mean. Behavioral choice revealed 'reward' effects (R+ = rewarded vs. R-= unrewarded) (solid line with significance asterisks) and 'reward � transition' interactions (C = common vs. U = uncommon) (dashed line with significance asterisks), while ilPFC revealed a 'reward' effect (solid line with significance asterisks). See S1 Fig for details. LME coefficients (Right). Between sessions, behavioral choice revealed 'reward � session' and 'reward � transition � session' interactions, whereas no such effects were found on vmPFC, dlPFC and ilPFC. Error bars represent standard error of the estimate. Significant post-hoc comparisons on the interaction effects are Bonferroni corrected and highlighted ( � ). See Table 2  Training model-free and model-based control changes in all seven parameters differently (S1 and S2 Figs, S6 Table). Effects of 'reward' were primarily positively correlated with changes in the parameters bMF (correlation index MF CI = 0.992), α 1 (MF CI = 1.000), λ (MF CI = 0.970) and p (MFCCI = -0.742), i.e., a decrease in any of these parameter values results in decreasing LME coefficients for MF control; while 'reward � transition' interactions seemed to be primarily positively correlated with changes in the parameters bMB (MB CI = 0.983), β 2 (MB CI = 0.995) and α 2 (MB CI = 0.996), i.e., a decrease in any of these parameter values results in decreasing LME coefficients for MB. Note that magnitudes of these indices should only be interpreted in the context of the simulation. In summary, these findings indicate that even under the assumption that the model fully captures the cognitive system mediating learning in this task, then LME one-step effects not only reflect contribution of the MF and MB systems, but also parametric changes within the two systems. This means that interpreting the 'reward' and 'reward � transition' coefficients as directly indexing MF and MB control may be misleading. One interpretation of our discrepant results therefore is that the LME results capture changes in α 1 and α 2 , which did change between sessions. Because these two parameters have no direct relation to either MF or MB control or the balance between the two systems, this not support an assumption of training effects on MF or MB strategies. To corroborate these conclusions, we provide an illustration that the regression coefficients based on our simulations allow reconstructing the actual LME pattern that we observe from our fitted computational model coefficients (S3 Fig). It should however be noted that the method presented here designed to assess how LME regression captures the seven model parameters, cannot be reversed, i.e., it the model parameters itself cannot be recovered. The method can therefore only be applied and interpreted in the context of the LME.

Power analysis
Since the presented results are negative findings, we performed a post-hoc power analysis using a previously published distribution of the parameters bMF and bMB [68]. Under the assumption that our training changes parameters linearly over the five sessions, that it does not change the variance in the parameters over individuals, and that the test-retest-reliability of the parameters is zero (i.e., that between-subject variation in the parameters is not due to stable traits), then our sample size of N = 20 would have been sufficient to detect an at least 80% change in bMF and an at least 120% change in bMB with 80% power at an alpha level of 5%. Assuming a test-retest reliability of 0.5, we had sufficient power to detect a 60% change in bMF and an 85% change in bMB; and at a test-retest reliability of 0.8, these values were 35% change in bMF and a 55% change in bMB.
Goodness of model fit as evidenced by the subject-specific BIC was not affected by training; the smaller the BIC the better the fit. See Table 3 for statistics.

Discussion
In this paper, we tested a hypothesis that training humans on a two-step task reduces the influence of MF control whilst strengthening the influence of MB control. Such training may be relevant for assessing psychiatric conditions including compulsion or addiction, because of their reported association with an overreliance on habits [24]. Our results show that the two-step  [65]) and Coefficients of Variation (CV) of the seven parameters (bMB, bMF, β 2 , α 1 , α 2 , λ, p). See Table 5 for statistics.
https://doi.org/10.1371/journal.pcbi.1007443.g006 task reliably assesses individual MF and MB behavior but that training on the two-step task in its current form does not support a shift in the balance between the two systems. Training on the two-step task may thus require further adaptations in order to reduce MF control or compensate for deficits in goal-directed choice. Although the current study was conducted in healthy subjects and may therefore not be directly generalizable to psychiatric populations with premorbid, i.e., pre-training, deficits in MB control, our results may contribute to the current debate how the two-step could be adjusted to be used as training tool and to advance its application in the trans-diagnostic evaluation of psychiatric conditions [43,67].

Reliability of MF or MB control
Results of the behavioral model indicated higher test-retest reliability for the two-step task (overall ICC = 0.95, Table 5, Fig 6) than previously reported in a literature review (approx. mean ICC = 0.7) [69]. Although the purpose of the present study was the evaluation of training that was supposed to change behavior and thus requires caution in the interpretation of reliability, our findings suggest that the two-step task has potential as a behavioral marker to characterize individual behavior. The high reliability was associated with low precision (overall CV = 106%, Table 5, Fig 6) indicating that the standard deviation exceeded the mean value, in other words, that inter-subject variation was similar compared to intra-subject variation.
Together, this suggests that the model does reflect individual variation but is not precise.

No substantial change in MF or MB control via training
Results of the behavioral model suggest that training on the two-step task in its current form does not affect the balance between MF and MB control, as exemplified by a relatively stable pattern of the bMF and bMB parameters across sessions (Table 3, Fig 5). The only convincing training effects were reflected in decreasing α 1 and α 2 learning rates. This indicates that the degree to which participants incorporated new information decreased as task training progressed. Considering these modeling results and our simulations on the relation of model parameters and choice behaviors, the LME effects on behavioral data and the brain data ( Table 2, Fig 4) most likely do not reflect changes in the balance between MF and MB control, nor in the individual systems, but merely capture changes in α 1 and α 2 based on the parametric mapping on LME (S1 Fig, S4 Table). Together, these findings suggest that training on the two-step task induced no substantial changes in decision strategies besides affecting learning rates. Although the results support some correspondence between behavioral choice and ilPFC, our results do not support a previous hypothesis that ilPFC arbitrates between the MF and MB system [27]. As a limitation, our power analysis indicates that a larger sample would be required to find small training effects (e.g. parameter change smaller than 50% at a parameter test-retest reliability of r = 0.8).

Comparison with previous training study
A previous training study utilizing the same two-step task by Economides et al. [9] reported evidence of training effects. Training increased MB control (as evidenced by an increased α learning rate, an increased weighting parameter ω and increased 'reward � transition' interactions), while leaving MF control unaffected (as evidenced by unchanged 'reward' effects). These behavioral changes were however observed following the concurrent introduction of a secondary load task, and the authors conjectured that the addition of load may have been necessary to expose training-induced changes in behavior in the two-step task. There are also several other possible explanations for this disparity. One likely candidate is the difference in training intervals. Economides et al. [9] trained subjects over three consecutive days, whereas the present study trained subjects over five days separated by a week. Another reason might be the difference in training intensity. Economides et al. [9] trained subjects on 768 trials, whereas we trained subjects on 1005 trials, almost one-third more trials. A third reason might be differences in statistical analysis methodology. Economides et al. [9] did not test for interactions between reward, transition and session in the LME and made use of an additional slope parameter sigma (σ) that allowed the weighting parameter ω to shift across training sessions when fitting data across all sessions; notably an implementation of the sigma (σ) parameter in our model did not change overall results (analysis not included in this article).

Interpreting the lack of changes in MF and MB control
Within each session of the present study, participants followed slightly more MB strategies, as indicated by a median ratio between bMB and bMF of 1.07 (p = 0.041) (compared to a median weighting parameter ω of 0.39 indicating more reliance on MF strategy reported by Daw et al. [1], S3 Table). Hence, participants were able to establish an internal model of the task by considering the dynamic interactions between rewards and transitions, although training did not strengthen that internal model. The missing training effect might be due to a natural re-equilibration of the balance to its default setting, i.e., the MF system, which is less computationally demanding. The arbitrator responsible for inhibiting the default habitual control and deliberating the MB system [27] may have become weaker towards the end of training due to habituation. Additional cofounders like tiredness and monotony induced by the high number of repetitions may have favored less effortful MF strategies, as supported by the progressively faster response times observed across sessions (Table 1, Fig 3). Demotivation or devaluation may also be justified by the missing trade-off between performance accuracy and reward rates (Table 1, Fig 3). It is well-established that payoffs in the two-step task do not differ between performance of strictly MF versus strictly MB agents or even agents who chose randomly [43,67]. These findings suggest that the stochasticity of the two-step task imposes a low ceiling on achievable performance, preventing MB control from outperforming simple MF strategies [43]. It might have therefore been rational for participants to not invest in the higher cognitive costs of MB strategies, as they did not pay off.
The missing training effect might also point to the employment of a third decision-making strategy, namely sophisticated automatization, that is distinct from pure MF and MB learning [43]. Previous simulations suggested that the two-step paradigm may or even should promote such a third control system [43]. Faced with recurrent transitions there might be an increased incentive to deconstruct the task and identify stimuli for automatized responses. This may produce a behavior that mimics goal-directed planning but in fact arises as a fixed mapping of limited states matched with habitual response and automatable strategies [13,43]. This kind of automatization could indeed be beneficial as it may render MB control less susceptible to distraction [9]. Arguments for automatization may thus that it reduces the computational cost associated with MB planning, making MB reasoning more efficient, although not explicitly impacting the balance between MF and MB decision processes.
To enable training effects on MB learning while also allowing for some degree of automatization, several task adaptations have been proposed, such as increasing payoff attractiveness by enhancing the trade-off been performance accuracy and reward, sharpening contrasts between transition and reward probabilities, increasing complexity of decision trees while compensating with simpler transitions, masking high frequent repetitions by alternated task settings to reduce the burden of automatization [43,67]. Using such incentives to boost model-based control has also been suggested to be a useful intervention in a range of personality traits and latent psychiatric symptom constructs [70].

Conclusion
Previous evidence suggests that an imbalance between MF and MB control may be a common mechanism in various psychiatric disorders. The potential to rebalance such decision strategies through task training therefore remains a promising therapeutic approach. The present study suggests that training on the two-step task in its current form does not change the balance between MF and MB control. An evaluation in psychiatric populations is required to assess whether the present results can be translated into a trans-diagnostic framework [50].
Supporting information S1 Table. fNIRS setup. Sources, detectors, and channels for selected regions of interest (ROIs) representing the model-free system (vmPFC, dlPFC) and the arbitrator (ilPFC) illustrated in   [1] as implemented in the Emfit toolbox [73] that account for differences in model complexity. Each model was assessed across all five training sessions. Model variants may consist of separate parameters for 1 st and 2 nd stage choices (α 1/2 = learning rate; β 1/2 = softmax inverse temperature), an eligibility trace (λ), first-order perseveration (p), two separate betas, one for the model-free system (bMF) and for the modelbased system (bMB),or a weighting parameter (ω) that determines the balance between model-free (ω = 0) and model-based (ω = 1) control. In simpler models, parameters were fixed between 1 st and 2 nd stage choices. Model llm2b2alr is the original hybrid model by Daw et al. [1]. Bold-face denotes the winning model variant ll2bmfbmb2alr based on the lowest integrated Bayesian information criterion (iBIC) score that was used in the present analysis. (DOCX) S3 Table. Best-fitting parameter estimates. Best-fitting parameter estimates (β 1 , β 2 , α 1 , α 2 , λ, ω and p) shown as median plus 25 th and 75 th percentile across sessions S1-S5 obtained with the model variant in the present analysis in comparison with the estimates obtained with the original model by Daw et al. [1]. Note that the parameter p has a different scale in the model variant.
(DOCX) S4 Table. Distribution of Simulated parameter values. Simulation data were generated for each of the seven parameters (untransformed values) within the distribution of the untransformed values obtained from the actual data (5 th , 25 th , 50 th , 75 th , 95 th percentile, across sessions S1-S5) while keeping the remaining parameters constant at the median. (DOCX) S5 Table. Correlation between model parameters and NIRS responses. Listed are the Pearson correlations between the model parameters (bMB, bMF, β 2 , α 1 , α 2 , λ, p) with the averaged NIRS responses within critical trials (those that were preceded by a rare/common trial, those that were rewarded/unrewarded) on the single subject level across all sessions. The results indicated no significant correlations. (DOCX) S6 Table. Simulated correlation indices. Listed are the inferred MF and MB correlation indices (MF CI and MB CI ) for each parameter (bMB, bMF, β 2 , α 1 , α 2 , λ, p) approximating the parameter-specific change in LME coefficients for MF control ('reward' effect) and MB control ('reward � transition' interaction). Positive versus negative correlation indices indicate that parameters are positively versus negatively correlated with LME coefficients. Note that the magnitudes of these indices should only be interpreted in the context of the simulation. (DOCX) S1 Fig. LME main effects per session. Each bar represents the stay probability (p(stay)) or mean tHb response across all participants for each session. For each session, bars from left to right represent R + C, R + U, R -C, R -U (R+ = rewarded vs. R-= unrewarded, C = common vs. U = uncommon, as detailed in Fig 4) Error bars represent standard error of the mean. See Table 2 for statistics. (TIF)

S2 Fig. Effects of independent parameter changes on LME.
Results of the simulation assessing independent changes of the parameters (bMB, bMF, β 2 , α 1 , α 2 , λ, p) on LME. (Top) Inferred LME regression main effects. For each percentage change, bars from left to right represent R + C, R + U, R -C, R -U (R+ = rewarded vs. R-= unrewarded, C = common vs. U = uncommon, as detailed in Fig 4) (Bottom) Inferred LME coefficients representing parameter-specific changes in LME coefficients for MF control ('reward' effect) and MB control ('reward � transition' interaction). (TIF)

S3 Fig. Correlation indices used to reconstruct LME coefficients.
Illustration of a simple approximation to reconstruct the patterns of the MF ('reward' effect) and MB ('reward � transition' interaction) coefficients for comparison with the actual LME. Reconstruction was done by multiplying the correlation indices (MF CI and MB CI , S6 Table) with the actual parameter values (bMB, bMF, β 2 , α 1 , α 2 , λ, p). (Left) To reconstruct the MF coefficients, the mean values of the parameters primarily affecting MF control (bMF, α 1 , and λ) multiplied with the corresponding MF CI per session were summed for illustration. (Right) To reconstruct the MB coefficients, the mean values of the parameters primarily affecting MB control (bMB, β 2 , and α 2 ,) multiplied with the corresponding MB CI per session were summed for illustration. According to the actual LME results, data are shown in comparison with the reference session S1. (TIF) Supervision: Dominik R. Bach, Lisa Holper.