^{1}

^{1}

^{2}

^{1}

^{1}

^{3}

^{1}

^{2}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: ME ZKN MGM RJD. Performed the experiments: ME ZKN AL. Analyzed the data: ME ZKN AL. Wrote the paper: ME ZKN AL MGM RJD.

Model-based and model-free reinforcement learning (RL) have been suggested as algorithmic realizations of goal-directed and habitual action strategies. Model-based RL is more flexible than model-free but requires sophisticated calculations using a learnt model of the world. This has led model-based RL to be identified with slow, deliberative processing, and model-free RL with fast, automatic processing. In support of this distinction, it has recently been shown that model-based reasoning is impaired by placing subjects under cognitive load—a hallmark of non-automaticity. Here, using the same task, we show that cognitive load does not impair model-based reasoning if subjects receive prior training on the task. This finding is replicated across two studies and a variety of analysis methods. Thus, task familiarity permits use of model-based reasoning in parallel with other cognitive demands. The ability to deploy model-based reasoning in an automatic, parallelizable fashion has widespread theoretical implications, particularly for the learning and execution of complex behaviors. It also suggests a range of important failure modes in psychiatric disorders.

Automaticity develops with task familiarity. One possible explanation is that automaticity arises when performance of the task becomes habitual, or model-free. Here we asked whether goal-directed, or model-based, reasoning could also become automatic, or resistant to distraction. We used a well-characterized task that differentiates model-based from model-free action. We replicate previous findings that distraction strongly impairs model-based reasoning in task-naive subjects. However, in subjects with prior exposure to the task, distraction does not impair model-based reasoning. This suggests that humans can deploy sophisticated and flexible reasoning more extensively than previously thought.

A wealth of experimental data indicates the brain uses at least two distinct decision making strategies in value-guided choice. One involves prospective reasoning about action-outcome contingencies, while the other retrospectively links rewards to actions [

A compelling computational account of these two control mechanisms draws on reinforcement learning (RL) theory [

Contemporary theories posit that model-based reasoning engages limited-resource executive functions [

However, studies of model-based decision-making often utilize tasks in which the stimuli, contingencies and other task parameters are novel to the subject. This raises the possibility that reliance on limited-resource executive functions is not an intrinsic property of model-based reasoning, but is instead a characteristic of reasoning with an unfamiliar model. In everyday life, tasks become "second-nature" with experience and are subsequently more easily used as building blocks for increasingly complex tasks. It remains untested whether this is entirely due to the formation of efficient habits, or if what is "second-nature" can include sophisticated reasoning with a model of the world.

Here, we used a two-step decision-task that engages both model-free and model-based reasoning [

(

Model-free and model-based decision strategies make different predictions about choice dependence on transitions and rewards from previous trials. We used computational modeling and logistic regression to quantify the contribution of model-free and model-based strategies when subjects performed the two-step task, either alone (single-task condition) or in combination with a demanding concurrent task (dual-task condition). The latter represents a high load condition. We also wanted to test whether the effect of load changed with practice.

To this end we trained subjects on the two-step task for 3 consecutive days and introduced intermittent periods of high load. An initial group of 22 healthy subjects, referred to as the ‘high load group’, experienced the dual-task condition on each day of training. This allowed us to characterize choice under load across the entire training period. A second group of 23 healthy subjects, referred to as the ‘low load group’, experienced the dual-task condition on day 3 only. This allowed us to determine how training on the two-step task alone would impact choice under load.

We hypothesized that model-based calculations would become less reliant on executive resources following training, independent of whether training included or excluded load, leading to a reduction in the detrimental effect of cognitive load on model-based choice.

We analyzed data using previously described reinforcement learning (RL) models [

We first sought to validate that choice in the two-step task reflected a mix of both model-free and model-based valuations [

The weighting parameter

Next, we fit the hybrid model to data from days 2 and 3 of training in the ‘high load group’, separately for single-task and dual-task trials. We were interested in whether subjects abruptly switch their choice strategy at the start of a given training day, or alternatively, whether a gradual shift in behavioral control emerges across days. We performed paired t-tests on parameter estimates from Bayesian model inference. In the single-task condition, we found evidence for a moderate shift towards more model-based choice, as indexed by higher

The weighting parameter

To corroborate the finding that

In addition to differences in the value of

Computational modeling relies on fitting several model parameters that can exhibit a degree of shared variance, and this has a potential to complicate interpretation when the true value of more than one parameter differs between two conditions. We therefore employed a logistic regression to validate the main findings from our model. We quantified the degree to which choice on the current trial reflected a model-free and model-based influence with respect to events occurring on the preceding 3 trials (see

During single-task trials, we identified both a significant model-free and model-based influence on choice extending up to 3 trials in the past (all p < 0.05), consistent with subjects utilizing a hybrid of both systems (

Results of a logistic regression that considers model-free and model-based influences on choice in the current trial with respect to events that occurred up to 3 trials in the past. (_{-1}, _{-2} and _{-3} increase (coded as +1) or decrease (coded as -1) the probability of choosing fractal A according to a model-free or a model-based system (6 total regressors). Model-free coefficients are plotted on the left-hand side of x-axis, and model-based coefficients on the right-hand side. Data from days 1 and day 3 are plotted in the top and bottom panels respectively. Coefficients corresponding to the single-task are shown in blue, and those corresponding to the dual-task are shown in orange. Vertical lines represent SEM. * denotes p < 0.05, ‡ denotes p = 0.09. (_{-1}, _{-2} and _{-3}, and derived single estimates of the degree to which model-free (plotted on the y-axis) and model-based (plotted on the x-axis) control were dominant in choice. Vertical lines represent 95% confidence intervals. A line through the origin represents points in which model-free and model-based valuations have an equal influence on choice.

To our surprise, we were unable to identify a model-free influence in either group in the high load (dual-task) condition (see

In keeping with other studies utilizing the two-step task [

Mean numerical Stroop accuracy during dual-task trials was 81.9% on day 1, 85.5% on day 2, and 89.5% on day 3 for the ‘high load group’. Thus, performance on the secondary task demonstrated an approximately linear improvement across training days (day 2 vs. day 1: paired t(21) = 2.53, p = 0.019; day 3 vs. day 2: paired t(21) = 3.88, p < 0.001; day 3 vs. day 1: paired t(21) = 5.34, p < 0.001). Mean numerical Stroop accuracy for the ‘low load group’, in which subjects only experienced the dual-task condition on day 3 of training, was 83.2%, and thus comparable to the ‘high load group’.

Here we asked whether reliance on finite executive resources [

There are several possible accounts for these findings. First, subjects may change the way they calculate the contingencies of the task following training. From a neural perspective, model calculations may be implemented in new brain areas such that they no longer overlap with those used in the concurrent task. Training has previously been shown to cause "off-loading" in tasks requiring executive resources, including an implementational shift from prefrontal to parietal and striatal regions [

Second, resilience to load could emerge if auxiliary processes (other than reasoning with the structure of the task itself) become more efficient. For example, some cognitive resources may be required for identifying the various stimuli, for tracking events that occurred on previous trials, and for recalling learned values at the second stage. There may also be resource requirements for maintaining belief distributions over meta-parameters, such as whether the task structure changes or new fractals appear, what appropriate learning rates are, when model-based reasoning should be deployed [

Third, subjects might learn to perform model-based calculations at the end of each trial ("offline"), rather than at the beginning of the next trial. When used to update a cached or habitual value accessed for the next choice, such offline calculation could relieve the need to store the current reward in memory until the beginning of the next trial. In turn, this might allow better allocation of executive resources to the concurrent task. Indeed, a recent experiment has suggested that the model-based system can “train” the model-free system by replaying and simulating experience offline, and that this in turn allows for choice under load that appears model-based [

A final consideration is that choice under load after training may not be truly model-based. Increasingly sophisticated choice heuristics (for example, applying Q-value updates to the opposite first-stage transition following an uncommon transition), can permit behavior that is increasingly difficult to distinguish from fully model-based in the two-step task [

Our regression analysis suggests the possibility that the reduction in

In addition, we found higher w values on day 3 of training in the ‘high load group’ than the ‘low load group’ in both trial types. Because the 'high load group' had more prior exposure to the Stroop task in this comparison, their higher

In our computational model, load affected not just

At first glance our result might appear contrary to a standard view that increasing training produces a shift from goal-directed (model-based) to habitual (model-free) control. For example, it is well established that extended training reduces sensitivity to outcome devaluation [

A central feature of human learning is the ability to acquire very complex task structures, which often involve performing multiple subtasks in parallel. One way to achieve this parallelism is to reduce the subtasks to habits, reflecting fixed and inflexible action patterns. Our work suggests that even when subtasks are performed in parallel, each subtask can realize sophisticated and flexible model-based reasoning. This lends richness to ideas on the range of behavioral repertoires that humans can express. It is also consistent with the notion of "models" throughout processing hierarchies in the brain, from low-level sensory processing to high-level cognition [

The possibility that model-based reasoning can become automatic suggests new failure modes (and treatment avenues) in psychiatric disorders. If maladaptive models become automatic, they may lead to behavior that is both sophisticated and pernicious. Conversely, if adaptive models fail to become automatic when they should, they may fail to compete with maladaptive habits, especially under stress or cognitive load. Yet another possible failure mode is that experience calcifies models into true, inflexible habits rather than automatic models.

In summary, we present data that is a challenge to a widespread notion in decision-making that "goal-directed" and "deliberative" are synonymous. We suggest that a dependence of goal-directed reasoning on use of serial executive resources can lessen with task experience. This could be important in the acquisition of progressively more complex behavior, with implications for therapies that aim to restore normal decision-making in psychiatric disorders.

Written informed consent was obtained from all participants prior to the experiment and the UCL Research Ethics Committee approved the study (project number 3450/002).

Previous studies in our laboratory and others have shown that 20 to 25 participants provide sufficient power to quantify the contribution of model-free and model-based strategies in the two-step task [

In line with [

In the ‘high load group’, subjects performed alternating blocks of single-task (two-step alone) (128 trials) and dual-task (64 trials) trials until two blocks of each trial type were completed (256 single-task trials, 128 dual-task trials in total). This protocol was repeated across three consecutive days. Subjects received 20 practice trials of each trial type at the start of day one. In the ‘low load group’, subjects performed 256 trials of the single-task (two-step alone) condition for two consecutive days, while the protocol on day three was identical to the ‘high load group’. Subjects in the ‘low load group’ received 20 practice trials of the single-task condition at the start of day one, and 20 practice trials of the dual-task condition at the start of day 3.

Subjects performed a two-step decision task based on [

Dual-task trials followed the same procedure, except that subjects had to simultaneously perform a numerical Stroop task [

The reward probabilities associated with second-stage fractals were governed by independently drifting Gaussian random walks (SD = 0.025). We generated a pool of fifteen random walks for which reward probabilities did not exceed ~0.75 or fall below ~0.25. For each subject, three walks were selected at random from the pool for use on each successive day of training. Thus, walks were continuous between blocks of single-task and dual task trials.

Based on [_{A} for the first-stage fractal pair; _{B} and _{C} for the second-stage fractal pairs) where two possible actions (_{A},_{B}) can be taken from each state. The goal of each RL algorithm is to learn a state-action value function _{1,t} and _{2,t} respectively, while first and second-stage choices (actions) are indicated as _{1,t} and _{2,t} Since there is no reward at the first stage, _{1,t} is always zero, while _{1,t} can be zero or one.

The model-free algorithm was temporal difference Q-learning [

Note that for the first stage choice, _{i,t} is always zero and

After outcome delivery, the second stage RPE is used to update the first-stage action _{TD}(_{1,t},_{1,t}) according to the eligibility trace λ, which assigns credit to the first-stage action without the need for an additional step.

Thus, in the event that λ = 0, choice is driven by the estimated value of the second-stage state on the previous trial. Consistent with previous studies [

A model-based RL algorithm involves learning a set of contingencies between actions and states (a state-transition function), estimating a reward value for each state, and then combining the two by iterative expectation. Here, since first-stage transitions are probabilistic, a player must map action-state pairs to a probability distribution over subsequent states.

One can approximate subjects’ estimate of the transition probabilities by assuming they believe one of two alternatives:
_{A} to _{B} given _{A} and from _{A} to _{C} given _{B} (or vice versa). A previous study has shown this scheme settles on the true transition matrix after the first few trials and fits subjects’ choices better than implementing a traditional trial-by-trial learning algorithm [

Since the second-stage action is the only choice associated with immediate reward, and is the final step in a trial, an agent can learn the value of the second-stage state in a manner equivalent to temporal difference Q-learning (as above). Thus, _{TD}(_{2,t},_{2,t}) is simply an estimate of the immediate reward _{2,t,} and the model-based algorithm converges with model-free learning at this stage.

By combining the transition function with the second-stage values we can define the values of the two first-level actions (using Bellman’s equation) as follows:

For the hybrid model we consider contributions from both model-free and model-based RL. First-stage action values were defined as the weighted sum of values from the algorithms described above as follows:

When fitting data across all sessions, we included a slope parameter sigma (_{D} as the new weighting parameter.

At the second-stage, all three models (model-free, model-based, hybrid) converge.

For each model, values were converted to action probabilities using a sigmoid (softmax) function:

When fitting data from individual days, we considered a hybrid RL model that included a single learning rate (

When fitting data across all days, we considered a family of (nested) hybrid RL models in which specific parameters were omitted or included as fixed versus free parameters. More complex models included separate RL parameters for first and second stage choices, an eligibility trace, and a slope parameter that permitted the weighting between model-free and model-based control to shift across days. See

The model fitting routine follows that previously described by Huys and colleagues [_{i}, for each subject, _{i}, given a vector of each subject’s choices,_{i}:

We used a hierarchical (random effects) model-fitting approach, with the assumption that parameter estimates were normally distributed at the group level, where

The intractable integral above was estimated by Expectation-Maximization (EM). The E-step at the

We used a Laplace approximation, which assumes that the likelihood surface is normally distributed around the maximum a posteriori parameter estimate:
^{(k)} of the normal prior distribution, mean ^{2}, were updated as follows:

We compared models by Bayesian model evidence, _{1} … _{N}|_{int}:
_{1} … _{N}| is the total number of choices made by all subjects, and |

The right hand expression approximates the integral by summing over

In line with recent studies using the two-step task, we considered model-free and model-based influences on choice in the current trial, with respect to events that occurred up to 3 trials in the past [_{-1}, _{-2} and _{-3} would increase (coded as +1) or decrease (coded as -1) the probability of choosing A according to a model-free or a model-based system (6 regressors in total). Importantly, if a trial involved a common transition, both systems make identical predictions. However, opposing predictions emerge following uncommon transitions. We implemented a random-effects logistic regression in Matlab (MathWorks) and performed one-sample t-tests on the resulting coefficient estimates for the 6 regressors, separately for trained (day 3) versus untrained (day 1), and dual-task (high load) versus single-task (low load) conditions (see

We performed a logistic regression on data from the ‘low load group ‘ on day 3 of training to estimate the relationship between choice on trial _{-1} up to _{-3}. Here, regression coefficients can be interpreted as reflecting a model-free or model-based influence on choice, where larger coefficients indicate a stronger influence. In the single-task condition (blue bars), model-free and model-based coefficients were significantly different from 0 (up to 3 trials in the past), suggesting that subjects used a hybrid of both strategies. In the dual-task (high load) condition (orange bars), we observed a significant influence of a model-based system, that did not differ from the single-task condition, up to 3 trials in the past. In contrast, we found no significant influence of a model-free system. These results are consistent with data from the ‘high load group’ (see

(TIF)

Bar plots show the average probability with which subjects chose to repeat their first-stage action on the subsequent trial as a function of the transition (common vs. uncommon) and outcome (rewarded vs. unrewarded) on the previous trial. Blue bars correspond to common transitions and red bars correspond to uncommon transitions. Vertical lines represent SEM. (

(TIF)

(TIF)

Results of a Bayesian model comparison that accounted for differences in model complexity. The hybrid model, which incorporated influences from both model-free and model-based control, fit subject data better than pure model-free and model-based RL algorithms across both trial types (single-task versus dual-task) and both groups (‘high load group’ day 1, ‘low load group’ day 3). Bold-face denotes the winning model (lowest iBIC score) for each condition. α = learning rate; β = softmax inverse temperature; ε = lapse rate; w = model-free/model-based weight. The eligibility trace, λ (not shown), was set to 1 in all cases. w was set to 0 and 1 for pure model-free and pure model-based RL respectively.

(DOCX)

Results of a Bayesian model comparison that accounts for differences in model complexity. More complex model variants include those that have separate parameters for first and second stage choices, an eligibility trace, and a parameter for capturing shifts in model-free versus model-based control across days (σ). In simpler models, RL parameters were fixed between first and second stage choices, the eligibility trace was fixed at 1, and σ was set to 0. Bold-face denotes the winning model (lowest iBIC score) for each condition. Parameters followed by a superscript of 1 or 2 correspond to first-stage or second-stage choices respectively. α = learning rate; β = softmax inverse temperature; ε = lapse rate; w = model-free/model-based weight; λ = eligibility trace; σ = slope governing a shift in model-free/model-based weight (w) across days.

(DOCX)

Best-fitting parameter estimates shown separately for each group and condition (single-task versus dual-task), using data concatenated across all 3 days of training. Values represent mean parameter fits across all subjects. * represents fixed parameter values. Parameters followed by a superscript of 1 or 2 correspond to first-stage or second-stage choices respectively. In simpler models, λ was fixed at 1 and σ was set to 0. α = learning rate; β = softmax inverse temperature; ε = lapse rate; w = model-free/model-based weight; λ = eligibility trace; σ = slope governing a shift in model-free/model-based weight (w) across days.

(DOCX)

Table shows the group-level output of a logistic regression on first-stage switch-stay behavior, separately for single-task (‘high load group’ and ‘low load group’) and dual-task trials, from data concatenated across all 3 training sessions. We note that ‘reward x day’ was orthogonalized with respect to reward, and in turn ‘reward x transition x day’ was orthogonalized with respect to ‘reward x transition’. These regressors thus account for variance unexplained by the simpler main effect or 2-way interaction respectively (see

(DOCX)

(DOCX)

We would like to thank Peter Dayan, Peter Smittenaar, Ross Otto and Kevin Miller for helpful discussions.