The authors have declared that no competing interests exist.
Many accounts of decision making and reinforcement learning posit the existence of two distinct systems that control choice: a fast, automatic system and a slow, deliberative system. Recent research formalizes this distinction by mapping these systems to “model-free” and “model-based” strategies in reinforcement learning. Model-free strategies are computationally cheap, but sometimes inaccurate, because action values can be accessed by inspecting a look-up table constructed through trial-and-error. In contrast, model-based strategies compute action values through planning in a causal model of the environment, which is more accurate but also more cognitively demanding. It is assumed that this trade-off between accuracy and computational demand plays an important role in the arbitration between the two strategies, but we show that the hallmark task for dissociating model-free and model-based strategies, as well as several related variants, do not embody such a trade-off. We describe five factors that reduce the effectiveness of the model-based strategy on these tasks by reducing its accuracy in estimating reward outcomes and decreasing the importance of its choices. Based on these observations, we describe a version of the task that formally and empirically obtains an accuracy-demand trade-off between model-free and model-based strategies. Moreover, we show that human participants spontaneously increase their reliance on model-based control on this task, compared to the original paradigm. Our novel task and our computational analyses may prove important in subsequent empirical investigations of how humans balance accuracy and demand.
When you make a choice about what groceries to get for dinner, you can rely on two different strategies. You can make your choice by relying on habit, simply buying the items you need to make a meal that is second nature to you. However, you can also plan your actions in a more deliberative way, realizing that the friend who will join you is a vegetarian, and therefore you should not make the burgers that have become a staple in your cooking. These two strategies differ in how
Theoretical accounts of decision making emphasize a distinction between two systems competing for control of behavior [
Recent research formalizes the dual-system architecture in the framework of reinforcement learning [
Currently, the dominant method that aims to dissociate mechanisms of behavioral control within the reinforcement learning framework is the “two-step task” introduced by Daw, Gershman, Seymour, Dayan, and Dolan [
(A) State transition structure of the original two-step paradigm. Each first-stage choice has a high probability of transitioning to one of two states and a low probability of transitioning to the other. Each second-stage choice is associated with a probability of obtaining a binary reward. (B) To encourage learning, the second-stage reward probabilities change slowly over the course of the experiment.
The fundamental problem in reinforcement learning is estimation of state-action values (cumulative future reward), which an agent then uses to choose actions. In the dual-system theory, the fast and automatic system corresponds to a “model-free” reinforcement learning strategy, which estimates state-action values from trial-and-error learning [
The slow and deliberative system corresponds to a “model-based” learning strategy that possesses operating characteristics complementary to the model-free strategy. This strategy learns an explicit causal model of the environment, which it uses to construct plans (e.g., by dynamic programming or tree search). In contrast to the habitual nature of the model-free strategy, the capacity to plan enables the model-based strategy to flexibly pursue goals. While more computationally expensive (hence slower and more effortful) than the model-free approach, it has the potential to be more accurate, because changes in the environment can be immediately incorporated into the model. The availability of a causal model also allows the model-based strategy to solve the credit-assignment problem optimally.
This dual-system framework sketched above can account for important findings in the reinforcement learning literature, such as insensitivity to outcome devaluation following overtraining of an action-reward contingency [
How might the brain arbitrate between model-free and model-based strategies? Since the model-based strategy attains more accurate performance through effortful computation, people can (up to a point) increase reward by engaging this system. However, in time-critical decision making settings, the model-based strategy may be too slow to be useful. Furthermore, if cognitive effort enters into the reward function [
Here, we will first describe in detail the design of the Daw two-step task, and the reinforcement-learning model of this task [
In the two-step task, participants make a series of choices between two stimuli, which lead probabilistically to one of two second-stage states (
These low-probability transitions allow for a behavioral dissociation between habitual and goal-directed choice. Since the model-free strategy is insensitive to the structure the task, it will simply increase the likelihood of performing an action if it previously led to reward, regardless whether this reward was obtained after a common or rare transition. Choice dictated by the model-based strategy, on the other hand, reflects an interaction between the transition type and reward on the previous trial (
(A) For model-free agents, the probability of repeating the previous choice is dependent only on whether a reward was obtained, and not on transition structure. (B) Model-based behavior is reflected in an interaction between previous transition and outcome, increasing the probability of transitioning to the state where reward was obtained. (C) Behavioral performance on this task reflects features of both model-based and model-free decision making, the main effect of previous reward and its interaction with the previous transition.
Behavior on the Daw two-step task can be modeled using an established dual-system reinforcement-learning model [
We now describe how these learning rules apply specifically to the two-step task. The reward prediction error is different for the first two levels of the task. Since
Since there is no third stage, the second-stage prediction error is driven by the reward
Both the first- and second-stage values are updated at the second stage, with the first-stage values receiving a prediction error down-weighted by the eligibility trace decay, λ. Thus, when λ = 0, only the values of the current state get updated.
At the second stage, the learning of the immediate rewards is equivalent to the model-free learning, since those Q-values are simply an estimate of the immediate reward
The model-based values are defined in terms of Bellman’s equation [
Again, at the second stage the decision is made using only the model-free values. We used the softmax rule to translate these Q-values to actions. This rule computes the probability for an action, reflecting the combination of the model-based and model-free action values weighted by an inverse temperature parameter. At both states, the probability of choosing action
In order to test whether the Daw two-step task embodies a trade-off between goal-directed behavior and reward, we estimated the relationship between control (model based vs. model free) and reward by Monte Carlo simulation.
For each simulation, we generated a new set of four series of independently drifting reward probabilities across 201 trials according to a Gaussian random walk (mean = 0,
(A) Surface plot of the standardized linear effect of the weighting parameter on reward rate in the original version of the two-step task. Each point reflects the average of 1000 simulations of a dual-system reinforcement-learning model of behavior of this task with different sets of drifting reward probabilities, as a function of the learning rate and inverse temperature of the agents. The red circle shows the median fit. Importantly, across the entire range of parameters, the task does not embody a trade-off between habit and reward. (B) An example of the average relationship between the weighting parameter and reward rate with inverse temperature = 5.0 and α = 0.5 (mirroring the median fits reported by Daw and colleagues [
The striking feature of this surface map is that the regression coefficients are uniformly close to zero, indicating that none of the parameterizations yielded a linear relationship between model-based control and reward rate.
Since its conception, the design of the Daw two-step task has been used in many similar sequential decision making tasks. Given the surprising absence of the accuracy-demand trade-off in the original task, it is important to investigate whether related versions of this paradigm are subject to the same shortcoming.
In one of these variants, developed by Dezfouli and Balleine [
The red circle shows the median fit. Similar to the Daw variant, this task does not capture a trade-off between accuracy and demand across all tested parameterizations.
A second variant, reported by Doll and colleagues [
(A) Surface plot of the linear relationship between the weighting parameter and reward rate in the Doll version of the two-step task. The red circle shows the median fit. Similar to the Daw variant, this task does not capture a trade-off between accuracy and demand across all tested parameterizations, except for a slightly elevated region of parameter space with high inverse temperature and low learning rate. (B) Behavioral predictions in this task. The model-free system learns separate values for each action in each state, so outcomes only affect choices in the same start state. Our simulation of model-free behavior revealed elevated likelihood of staying after a reward from the other state, since this means there is a current high-probability option that the model-free system has been learning about after transitioning there from both start states. The model-based system (on the right) treats start states as equivalent, since they both afford the same transitions, so choices are not affected by whether the previous start state was the same or different.
The dissociation between habit and planning in this task follows a different logic. Here, it is assumed that only model-based learners use the implicit equivalence between the two first-stage states, and can generalize knowledge across them. Therefore, for a model-based learner, outcomes at the second level should equally affect first-stage preferences on the next trial, regardless whether this trial starts with the same state as the previous trial or a different one. For model-free agents, however, rewards that are received following one start state should not affect subsequent choices from the other start state. According to Doll and colleagues [
However, our simulations revealed that the behavioral profile for the model-free learner also showed a slightly elevated likelihood to stay with the previous choice after a reward if the start state was different (
In order to assess whether the Doll two-step task embodied a trade-off between accuracy and demand, we again estimated the relationship between the weighting parameter and reward rate. This analysis (depicted in
Despite the substantial differences between these variants of the two-step task, we found that none of them encompasses a motivational trade-off between planning and reward. This observation naturally raises a question: Why does planning not produce an increased reward rate in this task? What characteristics of the paradigm distort the accuracy-demand trade-off?
We investigate five potential explanations. These are not mutually exclusive; rather, they may have a cumulative effect, and they may also interact with each other. First, we show that the sets of drifting reward probabilities that are most often employed are marked by relatively low distinguishability. Second, we show that the rate of change in this paradigm is slow and does not require fast online (model-based) flexibility. Third, we show that the rare transitions in the Daw two-step task diminish the reward-maximizing effect of a model-based choice. Fourth, we show that the presence of the choice at the second stage decreases the importance of the choice at the first stage, which is the only phase where the model-based system has an influence. Fifth, we show that the stochastic reward observations in this task do not carry enough information about the value of the associated stimuli. We use simulations of performance on novel tasks to demonstrate these five points and, as a result, develop a novel paradigm that embodies an accuracy-demand trade-off.
In the two-step task, the difference between model-based and model-free strategies only carries consequences for the first stage, since the second stage values are identical for both strategies. Therefore, the finding that model-based control is not associated with an increased reward rate suggests that the first-stage choices the agent makes do not carry importance, for example, because the reward outcomes at the second stage are too similar. In the original version of the two-step task, the reward probabilities have a lower bound of 0.25 and an upper bound of 0.75. This feature results in a distribution of differences between reward probabilities that is heavily skewed left (
(A) Distribution of differences in reward probabilities between the actions of each trial. (B) Increasing the range of probabilities increases the average linear effect between model-based control and reward for a parameter space associated with high inverse temperatures and relatively low learning rate. Average parameter fits in the original report do not lie within this region of increased sensitivity to the accuracy-demand trade-off.
One straightforward way to increase the differences between the second-stage options is to maximize the range of reward probabilities by setting the lower bound to 0 and the upper bound to 1 (e.g., [
It is also possible that the changes over time in the second-stage reward probabilities, depicted in
In order to explore the effect of the drift rate (i.e., the standard deviation of the Gaussian noise that determines the random walks of the reward distributions) on the accuracy-demand trade-off, we performed simulations of the generative reinforcement learning model with inverse temperature parameter
The results, depicted in
(A) The effect of the size of the drift rate on the relationship between model-based control and reward, for two-step tasks with a narrow and a broad reward probability range. (B) Increasing the range of probabilities and the drift substantially increases the average linear effect between model-based control and reward when the inverse temperature is high.
These analyses show that the rate of change of the reward probabilities in the original Daw two-step task is too slow to promote model-based planning. The relationship between reward and model-based control becomes stronger when the drift rate of the Gaussian random walk governing the reward probabilities is moderately increased, and this effect is especially pronounced when these probabilities are more dissociable. However, even though these two factors contribute substantially to the absence of the accuracy-demand trade-off in the Daw two-step task, we found that a task that adjusted for their shortcomings only obtained a modest trade-off between reward and goal-directed control.
Because the Daw two-step task employs rare transitions, model-based choices at the first stage do not always lead to the state that the goal-directed system selected. This feature of the task might lead to a weakening of the relationship between model-based control and reward rate. The task structure employed by Doll and colleagues [
To assess the influence of the deterministic task structure, we simulated performance on the Doll version of the two-step task with sets of reward probabilities with the wider range and increased drift rate (a bounded Gaussian random walk with
(A) Because of the deterministic transitions, model-based choices in the Doll two-step task always result in the desired state outcome. Combined with increased distinguishability and increased drift rate in the reward probabilities, this task results in a substantial increase in the relationship between planning and reward. (B) When this task structure is adapted to include stochastic transitions, the relationship between planning and reward is significantly reduced, indicating an important contribution of the rare transitions in diminishing the accuracy-demand trade-off in the original paradigm.
Even though this result is consistent with the assumption that model-based choices in the Daw two-step task lead to the desired state less often than in the deterministic version of the two-step task, it is equally possible that the second task shows an increased accuracy-demand trade-off because it introduces the possibility of generalization across actions, and not because of the elimination of the rare transitions. To disentangle these two possibilities, we simulated reinforcement-learning performance on a hybrid task with two starting states but with rare transitions (Simulation 3b;
As noted above, model-based and model-free strategies make divergent choices only at the first stage of the multi-step paradigms we have considered so far; at the second stage, both strategies perform a biased selection weighted towards the reward-maximizing option. Thus, the advantage of model-based control over model-free control is approximately bounded by the difference between the maximum value of all actions available in one second-stage state and the maximum value of all actions available in the other second-stage state. Intuitively, as the number of actions available within each second-stage state grows, this difference will shrink, because both second-stage states will likely contain some action close to the maximum possible reward value (i.e., a reward probability of 1). Conversely, the difference between the maximum value actions available in each second-stage state will be greatest when only a single action is available in each state. This design should favor the largest possible difference in the rate of return between model-based and model-free strategies.
To quantify this, we generated 10,000 sets of reward probabilities in this task (according to a Gaussian random walk with reflecting bounds at 0 and 1 and
Since the model-based system only contributes to the first-stage decision, we simulated performance of the reinforcement-learning model in a deterministic two-step task in which the second-stage states do not contain a choice between two actions. In this task, the average difference in reward probabilities that the model-based system uses to make a choice at the first stage is 33%, an increase in comparison to the task that implements a binary choice at the second stage states.
To assess whether this change to the task resulted in a stronger accuracy-demand trade-off, we simulated performance on this task and estimated the strength of the relationship between the weighting parameter and reward rate, across the same range of reinforcement-learning parameters (Simulation 4;
Because of the deterministic transitions, model-based choices in the Doll two-step task always result in the desired state outcome. Combined with increased distinguishability and increased drift rate in the reward probabilities, this task results in a substantial increase in the relationship between planning and reward.
In order to determine the value of an action in the two-step task, the stochastic nature of the task requires participants to sample the same action repeatedly and integrate their observations. In other words, since each outcome is either a win or a loss, the information contained in one observation is fairly limited. Here, we will test whether the high amount of ambiguity associated with each observation contributes to the absence of the accuracy-demand trade-off in the two-step task. One way to increase the informativeness of an outcome observation is to replace the drifting reward probabilities at the second stage with drifting scalar rewards, so that the payoff of each action is exactly identical to its value [
In order to test whether the reward distributions for the second-stage actions would improve the information obtained from each observation, we performed a series of simulations for two simple reinforcement learning tasks (
(A) We ran simulations of RL agents on two different two-armed bandit tasks. For one, the reward distributions indicate the reward probability associated with each action. The other task does not include binomial noise, but instead the actions pay off rewards that are directly proportional to its value in the reward distribution. (B) Agents show greater accuracy in choosing the highest-value action on the task the task where the two-armed bandit pays off points instead of affording a probability to win a reward, especially when both the inverse temperature and learning rate were high. (C) The Q-values of each action shows stronger correlations with their objective reward value in the task where the two-armed bandit payed off points instead of affording a probability to win a reward.
We first compared the model’s performance on these two tasks by computing the accuracy of its choices, i.e., how often it selected the action with the highest reward probability or reward payoff.
As a second metric of the information contained in each outcome observation, we computed the correlation between the model’s action values and the actual payoffs in the simulations reported above. We expected that the increased precision in outcome observations in the payoff condition would lead to a tighter coupling between the Q-values of the model and the objective values as compared to the probabilities.
Next, we assessed whether this increased performance in the payoff condition would result in a stronger accuracy-demand trade-off in the deterministic two-step task. We reasoned that if the agent obtained a more accurate estimation of the second-stage action values, then the model-based system would be better positioned to maximize reward. To test this prediction, we again estimated the strength of the relationship between the weighting and reward rate, across the range of reinforcement-learning parameters (Simulation 5;
(A) The surface plot of the relationship between model-based control and reward in the novel two-step task with reward payoffs at the second stage. The inclusion of this fifth factor substantially increased the accuracy-demand trade-off in the two-step paradigm. (B) An example of the average relationship between the weighting parameter and reward rate with inverse temperature = 10 and α = 0.4.
In a recent study, Akam and colleagues [
This approach—i.e., a comparison of optimal parameter settings under model-free versus model-based control—provides an important existence proof of the potential benefits of model-based control. However, their way of quantifying the accuracy-demand trade-off differs significantly from the current approach. In order get a more comprehensive overview of the accuracy-demand trade-off in the Akam two-step task, we again estimated the strength of the relationship between the weighting parameter and reward rate, across the same range of reinforcement learning parameters (
This feature of the task means that an increase in model-based control, keeping all other RL parameters fixed, is not likely to yield significantly increased total reward, because reinforcement learning parameters tend to vary widely across individuals [
Despite these concerns, both tasks achieve an accuracy-demand trade-off, and in this respect represent a substantial improvement over the Daw two-step task. Future empirical work should compare the empirical correlations between reward and model-based control for our task and the Akam two-step task, so as to gain fuller comprehension of their respective merits.
We have identified several key factors that reduce the accuracy-demand trade-off in the Daw two-step task. We found that the sets of drifting reward probabilities that are most often employed in this task are marked by low distinguishability and a rate of change that is too slow to benefit from flexible online adaptation. We also showed that the rare transitions in the original task and the presence of multiple choices in the second-stage states diminished the effect of model-based decisions on reward rate. Finally, we showed that the stochastic reward observations in this task do not carry sufficient information about the value of the associated stimuli. In addition to identifying these factors, we have provided improvements to the paradigm targeting each shortcoming.
We calculated the volume under the surface of coefficients of the linear relationship between the weighting parameter and the reward rate for each of the paradigms in the section above. Across these simulations, we progressively included elements that strengthened the relationship, as summarized in this figure.
Here, we have presented a progression of five factors that enhance the accuracy-demand trade-off in the two-step task. Which of these factors contributed most to the increase in this strength?
In order to test the effect of factor order in our analyses, we computed the surface of regression coefficients for all 32 possible combinations of our binary factors (25), using the same procedure as described above (omitting the cases where β = 0, or α = 0). Next, we computed the volume under the surface as an approximation of the average strength of the relationship between model-based control and reward for each these simulations.
Each dot represents the volume under the surface of linear regression coefficients for one task, and is plotted as a function of the number of ‘beneficial’ factors that are included in each task’s design. The gray line represents the average increase in the strength of the relationship between model-based control and reward.
The converse is also true: all factors had a similar and small individual effect on the original Daw paradigm. To see this, compare the score of the original task with 0 factors to the scores of all tasks with 1 factor. The strength of the effect in the original task was only 1.3 standard deviations removed from the tasks with 1 factor, and even slightly better than the 1-factor task with the smallest effect. Most importantly, even if each individual factor did not substantially increase the total effect compared to the original paradigm, their joint inclusion increased the strength of the relationship between model-based control and reward rate by a factor of approximately 230.
At least in theory, we have developed a paradigm that embodies an accuracy-demand trade-off between model-based control and reward rate. Next, we attempt to validate this paradigm by having human participants perform either a novel version of the two-step task with the improved features described above, or the original version of the two-step task as described by Daw and colleagues [
In addition, the comparison between these two paradigms allows us to test whether human participants spontaneously modulate the balance between model-free and model-based control depending on whether a novel task favors model-based control. So far, we have discussed the accuracy-demand trade-off uniquely as it is instantiated in the two-step task. However, if the novel paradigm embodies an empirical accuracy-demand trade-off, then the results of this study allow us to test whether the brain also computes a cost-benefit trade-off between the two systems. We predicted that average model-based control would be elevated in the novel paradigm, since planning was incentivized in this task [
Four hundred and six participants (range: 18–70 years of age; mean: 33 years of age; 195 female) were recruited on Amazon Mechanical Turk to participate in the experiment. Participants gave informed consent, and the Harvard Committee on the Use of Human Subjects approved the study.
One hundred and ninety-nine participants completed 125 trials of the novel two-step reinforcement-learning task. The structure of the task was based on the procedure developed in the previous section. The remaining two hundred and seven participants completed 125 trials of the two-step with the original Daw structure [
(A) State transition structure of the paradigm. At the first stage, participants choose between one of two pairs of spaceships. Each choice deterministically leads to a second-stage state that was associated with a reward payoff that changed slowly according to a random Gaussian walk over the duration of the experiment. Note that the choices in the two different first-stage states are essentially equivalent. (B) Predicted behavior from the generative reinforcement-learning model of this task (using median parameter estimates, and
The most important feature of the task is that the spaceships at the first states were essentially equivalent. For each pair, one spaceship always led to the red planet and alien, whereas the other always led to the purple planet and alien. Because of this equivalence, we were able to dissociate model-based and model-free contributions to choice behavior, since only the model-based system generalizes across the equivalent start state options by computing each action’s value as its expected future reward. Therefore, model-based and model-free strategies make qualitatively different predictions about how second-stage rewards influence first-stage choices on subsequent trials. Specifically, for a pure model-based learner, each outcome at the second stage should affect first-stage preferences on the next trial, regardless of whether this trial starts with the same or the other pair of spaceships. In contrast, under a pure model-free strategy a reward obtained after one pair of spaceships should not affect choices between the other pair.
As explained in detail above, model-based and model-free strategies make qualitatively different predictions about how second-stage rewards influence first-stage choices on subsequent trials. Specifically, choice under a pure model-free strategy should not be affected by the type of transition (common vs. rare) observed on the previous trial (see
Before completing the full task, participants were extensively trained on different aspects of the task. Participants who completed the novel paradigm first learned about the value of space treasure and antimatter, and the change in payoffs from both space mines by sampling rewards from two different aliens. Next, they learned about the deterministic transitions between spaceships and planets during a phase in which they were instructed to travel to one planet until accurate performance was reached. Participants who completed the Daw paradigm sampled from aliens with different reward probabilities, and were extensively instructed on the transition structure. Finally, both groups of participants practiced the full task for 25 trials. There was no response deadline for any of the sections of the training phase. The color of the planets and aliens in this phase were different from those in the experimental phase.
We used our reinforcement learning model of the novel task to produce behavioral predictions for a pure model-free and pure model-based decision maker, and an agent with a mixture between model-free and model-based control. This model was largely the same as before, with the exception of how the transition structures were learned.
Recall that participants that completed the novel paradigm performed a practice phase in which they were taught a set of deterministic transitions between the four spaceships and two different planets. Next, they were told that in the experimental phase, the rules and spaceships were the same as in the practice phase, but that there would be new planets. Therefore, we assumed that participants would assume equal probability of each spaceship traveling to one of the two planets, until they observed one transition for a first-stage state. After this observation, the model immediately infers the veridical transition structure for that first-stage state.
The participants that completed the Daw paradigm of the two-step task learned about the transition structure through instruction and direct experience in a practice phase with two different planets. They were also told that the rules and spaceships would be the same, but that the planets would be new. Therefore, we assumed that participants initially assumed equal probability of transitioning between the spaceships and the planets. Next, we characterized transition learning by assuming that participants chose between three possible transition structures as a function of how many transitions they observed between the states and actions: a flat structure with equal probabilities between all states and actions, or two symmetric transition structures with opposite transition probabilities of 70% and 30% between the two spaceships and planets.
As we have argued above, in our novel paradigm the differences in the probability of repeating the previous first-stage choice do not show a major qualitative difference between a purely model-based and model-free strategy, when plotted as a function of whether the previous start state is the same as or different from the current start state and whether a reward was obtained on the previous trial (
The lack of qualitative differences in single-trial staying behavior between the model-free and mixture strategies places special importance on model-fitting to quantify the balance between habit and control. Not only does model-fitting incorporate an influence of all previous trials on choice, but it also provides a numerical value for the relative weighting of model-based and model-free strategies (the
In order to demonstrate that standard model-fitting procedures are sufficient to robustly estimate
An alternative way to correct for the influence of reward in the previous trials is by predicting ‘staying’ behavior through a multilevel logistic regression analysis that accounts for this influence with a predictor that incorporates behavior about the outcome of the previous choice [
In order to estimate each participant’s weighting parameter, we fitted one of two reinforcement learning models to each participant’s data, dependent on which task they completed. This model was equivalent to the models described above, with the exception for the input into the softmax decision rule:
We used maximum
Participants were excluded from analysis if they timed out on more than 20% of all trials (more than 25), and we excluded all trials on which participants timed out (average 2.7%). After applying these criteria, data from 381 participants were submitted to the model-fitting procedure.
For the participants who completed the Daw task, we found that a reward on the previous trial increased the probability of staying with the previous trial’s choice [
(A) Behavioral performance on the Daw task showed both a main effect of previous outcome and an interaction between previous outcome and transition type, suggesting that behavior showed both model-based and model-free strategies. (B) Behavioral performance on the novel paradigm showed a significant difference in stay behavior between same and different start states conditions after a reward, suggesting that behavior was not fully model-based. Error bars indicate within-subject SEM.
For the participants who completed the new paradigm, we found that a positive reward on the previous trial significantly enhanced staying behavior from chance for both similar and different current start states, (
The reinforcement learning models described above incorporates the (decayed) experience on all previous trials to choice and is better able to dissociate the contributions of the two strategies. This model consists of a model-free system that updates action values using temporal-difference learning and model-based system that learns the transition model of the task and uses this to compute action values online. The weighting parameter
We first investigated whether the inclusion of either stickiness parameter (
Second, we used model comparison with both goodness-of-fit measures to analyze whether the hybrid model including the
In summary, the model fits presented below used all six free parameters for the participants that completed the Daw paradigm, but omitted the stimulus stickiness parameters for the participants that completed the novel paradigm. These parameter estimates and their quartiles are depicted in
Paradigm | Predictor | ||||||
---|---|---|---|---|---|---|---|
Daw | 25th percentile | 2.35 | 0.11 | 0.25 | 0.03 | -0.03 | 0.00 |
Median | 3.35 | 0.34 | 0.65 | 0.21 | 0.05 | 0.27 | |
75th percentile | 3.88 | 0.57 | 1.00 | 0.4 | 0.19 | 0.66 | |
Novel | 25th percentile | 0.51 | 0.01 | 0.07 | -0.29 | 0.04 | |
Median | 0.72 | 0.67 | 0.62 | -0.06 | 0.48 | ||
75th percentile | 3.31 | 1.00 | 1.00 | 0.14 | 0.85 |
Across participants, we found that the median weighting parameter
Of greatest relevance to our present aims, we found that the weighting parameter was positively related to our measure of the reward rate that controlled for average chance performance for the novel task (
We found a positive correlation in the novel paradigm, but not in the original paradigm, suggesting that we successfully established a tradeoff between model-based control and reward in the two-step task. Dashed lines indicate the 95% confidence interval.
Next, in order to quantify the average gain in points across the entire range of
These results validate the accuracy-demand trade-off of the novel two-step paradigm, and also demonstrate that the original Daw two-step paradigm does not embody such a trade-off.
The distinction between planning and habit lies at the core of behavioral and neuroscientific research, and plays a central role in contemporary dual process models of cognition and decision making. Modern reinforcement learning theories formalize the distinction in terms of model-based and model-free control, bringing new computational precision to the long-recognized trade-off between accuracy and demand in decision making. In principle, the model-based strategy attains more accurate performance through increased effort relative to the computationally inexpensive but more inaccurate model-free strategy.
Yet, building on prior work [
First, we found that the trade-off depends on highly distinguishable reward probabilities. Broadening the range of possible reward probabilities (from 0 to 1) contributed a small, but measurable effect on the relationship between model-based control and reward (Simulation 1,
It is likely that more than these five factors alone moderate the effect of model-based control on accuracy. For example, in the Akam version of the two-step task, rewards alternate between blocks of opposite reward probabilities, so that one option strictly dominates the other until the next alternation is implemented. As discussed, this change to the paradigm resulted in a strong trade-off between control and reward in a selective region of parameter space. It is plausible that there are alternative versions of the two-step task that embody an even stronger trade-off than those discussed here, and we look forward to a comparison of how those relate to the current paradigm.
In addition to the difference in the strength of the accuracy-demand trade-off between paradigms, we also found that novel two-step task elicited greater average model-based control in our participants than the original Daw two-step task. This result is one of the first pieces of behavioral evidence suggesting an adaptive trade-off between model-based and model-free control. Put simply, participants reliably shifted towards model-based control when this was a more rewarding strategy. This may indicate that participants store “controller values” summarizing the rewards associated with model-based and model-free control. However, there are alternative explanations for this result. For example, it is possible the presence of deterministic transition structure or the introduction of negative reward induced increased model-based control triggered by a Pavlovian response to these types of task features. In other words, the increase in planning might not a reflect motivational trade-off, but rather a simple decision heuristic that does not integrate computational demand and accuracy. Future investigations, where task features and reward are independently manipulated, will be able to provide more conclusive evidence that people adaptively weigh the costs and benefits of the two strategies against each other.
Although the original Daw two-step task does not embody an accuracy-demand trade-off, choice behavior on this task nonetheless reflects a mixture of model-based and model-free strategies. Furthermore, the degree of model-free control on this task is predicted by individual difference measures such as working memory capacity [
This analysis can help explain the types of experimentally induced shifts in control allocation that have been reported using the two-step task, as well as those that have not. Prior research has demonstrated several factors that increase the control of model-free strategies on decision making. Control shifts to the model-free system with extensive experience [
Our novel paradigm opens up the possibility of studying the neural mechanism underlying the trade-off between model-based and model-free control. The first and most influential neuroimaging study of the two-step task [
One potential limitation of the current paradigm is that it does not afford a simple qualitative characterization of model-based versus model-free control based exclusively on the relationship between reward (vs. punishment) on one trial and a consistent (vs. inconsistent) behavioral policy on the subsequent trial. As depicted in
Indeed, our exploration of this point revealed an apparent mystery and suggests a potentially illuminating explanation. Although our full model fits of participant data indicate a high degree of model-based control, this trend is not at all evident in their raw stay probabilities, conditioned on reward in the previous trial. Not only do we fail to find the high staying probability we would expect for trials on which the associated stage-one choice was previously rewarded (assuming some influence of model-based control), in fact we find an even lower stay probability than would be expected given a computational model of pure model-free control. How can we explain this divergence between our empirical result and the predictions of our generative model? Recent work on the influence of working memory capacity on reinforcement learning may shed some light on this puzzling finding. Collins and Frank [
Finally, we observed a shift in arbitration between model-based and model-free control when comparing the original and novel versions of the two-step paradigm. Specifically, participants in the novel paradigm were more likely to adopt the model-based strategy compared to those who completed the Daw version of the task. This result is one of the first pieces of evidence that the people negotiate an accuracy-demand trade-off between model-based and model-free strategies, and is consistent with a large body of literature that suggests that increased incentives prime more intense controlled processing [
In recent years, the Daw two-step task has become the gold standard for describing the trade-off between accuracy (model-based control) and computational demand (model-free control) in sequential decision making. Our computational simulations of this task reveal that it does not embody such a trade-off. We have developed a novel version of this task that theoretically and empirically obtains a relationship between model-based control and reward (a proxy for the accuracy-demand trade-off). The current investigation reveals a critical role for computational simulation of predicted effects, even if these appear to be intuitive and straightforward. It also introduces a new experimental tool for behavioral and neural investigations of cost-benefit trade-offs in reinforcement learning. Finally, it opens new avenues for investigating the features of specific tasks, or domains of task, that favor model-based over model-free control.
We found that the size of the drift rate affected the strength between model-based control and reward in a non-monotonic fashion, with the largest effect found at moderate values of the drift rate (0.1–0.3) and with a broad reward probability range. Importantly, the results of this analysis shows that this effect was not only found in the particular parameterization depicted in
(EPS)
Each dot represents the volume under the surface of linear regression coefficients for one task, and is plotted as a function of the number of ‘beneficial’ factors that are included in each task’s design. The gray line represents the average increase in the strength of the relationship between model-based control and reward. These results are qualitatively identical to those reported in
(EPS)
(DOCX)
(DOCX)
(DOCX)
(DOCX)
We thank Catherine Hartley for generously sharing her stimuli, and the members of the Moral Psychology Research Laboratory and the Computational Cognitive Neuroscience Laboratory for their advice and assistance.