Skip to main content
Advertisement

< Back to Article

Fig 1.

Design of the Daw two-step task.

(A) State transition structure of the original two-step paradigm. Each first-stage choice has a high probability of transitioning to one of two states and a low probability of transitioning to the other. Each second-stage choice is associated with a probability of obtaining a binary reward. (B) To encourage learning, the second-stage reward probabilities change slowly over the course of the experiment.

More »

Fig 1 Expand

Fig 2.

Probability of repeating the first-stage choice for three agents.

(A) For model-free agents, the probability of repeating the previous choice is dependent only on whether a reward was obtained, and not on transition structure. (B) Model-based behavior is reflected in an interaction between previous transition and outcome, increasing the probability of transitioning to the state where reward was obtained. (C) Behavioral performance on this task reflects features of both model-based and model-free decision making, the main effect of previous reward and its interaction with the previous transition.

More »

Fig 2 Expand

Fig 3.

Results of simulation of accuracy-demand trade-off in the Daw two-step task.

(A) Surface plot of the standardized linear effect of the weighting parameter on reward rate in the original version of the two-step task. Each point reflects the average of 1000 simulations of a dual-system reinforcement-learning model of behavior of this task with different sets of drifting reward probabilities, as a function of the learning rate and inverse temperature of the agents. The red circle shows the median fit. Importantly, across the entire range of parameters, the task does not embody a trade-off between habit and reward. (B) An example of the average relationship between the weighting parameter and reward rate with inverse temperature = 5.0 and α = 0.5 (mirroring the median fits reported by Daw and colleagues [8]) across 1000 simulations. (C) The probabilities of repeating the first-stage action as a function of the previous reward and transition for a purely model-free agent and purely model-based agent.

More »

Fig 3 Expand

Fig 4.

Surface plot of the linear relationship between the weighting parameter and reward rate in the Dezfouli and Balleine version of the two-step task.

The red circle shows the median fit. Similar to the Daw variant, this task does not capture a trade-off between accuracy and demand across all tested parameterizations.

More »

Fig 4 Expand

Fig 5.

Results of simulation of the Doll two-step task.

(A) Surface plot of the linear relationship between the weighting parameter and reward rate in the Doll version of the two-step task. The red circle shows the median fit. Similar to the Daw variant, this task does not capture a trade-off between accuracy and demand across all tested parameterizations, except for a slightly elevated region of parameter space with high inverse temperature and low learning rate. (B) Behavioral predictions in this task. The model-free system learns separate values for each action in each state, so outcomes only affect choices in the same start state. Our simulation of model-free behavior revealed elevated likelihood of staying after a reward from the other state, since this means there is a current high-probability option that the model-free system has been learning about after transitioning there from both start states. The model-based system (on the right) treats start states as equivalent, since they both afford the same transitions, so choices are not affected by whether the previous start state was the same or different.

More »

Fig 5 Expand

Fig 6.

The influence of the range of reward probabilities.

(A) Distribution of differences in reward probabilities between the actions of each trial. (B) Increasing the range of probabilities increases the average linear effect between model-based control and reward for a parameter space associated with high inverse temperatures and relatively low learning rate. Average parameter fits in the original report do not lie within this region of increased sensitivity to the accuracy-demand trade-off.

More »

Fig 6 Expand

Fig 7.

The influence of the drift rate.

(A) The effect of the size of the drift rate on the relationship between model-based control and reward, for two-step tasks with a narrow and a broad reward probability range. (B) Increasing the range of probabilities and the drift substantially increases the average linear effect between model-based control and reward when the inverse temperature is high.

More »

Fig 7 Expand

Fig 8.

The influence of a deterministic task structure.

(A) Because of the deterministic transitions, model-based choices in the Doll two-step task always result in the desired state outcome. Combined with increased distinguishability and increased drift rate in the reward probabilities, this task results in a substantial increase in the relationship between planning and reward. (B) When this task structure is adapted to include stochastic transitions, the relationship between planning and reward is significantly reduced, indicating an important contribution of the rare transitions in diminishing the accuracy-demand trade-off in the original paradigm.

More »

Fig 8 Expand

Fig 9.

The influence of reducing the number of second-stage action.

Because of the deterministic transitions, model-based choices in the Doll two-step task always result in the desired state outcome. Combined with increased distinguishability and increased drift rate in the reward probabilities, this task results in a substantial increase in the relationship between planning and reward.

More »

Fig 9 Expand

Fig 10.

The influence of the type of reward distribution (points vs probabilities) on choice accuracy.

(A) We ran simulations of RL agents on two different two-armed bandit tasks. For one, the reward distributions indicate the reward probability associated with each action. The other task does not include binomial noise, but instead the actions pay off rewards that are directly proportional to its value in the reward distribution. (B) Agents show greater accuracy in choosing the highest-value action on the task the task where the two-armed bandit pays off points instead of affording a probability to win a reward, especially when both the inverse temperature and learning rate were high. (C) The Q-values of each action shows stronger correlations with their objective reward value in the task where the two-armed bandit payed off points instead of affording a probability to win a reward.

More »

Fig 10 Expand

Fig 11.

The influence of removing binomial noise from the reward distributions at the second stage.

(A) The surface plot of the relationship between model-based control and reward in the novel two-step task with reward payoffs at the second stage. The inclusion of this fifth factor substantially increased the accuracy-demand trade-off in the two-step paradigm. (B) An example of the average relationship between the weighting parameter and reward rate with inverse temperature = 10 and α = 0.4.

More »

Fig 11 Expand

Fig 12.

Surface plot of the relationship between model-based control and reward in the Akam and colleagues [9] version of two-step task with alternating blocks of reward probabilities at the second-stage states.

More »

Fig 12 Expand

Fig 13.

Comparison of trade-off between model-based control and reward across different paradigms.

We calculated the volume under the surface of coefficients of the linear relationship between the weighting parameter and the reward rate for each of the paradigms in the section above. Across these simulations, we progressively included elements that strengthened the relationship, as summarized in this figure.

More »

Fig 13 Expand

Fig 14.

Volume under the surface for all 32 tasks generated by the 5 binary factors discussed in this paper.

Each dot represents the volume under the surface of linear regression coefficients for one task, and is plotted as a function of the number of ‘beneficial’ factors that are included in each task’s design. The gray line represents the average increase in the strength of the relationship between model-based control and reward.

More »

Fig 14 Expand

Fig 15.

Design of the novel two-step task.

(A) State transition structure of the paradigm. At the first stage, participants choose between one of two pairs of spaceships. Each choice deterministically leads to a second-stage state that was associated with a reward payoff that changed slowly according to a random Gaussian walk over the duration of the experiment. Note that the choices in the two different first-stage states are essentially equivalent. (B) Predicted behavior from the generative reinforcement-learning model of this task (using median parameter estimates, and w = 0.5 for the agent with a mixture of strategies). Note that in this task the model does not produce qualitatively different behavior for the different systems as reported in Fig 5. Instead, the differences in behavior are subtler, and therefore differences in strategy arbitration are better captured using model-fitting techniques.

More »

Fig 15 Expand

Fig 16.

Behavioral performance on the two-step tasks.

(A) Behavioral performance on the Daw task showed both a main effect of previous outcome and an interaction between previous outcome and transition type, suggesting that behavior showed both model-based and model-free strategies. (B) Behavioral performance on the novel paradigm showed a significant difference in stay behavior between same and different start states conditions after a reward, suggesting that behavior was not fully model-based. Error bars indicate within-subject SEM.

More »

Fig 16 Expand

Table 1.

Best-fitting parameter estimates shown as median plus quartiles across participants.

More »

Table 1 Expand

Fig 17.

Relationship between the estimated weighting parameters and adjusted reward rate in the Daw and novel two-step paradigms.

We found a positive correlation in the novel paradigm, but not in the original paradigm, suggesting that we successfully established a tradeoff between model-based control and reward in the two-step task. Dashed lines indicate the 95% confidence interval.

More »

Fig 17 Expand