^{1}

^{2}

^{1}

^{3}

The authors have declared that no competing interests exist.

Analyzed the data: TA. Wrote the paper: TA RC PD. Concieved and designed the simulations: TA RC PD. Performed the simulations: TA.

The recently developed ‘two-step’ behavioural task promises to differentiate model-based from model-free reinforcement learning, while generating neurophysiologically-friendly decision datasets with parametric variation of decision variables. These desirable features have prompted its widespread adoption. Here, we analyse the interactions between a range of different strategies and the structure of transitions and outcomes in order to examine constraints on what can be learned from behavioural performance. The task involves a trade-off between the need for stochasticity, to allow strategies to be discriminated, and a need for determinism, so that it is worth subjects’ investment of effort to exploit the contingencies optimally. We show through simulation that under certain conditions model-free strategies can masquerade as being model-based. We first show that seemingly innocuous modifications to the task structure can induce correlations between action values at the start of the trial and the subsequent trial events in such a way that analysis based on comparing successive trials can lead to erroneous conclusions. We confirm the power of a suggested correction to the analysis that can alleviate this problem. We then consider model-free reinforcement learning strategies that exploit correlations between where rewards are obtained and which actions have high expected value. These generate behaviour that appears model-based under these, and also more sophisticated, analyses. Exploiting the full potential of the two-step task as a tool for behavioural neuroscience requires an understanding of these issues.

Planning is the use of a predictive model of the consequences of actions to guide decision making. Planning plays a critical role in human behaviour, but isolating its contribution is challenging because it is complemented by control systems which learn values of actions directly from the history of reinforcement, resulting in automatized mappings from states to actions often termed habits. Our study examined a recently developed behavioural task which uses choices in a multi-step decision tree to differentiate planning from value-based control. We compared various strategies using simulations, showing a range that produce behaviour that resembles planning but in fact arises as a fixed mapping from particular sorts of states to action. These results show that when a planning problem is faced repeatedly, sophisticated automatization strategies may be developed which identify that there are in fact a limited number of relevant states of the world each with an appropriate fixed or habitual response. Understanding such strategies is important for the design and interpretation of tasks which aim to isolate the contribution of planning to behaviour. Such strategies are also of independent scientific interest as they may contribute to automatization of behaviour in complex environments.

Humans and other animals are thought to use a mixture of different strategies to learn to choose actions that lead to positive outcomes and prevent negative outcomes [

Dissociating the contributions of model-based and model-free RL to behaviour is challenging because under many circumstances, including most laboratory based reward guided decision making tasks, they are expected to produce similar behaviour. Outcome devaluation (or, more generally, revaluation) has traditionally been used as a gold-standard test to demonstrate the use of a simple forward model predicting the specific outcomes of actions [

Research using outcome devaluation paradigms has established that learnt actions are initially specified by model-based RL, but can transition to being devaluation insensitive given extensive training under appropriate conditions [

Recent approaches to behavioural neuroscience derive substantial explanatory value from parametric variation of decision variables in the context of large decision datasets. It is therefore desirable to develop tasks which achieve these ends, but also exhibit the critical feature of outcome devaluation–namely the wherewithal to discriminate model-based and model-free RL. The two-step task [

(

Rewards obtained (or not) at the second step modify the subjects estimates of the values of the second-step states, which are themselves the outcomes of the first step actions. On trials with rare transitions, the second-step state whose value is changed by obtaining or not a reward is normally reached from the first-step action that was not chosen. This suggests that a model-based agent which understands the true mapping between first-step actions and second-step states will behave differently from a model-free agent which does not use this knowledge. Model-based and model-free control can indeed be dissociated by evaluating how the events on one trial, specifically the transition (common or rare) and outcome (rewarded or not), affect the probability of repeating the same choice at the first step on subsequent trials.

Three sorts of analysis of these effects are in common use. The simplest is to look directly at the probability of repeating the first-step choice on just the next trial–this is called the ‘stay’ probability. A model-free strategy in which the value of the chosen first step action is updated directly by the trial outcome produces a pattern in which the subject tends to stay following rewarded and switch following non-rewarded trials, with no effect of transition (

There is strong evidence that human subjects who have been explicitly told in advance about the transition structure and drifting reward probabilities, gamely pursue model based strategies, potentially integrating them with MF influences [

Here we consider a stripped down version of the task which substantially improves the payoff for model-based strategies relative to chance level and model-free control. We show that seemingly innocuous changes to the task induce correlations between events which can allow model-free RL to masquerade as model-based. We first show that correlation between action values at the start of trials and the subsequent trial events can cause the stay probability analysis, when applied to the behaviour of purely model-free agents, to exhibit the transition-outcome interaction classically interpreted as indicative of model-based RL. We further show that a previously proposed modification to the analysis [

A second, and more pernicious, issue arises from the correlation between where rewards are obtained (second-step state

The original two-step task and a simplified version with enhanced contrast between good and bad options, termed the reduced two step task, are shown in

We initially simulated the behaviour of a model-free and a model-based agent on both versions of the task. The model-free agent (strictly speaking a

As reported previously for the original two-step task [

We evaluated the performance, i.e. the fraction of rewarded trials, achieved by the model-based and ^{−6}), confirming that the modifications made in the reduced task increased the contrast between good and bad options and differentiated the performance achieved by different strategies.

Behaviour simulated from the ^{−10}, t-test for non-zero predictor loading) (

(

Comparison of the behaviour of all agents types discussed in the paper on the reduced task. Far left panels–Stay probability plots. Centre left panels—Predictor loadings for logistic regression model predicting whether the agent will repeat the same choice as a function of 4 predictors; Stay–a tendency to repeat the same choice irrespective of trial events, Outcome–a tendency to repeat the same choice following a rewarded trial, Transition—a tendency to repeat the same choice following common transitions, Transition x outcome interaction–a tendency to repeat the same choice dependent on the interaction between transition (common/rare) and outcome (rewarded/not). Centre right panels–Predictor loadings for logistic regression analysis with additional ‘correct’ predictor which captures a tendency to repeat correct choices. Right panels—Predictor loadings for lagged logistic regression model. The model uses a set of 4 predictors at each lag, each of which captures how a given combination of transition (common/rare) and outcome (rewarded/not) predicts whether the agent will repeat the choice a given number of trials in the future, e.g, the ‘rewarded, rare’ predictor at lag -2 captures the extent to which receiving a reward following a rare transition predicts that the agent will choose the same action two trials later. Legend for right panels is at bottom of figure. Error bars in all plots show SEM across sessions. Agent types: (

Why are action values at the start of the trial correlated with subsequent trial events, specifically the transition-outcome interaction? There are two steps in the argument. First, the difference in action values between chosen and not-chosen action is on average larger for trials where the agent chooses the correct action, i.e. that which commonly leads to the state with high reward probability, than for trials where the agent choses the incorrect option. When the difference in action values is small, the agent has little evidence that one option is better than the other, and is more likely to choose the incorrect action. Additionally, due to the stochastic softmax decision rule the agent sometimes chooses the action with lower subjective value, and such ‘exploratory’ choices are more likely to be incorrect. Second, choosing the correct, rather than incorrect, action changes the probabilities of observing different combinations of trial events. Rewarded common transitions and unrewarded rare transitions are more likely to occur following a correct action than they are to occur following an incorrect action. Conversely, rewarded rare transitions and unrewarded common transitions are more likely to occur following an incorrect action.

To summarise; the difference in action values going into the trial correlates with the probability of choosing the correct option. Whether the agent chooses the correct option determines the probability of observing each combination of subsequent trial events. Therefore when trials are divided into groups by outcome and transition, the action values at the start of the trial show a transition-outcome interaction (

This effect is not restricted to block based reward probabilities; it can also be observed when reward probabilities change as random walks (

Data simulated from the

It is possible to modify the logistic regression analysis of stay probabilities to prevent differences in action values at the start of the trial from appearing as a spurious loading on the transition-outcome interaction predictor. This can be done by including an additional ‘correct’ predictor which captures the tendency of the agent to repeat correct choices, as originally suggested in [^{−12}) (

Including the correct predictor reduced, but failed to completely remove, loading on the transition-outcome interaction predictor for the ^{−3}, t-test for non-zero predictor loading) (

An alternative way of differentiating model-based and model-free strategies is a lagged logistic regression analysis which examines the effect on choice probability of trial events at different lags relative to the current trial (Miller at al. Soc. Neurosci. Abstracts 2013, 855.13). Fig

Various other factors have been suggested as influencing strategies, including eligibility traces for MF algorithms, the possibilities of continual learning of the transition probabilities, and also outcome- and transition-independent perseveration. We also considered the effects of all of these on the statistics of choice.

Although a

It is typically assumed that subjects on the two-step task understand that the transition probabilities linking the first step actions to second-step states are fixed, and hence do not update their estimates of these based on the transitions they experience trial to trial. As this assumption may not be valid for subjects who do not have prior information about the task structure, we evaluated the behaviour of a model-based agent which learned the transition matrix online by updating its estimate of the transition probabilities for the chosen action on each trial based on the experienced transition (

Human subjects typically show a perseveration bias on the two-step task [

We have so far considered only agents whose state representation corresponds to that used by the experimenter to define the task. However, identifying those states that are relevant for behaviour is a substantial component of the real control problem faced by organisms and there is no guarantee that when faced with a decision task, subjects will adopt the same state representation conceived by the experimenter. In the two-step task there is an underlying latent state that is relevant to behaviour–whether the reward probabilities are higher in state

We first consider a simple way of exploiting the correlations. The two-step task has a circular structure in which subjects cycle repeatedly through the decision state, second-step states and trial outcomes. This repeating structure provides opportunities for subjects to learn predictive relationships between events on one trial and the actions that are likely to lead to reward on the subsequent trial. One such predictive relationship is that the location where reward is obtained on one trial predicts which choice on the next trial is likely to lead to reward. That is, if a reward is obtained in state

It is plausible that over-trained animals could learn to use the location of reward as a discriminative stimulus to guide choice on the next trial, as animals straightforwardly learn to use discriminative sensory stimuli of various sorts as cues for the best action to take next [

For all the agents in

Performance achieved by different agent types in the original (^{−5}.

As noted, the reward-as-cue strategy works because there is in fact a latent, unobservable state of the world that is important to the decision problem–whether the reward probability is higher in state

The behaviour of the latent-state agent looked qualitatively very similar to that of the model-based agent. The one trial back stay probability analyses showed a transition-outcome interaction (

Data likelihood for maximum likelihood fits of different agent types (indicated by x-axis labels; MB–Model based, RC–Reward-as-cue, LS–Latent-state) to data simulated from each agent type (indicted by labels above axes) on the reduced (^{−4} except for that between the fit of the reward-as-cue and latent-state agents to data simulated from the reward-as-cue agent which is significant at P = 0.027.

With parameters optimized to maximize reward, the latent-state agent achieved performance that was not significantly different from the model-based agent (

Finally, we evaluated the behaviour of agents using the reward-as-cue and latent-state strategies on the original version of the task (

We have provided a detailed analysis of the performance of a number of different RL strategies on variants of the two-step task. Since in the original task, complex and taxing strategies only garner modestly more reward than simple ones, it might seem attractive to alter the task to enhance the discrimination. We showed some dangers inherent in this idea, in that induced correlations can make discrimination harder. We also generalized this analysis to more complicated model-free strategies.

In particular, we identified two ways in which behaviour on the two-step task could, under certain conditions, be incorrectly identified as arising from prospective model-based evaluation of actions. The first issue is with the stay probability analysis commonly used as a metric of subjects’ strategies. We showed that rather than reflecting only the action value update occurring on a given trial, which is distinct for model-based and model-free action evaluation, stay probabilities can also be affected by action values at the start of the trial. This can cause the behaviour of a model-free agent to exhibit a stay probability transition-outcome interaction, which is classically interpreted as a signature of model-based behaviour. The second issue is the existence of alternative strategies which use different state representations from the basic states that define the task structure and produce behaviour which is similar to that of a model-based agent though not dependent on prospective evaluation of the outcome of actions.

The possibility that purely model-free agents can exhibit a transition-outcome interaction effect on stay probability has been discussed in two prior studies using the original two-step task [

The effect of trial start action values can be corrected for in a logistic regression analysis of stay probabilities by including a predictor which captures the tendency to choose the action which was correct on the previous trial, i.e. to repeat correct choices. Such a modification to the regression analysis was proposed in [

The second issue we have identified with the two-step task is that, due to its repeating structure, subjects could, in principle, learn to exploit correlations between where rewards are obtained and the expected value of first step actions, to produce behavioural strategies that look similar to model-based behaviour but do not use prospective evaluation of actions. One simple strategy which we termed ‘reward-as-cue’ learns a fixed mapping between events on one trial and choice on the next trial (e.g. reward in state

On the large simulated datasets used in this study, behaviour simulated from latent-state and model-based agents could be differentiated by model-comparison, and this probably represents the best approach to doing so in experimental data. Data simulated from the latent-state agent was fit with higher likelihood (Fig ^{−5}). However, several factors will make this discrimination more difficult when working with experimental data. Firstly; the size of experimental datasets is typically substantially smaller, reducing the resolution of model comparison approaches. Secondly, the quantitative details of fitted models are unlikely to exactly match subject’s strategies. Thirdly, subject’s behaviour may be generated by a mixture of interacting control systems using different strategies. Whether latent-state and model-based strategies can be discriminated using model comparison in a given behavioural dataset is ultimately an empirical question.

Is it plausible that subjects could learn latent-state type strategies in the two-step task? Many paradigms for humans and animals show evidence of aspects of this. It is apparent in probabilistic reversal learning tasks, in which humans [

Both the reward-as-cue and latent-state strategies (termed collectively ‘extended-state’ strategies) work by exploiting the regularity in the task structure that the location where rewards are obtained correlates with which first step action has higher reward probability. Evidence for this regularity accrues slowly as it is only across multiple reversals in the reward probabilities that the correlation becomes apparent. It therefore seems probable that if subjects do learn to exploit this regularity, the strategy would only arise after extended experience with the task. In the original version of the task used typically in the human literature, subjects do a total of ~200 trials. The limited number of trials performed, and the fact that human subjects have been trained to understand the true task structure—presumably priming the use of a model-based strategy—both argue against the possibility that the apparently model-based behaviour reported in the bulk of the human literature in fact arises from extended-state strategies. Indeed, it is only after substantial additional training [

Latent state strategies go beyond classical model-free RL and are interesting in their own right. Indeed, although they do not use a model which predicts future state given chosen action, which following [

Various options exist to minimise the probability that apparently model-based behaviour is in fact due to such strategies. One would be to avoid overtraining subjects, limiting the total number of trials they perform. However, this precludes generating very large behavioural datasets to better quantify the effect of manipulations or the relationship between behaviour and neural activity. A second possibility is to accept that it may be difficult to disambiguate extended-state from classical model-based strategies purely from behaviour, and use neural data to try and disambiguate the strategy used by subjects. A final potential option is to modify the two-step task to introduce reversals into the transition matrix which maps the first step choice to second-step state. In this task variant, not only does the reward probability in each second-step state change over time, but the action which must be chosen to reach a given second-step state also changes. Model-based control that performs incremental learning of the current transition probabilities (one of the variants discussed above), can adjust in a straightforward manner to this change; one could even imagine coupling simple latent state inference for just the transition structure (as in conventional probabilistic reversal learning) to model-based RL. However, the task modification substantially increases the complexity of pure latent state strategies. Reversals in the transition matrix break the fixed predictive relationship in the original task between where reward is obtained and which action at the first step is likely to lead to reward. To solve this version through a fixed mapping from an inferred latent state to action requires latent states that are non-linear combinations of where rewards have been obtained and which actions have led to which states.

The possibility we have identified here for model-free strategies to masquerade as model-based mirrors proposals that apparently model-free behaviour on the two-step task may in fact be due to model-based selection applied to action sequences [

The two-step task latent state strategy provides an example of how agents may turn a planning problem into a set of automatized state-response mappings if there is a limited set of relevant states of the world, each with their own appropriate response. Even if the planning problem is large, with a great diversity of possible solutions, e.g. navigating from home to work, with experience the decision may be automatized to a mapping from a small number of relevant states of the world, e.g. is it rush hour, to a set of options which are known to work best in each condition. Such automatization is more sophisticated than stimulus-response habits as typically envisioned; the states of the world that evoke the response may be high level abstractions rather than directly observable stimuli, and the responses may be action sequences, or options in the hierarchical RL formalism [

All simulations and analysis were conducted in Python. Full code used to produce the paper figures is included in supplementary material (

All tasks used in the paper shared the common structure that on each trial an initial choice between two actions, termed action A and action B, led probabilistically to one of two states, termed state

The following variants of the two-step task were used in the simulations:

Version of the task described in [

The probability of common/rare transitions was 0.8/0.2. There was one action available in each second-step state. Except where stated otherwise, reward probabilities alternated every 50 trials between blocks with reward probability 0.8/0.2 in states

In describing the action value updates used by the different agents we use the following variables:

_{1}, _{1}): The value of the first step action chosen on the trial.

_{2}, _{2}): The value of the second-step action chosen on the trial.

All agents used a softmax decision rule with inverse temperature parameter

The update rules used by the agents were as follows:

The action value update rules used by the

Where _{1}, _{1}) and _{2}, _{2}) were applied sequentially at the end of each trial.

At the start of each trial the model-based agent computed action values for the first step actions as:

Where _{1}, _{i}) is the value of first-step action _{j}) is the value of the second-step state _{j}|_{i}) is the true probability of reaching second-step state _{j}) was the maximum of the two action values available in that state; _{j}) = max_{l} (_{j}, _{l})). In the reduced task the second-step state value _{j}) was the value of the one action available in that state; _{j}) = _{j},

The action value update rule used by the model-based agent at the end of each trial was:

In _{j}|_{i}) linking the first-step actions to the second-step states was learnt online from experienced transitions. The update rule for the agents estimate of transition probabilities was:

Where _{1},

In

The reward-as-cue agent treated the choice between actions A and B as occurring in one of four different states on each trial, corresponding to the 4 combinations of the outcome (1 or 0) and second-step state (

Where _{1}, _{1}) is the value of the action chosen at the first step in the relevant state. In the original task where there was a choice at the second-step, action values for the second step actions were updated as:

The reward as cue agent on the original task used separate softmax inverse temperatures and learning rates at the first and second steps.

The latent-state agent believed there were two states of the world, one of which had reward probabilities of (_{good}, _{bad}) in second-step states _{bad}, _{good}) in second-step states _{good} = 0.8 and _{bad} = 0.2. On the original task version _{good} = 0.625 and _{bad} = 0.375.

At the start of each trial the agent performed a Bayesian update of the probability that the world was in each of these states based on the previous trial events. The agent then updated the probability that the world was in each state to account for the possibility that the world reversed in state between the previous and current trial, which was assumed to occur with probability ω. The agent used a probabilistic mapping from its estimate of the state of the world to choice, choosing with probability (1 – ε) the action with higher reward probability in the most probable state, and with probability ε the action with higher reward probability in the less probable state. On the original task where there was a choice at the second-step, action values for the second step actions were updated as:

The parameter values of the model-based agent on both tasks were set to:

To ensure that average behaviour for the different agents was comparable, the parameters of the other agents were set by maximum likelihood fitting to data simulated from the model-based agent. This resulted in the following agent parameters:

Original task: | |

Reward-as-cue agent: | _{first step}=0.00184, _{first step} = 4.82, _{second step} = 0.499, _{first step} = 4.98 |

Latent-state agent: | |

Reduced task: | |

Reward-as-cue agent: | |

Latent-state agent: |

To evaluate the performance of the different agents in

In all logistic regression analyses, the dependent variable was the subject’s choice, coded as stay vs switch, such that positive values of the predictor promote staying with the previous choice. Predictors used in the analysis took the following values as a function of trial events:

Loading on the transition-outcome interaction predictor as a function of agent parameter values for behaviour simulated from different agent types on the reduced version of the task. Agent types: (

(EPS)

(

(EPS)

Comparison of the behaviour of all agents types discussed in the paper on the original task. Far left panels–Stay probability plots. Centre left panels—Predictor loadings for logistic regression model predicting whether the agent will repeat the same choice as a function predictors; stay, outcome, transition, transition-outcome interaction. Centre right panels–Predictor loadings for logistic regression analysis with additional ‘correct’ predictor. Right panels—Predictor loadings for lagged logistic regression model. Error bars in all plots show SEM across sessions. Agent types: (

(EPS)

Comparison of behaviour simulated on reduced task by agents with intermediate values of the lambda parameter that controls the relative contribution of the

(EPS)

Comparison of behaviour simulated on reduced task by model based agent that learned the transition probabilities online from the experienced transitions. (

(EPS)

Comparison of behaviour simulated on reduced task by (

(EPS)

BIC scores for maximum likelihood fits of different agent types (indicated by x-axis labels; MB–Model based, RC–Reward-as-cue, LS–Latent-state) to data simulated from each agent type (indicted by labels above axes; (

(EPS)

(ZIP)

The authors thank Evan Russek, Kevin Miller, Bruno Miranda, Eric DeWitt, Nathaniel Daw and Anthony Dickinson for useful discussions.