Combined model-free and model-sensitive reinforcement learning in non-human primates

doi:10.1371/journal.pcbi.1007944

Fig 1.

Two-stage decision task.

(A) Timeline of events. Eye fixation was required while a red fixation cue was shown, otherwise subjects could saccade freely and indicate their decision (arrow as an example) by moving a manual joystick in the direction of the chosen stimulus. Once the second-stage choice had been made, the nature of the outcome was revealed by a secondary reinforcer cue (here, the pause symbol represents high reward). Once the latter cue was off the screen, there was a fixed 500 ms delay and the possibility of a further delay (for both medium and low rewards) before juice was provided (for both high and medium rewards). (B) The state-transition structure (kept fixed throughout the experiment). Each second-stage stimuli had an independent reward structure: the outcome level (defined by the magnitude of the reward and the delay to its delivery) remained the same for a minimum number of trials (a uniformly distributed pseudorandom integer between 5 and 9) and then, either stayed in the same level (with one-third probability) or changed randomly to one of the other two possible outcome levels.

More »

Expand

Fig 2.

The impact of both reward and transition information on first-stage choice behaviour.

(A) Likelihood of first-stage choice repetition, averaged across sessions, as a function of reward and transition on the previous trial. Error bars depict SEM. (B-C) Logistic regression results on first-stage choice with the contributions of the reward main effect (B) and reward × transition (C) from the five previous trials. Dots represent fixed-effects coefficients for each session (red when p < 0.05, grey otherwise). (D-F) Similar results obtained from simulations (100 runs per session and respecting the exact reward structure subjects experienced) using the best fit Hybrid+ model. Bar and error bar values correspond, respectively, to mixed-effect coefficients and their SE. Dashed lines illustrate the exponential best fit on the mean fixed-effects coefficients of each trial into the past. ** α = 0.01 and * α = 0.05 in two-tailed one sample t-test with null-hypothesis mean equal to zero for the fixed-effects estimates.

More »

Expand

Table 1.

Best fitting mixed-effects hyperparameters from the best models of each reinforcement learning approach.

More »

Expand

Fig 3.

The impact of both reward and transition information on first-stage choice reaction time.

(A) The averaged across sessions z-scored first-stage reaction time (RT) difference between previous common and previous rare trials as a function of reward on the previous trial (high z-scores indicate responses faster if previous transition was rare). Error bars depict SEM. (B-C) Multiple linear regression results on first-stage reaction time with the contributions of the reward main effect (B) and the reward × transition interaction term (C) from the five previous trials. Dots represent the fixed-effects coefficients for each session (coloured red when p < 0.05 and grey otherwise). Bar and error bar values correspond, respectively, to the mixed-effect coefficients and their SE. Dashed lines illustrate the exponential best fit on the mean fixed-effects coefficients of each trial into the past. ** α = 0.01 and * α = 0.05 in two-tailed one sample t-test with null-hypothesis mean equal to zero.

More »

Expand