Removal of reinforcement improves instrumental performance in humans by decreasing a general action bias rather than unmasking learnt associations

Performance during instrumental learning is commonly believed to reflect the knowledge that has been acquired up to that point. However, recent work in rodents found that instrumental performance was enhanced during periods when reinforcement was withheld, relative to periods when reinforcement was provided. This suggests that reinforcement may mask acquired knowledge and lead to impaired performance. In the present study, we investigated whether such a beneficial effect of removing reinforcement translates to humans. Specifically, we tested whether performance during learning was improved during non-reinforced relative to reinforced task periods using signal detection theory and a computational modelling approach. To this end, 60 healthy volunteers performed a novel visual go/no-go learning task with deterministic reinforcement. To probe acquired knowledge in the absence of reinforcement, we interspersed blocks without feedback. In these non-reinforced task blocks, we found an increased d’, indicative of enhanced instrumental performance. However, computational modelling showed that this improvement in performance was not due to an increased sensitivity of decision making to learnt values, but to a more cautious mode of responding, as evidenced by a reduction of a general response bias. Together with an initial tendency to act, this is sufficient to drive differential changes in hit and false alarm rates that jointly lead to an increased d’. To conclude, the improved instrumental performance in the absence of reinforcement observed in studies using asymmetrically reinforced go/no-go tasks may reflect a change in response bias rather than unmasking latent knowledge.


Results separately for the two experiments
We acquired two independent data sets (each N = 30), which we pooled for the main analyses. Here, we present the results separately for Experiment 1 (Fig A-C and Table A

A B
C D

Behavioural analysis of the pooled data set
Here, we report the post hoc tests for the difference in the sensitivity d' between probe and pre-probe trials (Table I). Additionally, we analysed the difference between probe and post-probe trials of the pooled data ( Fig G and Table J).

Analysis of modelling results
We compared all models with five additional learning rates evenly log-spaced between 0.01 and 0.2 to verify that that the results are not dependent on the learning rate α = 0.06, which we chose for the main analysis (Table K and Fig H). The results of the recovery for all free parameters of the bias model are shown in this part (Fig I and Table L).

Model validation
Because of the artefacts of the SDT analysis, we did not have quantitative measures for the model validation. Therefore, we plotted the go-response probabilities based on model simulations and inspected visually whether the goodness of model fit supports the BIC outcomes. First, we validated whether the individual parameters of the baseline model are necessary to describe the general behaviour ( Fig J). Second, we compared different types of forgetting (Fig K). Third, we validated whether the temperature or bias model could reproduce the behavioural change in probe trials (Fig L).

Individual parameters
First, we ignored the probe trials and checked whether all parameters in the baseline model are needed to describe the empirical behaviour: Participants' go-response probabilities for both go and no-go trials started high with the probability for go trials staying high and the probability for no-go trials decreasing over time.
We started with a model containing two free parameters: a softmax temperature and a general bias. This simple model was not suitable to describe the empirical data as the go-response probabilities start relatively low and the probabilities for go and no-go trials

Types of forgetting
There are several ways to implement a decay of option values due to forgetting. We implemented two different ways of forgetting and compared it to our baseline model. First, we set up a model in which values decay towards the initial Q-value. The model worsened again and due to a low initial Q-value, the go-response probabilities start low with the probabilities for no-go trials staying low and the probabilities for go trials increasing over time, which is not in line with the participants' behaviour (BIC = 547.74 ± 116.46, Fig K.A).

A B C D
Another approach for the decay parameter is to implement forgetting when no feedback for the go-response is received (instead of forgetting after no-go-responses). Again, this model performed worse compared to the baseline model and could not capture the participants' behaviour (BIC = 600.70 ± 117.24, Fig K.B).

Modelling the behaviour in probe trials
In probe trials, participants' go-response probabilities for both go and no-go trials decreased.
Based on the baseline model, we now implemented two models differentiating between reinforced and probe trials; In the temperature model fitted a softmax temperature separately for each trial type. It performed worse than the baseline model and the decrease in go-