Novelty is not surprise: Human exploratory and adaptive behavior in sequential decision-making

doi:10.1371/journal.pcbi.1009070

Novelty is not surprise: Human exploratory and adaptive behavior in sequential decision-making

Fig 6

A. Model-based surprise modulates model-free learning. A1. The learning rate of the model-free branch as a function of the model-based surprise, after fitting parameters to the behavior of all participants (see Eq 9 in Methods). The model-free learning rate for highly surprising transitions is more than 8 times greater than the one for expected transitions. A2. Three modules from the block diagram of Fig 4C. There are two types of interactions between the model-based and the model-free branches of SurNoR: (i) The model-based branch modulates the learning rate of the model-free branch and (ii) the weighted (arrow thickness) outputs of the model-based and the model-free branches influence action selection (hybrid policy). A3. The histogram of surprise values across all trials of 12 participants. The distribution is multimodal with high surprise for the unexpected transitions in the 1st episode of block 2, medium surprise for whenever a transition is experienced for the first time, and low surprise for the expected transitions. A4. The relative importance of model-free (MF) compared to model-based (MB) in the weighting scheme of the hybrid policy during different episodes. Vertical axis: dominance of model-free (see Methods). Values larger than one (dashed line) indicate that the model-free branch dominates action selection. Error bars show the standard error of the mean. B. Action choice probability indicates that surprise boosts learning during a single episode. Action choice probabilities of participants (data, grey) are compared with those of SurNoR and Hyb+N at the fist time visiting state 7 (B1) or state 3 (B2) in episodes 1 (left) and 2 (right) of block 2. B1. In state 7, action 1 is the good action before the swap, and action 4 is the good action after the swap. Error bars show the standard error of the mean, and the black dashed line corresponds to random choice action probability (0.25). In episode 2, SurNoR assigns a significantly higher probability to action 4 than to action 1, while according to the Hybrid model without surprise modulation, the action probabilities of action 1 and action 4 are not significantly different. B2. In state 3, action 4 is the good action before the swap, and action 1 is the good action after the swap. Behavioral data and SurNoR show a more rapid re-adaptation to the good action than Hyb+N. Note that only 8 (out of the 12) participants encountered state 7 in the first episode of block 2 before reaching the goal. We therefore limit the data analysis to these 8 participants in B1 but use data of all 12 participants in B2.

doi: https://doi.org/10.1371/journal.pcbi.1009070.g006