Novelty is not surprise: Human exploratory and adaptive behavior in sequential decision-making

doi:10.1371/journal.pcbi.1009070

Fig 1.

Experimental paradigm.

A. After image onset, participants had to wait for 700–1700ms (randomly chosen) until four grey disks were presented at the bottom of the image. After clicking on one disk, a blank screen was presented for another random interval of 700 to 1700ms. The next image appeared afterwards. Different participants saw different images, but the underlying structure was identical for all participants. The goal image is a ‘thumb-up’ image in this example. The blue lines indicate the window of EEG analysis. B. Structure of the environment during block 1. There were 10 states with 4 actions each plus a goal state (G). States 1–7 are progressing states and states 8–10 are trap states. For each progressing state, one action led participants to the next progressing state, two actions led participants to one of the trap states, and one action made participants stay at the current state. The action which made participants stay at the current state is shown for states 1, 3, and 7, as an example. For each trap state, three actions led participants to one of the trap states, and one action led participants to state 1. Not all action arrows are drawn for the trap states to simplify illustration. C. Average number of actions of participants during block 1 (blue) and block 2 (red): The 1st episode of block 2 was significantly shorter than the 1st episode of block 1 (one-sample t-test, p-value = 0.035). Error bars show the standard error of the mean, and each grey point shows the data of one participant. D. Environment used in block 2: The images presenting state 3 and state 7 (in red) were swapped. Other transitions remained unchanged.

More »

Expand

Fig 2.

Behavioral results for episode 1 of blocks 1 and 2.

A. Escape from the trap states: Median number of actions of participants between falling into a trap state and reaching state 2 in episode 1 of block 1 (left) and block 2 (right). Error bars show the 25% and 75% quantiles, and each grey point shows the data of one participant. The grey dashed lines correspond to the minimum number of actions (2) that are needed to escape the trap states. x-axis shows the number of visits of the trap states, for example, 10 means the 10th times participants fall from a progressing state into the trap states. Because of between-participant differences, not all participants visited the trap states for, e.g., 20 times. The size of circles indicates number of participants over which the average is taken. In the 1st episode of block 2 (right), four participants reached the goal state without falling into the trap states; thus, only the data for the other 8 participants is shown. A moving average of length three was applied to the data. B. Average progress of participants each time visiting states 1, 2, 3, and 4 in episode 1 of block 1. We assign a progress value of 1 to good actions (the ones taking participants closer to the goal), 0.5 to neutral actions (the ones making participants stay where they are), and -0.75 to bad actions (the ones taking participants to the trap states); with this assignment, average progress vanishes for random exploration. The size of circles shows the number of participants over which the average is taken, and error bars show the standard error of the mean. A moving average of length three was applied to the data. C. Average progress of participants each time visiting states 1, 2, 7 (swapped with 3), and 4 in episode 1 of block 2. See S1 Fig (A) for the average progress at the progressing states in the proximity of the goal.

More »

Expand

Fig 3.

Novelty in episode 1 of block 1.

A. The number of state visits (left panel) and novelty (right panel) as a function of time for one representative participant: The number of visits increases rapidly for the trap states and remains 0 for a long time for the states closer to the goal. Novelty of each state is defined as the negative log-probability of observing that state (see Eqs 1 and 2) and, hence, increases for states which are not observed as time passes. The first time participants encounter state 7 (the state before the goal state) is denoted by t*. B. Average (over participants) novelty (color coded) at t*: Novelty of each state is a decreasing function of its distance from the goal state.

More »

Expand

Fig 4.

Surprise as a modulator of the learning rate in episode 1 of block 2.

A. Surprise as a function of time since the start of block 2 for one representative participant: Surprise has very small values most of the time, because the participant has already learned the transitions in the environment during block 1. The surprising transitions are the ones to the swapped states (blue) and the ones from the swapped states (red). B. Maximal log-surprise values (yellow = large surprise) during the 1st episode of block 2, averaged over all participants. The swapped states are marked in red and the states before them in blue. One action from each swapped state is not surprising, i.e., the action leading participants to trap states both before and after the swap. C. Block diagram of the SurNoR algorithm: Information of state s_t and reward r_t at time t is combined with novelty n_t (grey block) and passed on to the world-model (blue block, implementing the model-based branch of SurNoR) and TD learner (red block, implementing the model-free branch). The surprise value computed by the world-model modulates the learning rate of both the TD-learner and the world-model. The output of each block is a pair of Q-values, i.e, Q-values for estimated reward Q_MF,R and Q_MB,R as well as for estimated novelty Q_MF,N and Q_MB,N. The hybrid policy (in purple) combines these values.

More »

Expand

Fig 5.

Model comparison of model-based (MB, blue bars), model-free (MF, red bars), and hybrid algorithms (Hyb and SurNoR, purple bars).

Exploratory behavior is either induced by optimistic initialization (+OI), uncertainty-seeking (+U), unbiased random action choices (RC), or novelty-seeking (+N); e.g., a model-based algorithm with novelty seeking is denoted as MB+N. SurNoR and the model-free or hybrid algorithms annotated with ‘+S’ use surprise to modulate the learning rate of the model-free TD learner; SurNoR and all algorithms annotated with ‘+S’ use surprise modulation also during model building (see Methods). A. Difference in log-evidence (with respect to RC) for the algorithms for all episodes of both blocks (left panel), the 1st episode of block 1 (middle), and the 1st episode of block 2 (right panel). High values indicate good performance; differences greater than 3 or 10 are considered as significant or strongly significant, respectively (see Methods); a value of 0 corresponds to random action choices (RC). The random initialization of the parameter optimization procedure introduces a source of noise, and the small error bars indicate the standard error of the mean over different runs of optimization (Methods, statistical model analysis). B. The expected posterior model probability [52, 53] given the whole dataset (Methods) with random effects assumption on the models. C. Accuracy rate of actions predicted by SurNoR (left scale and purple bars: mean and the standard error of the mean across participant) and the average uncertainty of SurNoR (right scale and dashed grey curve: mean entropy of action choice probabilities and the standard error of the mean across participants).

More »

Expand

Fig 6.

A. Model-based surprise modulates model-free learning. A1. The learning rate of the model-free branch as a function of the model-based surprise, after fitting parameters to the behavior of all participants (see Eq 9 in Methods). The model-free learning rate for highly surprising transitions is more than 8 times greater than the one for expected transitions. A2. Three modules from the block diagram of Fig 4C. There are two types of interactions between the model-based and the model-free branches of SurNoR: (i) The model-based branch modulates the learning rate of the model-free branch and (ii) the weighted (arrow thickness) outputs of the model-based and the model-free branches influence action selection (hybrid policy). A3. The histogram of surprise values across all trials of 12 participants. The distribution is multimodal with high surprise for the unexpected transitions in the 1st episode of block 2, medium surprise for whenever a transition is experienced for the first time, and low surprise for the expected transitions. A4. The relative importance of model-free (MF) compared to model-based (MB) in the weighting scheme of the hybrid policy during different episodes. Vertical axis: dominance of model-free (see Methods). Values larger than one (dashed line) indicate that the model-free branch dominates action selection. Error bars show the standard error of the mean. B. Action choice probability indicates that surprise boosts learning during a single episode. Action choice probabilities of participants (data, grey) are compared with those of SurNoR and Hyb+N at the fist time visiting state 7 (B1) or state 3 (B2) in episodes 1 (left) and 2 (right) of block 2. B1. In state 7, action 1 is the good action before the swap, and action 4 is the good action after the swap. Error bars show the standard error of the mean, and the black dashed line corresponds to random choice action probability (0.25). In episode 2, SurNoR assigns a significantly higher probability to action 4 than to action 1, while according to the Hybrid model without surprise modulation, the action probabilities of action 1 and action 4 are not significantly different. B2. In state 3, action 4 is the good action before the swap, and action 1 is the good action after the swap. Behavioral data and SurNoR show a more rapid re-adaptation to the good action than Hyb+N. Note that only 8 (out of the 12) participants encountered state 7 in the first episode of block 2 before reaching the goal. We therefore limit the data analysis to these 8 participants in B1 but use data of all 12 participants in B2.

More »

Expand

Fig 7.

Posterior predictive checks.

A. Average number of actions of all 12 simulated participants for each episode (c.f. Fig 1C). B. Median number of actions of simulated participants to escape the trap states at each of their visits in episode 1 of block 1 (left) and block 2 (right) (c.f. Fig 2A) C. Average progress of participants each time visiting states 1, 2, 3, and 4 in episode 1 of block 1. (c.f. Fig 2B). D. Average progress of simulated participants each time visiting states 1, 2, 7 (swapped with 3), and 4 in episode 1 of block 2. (c.f. Fig 2C). See S2 and S3 Figs for two other sets of 12 simulated participants with different random seeds. See S1 Fig (B) for the average progress at the progressing states in the proximity of the goal.

More »

Expand

Fig 8.

When data is generated by SurNoR, the true model can be recovered.

We applied our model-selection method to the data of three sets of 12 simulated participants. The left column corresponds to the data shown in Fig 7, and the middle and the right columns correspond to the data shown in S2 and S3 Figs, respectively. We compared the SurNoR model with the strongest competitors of SurNoR: MF+S+N, Hyb+S+U, and Hyb+N (c.f. Fig 5). A. Difference in log-evidence with respect to random choice (RC) and B. the expected posterior model probability [52, 53] for the algorithms for all episodes of both blocks given the data of each of the three sets (different columns) of 12 simulated participants (c.f. Fig 5A and 5B).

More »

Expand

Fig 9.

Grand correlation analysis of normalized ERPs over all 2524 trials of 10 participant.

The dashed lines show confidence intervals. Shaded areas indicate intervals of significant correlations (FDR controlled by 0.1, one-sample t-test). Correlations of ERP with A. Surprise, B. Novelty, C. NPE, D. RPE (computed after excluding the trials from the 1st episode of the 1st block during which RPE is equal to 0) and E. Reward.

More »

Expand

Fig 10.

ERP variations explained by trial-by-trial and participant-by-participant multivariate linear regression analysis.

Surprise_⊥ (magenta), Novelty_⊥ (dark blue), NEP_⊥ (light blue), R₊ (brown) and R₋ (red) were used as explanatory variables, and the ERP amplitude at each time point was considered as the response variable. A. Encoding power (adjusted R-squared values) averaged over 10 participants (dashed lines show the standard error of the mean) at each time point. Shaded areas and horizontal lines indicate four time intervals (W1, …, W4) of significant encoding power (FDR controlled by 0.1, one-sample t-test, only for the time-points after the baseline). The 3rd time interval has been split into two time windows of equal length for the analysis in C. B. Values of the regression coefficients (averaged over participants) for Surprise_⊥, Novelty_⊥, NEP_⊥, R₊, and R₋ as a function of time. Errors are not shown to simplify the illustration. C. In each of the 5 time windows, the regression coefficients plotted in B have been averaged over time. Error bars show the standard error of the mean (across participants). Asterisks show significantly non-zero values (FDR controlled by 0.1 for each time window, one-sample t-test). The Novelty_⊥ coefficients in the 1st and the last time windows (dot) have p-values of 0.03 and 0.04, respectively, which are not significant after FDR correction. In the second time window, Surprise_⊥, Novelty_⊥, NEP_⊥, and R₊ have significantly positive coefficients.

More »

Expand