Novelty is not surprise: Human exploratory and adaptive behavior in sequential decision-making

doi:10.1371/journal.pcbi.1009070

Novelty is not surprise: Human exploratory and adaptive behavior in sequential decision-making

Fig 4

Surprise as a modulator of the learning rate in episode 1 of block 2.

A. Surprise as a function of time since the start of block 2 for one representative participant: Surprise has very small values most of the time, because the participant has already learned the transitions in the environment during block 1. The surprising transitions are the ones to the swapped states (blue) and the ones from the swapped states (red). B. Maximal log-surprise values (yellow = large surprise) during the 1st episode of block 2, averaged over all participants. The swapped states are marked in red and the states before them in blue. One action from each swapped state is not surprising, i.e., the action leading participants to trap states both before and after the swap. C. Block diagram of the SurNoR algorithm: Information of state s_t and reward r_t at time t is combined with novelty n_t (grey block) and passed on to the world-model (blue block, implementing the model-based branch of SurNoR) and TD learner (red block, implementing the model-free branch). The surprise value computed by the world-model modulates the learning rate of both the TD-learner and the world-model. The output of each block is a pair of Q-values, i.e, Q-values for estimated reward Q_MF,R and Q_MB,R as well as for estimated novelty Q_MF,N and Q_MB,N. The hybrid policy (in purple) combines these values.

doi: https://doi.org/10.1371/journal.pcbi.1009070.g004