Novelty is not surprise: Human exploratory and adaptive behavior in sequential decision-making
Fig 4
Surprise as a modulator of the learning rate in episode 1 of block 2.
A. Surprise as a function of time since the start of block 2 for one representative participant: Surprise has very small values most of the time, because the participant has already learned the transitions in the environment during block 1. The surprising transitions are the ones to the swapped states (blue) and the ones from the swapped states (red). B. Maximal log-surprise values (yellow = large surprise) during the 1st episode of block 2, averaged over all participants. The swapped states are marked in red and the states before them in blue. One action from each swapped state is not surprising, i.e., the action leading participants to trap states both before and after the swap. C. Block diagram of the SurNoR algorithm: Information of state st and reward rt at time t is combined with novelty nt (grey block) and passed on to the world-model (blue block, implementing the model-based branch of SurNoR) and TD learner (red block, implementing the model-free branch). The surprise value computed by the world-model modulates the learning rate of both the TD-learner and the world-model. The output of each block is a pair of Q-values, i.e, Q-values for estimated reward QMF,R and QMB,R as well as for estimated novelty QMF,N and QMB,N. The hybrid policy (in purple) combines these values.