Novelty is not surprise: Human exploratory and adaptive behavior in sequential decision-making

Classic reinforcement learning (RL) theories cannot explain human behavior in the absence of external reward or when the environment changes. Here, we employ a deep sequential decision-making paradigm with sparse reward and abrupt environmental changes. To explain the behavior of human participants in these environments, we show that RL theories need to include surprise and novelty, each with a distinct role. While novelty drives exploration before the first encounter of a reward, surprise increases the rate of learning of a world-model as well as of model-free action-values. Even though the world-model is available for model-based RL, we find that human decisions are dominated by model-free action choices. The world-model is only marginally used for planning, but it is important to detect surprising events. Our theory predicts human action choices with high probability and allows us to dissociate surprise, novelty, and reward in EEG signals.

The SurNoR algorithm (Alg. A) combines surprise signals with novelty and reward so as to explore and learn the environment and exploit rewards. A simple block diagram of the algorithm is shown in Fig 4C of the main text. In the SurNoR algorithm, a model-based and a model-free branch interact with each other. The output of each branch is a pair of Q-values for estimated novelty and estimated reward. The model-based branch updates model-based Q-values using a world-model that is estimated online, while the model-free branch uses a surprise-modulated TD-learner for updating the model-free Q-values. Finally, actions are selected following a hybrid policy that combines model-free and model-based Q-values -see [1,2] for similar approaches. In this section, we describe the SurNoR algorithm in detail. For the sake of clarity and coherence, we repeat here some details already explained in the main text.
Formalization of the environment. The state and the action at time t are random variables S t and A t which take values in the finite sets S and A, respectively. In the particular case of our experiment, we have S = {1, ..., 10, G} and A = {1, ..., 4}. Taking a Bayesian perspective, we consider the transition probability matrix as another random variable Θ, i.e. P(S t+1 = s |S t = s, A t = a, Θ = θ) = θ s,a (s ). (1) where the values of θ s,a (s ) for combinations of s, a, and s are unknown and needed to be estimated from experience. Since our environment is deterministic, except for the switch of two states before the start of block 2, the real transition probabilities are θ real s,a (s ) = δ(s , T (s, a)), where T (s, a) denotes the target state of the transition from state s given action a, and the Kronecker δ is defined as δ(x, x ) = 1 if x = x and zero otherwise; T (s, a) corresponds to the arrows in Fig 1B and Fig 1D in the main text. The target state depends on the block number in our experiment. Note that T (s, a) is unknown to the participants and to the SurNoR algorithm.
Definition of novelty. While a participant moves in the environment, the count C (t) s = |{t : 1 ≤ t ≤ t and s t = s}| indicates how often state s has been encountered up to time t. We assume that at each time t, participants are able to estimate the empirical frequency p (t) N (s) of encountering state s ∈ S, formally defined as where |S| is the total number of states (i.e., 11 for our experiment). Note that the participants know the total number of states, due to the pre-experiment introduction. The empirical frequency in Eq 3 is equal to the expected probability of observing state s given s 1:t under the assumption of a uniform prior over the probabilities of observing different states: before the start of the experiment all |S| states have the same prior probability p (0) N (s) = 1/|S|. We define the novelty of the state s at time t as the negative logarithm of the empirical frequency In our algorithm, novelty acts just like an internally generated reward or exploration bonus (see subsection 'Formalizing model-based Q-values'). The main difference between our definition of novelty and most of the previously proposed measures of 'exploration bonus' [3][4][5][6][7][8][9] is in their dependency upon states and actions: while the usual exploration bonus measures are functions of state-action pairs, we define our novelty as a function of states only. Our choice is more consistent with the behavior of participants in our experiment, since a reasonable strategy of participants is to visit the states that they rarely encounter in the experiment as opposed to testing all actions in all states. From this perspective, our novelty measure is similar to the exploration bonus proposed by Bellemare, et al. (2016) [10].
In three of our alternative algorithms (see Section 'Alternative algorithms') we use a state-of-the-art exploration strategy [8,9] which defines exploration bonus (or internal reward) as a function of the pairs of states and actions. We compare these algorithms with SurNoR (see Fig 5 in the main text).

Model-based branch of SurNoR
The pseudocode for the model-based branch is shown in Alg. B. In this subsection, the details are explained.
World-model. The participants knew that there were 11 states and 4 possible actions in each state. However, they were not aware of the actual transition probability matrix. In particular, they did not know whether the environment is deterministic or stochastic. Therefore, we define a participant's model of the world as an approximation q of the posterior distribution of the transition probability matrix, similar to the approach of [11][12][13], In the following, we call q the belief of the participant. We assume that a participant estimates the transition probabilities by a weighted averagê where the weighting factor is given by the belief q (t) . For convenience, the transition probabilityθ s,a (s ) is written as p (t) (s |s, a) in the main text, e.g., in Eq 4 and Eq 3.
For exact Bayesian inference one needs to explicitly specify the generative model which governs the transition. Particularly, the dynamics of Θ over time should be known, e.g., whether it is fixed, continuously drifting, or experiencing abrupt changes [14][15][16][17]. However, rather than making explicit assumptions about the generative model as a starting point for exact Bayesian inference, we work with a general (parametric, see the next part) distribution q (t) which is updated by an appropriate learning algorithm after each observation, similar to approaches in machine learning [18,19].
Beliefs as Dirichlet distributions. We assume that the transition probabilities from different state-action pairs are independent of each other, i.e.
where θ s,a is defined as in Eq 1. As a natural 1 choice for a probability distribution over transition probabilities, we consider the belief q (t) (θ s,a ) to be a Dirichlet distribution with parameter α (t) s,a : q (t) (θ s,a ) = Dir(θ s,a ; α (t) s,a ).
As a result, at each time t, the belief of participants about their environment can be summarized in the set s,a , ∀(s, a) ∈ S × A}. We consider the parameter of the prior belief q (1) (i.e., α (1) ) to be the same for all transitions, i.e., where > 0 is a free parameter. With this choice of prior,θ s,a (i.e., the prior estimate of the transition probabilities from the pair of state s and action a) is a uniform distribution over states. Furthermore, the free parameter expresses how deterministic the transitions are from the point of view of a participant, i.e., smaller values of indicate a more deterministic interpretation of the environment.
Using a Dirichlet distribution for the belief q (t) and Eq 6, a participant's estimation of the transition probabilities iŝ s,a (s ) .
Note that, the pseudo-countsC Definitions of surprise. We work with the 'Bayes Factor' surprise S BF [14]. Consider the transition initiated at time t, i.e., (S t = s, A t = a) → (S t+1 = s ) . The Bayes Factor surprise corresponding to this transition is [14] S s,a (s ) .
Due to the particular form of the prior q (1) that we chose,θ s,a (s ) is constant. As a result, the surprise S  s,a (s ) is written as p (t) (s |s, a), andθ (1) s,a (s ) is written as p reset (s |s, a).
We note that in the particular case of our behavioral paradigm, the Shannon surprise [20] is just the shifted logarithm of the 'Bayes Factor' surprise, i.e., S (t+1) Sh = log S (t+1) BF + log |S|. Furthermore, the state prediction error (SPE) [1] is an increasing function of the 'Bayes Factor' surprise, i.e., SPE (t+1 surprise-modulated learning rates in the SurNoR algorithm can alternatively be expressed in terms of S Surprise-modulated update of the belief. Learning the world-model corresponds to updating the parameters of the Dirichlet distribution after each transition. Consider the transition (S t = s, A t = a) → (S t+1 = s ) initiated at time t which generates a surprise S (t+1) BF at time t + 1. The surprise-modulated adaptation rate is defined as [14] γ(S where m > 0 is a positive free parameter. The parameter m controls the sharpness of the transition. With this modulated adaptation rate, the change in a participant's belief is given by an update of the Dirichlet parameters α where BF , m). The update rule becomes the same as the one in Eq 6 of the main text if we replace α s,ã→s + . The update rule expresses the new belief as a mix between two possibilities, represented by the current parameters α (t) s,ã (s ) and the prior α (1) (s ), weighted with 1 − γ t+1 and γ t+1 , respectively. In the case of a large surprise, the value of γ t+1 is close to one, and as a result, the current parameters are forgotten. The update makes a step based on the currently observed transition, expressed by the Kronecker-δ in the first line. The parameters of transitions from the pairs of the states and actions different form the current one (i.e., s and a) are not changed (second line). The update rule of Eq 13 is called Variational Surprise Minimizing Learning (VarSMiLe) rule in [14].
Formalizing model-based Q-values. The world-model of the participants is summarized by their beliefs q (t) (θ) about the transition matrix of the environment. The belief is used for evaluation of two sets of Q-values [21], one for novelty N and the other one for the external reward R.
Novelty N (t) (s) of state s at time t (cf. Eq 4) is useful to guide behavior during exploration. Analogous to the common framework in reinforcement learning [21] where information of a reward at state s is propagated by the Bellman equation to states s = s , we use a Bellman equation to propagate the novelty of state s to other states s = s by using the model of the world. More specifically, for the model-based branch, we assign to each state-action pair a novelty-based value Q (t) MB,N (s, a) which is an estimation of the accumulated future discounted novelty that can be gained by taking action a in state s. The Bellman equation is whereθ s,a (s ) are the estimated transition probabilities, and λ N ∈ [0, 1] is a discount factor for novelty. The Bellman equation assigns a value to the action a in state s as long as a novel state is likely to be reached within the next few steps -even if the immediately neighboring states are not novel. The discount rate λ N controls the time horizon of 'future novelty'. For λ N → 0, only the novelty of the immediately following state matters; for λ N → 1, the time horizon becomes infinitely long.
Rewards R(s) of states s ∈ S guide behavior during exploitation. In the theory of reinforcement learning, reward information is summarized in values Q (t) MB,R (s, a) that are estimations of the accumulated future discounted reward that can be collected when starting at state s with action a. The Q-values are given by the Bellman equation where λ R ∈ [0, 1] is the discount factor for reward, which is not necessarily equal to the discount factor for novelty λ N . Note that in our environment R(s) = 0 at all states except at the goal. Since the scale of the reward is arbitrary, we set R(s Goal ) = 1. As a result, the reward function is R(s) = δ(s, s Goal ).
The total model-based Q-value is a linear combination of the Q-values for novelty Q where β N ≥ 0 is a free parameter controlling the trade-off between exploitation and exploration, i.e., between reward-seeking and novelty-seeking behavior. In our model, β N depends on whether participants are in the exploration phase or the exploitation phase. This dependency is simplified as follows: Since novelty is the main drive in the 1st episode of the 1st block, we keep β N fixed at a value β N 1 throughout this episode. However, at the end of the 1st episode of the 1st block, once participants have found the goal and do not need further exploration, we set β N = 0 and keep it at zero for all remaining episodes of the 1st block.
Surprise increases rapidly at the first mismatch that participants face in the 1st episode of the 2nd block, when they encounter an unexpected transition. We hypothesize that the unexpected transitions make them realize that something has changed in the environment and they do not know anymore a path to the goal state and need to re-explore the environment and search for the goal in the absence of any external reward; hence, we assume that the huge surprise signal triggers renewed exploration and we therefore set β N = β N 2 for the 1st episode of the 2nd block. With the same arguments as for the 1st block, we set β N to zero for the remaining episodes of the 2nd block. β N 1 and β N 2 are free parameters of the model.
We also tested a variant of SurNoR with an additional free parameter β N = β N −2to5 for the weights of Q MB,N in episodes 2-5 of blocks 1 and 2, but we did not find any significant improvement in the fit (difference in log-evidence = 15 ± 13).
Note that, for model comparison, we use the same assumptions for all other alternative algorithms that either seek novelty or uncertainty -see Section 'Alternative algorithms'.
At the transition from time step t − 1 to time step t several iterations take place. The algorithm first puts U MB,N (s, a), and U

SurNoR model-free branch
The pseudocode for the model-free branch is shown in Alg. C. In this subsection, the details are explained.  N (s, a) as values of the state-action pairs corresponding to the external reward R and novelty N , respectively. In contrast to the model-based branch, the model-free Q-values are updated using TD-learning [21,24], for which the model of the world is not directly used -see the paragraph 'Updating model-free Q-values'.
Analogous to the total model-based Q-values, we define the total model-free Q-values as Q where β N ≥ 0 has the same value as the one used in Eq 16.

Reward and novelty prediction errors.
A crucial signal in model-free reinforcement learning is the reward prediction error (RPE), defined as the difference between the expected 'reward' of a state-action pair and its real 'reward' [21]. Since we defined two separate sets of Q-values, one for the external reward and one for novelty (which plays the role of an 'internal reward'), we also define two separate corresponding prediction errors. Consider the transition (S t = s, A t = a) → (S t+1 = s ). The RPE at time t + 1 is defined as and similarly, the novelty prediction error (NPE) at time t + 1 is defined as where λ R and λ N are the same discount factors as the ones used in the model-based branch.
May 20, 2021 6/16 Eligibility trace. To keep track of the previously chosen state-action pairs, and to include them in the update rule, we use eligibility traces [21,25,26]. To have the most general setting, we define two separate eligibility traces, one for the external reward e  (s, a). We initialize the eligibility traces at zero and reset their values to zero at the beginning of each episode. After the transition (S t = s, A t = a) → (S t+1 = s ), the eligibility traces are updated to where λ R and λ N are the discount factors defined above, and µ N ∈ [0, 1] and µ R ∈ [0, 1] are free parameters expressing how fast eligibility traces decay in time.
Surprise modulation of model-free learning rate. Usual TD learning algorithms use a constant (or decreasing in time) learning rate for updating Q-values [21]. However, the model-free branch of SurNoR has a learning rate modulated by the model-based branch. This novel interaction between model-based and model-free modules has not been explored by previous hybrid models in neuroscience, e.g., [1,2]. We define the surprise modulated model-free learning rate ρ t as where γ(S for all (s, a) ∈ S × A.

Hybrid policy
The policy for action selection is based on a linear combination of Q-values, similar to [1,2]. We use a softmax policy [21] and consider the probability of choosing action a in state s as where Z(s) is the normalization constant (that ensures that a π(A t = a|S t = s) = 1), ω scale ≥ 0 is a free parameter to correct the potentially different scaling of the model-based and model-free values, and ω ∈ [0, 1] is a free parameter to balance the relative contribution of the model-based and model-free branches on the policy. When ω = 1, the policy is purely model-free (but includes the effect of surprise modulation on the TD-learning learning rate), and when ω = 0, the policy is purely model-based. Note that ω M F and ω M B mentioned in the main texts are equal to ω × ω scale and 1 − ω, respectively. The reverse temperature β ≥ 0 controls the sharpness of policy (the larger β the more deterministic is the policy).
As it was shown by [1], ω can vary in time. Specific to our experiment, we consider ω to be piece-wise constant in time: 1. ω = ω 11 for the 1st episode of the 1st block, when participants are in the pure exploration phase, 2. ω = ω 12 for the 1st episode of the 2nd block, when the goal is lost, and 3. ω = ω 0 for the rest of the experiments (i.e., episodes 2 to 5 for both blocks), when participants are in the exploitation phase. Moreover, we allow the value of β to be different for the 1st and the 2nd block, β 1 and β 2 respectively. By doing so, we allow the model to change its confidence in action selection after observing the sudden change in the environment.
Note that, for model comparison, we use the same assumptions for all other alternative algorithms that use hybrid policy -see Section 'Alternative algorithms'.

Summary of free parameters
SurNoR has 18 free parameters, summarized as is used for initialization of the belief in Eq 9. m is used for modulation of the adaptation rate in Eq 12. λ R and λ N are discount factors used in the definitions and the updates of Q-values. β 1 and β 2 are the inverse temperatures controlling the sharpness of the hybrid policy in Eq 24. β N 1 and β N 2 are used for balancing novelty against external reward in equations 16 and 18. T PS is used for Prioritized Sweeping in Algorithm D. µ R and µ N are used for controlling the decay of eligibility traces in Eq 21.       Compute Q

# Prediction errors
MF,N (s t , a t ). # Update of the eligibility traces 4: Update e

May 20, 2021
Algorithm D Pseudocode for the modified version of Prioritized Sweeping Algorithm for one time-step at time t + 1 # Specifying whether the update is for the internal or the external reward 1: Put λ = λ R for reward and λ = λ N for novelty. for (s, a) ∈ S × A do 13: s,a (s )∆V # Updating the priority queue 14: for s ∈ S do 15: Prior(s) = |U (t+1) (s) − max a∈A Q (t+1) (s, a)| (xi) Hyb+S+OI: This algorithm uses MF+OI (but with surprise modulation of the learning rate of the model-free branch) and MB+S+OI in parallel and combines their Q-values (in the same fashion as in SurNoR) in a hybrid policy. Hyb+S+OI has overall 14 free parameters { , m, λ R , β 1 , β 2 , T P S , µ R , Q R0 , ρ b , δρ, ω scale , ω 11 , ω 12 , ω 0 }. Null model. (xii) RC (Random Choice): According to this algorithm, participants choose actions with uniform distribution, i.e., each action is selected with a probability equal to 1 |A| = 0.25. We used this model as a reference to quantify the effect of our novelty-seeking exploration in the 1st episode of the 1st block. This algorithm does not have any free parameter.
Control modifications of SurNoR. (xiii) Binary Novelty: The two control algorithms mentioned in the main text are exactly the same as SurNoR except for a change in the intrinsic motivation signal that drives exploration. While in the SurNoR algorithm the continuous-valued novelty signal defined in Eq 3 and Eq 4 serves as the intrinsic reward, in the control algorithms the intrinsic reward of state s at time t is binary: in the first control algorithm it is considered to be −1 if the count C (t) s ≥ C thr and 0 otherwise, where C thr is a new free parameter, i.e., the algorithm considers the states that are encountered more than C thr times as bad states and assigns a constant negative reward to them. Similarly, in the 2nd control algorithm, the intrinsic reward of state s at time t is considered to be −1 if state s is among the n most frequently encountered states and 0 otherwise, where n is a new free parameter, i.e., the algorithm considers the n most frequently encountered states as bad states. Therefore, the pseudo-code of the control algorithms is the same as the pseudo-code of SurNoR in Alg. A but with 2 modifications: (i) U s ≥ C thr 0 otherwise (30) for the 1st algorithm and to s ∈ n highest counts 0 otherwise (31)