Predicting change: Approximate inference under explicit representation of temporal structure in changing environments

In our daily lives timing of our actions plays an essential role when we navigate the complex everyday environment. It is an open question though how the representations of the temporal structure of the world influence our behavior. Here we propose a probabilistic model with an explicit representation of state durations which may provide novel insights in how the brain predicts upcoming changes. We illustrate several properties of the behavioral model using a standard reversal learning design and compare its task performance to standard reinforcement learning models. Furthermore, using experimental data, we demonstrate how the model can be applied to identify participants’ beliefs about the latent temporal task structure. We found that roughly one quarter of participants seem to have learned the latent temporal structure and used it to anticipate changes, whereas the remaining participants’ behavior did not show signs of anticipatory responses, suggesting a lack of precise temporal expectations. We expect that the introduced behavioral model will allow, in future studies, for a systematic investigation of how participants learn the underlying temporal structure of task environments and how these representations shape behavior.


Introduction
Our ability to represent time and to generate complex actions and plans based on this representation are central to all aspects of our behavior.Knowing when to act, precisely, is clearly crucial for our survival [1,2].Thus, it is not surprising that the question of how we perceive and represent time gains an increasing interest in neurosciences [3][4][5].However, in comparison to our understanding of spatial cognition, the neural basis of time perception is still poorly understood [6].Here, we address the question of how the brain, in principle, can use knowledge about a hidden temporal structure of a task when making decisions.
Traditionally, the question how we learn the structure of the world and use these representations for decision making, has been investigated from the perspective of reinforcement learning [7,8].The focus of these investigations has been on how learning is driven by prediction errors [9], defined as a mismatch between expected and observed outcomes of one's actions.More recent studies on human behavior in dynamic environments [10][11][12][13] have demonstrated that the relative precision of one's prior beliefs and current sensory information weights prediction error signals [14][15][16][17][18][19][20][21].These findings indicate that humans update their beliefs about the structure of the world akin to a rational (Bayesian) agent [22,23].This suggests that one can approximately describe human behavior using the methodology of probabilistic inference.
The key advantage of the probabilistic inference framework over the standard reinforcement learning modelling approach is that one can embed the knowledge about the structure of the world and the uncertainty about that knowledge within a generative model that describes the known rules that shape the dynamics of the world.
A potential limitation of previous approaches-both probabilistic and reinforcement learning based-is that they do not take into account the underlying temporal structure of the task.Recently a series of studies have demonstrated the relevance of learned temporal associations on attention, perception, and sensory integration [24][25][26][27].Experimental paradigms employed in this studies utilized observable temporal structure within trial.Similarly, in [28,29] authors demonstrate that learning of the underlying temporal structure of the task is used by human participants (or animal subjects) to anticipate the moment in time at which rules are most likely to change.In contrast to the experimental design which use within trial temporal structure, here participants were exposed to a hidden temporal structure across trials.
Motivated by these findings, we propose a way to extend current probabilistic models of behavior, aimed at describing decision making in dynamic environments, to incorporate an explicit representation of the underlying temporal structure in the form of prior beliefs about state durations.In particular we will focus on the case when the state durations are not directly observable, but inferred across multiple trials using observable changes in action outcomes.The two essential components of the proposed behavioral model are (i) the update of beliefs about states and duration derived using approximate inference under hidden semi-Markov models [30,31] and (ii) the update of beliefs about actions, that is the planning process, cast as an inference problem [32][33][34].
As a test bed for illustrating the key properties of the model and for demonstrating its applicability to experimental studies we adopted a probabilistic reversal learning task [35].This task and its variants have been frequently used in human and animal studies to investigate key properties of flexible behavioral adaptation in dynamic environments [29,[36][37][38].In a typical setup participants first learn to associate a particular choice with a reward, followed by a sequence of reversals of reward contingencies.The interesting question is typically how fast participants adapt their choices to these new contingencies.In the work here, we address the question, whether participants actually use the underlying temporal task structure to predict the moment of the reversal so that behavior is adapted faster as compared to the case when the reversal is unexpected.
Using the reversal learning task, we first demonstrate using simulations that we can link sub-optimal behavior in changing environments to a misrepresentation of the underlying temporal structure of the changes.Subsequently, when fitting the behavioral model to experimental data, we relate the measured behavior of each participant to model parameters that define the beliefs about the temporal structure of reversals.Strikingly, our results suggest a heterogeneous distribution of the task representation across participants where some participants correctly inferred the latent temporal structure of the task, whereas others did not.We discuss potential reasons for this finding and how the presented approach enables systematic investigations of how participants learn the underlying temporal structure of task environments and how these representations shape behavior.

Materials and methods
In what follows we will first introduce the reversal learning task which we used here as a test bed for illustrating behavioral properties of the proposed model and its applicability to experimental data.Afterwards, we will briefly describe a classical reinforcement learning approach for modeling behavior in a reversal learning task, and later introduce the procedure for deriving the probabilistic behavioral model based on a special type of hidden semi-Markov models.Finally, we will describe in detail the analysis steps used for fitting behavioral models to data, and performing model comparison.

Ethics statement
All participants provided written informed consent and were paid on an hourly basis.The Medical Faculty of Leipzig University approved the study.

Reversal learning
A probabilistic reversal learning task is typically structured as follows.An agent is presented with two choices A and B where each choice is associated with a probability of receiving a reward or punishment.For example, initially choice A has high probability p H and choice B low probability p L of getting a reward.Importantly, after several trials the reward contingencies switch, such that choice B now corresponds to the high reward probability choice.Participants are not informed about the switch and they have to infer that a change occurred and adapt their behavior.Here we used a multiple reversals design in which reversals occur several times during the experiment.Furthermore, the reversals occur at predefined trial numbers and are independent of participant's performance.This reversal schedule was successfully used in past studies in healthy as well as patient samples and is well suited to detect inter-individual differences in behavioral adaptation [39,40].
For the model-based analysis, we used a subset of behavioral data-consisting of 22 healthy participants-from a previously published fMRI study [39].In the experimental task participants were deciding between two cards shown on a screen, each showing a different stimulus (a geometric shape, e.g.rectangle, triangle, etc.) as shown in Fig 1A .The reward probabilities associated with the two choice options were anti-correlated on all trials: whenever reward probability of choice A was high (p H = 0.8) the reward probability of choice B was low (p L = 0.2), and vice versa.Note that p H = 1 − p L on all trials.Reward contingencies change as follows: they were stable for the first 55 trials, afterwards they changed four times after 15 or 20 trials, and remained stable for the last 35 trials, see Fig 1B .In total, the experimental block consisted of 160 trials.
The location of each stimulus on the screen (right or left side) was randomized over trials.After each choice the stimulus was highlighted and depicted for 1.5 s minus the reaction time.The feedback in form of reward (won 10 Eurocents) or punishment (lost 10 Eurocents) was shown for 0.5 s.If no response occurred during the decision window, the message "too slow" was presented, and no outcome was delivered.During the inter-trial interval, a fixation cross was presented for a variable duration (jittered and exponentially distributed; range 1-12.5 s).
All 22 participants underwent a training session during which they had the opportunity to learn the statistics of the rewards associated with high and low reward probability choices.The set of stimuli used in the training phase differed from the one used during the testing phase.The participants were instructed that they could either win or lose 10 cents on each trial, and that they will be paid the total amount of money they gained during the testing phase at the end of the experiment.Each participant performed 20 training trials without any reversal of reward contingencies.Before the start of the testing phase participants were told that reward probabilities might change over the course of the experiment.No other information about reversals or the correlation of choices and outcomes was provided.Thus, the participants had no explicitly instructed knowledge about the anti-correlated task structure before the experiment.

Reinforcement learning models
Classical Rescorla-Wagner model.A classical approach to modeling the probabilistic multiple-reversals learning task is using the Rescorla-Wagner (RW) model [41,42].The reasoning of this approach is that agents tend to value those choices which on average lead to more rewarding outcomes in the past.This reasoning can be formalized as follows where c t 2 [A, B] denotes a choice at trial t, o t 2 {−1, 1} denotes the outcome of that choice (loss or gain), and V c t t is the current value associated with each choice.Importantly, the choice values V c t t are updated after each trial proportionally to the prediction error weighted by a constant learning rate α.
Dual update Rescorla-Wagner model.As the experimental design imposes anti-correlation between high and low reward probability (p H = 1 − p L ), an extended version of Eq (1) was proposed in [39,40] which incorporates fictive learning signals [43].The extension is based on the assumption that agents update, in parallel, values of both choices.The value of the executed choice is updated as before, whereas the value of the alternative (non-executed) choice is updated as if the outcome for that choice was the opposite to the observed one.We can formalize this with the following update rules where P denotes the permutation operator such that if c t = A, then Pc t ¼ B (and vice versa).In addition, κ denotes the coupling strength of the fictive prediction error.

Probabilistic model
To derive the probabilistic model of behavior we start by defining a generative model of the task that formalizes an agent's beliefs about the structure and the dynamics of the task environment.The update rules are then obtained using (approximate) Bayesian inference.
The assumption here is that participants learn to represent the latent task structure.This structure consists of hidden states which define two possible configurations of the environment: in one configuration stimulus A corresponds to a high reward choice, in the second, stimulus B corresponds to a high reward choice.Note that the notion of state used here differs from what is typically used in reinforcement learning models, in which states are cues that are associated with values over time.Here, the states are hidden and not directly observable.Furthermore, they capture a context, which defines how two stimuli (option cues) are related to actions and outcomes of those actions.
Importantly, the task environment transits from one state to another in a probabilistic manner.Here we will consider two assumptions about the dynamics of state transition probabilities: (i) the state transition probability is constant and independent of the moment of the previous change, as is the case under hidden Markov models (HMM) [44]; (ii) the state transition probability is time dependent and linked to the moment of the last change, which we will represent using hidden semi-Markov models (HSMM) [31,45].As the HMM corresponds to a special case of HSMM, in which state transition probabilities are constant, it is sufficient to define the behavioral model based on the HSMM assumption.
In what follows we will define the components of the generative model (observation likelihood and state dynamics) and derive the corresponding update rules.
Observation likelihood.The observation likelihood defines the probability of observing reward or punishment in any of the two possible states s t 2 {¬R, R} depending of the choice c t 2 {A, B} that an agent makes at a given trial t.Note that we have used ¬R to denote an initial non-reversal state (e.g., stimulus A is linked to a high reward probability), and R for a reversal state with switched reward contingencies.We define the observation likelihood as where the P again corresponds to the permutation operator (e.g.PA ¼ B).Note that the choice dependent reward probabilities (ρ A and ρ B ) are treated as random variables.Hence they are initially unknown to the agent and learned during the course of the experiment.

State dynamics
To formalize the presence of sequential reversals, that is, transitions from one task configuration to another, we define the state transition probability as where δ denotes probability of transiting between distinct hidden states (e.g. from s t−1 = ¬R to s t = R.This representation of the state transition process corresponds to a standard HMM previously used in reversal learning tasks [29,38,[46][47][48].Note that under this formulation of the state transition matrix the agent implicitly assumes that the interval (the elapsed number of trials) between two reversals follows a geometric distribution.Hence, the probability that any between reversal interval is of length d is given as pðdÞ ¼ ð1 À dÞ dÀ 1 d; where d 2 f1; 2; 3; . ..g ð5Þ with an expected dwelling time in each task configuration (mean interval length) set to m ¼ 1 d .To derive agents with arbitrary beliefs about between reversal intervals we will use a special type of hidden semi-Markov models, the so-called explicit duration hidden Markov models (ED-HMM) [49,50].This will allow us to identify individual differences in the beliefs about the temporal task structure and to investigate what impact these beliefs might have on an agent's performance.

State durations
The ED-HMM embeds the generative model with the representation of state durations, that is, the state dwelling time d t (the time spent in each state, ¬R or R).In other words, an agent will be able to form expectations about the number of trials between consecutive reversals.
Under the ED-HMM we define the state transition matrix as follows where I 2 denotes the 2 × 2 identity matrix and J 2 denotes the 2 × 2 all-ones matrix.The above relations describe a simple deterministic process for which the current state s t remains unchanged as long as d t−1 > 1 and switches to alternative state (e.g. if s t−1 = A then s t = B) with probability one when d t−1 = 1.
The transitions between state durations follow a deterministic countdown (d t = d t−1 − 1) as long as d t−1 > 1; once d t−1 = 1 the subsequent value d t is sampled from the prior over state durations p 0 (d t ), that is, from the beliefs over between reversal interval.We can write this formally in the form of the transition probability as In Fig 2 we illustrate conditional dependencies between states, durations, reward probabilities, and outcomes in the form of a a graphical model.Although, the semi-Markov formalisms allows for defining state dependent prior beliefs p 0 (d t |s t ), here, to reduce model complexity, we have assumed that the priors are independent of the current state, thus we set p 0 (d t |s t ) = p 0 (d t ) for any s t 2 {¬R, R}.
In practice, prior beliefs about state durations p 0 (d t ) can have an arbitrary form, however for the purpose of inferring the participants' representation of between reversal intervals we will use a parametric distribution, specifically the negative binomial NB distribution defined as where δ 2 [0, 1] and r > 0.
The NB distribution has several convenient properties.First, in the case r = 1 we obtain the geometric distribution (see Eq (5)), hence we recover the HMM model described above.Hence, the parameters of the negative binomial distribution will allow us to quantify an agent's specific belief in the regularity of reversals.The lower the variance σ, the more an agent believes that reversals occur at regular intervals.
Importantly, the variance of prior beliefs p 0 (d) has direct effects on the temporal expectation profile (that a reversal will occur at some future trial τ).In Fig 4 we show the dependence of the expected reversal probability δ τ on the variance σ, and under fixed mean μ = 20 of prior beliefs about the between-reversals interval.The expected reversal probability δ τ is defined as where τ > 1, and p(s = ¬R) = 1.Hence, δ τ corresponds to the transition probability (from nonreversal state ¬R into reversal state R) at some future trial τ starting from the non-reversal state s 1 = ¬R.Note that for prior beliefs p 0 (d) with large variance (σ = μ(μ − 1)), we get a constant transition probability, which corresponds to the HMM formulation of the state transition probability.In contrast, for prior beliefs with low variance (σ = μ) one obtains a trial dependent transition probability with values alternating between the low and high probabilities in a periodic manner.This temporal dependence of the transition probability will affect the inference process.The agent will become insensitive to subsequent reversals occurring few trials after the initial reversal, and highly sensitive to reversals occurring twenty to thirty trials after the initial reversal.

Inference
For model inversion we use a variational inference scheme [51][52][53] which allows us to derive the update rules for the inference process akin to the ones presented in Eq (2).After making a choice c t and observing outcome o t at trial t, the agent updates its beliefs over reward probabilities r, states s t , and durations d t following Bayes' rule where we used the following notation pðxÞ ¼ pðxjo tÀ 1:1 ; c tÀ 1:1 Þ, and � pðxÞ ¼ pðxjo t:1 ; c t:1 Þ-for x 2 fr; s t ; d t g-to denote prior and posterior beliefs, respectively, at time step t.
If we express prior beliefs as where B(x; a, b) denotes a beta distribution, we can define the approximate posterior in similar form where we use the following parametrization of the factors of the approximate posterior The prior expectation at the next trial t + 1 depends on the current posterior q(s t , d t ) via the sum product rule The update of the agent's beliefs about reward contingencies for each choice ρ, current state s t , and the number of trials until the next reversal d t is completely defined by the update rules of the sufficient statistics of the posterior, namely parameters θ t , a c t t , and b c t t .We will obtain the update rules for these parameters using variational inference.The parameters of the approximate posterior can be found as minimizers of the variational free energy defined as The posterior beliefs q that minimize the free energy F can be obtained using the following relations qðrÞ / pðrÞe hln pðo t jr;s t ;c t Þi qðs t Þ ; qðs t ; d t Þ / pðs t ; d t Þe hln pðo t jr t ;s t ;c t Þi qðr t Þ : ð16Þ To obtain update rules similar to the delta learning rules of the RW model we simplified the above iterative procedure needed to estimate the posterior parameters.To break the cyclic update we will assume that one can first update beliefs about q(s t , d t ) by setting hln pðo t jr; s t ; c t Þi qðrÞ � ln pðo t js t ; c t Þ and then use the obtained posterior beliefs about states and durations to estimate the reward probabilities qðrÞ.
Using this simplification we obtain the following update rules Note that the conditional posterior q(d t |s t ) corresponds to the conditional prior pðd t js t Þ, as � p d , hence it remains constant during the update of beliefs.However, the prior expectations (at the next time step) about state duration will be linked to the inference process via the state-duration transition matrix (see Eqs ( 6) and ( 7)).Subsequently, we update the beliefs about the choice reward contingencies as 2 , we get the following set of update equations where , and k t ¼ 1À y t y t .Although in form similar to the DU-RW learning rules, a notable difference is that the fictive prediction error is of the same sign as the actual prediction error.The reason for this is that under the generative model the fictive learning signals are a product of the agent's uncertainty about the current state of the world, that is, the current configuration of reward contingencies and not the uncertainty about the anti-correlation of reward contingencies on different choices.This alternative representation could be introduced into the generative model by introducing beliefs about a dependence between high and low reward probabilities.However, we will leave this extension for future work as it increases model complexity and does not contribute to our analysis of how temporal representations influence behavior.

Responses
Here we will describe the response model, which defines the mapping from the agent's beliefs to responses, based on the active inference framework [54].We will demonstrate how a response likelihood that is often used in reinforcement learning models (in the form of a softmax transform) can be derived within this framework.We do this for didactic purposes to show how one can formally relate active inference to a well-known reinforcement learning account.
The core concept of active inference is that agents generate behavior that is most likely to minimize the expected free energy, that is, that tends to maximise, at the same time, the extrinsic and the epistemic value of agents' choices [55].The expected free energy can be defined as ) denotes the utility of future outcomes (reward or punishment), and D KL denotes the Kullback-Leibler divergence between the posterior (conditioned on possible future outcome and action) and prior expectations at the time step t.Note that we have assumed that the behavior is characterized by planning only a single time step into the future.
Importantly, the response model maintains the causal structure of the task, where at trial t an agent first makes a choice c t and only afterward observes an outcome o t .
The first term on the right hand side of Eq ( 21) is typically denoted as the extrinsic value of an action (policy) whereas the second term is denoted as epistemic value or information gain.If we express the utility U as where o t 2 {−1, 1}, and λ > 0, the extrinsic value becomes In practice, for sufficiently large λ the behavior will be fully driven by the extrinsic value, as each independent observation o t carries little information about the underlying hidden states (a sequence of observations is required to identify the precise past moment of a reversal).In other words, each individual choice leads to a small information gain.Thus, we will assume that the expected free energy can be approximated as The optimal action (choice) corresponds to the one that minimizes the expected free energy, thus Still, for describing participants' behavior we have to assume that the action selection process is corrupted by external sources of noise; e.g.mental processes irrelevant for the task at hand.Therefore, we will soften the requirement of minimizing the expected free energy using the softmax transform, which is typically used to define the response likelihood [56,57].The choice probability then becomes where β denotes the response precision and p 0 (c) the response bias.These two parameters are the free parameters of the response model and allow us to capture participants' deviation from optimal behavior when fitting models to the data.Finally, in the case of the DU-RW model, we will use the same form of the response likelihood as above, with the difference that the choice value

Model fitting and model comparison
When fitting models to behavior we have focused our analysis on either the DU-RW or the ED-HMM model, where we have used in both cases a hierarchical prior over free model parameters (see below for details).We will not consider the SU-RW and HMM models in model comparison as they represent special cases of the DU-RW and the ED-HMM model, respectively, which are obtained in the limit of κ !0, in the case of the DU-RW model, and r ! 1 in the case of the ED-HMM model.Hence, we can easily recover these special cases from the estimates of the posterior distributions of the free model parameters, which are summarized in Table 1.
Note that the parameters n A 0 and n B 0 of the ED-HMM model (see Eq (19)), which define the initial number of observations, have been fixed to values which reflect that participants underwent a 20 trials long training session.Hence, setting n A 0 ¼ n B 0 ¼ 10 reflects our assumption that participants selected both options in equal amounts during the training session.
To estimate the posterior distribution over free model parameters we have used the above defined response model as observation likelihood of the behavioral model.Hence, the likelihood of behavior (the sequence of a participant's responses) is defined as denotes the sequence of observations (wins or losses) that the nth participant made, Λ denotes the set of free model parameters, and m 2 {1, 2} denotes the corresponding behavioral model.
To define the prior distribution over model parameters we have used the so-called horseshoe prior as a weakly informative hierarchical prior.If we denote the ith model parameter (where i 2 {1, . .., 6}) of the nth participant as l n i , the horseshoe prior is defined as where C + (x; 0, γ) denotes a half-Cauchy distribution with scale parameter γ.Hence, the full hierarchical prior for the model m can be expressed as Note that τ i acts as a hyper-prior, which plays the role of regulating the prior scale, for the corresponding free parameter, over the whole group.As l n i is a positive definite variable, we have used specific transforms to relate each l n i parameter to the corresponding model parameter (see Table 1).
Table 1.Free parameters, their transform, and interpretation for the two behavioral models.The model parameters are grouped based on whether they shape the inference (learning) or the behavioral responses.Note that the DU-RW and ED-HMM share the same mapping from beliefs into responses (see Eq (22)).T:1 g.However, the exact posterior distribution of the model parameters and their hyper-priors is analytically intractable, hence we have applied the variational mean-field approximation, in which one assumes that the posterior is fully factorized

Model
The full hierarchical model was implemented in the probabilistic programing library PyMC3 [58].The PyMC3 library provides an interface to multiple state-of-the-art inference schemes.In our specific case, for estimating the approximate posterior distribution over model parameters, we have used the PyMC3 implementation of the automatic differentiation variational inference (ADVI) [59].ADVI is a stochastic black-box variational inference scheme which returns an approximate estimate of the posterior distribution in a fully factorized form.
To attribute participants' behavior to a specific behavioral model we have assumed that models can be treated as random effects (variables), that is, the attribution of a model to participants' behavior can differ across participants [60].Furthremore, we have based Bayesian model comparison on the posterior predictive model evidence [61] instead of the full model evidence, that is marginal likelihood.The motivation for basing model comparison on the posterior predictive model evidence comes from the fact that the posterior predictive model evidence is less sensitive to a prior specification of free model parameters: the approximate estimate of the predictive evidence is based on the samples from the posterior distribution which is already constrained by the subset of the behavioral data.This procedure provides for a more robust model comparison and testing results as it resolves some issues arising when using weakly-informative or misspecified prior probabilities [61].The posterior predictive model evidence is defined as a marginal expectation of the subset of behavioral responses C T:k condition on the behavioral responses up to the kth trial (C k:1 ) and the full set of outcomes presented to participants (O).Formally, we can define the posterior predictive model evidence as where qðl n jmÞ ¼ Q i qðl n i jmÞ, and N = 10 4 .Note that to estimate the posterior predictive model evidence we have first estimated the approximate posterior over free model parameters on a reduced data set consisting of the pre-reversal and reversal phases (k = 125).Then, we generated N samples from the posterior distribution to estimate the posterior predictive model evidence on the remaining part of behavioral data C T:k which covers only the post-reversal phase (see Fig 1).
We then used the posterior predictive model evidence as the likelihood of the random effect model proposed in [60].The goal here is to identify the posterior probability that each of the two behavioral models can generate the same sequence of behavioral response as participants.
Here we have defined the random effect model as the following mixture model ; Similar to the role of the hyperpriors on the parameters of the behavioral model, γ plays here the role of a regularization parameter that allows for data driven constraints of the random effects assumption.If the marginal posterior probability over γ shrinks towards zero, this implies that data supports a null hypothesis which states that all models are present in the population with the same frequency and any differences we observe are purely chance driven [60].
As above, the random effects model selection was implemented in PyMC3, where the estimation of the posterior was performed using the PyMC3 implementation of the ADVI procedure.A fully factorized approximate posterior pðp; gjC; OÞ � qðgÞqðpÞ ð31Þ provides us with an estimate of the posterior model probability q(π).The posterior model probability can then be used to identify group level posterior estimate over possible model frequencies in the group of participants and the participant specific posterior model probability.

Simulating behavior
To illustrate how the agent's performance depends on the agent's prior beliefs over the between reversal interval p 0 (d) we will start by comparing two special cases of the ED-HMM based behavioral model using synthetic data.In the first case, we will fix the variance σ of the prior beliefs p 0 (d) to σ = μ(1 − μ), which we named irregular reversal interval (IRI) agent.In the second case, we will fix the variance to σ = μ, which we named regular reversal interval (RRI) agent.In addition, we will compare the performance of the two probabilistic behavioral models to the single (SU) and dual update (DU) Rescorla-Wagner (RW) models introduced in Eqs ( 1) and ( 2), respectively.To illustrate the behavioral difference of different computational models we have modified the experimental setup and introduced two conditions in which reversals occur either at regular or irregular intervals, as shown in Fig 5 .These two conditions allow us to make clear distinction between different behavioral models with respect to their behavior in the presence or absence of regularities.
In all simulated experiments we fixed the number of trials to T = 160 and the number of experimental blocks to n = 1000.Hence, each simulated agent repeated the experiment n times, where at the beginning of each experiment the free model parameters (besides mean and variance) of the ED-HMM based agents (IRI and RRI) were set to the following values In other words, simulated agents have a loose initial knowledge of the underlying reward probabilities, but do not know the initial configuration of the task (whether the environment is initially in the reversal or the no-reversal state).Fig 6 shows the dependence of the agent's performance on its prior beliefs about the between reversal interval in two different environments.We have defined the performance as the fraction of correct choices (the choice associated with the higher reward probability).The irregular environment corresponds to the interval duration drawn from the geometric distribution with mean μ = 20 and variance σ = μ(μ − 1), and the semi-regular environment corresponds to interval durations drawn from the negative binomial distribution with mean μ = 20 and variance σ = μ, as illustrated previously in Fig 5.The IRI agent exhibits stable performance levels independent on the environment or a belief misspecification (incorrect mean interval duration, or incorrect variability around the mean).In contrast, the performance of the RRI model strongly depends on the correctness of the prior beliefs.These results make the rather intuitive point that if one is unfamiliar with the temporal structure of the environment, assigning high uncertainty to the expected state duration (as is the case for the IRI agent) ensures reasonably high levels of performance.However, if the environment changes at regular intervals, it is worthwhile to build an accurate representation of the temporal structure, as the performance levels can drastically increase (in this example from 81% to 87% of median performance levels).
Fig 7 shows the comparison of the performance distribution between four models; the probabilistic agents IRI and RRI, and the two reinforcement learning based agents with single update (SU-RW) and dual update (DU-RW) learning rules.The free model parameters (α and κ, see Table 1) of the SU-RW and DU-RW agents were fixed to those values that maximize the average performance levels in each environment.Importantly, when comparing the performance distribution of the DU-RW agent to the performance distribution of the IRI agent we find similar average performance per trial which is stable across environments.This finding suggests that in spite of subtle differences in learning rules (see Eqs ( 2) and ( 20)) the two agents generate very similar behavior.
To investigate the accuracy of the procedure for model comparison which we described in Model fitting and model comparison, we have simulated behavior of the ED-HMM based and the reinforcement learning based agents on the experimental task, and applied the same model inversion procedure which we used for the analysis of behavioral data below.In Fig 8A we show the average performance per trial for each agent type estimated over n = 1000 repetitions of the experimental task.Fig 8B shows the probability of assigning behavior of each type of agents to the correct model type.The confusion matrix was estimated over n = 100 simulated experimental blocks, where in half of the experimental blocks ED-HMM based agents (IRI and RRI) were used to generate behavioral responses and in the other half of the experimental blocks the reinforcement learning agents (SU-RW and DU-RW) were used to generate behavioral responses.The rather high mixing probability of the confusion matrix suggests that the model comparison will have difficulties distinguishing properly between the models.Nevertheless, adjusting the experimental design to contain larger number of trials, that is, more data points for estimating posterior and model evidence, is likely to make model comparison more accurate.In contrast, the SU-RW agent shows only medium level performance in both environments with a significant increase of performance with irregular reversals.As in Fig 6, the RRI agent performs best, among all agents, when exposed to semi-regular reversals, but has the worst performance, among all agents, when exposed to irregular reversals.https://doi.org/10.1371/journal.pcbi.1006707.g007

Data analysis
Here we will present model comparisons and model fitting based on the behavioral data of 22 participants.As the SU-RW model was shown in previous studies to provide a worse account for behavior than the DU-RW model [40], we do not expect the SU-RW model to explain the behavioral data better than the other models.Moreover, as our simulations indicate that the HMM and DU-RW models provide for comparable response patterns (see Fig 7), we will use only the DU-RW model as a reference model, as this model was used in a previous study based on the same data [39].
To circumvent a potential sensitivity of the model evidence to the specification of the prior distribution for the free model parameters, we have based our analysis on the posterior predictive model evidence (see Model fitting and model comparison for details).As the posterior predictive model evidence is estimated from the posterior distribution, rather than the prior (see Eq (28)), it is more robust to mis-specification of the prior, given a sufficient amount of data in the predictive sample.Hence, to estimate the posterior predictive evidence we have split the behavioral data for each participant into two sets.The first set, containing the initial 125 trials, that is the pre-reversal phase and the reversal phase of the experiment (see Fig 1), we used for estimating posterior distributions of free model parameters (see Table 1).Note that as this set contains all but the last reversal, it should provide sufficient information to constrain the posterior model parameters.The reversal phase of the experiment is the only period which participants can use to shape their beliefs about the time structure of reversals.
The second set, containing the last 35 trials, that is, the post-reversal phase (see Fig 1), we used for Bayesian model comparison and model validation.Critically, as no reversals are present in the post-reversal phase, this phase is especially suited for model selection: If participants believe that a reversal will occur at specific moments during the post-reversal phase this belief should be reflected in their behavior; e.g., they might change their choices in anticipation of a reversal.We next asked the question whether there is some systematic behavioral difference between these six and the remaining eighteen participants.Thus, we compared the averaged responses of participants belonging to the ED-HMM group and the participants belonging to the DU-RW group.In Fig 10 we show the choice probability averaged for each of the two groups.Tellingly, the ED-HMM group (blue line) exhibits a sudden switch toward the alternative choice approximately 20 trials after the start of post-reversal phase.Note that no reversals were induced in this phase.This result suggests that participants belonging to the group for which the ED-HMM model has higher posterior evidence behaved as if they were expecting another reversal 20 trials after the last one.This indicates that they have used the reversal phase to infer the reversal frequency and regularity.
If the ED-HMM model indeed captures this aspect of behavior better than the DU-RW model, we would expect to see a similar trend in the response probabilities estimated from the two behavioral models.For this we estimated the response probability for each participant under each of the two models and averaged responses according to group.In Fig 11 we show the between model comparison of the estimated response probabilities for the two groups of participants.Although the mean response of the DU-RW model also shows a trough towards the alternative choice (when conditioned on the participants in the ED-HMM group), we see a much wider excursion in the mean response probability obtained using the ED-HMM, explaining its higher predictive model evidence for this group of participants.Still, it is important to note that the presence of the trough in the mean response probability of both models suggest that the change of response probability towards the alternative choice was driven by the specific sequence of outcomes that participants observed during this time window of the post-reversal phase.In other words, it seems that the participants that were assigned to the ED-HMM group were sensitive to a short sequence of negative outcomes, as if they were expecting another reversal during the post-reversal phase.
In Fig 12 we show the posterior estimates of the expected reversal probability δ τ for all participants of the ED-HMM group.The expected reversal probability was estimated from State durations and corresponds to the measure illustrated in Fig 4 .Note that for four out of the six participants we find the peak of the reversal probability to fall before τ = 20, that is, before the 20th trial within the post-reversal phase.For the other two we see a rather flat trajectory which suggests that these subjects behave close to the IRI agents.Although these results seem to be in contrast to our previous findings (that these six participants anticipated the change on the 20th trial of the post reversal phase), the expected reversal probability does not correspond to the expected response probability for a specific participant.The relation between these two quantities is non-linear and is also shaped by the sequence of outcomes.8) and ( 9)).Each light blue line corresponds to the trajectory obtained as a single sample from the posterior estimates over δ and r, whereas black lines correspond to the trajectory obtained from the mean posterior parameter estimates μ r and μ δ .https://doi.org/10.1371/journal.pcbi.1006707.g012

Discussion
The ability to represent complex temporal structures and to use these representations in everyday decision-making is one of the key features of human adaptive behavior [62].However, the computational mechanisms that underlie these aspect of behavior are still poorly understood.
Here we proposed a novel behavioral model which utilizes an explicit representation of state durations for making decisions.This model allowed us to investigate how beliefs about the temporal structure of latent states in the environment influence the decision-making process.
As a demonstration of the applicability of the proposed model to behavioral data we have used a reversal learning task, in which reversals followed a pre-specified temporal pattern.We have illustrated several properties of the novel model, using both synthetic data for model validation and experimental data for a proof-of-concept demonstration.
Fitting the model to behavioral data allowed us to infer individual beliefs in temporal regularities and identify participants who rely on the latent reversal schedule used in the experimental task.Interestingly, we found that slightly more than one fourth of the participants showed evidence of using temporal expectations when making choices during the post-reversal phase of the experiment.Although the majority of participants behaved as if they did not use any temporal regularity in the sequence of the reversals, this result is not too surprising because participants were only exposed to five reversals during the whole experiment.We expect that prolonged exposure to reversals would allow a larger number of participants to correctly learn the latent task structure, as it was recently demonstrated in [28].
We anticipate that the proposed model can bring novel insights into our understanding of the interplay between the subjective representation of temporal task structure and decision making.The relevance of accurately predicting the moments of change in our everyday lives is reflected by the fact that humans exhibit a strong bias towards expecting timed changes in various tasks.This behavior can even persist when exposed to a series of unpredictable events, e.g., during gambling [63] or trading on financial markets [64].Hence, it may not be too surprising that a firm prior belief in structured and predictable changes influences performance and may lead to a reduced performance in cases when these beliefs are incorrect (see Figs 6 and 7).Such over-confidence in temporally structured dynamics might also explain why human behavior is found to deviate from the fully-rational Bayesian model on similar reversal learning tasks [48].

Related work
The key component of the proposed behavioral model is its conceptualization as an explicit duration hidden Markov models ED-HMM [30], which involves an explicit representation of the between reversal intervals as the hidden structural variable.This representation results in an anticipation of specific moments of reversals.Such anticipation would be clearly advantageous for an agent, as it enables faster behavioral adaptation in cases when reversals actually do occur in a (semi)regular manner.
The ED-HMM belong to more general group of hidden semi-Markov models which are often applied to the analysis of non-stationary time series [65][66][67][68].In the context of decision making the semi-Markov formalism allows for temporal structuring of behavioral policies [69].Importantly, semi-Markov dynamics was also applied to temporal difference learning to account for dopamine activity in cases when the timing between action and reward is varied between trials [70].
The proposed model builds upon recent approaches to model behavior in changing environments [11,71,72] and can be seen as a direct extension of hidden Markov models (HMM) which were applied to reversal learning tasks before [29,38,[46][47][48].In previous works, the HMM were used to identify the moment of reversal, changes in the beliefs about reversal probability, and the most likely moment in which agents reversed their behavior.This was crucial for understanding the effects of dopamine modulation on the underlying inference and consequently behavior.Although it is out of the scope of the present paper, it is possible to perform backward inference with hidden semi-Markov models (HSMM), hence identifying the most likely moments of reversal in the past.Furthermore, we will explore in the future possible learning rules for parameters of the prior beliefs p 0 (d), similar to the work of [73].Such extension would make the models also suitable for addressing questions related to changes in prior beliefs of state durations.
In recent years, reinforcement learning models have found multiple applications in studies relying on a reversal learning task.For example, the classical Rescorla-Wagner model [74], the dual update extension of the Rescorla-Wagner model (as described in the present paper) [39,40], or models separating the prediction error signals on positive and negative prediction errors [75].As we have shown here a reinforcement learning model generates behavior very similar to a probabilistic counterpart in relatively simple settings of the reversal learning task.Hence, we would expect that additional extensions of the considered dual update RW model could make the behavior of the reinforcement learning even more similar to the probabilistic model introduced here.
Still, we can point out several advantages of probabilistic models of behavior over reinforcement learning models, in the context of decision making under uncertainty and in dynamic environments.The probabilistic modeling approach allows for a principle way of mapping complex knowledge about spatio-temporal task structure into a relatively simple set of learning rules (as demonstrated here).In turn this provides clear functional interpretation of various prediction error signals, and corresponding adaptive learning rates, which are typically difficult to derive or motivate within the context of classical reinforcement learning.Specifically, we would say that prescribing to a probabilistic modeling approach is crucial for understanding interaction between representation of temporal structure and decision making.

Interval timing
The sense of time and time representation at a neuronal level has been the focus of numerous studies in the past [76][77][78][79][80][81][82].Studies investigated how neuronal circuitry might implement a robust internal clock [83], and how such an internal clock can be utilized to estimate time elapsed between consequent events (e.g., between two tones).Although not described here, behavior in such and similar tasks can be modelled using the formalism of the hidden semi-Markov models.Estimating elapsed time between events (here, reversals) can be obtained using a backward inference process which provides an estimate on the most likely moment of the previous reversal.Nevertheless, it is an open question if the neural circuitry that explains behavior related to interval timing, which involves sub-second time scales, can also be useful to understand behavior which requires a temporal reasoning over much longer time scale.This is of specific interest for understanding how the brain makes accurate predictions about the expected moment of reward, and how this information is used to construct timed prediction error signals [84,85].

Variants of the reversal learning task
The reversal learning task has been shown to be especially useful for behavioral phenotyping, e.g. in areas of executive control [86][87][88][89][90]; for quantifying individual proneness towards impulsive and compulsiveness [91]; and defining individual vulnerability towards addiction [92] and other psychiatric disorders [35,93,94].The experimental paradigms associated with a reversal learning task can differ substantially from the design presented here (see Fig 1).For example, the probability of reward and punishment for different choices could be correlated to some degree [47,95], or the reversal schedule can be made dependent on the number of correct choices [96][97][98].It is relatively straightforward to adapt the proposed model to any of these variants.For example, when the reversal schedule depends on the number of previously correct choices, one can make the duration transitions dependent on the belief that the previous choice was correct.This would relate state durations to the beliefs about the number of correct choices since the last reversal.Thus, we believe that in a combination with the proposed model, the reversal learning task opens a wide range of opportunities for linking anomalous beliefs in the time domain to different cognitive disorders.

Relevance for understanding psychiatric disorders
Distortions in the temporal organization of cognition and behavior have been implicated, for a long time, in a wide range of psychiatric conditions, perhaps most prominently in attentiondeficit hyperactivity disorder, autism and schizophrenia [99].Hence, the proposed model might help to integrate findings of patients' aberrant behavior in simple time estimation tasks (e.g., a deficit to reproduce durations [100][101][102] with their known deficits in the decision-making domain [103], such as, in variants of the reversal learning task.
Over the recent years, the question of how people apply their knowledge about the underlying structure of the environment in their decision-making has been widely investigated and discussed.Studies have come to the rather unspecific notion that a reduction in model-based control based on using environmental structure is ubiquitous across different psychiatric symptoms and diagnoses [104,105].We believe that the model at hand could be a starting point for developing a set of tasks particularly appropriate for refining this notion.
For example, Bayesian theoretical accounts have conceptualized addictive behavior as decision-making based on an aberrant model of the world [106].Both animal and human studies suggest that addicted patients might have a specific deficit to infer or simulate statistical regularities in the environment [40,107,108].Such deficits suggest limitations in their ability to infer regularities, which consequently is observed as suboptimal decision-making.
Reversal learning deficits and impairments in model-based control have also been shown in patients suffering from obsessive compulsive disorder (OCD) [109,110].A subgroup of OCD patients suffers from an obsession for symmetry and organization.Similarly, obsessive compulsive personality disorder (OCPD) patients show an exaggerated focus on order and symmetry as well as rule-bound traits.Given this clinical picture, it is interesting to speculate that beliefs about regularities in the environment might differ between OCD/OCPD patients and controls.A reduced ability to infer and represent irregular changes in the environment might lead to suboptimal decision-making and distress in OCD/OCPD patients in the presence of irregular changes.
Finally, recent studies have demonstrated that schizophrenia patients may be characterized by an "over-dominance" of an internal model [111] (e.g., in the framework presented here, overrating regularities in an irregular environment).When such an internal model is detached from the external input from the world (i.e., the true regularity structure), this mismatch might be involved in the development of psychotic states (in paranoia, delusions, hallucinations).

Conclusion
In summary, we have presented a novel probabilistic model for understanding human decision making behavior in changing environments.The proposed model is based on the explicit duration hidden Markov model, which forms a special case of more general hidden semi-Markov models.The presented results, obtained from both simulated behavior and the modelbased analysis of behavioral data, suggest concrete applications of the proposed model in understanding human behavior in changing environments.Most notably, it might be possible to link the quality of the underlying representation of temporal task structure to behavior, thereby opening new directions for cognitive phenotyping.

Fig 1 .
Fig 1. Reversal learning task.A: Exemplary trial sequence of experimental reversal learning task.Participants were instructed to choose the card that they thought would lead to a monetary reward.After they chose one of the two cards, the corresponding card was highlighted and feedback was displayed.The feedback consisted of either a 10 Eurocents coin for a win outcome, or a crossed 10 Eurocents for a loss outcome.B: The time series of the underlying reward probability of one of the two stimuli.The reward probability of the more rewarding stimuli at any time step was set to 0.8 and a punishment probability to 0.2 (and vice versa for the other stimulus).Reward contingencies remained stable for the first 55 trials (pre-reversal phase) and for the last 35 trials (post-reversal phase).In between, reward contingencies changed four times (reversal phase).https://doi.org/10.1371/journal.pcbi.1006707.g001

Fig 2 .
Fig 2. Graphical representation of the generative model and model summary.Left: We illustrate here the conditional dependence of state durations, states, reward probabilities, and outcomes.Note that the choices c t are treated as observables within the generative model.Right: summary of the hidden state variables, observations, and the notation used for defining prior expectations and posterior beliefs over hidden states.https://doi.org/10.1371/journal.pcbi.1006707.g002

Fig 3 .
Fig 3. Specific cases of the negative binomial distribution.We illustrate here the dependence of the shape (and the mode) of the negative binomial distribution for different values of the variance σ.Note that when the variance σ equals its mean μ (blue), the distribution peaks around its expected value.As the variance increases (green) the mode shifts toward zero.The limiting case of the geometric distribution (red) corresponds to r = 1, that is, σ = μ(μ − 1).For all three cases we fixed the mean interval length at μ = 20.https://doi.org/10.1371/journal.pcbi.1006707.g003

Fig 4 .
Fig 4. Expected transition probability at future trial τ.Estimate of the transition probability δ τ (see Eq (9)) at a future trial τ conditioned upon a reversal at τ = 1 and known initial state set at s 1 = ¬R.Each curve corresponds to estimates of the transition probability obtained from prior beliefs p 0 (d) shown in Fig 3. Note that for the low variance case (blue, σ = μ), the temporal profile of transition probability has clearly defined periods of low and high transition probability.As the variance increases (green) the variability of the temporal profile decreases, until it finally becomes constant in the case of the geometrically distributed prior beliefs (red, σ = μ(μ − 1)).For all three cases we fixed the mean interval length at μ = 20.https://doi.org/10.1371/journal.pcbi.1006707.g004

Fig 5 .
Fig 5. Example of a reversal schedule for simulated data.In the condition with irregular reversals, the between reversal intervals were sampled from a geometric distribution (top right plot), whereas in the condition with (semi)regular reversals, the intervals were sampled from the negative binomial distribution with variance σ = 20 (bottom right plot).In both cases the mean interval duration is set to μ = 20.https://doi.org/10.1371/journal.pcbi.1006707.g005

Fig 6 .
Fig 6.Performance as a function of the expected interval duration for ED-HMM based agents.The shaded regions show the performance quartiles estimated over n = 1000 simulated experiments, in two different task environments.Left: In the first environment, reversals occur at irregular intervals.The irregular reversal interval (IRI) agent (red) corresponds to the case where prior beliefs over interval duration follow a geometric distribution (see Eq (5)), with mean μ and variance σ = μ(μ − 1).This agent expects reversals at irregular intervals.The regular reversal agent (RRI) (blue) corresponds to the prior beliefs in the form of the negative binomial distribution, with mean μ and variance σ = μ.This agent expects reversals at (semi)regular intervals.The true distribution of between reversal intervals in both conditions is depicted in Fig 5. Right: Same as left plot but the environment has semi-regular intervals encoded by μ = 20 and σ = 20.As expected, we find that the RRI agent is highly sensitive to the misspecification of the expected interval duration μ and variability of interval durations in different conditions, while the IRI agent shows a much better performance in the irregular environment.The lower row shows the average probability of making a correct choice relative to the point of reversal, denoted by the vertical black dashed line.https://doi.org/10.1371/journal.pcbi.1006707.g006

Fig 7 .
Fig 7. Performance distributions of all four behavioral models in two different environments.Here we compare the performance distributions of four different agents: irregular reversal interval (IRI) agent, regular reversal interval (RRI) agent, single-update Rescorla-Wagner(SU-RW) agent (Eq (1)) and dual-update Rescorla-Wagner (DU-RW) agent (Eq (2)).The free model parameters were fixed as follows (see main text for more details): (IRI) μ = 20, and σ = μ(μ − 1); (RRI) μ = 20, and σ = μ; (SU-RW) α = .25 and κ = 0; (DU-RW) α = .25 and κ = 1.Interestingly, the results show that the DU-RW agent achieves similar performance levels in both environments, with average responses closely following that of the IRI agent (see lower row).In contrast, the SU-RW agent shows only medium level performance in both environments with a significant increase of performance with irregular reversals.As in Fig6, the RRI agent performs best, among all agents, when exposed to semi-regular reversals, but has the worst performance, among all agents, when exposed to irregular reversals.

Fig 8 .
Fig 8. Agents performance on the experimental task and model comparison.A: Per trial performance (probability of making a correct choice) of four different agent types: irregular reversal interval (IRI) agent, regular reversal interval (RRI) agent, single-update Rescorla-Wagner(SU-RW) agent (Eq (1)) and dual-update Rescorla-Wagner (DU-RW) agent (Eq (2)).The free model parameters were fixed as follows: (IRI) μ = 20, and σ = μ(μ − 1); (RRI) μ = 20, and σ = 120; (SU-RW) α = .25 and κ = 0; (DU-RW) α = .25 and κ = 1.The parameters were fixed in a way that all models have comparable median performance levels (around 0.83).B: Confusion matrix showing the probability of assigning simulated behavior of different agents to the two different types of behavioral models.Note that as in Fig 7 the IRI and DU-RW agents generate very similar average responses across the experiment.This suggest a difficulty in distinguishing between these two type of models, which is also reflected in the confusion matrix that shows rather high mixing probability.https://doi.org/10.1371/journal.pcbi.1006707.g008

Fig 9 .Fig 10 .Fig 11 .
Fig 9. Bayesian model comparison.A: Posterior distribution of the frequency of the ED-HMM based model within the studied group of participants.The dashed line denotes the border at which both DU-RW and ED-HMM based models are equally likely.Hence, the probability mass on the right side of the dashed line (where ED-HMM has higher frequency within the population) corresponds to the exceedance probability (ep) of the ED-HMM based model (ep = 0.215).B: the color codes denote posterior model probability for each participant.The lighter the color the better the given model predicts behavior of the given participant during the post-reversal phase.https://doi.org/10.1371/journal.pcbi.1006707.g009

Fig 12 .
Fig 12. Posterior estimate of the expected reversal probability.The participants' specific beliefs about the probability of reversal within the postreversal phase, for all participants for which the ED-HMM model had higher posterior model evidence.The reversal probability is fully characterized by two parameters δ, and r (see Eqs (8) and (9)).Each light blue line corresponds to the trajectory obtained as a single sample from the posterior estimates over δ and r, whereas black lines correspond to the trajectory obtained from the mean posterior parameter estimates μ r and μ δ .
To demonstrate the relation of the above update rules to the ones of the DU-RW model, we transform the shape parameters (a c t t , and b c t t ) of the posterior beta distribution into the mean m t ; ð19Þ where x 2 fc t ; P � c t g.Expressing the expected reward probability μ t as a function of the expected value V t , that is, as