Human online adaptation to changes in prior probability

Optimal sensory decision-making requires the combination of uncertain sensory signals with prior expectations. The effect of prior probability is often described as a shift in the decision criterion. Can observers track sudden changes in probability? To answer this question, we used a change-point detection paradigm that is frequently used to examine behavior in changing environments. In a pair of orientation-categorization tasks, we investigated the effects of changing probabilities on decision-making. In both tasks, category probability was updated using a sample-and-hold procedure: probability was held constant for a period of time before jumping to another probability state that was randomly selected from a predetermined set of probability states. We developed an ideal Bayesian change-point detection model in which the observer marginalizes over both the current run length (i.e., time since last change) and the current category probability. We compared this model to various alternative models that correspond to different strategies—from approximately Bayesian to simple heuristics—that the observers may have adopted to update their beliefs about probabilities. While a number of models provided decent fits to the data, model comparison favored a model in which probability is estimated following an exponential averaging model with a bias towards equal priors, consistent with a conservative bias, and a flexible variant of the Bayesian change-point detection model with incorrect beliefs. We interpret the former as a simpler, more biologically plausible explanation suggesting that the mechanism underlying change of decision criterion is a combination of on-line estimation of prior probability and a stable, long-term equal-probability prior, thus operating at two very different timescales.


Ideal observer model derivation
Our derivation of the Bayesian online change-point detection algorithm for the ideal observer generalizes that of Adams and MacKay [1]. For clarity and ease of reference, we report here the full derivation; only the broad outline is described in the main text.

Bayesian online change-point detection
The Bayesian observer estimates the posterior distribution over the current run length, or time since the last change point, and the state (category probabilities) before the last change point, given the data (category labels) observed so far. We denote the length of the run at the end of trial t by r t . Similarly, we denote with π t and ξ t the current state and the state before the last change point, both measured at the end of trial t. Here, π t represents the probability that, on the subsequent trial, the category will be A (the probability of category B is 1 − π t ). Both π t , ξ t ∈ S π , where S π is a discrete set of possible states. In the experiment, S π = {0.2, 0.35, 0.5, 0.65, 0.8}. We use the notation C (r) t to indicate the set of observations (category labels) associated with the run r t , which is C t−rt+1:t for r t > 0, and ∅ for r t = 0. We use the subscript colon operator C t ′ :t to denote the sub-vector of C (the full sequence of observed categories) with elements from t ′ to t included.
We write the predictive distribution of category by marginalizing over run lengths r t and previous states ξ t , P (C t+1 |C 1:t ) = rt ξt P (C t+1 |r t , ξ t , C (r) t )P (r t , ξ t |C 1:t ). (S1) We assume that, in the case of a change point at the end of trial t, the new state might have Markovian dependence on the previous state, that is π t ∼ P (π t |π t−1 ). This is a generalization of the model of Adams and MacKay [1], in which the distribution parameters were assumed to be independent after change points. In the experiment, P (π t |π t−1 ) = 1 |Sπ|−1 π t = π t−1 . We use A to denote Iverson's bracket which is 1 if the expression A is true, and 0 otherwise [2].

(S4)
Eq S4 is the basis for the iterative Bayesian algorithm, since it allows us to derive the posterior distribution at time t as a function of the posterior distribution and a number of auxiliary variables at time t − 1.
For computational convenience, we rewrite the posterior from Eq S4 as an unnormalized posterior U (r t , ξ t |C 1:t ) = to denote the posterior from the previous trial times the conditional predictive probability for the current category, To compute the unnormalized posterior in Eq S5, we need: • the conditional predictive probability, which we compute in the following Section 1.1.1; • the change-point posterior, which we compute in Section 1.1.2; • the posterior from the previous trial.
We put everything together in Section 1.1.3.

Conditional predictive probability
The posterior over state at the end of trial t − 1, given the last r t−1 trials and the previous state ξ t−1 , is For computational convenience, we denote Ψ (r,π) t ≡ P (π t = π|r t = r, C (r) t ) and we store it in a table. Clearly, Ψ (r,π) t depends only on the length of the run r, the category probability π and the number of times category A occurs during the run. In the algorithm below, Ψ is computed iteratively trialby-trial and values of Ψ are computed only for combinations of run length and number of A categories that occur in the sequence. The conditional predictive probability for observing C t , using Eq S7, is

The change-point posterior
The conditional posterior on the change point (that is, run length) and previous state, P (r t , ξ t |r t−1 , ξ t−1 , C (r) t ), has a sparse representation since it has only two outcomes: the run length either continues to grow (r t = r t−1 + 1 and ξ t = ξ t−1 ) or a change point occurs (r t = 0 and the posterior over ξ t is the posterior over π t−1 computed from C (r) t ). We have The probability of a run length after a change point is where the function H(τ ) is the hazard function, , for τ ≥ 1.
(S11) P gap is the prior over run lengths. In our experimental setup, P gap (g) = 1 gmax−g min +1 g min ≤ g ≤ g max , with g min = 80 and g max = 120. The conditional posterior over the previous state is is the posterior over state given that C t has just been observed, (S13) where in the last step P (r t |r t−1 ) is constant in π t−1 and therefore irrelevant.

Iterative posterior update and boundary conditions
Using Eqs S5 and S9, the iterative posterior update equation becomes which is computed separately for the case r t = 0 and r t > 0 via Eqs S7-S13.
We assume as boundary conditions that a change point just occurred and uniform probability across previous states (S15) Once we have U (r t , ξ t |C 1:t ), we can easily obtain the normalized posterior P (r t , ξ t |C 1:t ) by computing the normalization constant via a discrete summation over run lengths r t and previous states ξ t , (S16) This result together with Eq S1 allows the observer to compute P (C t+1 |C 1:t ).

Task-dependent predictive distributions
Armed with an expression for the observer's posterior distribution over run lengths and previous states, given all trials experienced so far (Eq S16), we can now compute the predictive distributions for the observer's response at trial t for the covert-and overt-criterion tasks.

Covert-criterion task
The probability density of a noisy measurement x t is is the observer's visual measurement noise and σ 2 s is the stimulus variance. The conditional posterior for category C t , after observing x t , is . (S18) We assume that for a given noisy measurement the observer respondsĈ t if that category is more probable, that is, The probability of observing responseĈ t for a given stimulus s t is therefore P (Ĉ t |s t , C 1: which can be easily computed via 1-D numerical integration over a grid of regularly spaced x t using trapezoidal or Simpson's rule [3]. We can also consider an observer model with lapses that occasionally reports the wrong category with probability 0 ≤ λ ≤ 1, where |C| is the number of categories in the task (|C| = 2 in our case), and we assume equal response probability across categories for lapses.

Overt-criterion task
The optimal criterion z opt is the point at which P (C A |x, t) = P (C B |x, t), given the available information at trial t. Specifically, noting that P (C, π t |x) ∝ P (x|C)P (C|π t )P (π t ), we have (1−πt)P (πt) . We assume that µ A and µ B are known exactly from the training session.
The probability that the observer reports criterionẑ t at trial t is where σ 2 a is criterion-placement (adjustment) noise. Note that the likelihood at trial t is based on information gathered through trial t − 1.
We can also consider an observer model with lapses who occasionally reports a criterion uniformly at random with probability 0 ≤ λ ≤ 1,

Algorithm
In the following, we use the notation P (x|y) ∝ f (x, y) to indicate that the user needs to compute f (x, y) and then normalize as follows, Observe new category C t

Compute auxiliary variables
(a) Evaluate predictive probability (Eq S8) Evaluate the predictive probability times posterior probability (Eq S6) (c) Evaluate the posterior probability over state (from Eq S13) 4. Update run length and previous-state posterior (a) Calculate the unnormalized change-point probabilities (Eq S14) Calculate the unnormalized growth probabilities (see Eq S14)

Bookkeeping and predictions
(a) Update sufficient statistics for all r and π 6. Increase trial index t ← t + 1 and return to step 2 For each trial t, the posterior predictive distributions calculated in steps 5b and 5c are used to compute the observer's response probabilities in the covertand overt-criterion tasks, respectively, as described in Section 1.2.

Additional models
We describe here a number of model variants which we did not include in the main text for reasons of space. For all additional models, we report as a model comparison metric the difference in log marginal likelihood (∆LML) with respect to a baseline model (mean ± SEM across subjects). Usually, unless stated otherwise, we take as baseline the best-fitting model described in the main text, Exp bias . Positive values of ∆LML denote a worse-fitting model than baseline.

Bayesian
The main text discusses four Bayesian models. Bayes ideal is the algorithm above, using the precise generative model for our experiment. Bayes r uses the same algorithm, but adds a free parameter for the run-length distribution (and hence the hazard function) assumed by the observer. Bayes π also uses the same algorithm, but adds a parameter for the range of the set of five states assumed by the observer. Bayes β assumes the observer uses a betadistributed prior over states. To implement this observer requires minor modifications of the algorithm. In particular, the beta prior is substituted for the uniform distribution in the initialization steps for P (r 0 , ξ 0 |∅) and Ψ (r 0 ,π 0 ) as well as the update step forΨ (r,π) t in the case where r t = 0. To ensure the robustness of our results we fit two additional suboptimal Bayesian models and compared each model to the winning model (Exp bias ). To capture conservatism as we did in the Exp bias model, we fit a model that took a weighted average between the probability predicted by the ideal observer model and π = 0.5 (Bayes bias ). The weight on the probability computed by an ideal observer was defined by the parameter w with range 0 ≤ w ≤ 1, such that 0 indicated the use of a fixed criterion and 1 the optimal. This model is similar to the Bayes β model described in the main text with a symmetric hyperprior on π, in that both result in conservatism. However, we ran it to ensure that the fits did not change when the parameterization was identical to the Exp bias model. We also fit a three-parameter model, in which the maximum run length r and the hyperparameter β were both free parameters (Bayes r,β ). We chose these parameters because the Bayes r model was the best fitting Bayesian model tested, and the Bayes β model takes into account conservatism, which we observed in our data. Neither of these additional models fit better than the Exp bias (Bayes bias : ∆LML covert = 19.92 ± 5.00, ∆LML overt = 20.49 ± 3.43; Bayes r,β : ∆LML covert = 5.89 ± 2.10, ∆LML overt = 10.54 ± 4.32).

Reinforcement learning -probability updating
The following model (RL prob ) differs from the RL models in the main text in that it updates the category probability (as opposed to updating the decision criterion), which makes it similar to the exponential-averaging model. Similarly to the RL models in the main text (and in contrast with the Exp models), it updates probability according to a delta rule which is applied only after incorrect responses. After each response at trial t, the probability estimate for the next trial is updated using the following delta rule, whereπ A,t is the observer's estimate of the probability for category A on trial t (π B,t = 1−π A,t ), α prob is the learning rate, and C t is the current category label. Thus, the probability estimate is updated when negative feedback is received by taking a small step in the direction of the most recently experienced category. This model has two free parameters (α prob and either σ v or σ a ).

Wilson et al. (2013)
The Wilson et al. model [4,5] was developed as an approximation to the full change-point detection model. Their approximation used a mixture of delta rules, each of which alone is identical to our Exp model with different learning rates. In the main text, we fit a three node model with two free node parameters (l 2 and l 3 ) and the hyperparameter on category probability ν p as a free parameter as well. Here, instead we fit the model with ν p = 2, which was determined based on our experimental design. On average, this model provided a worse fit than the Wilson et al. model presented in the main text (∆LML covert = 28.79 ± 9.91; overt task: ∆LML overt = 4.18 ± 2.94).
3 Comparison of the Bayes r,π,β and the Exp bias models We compared the winning Exp bias from our preliminary model-comparison analysis to the Bayes r,π,β , which allowed for incorrect beliefs and a bias towards equal priors. Because of the complexity of the Bayes r,π,β , we fit both models using maximum likelihood and variational Bayes [6], thus computing the Bayesian information criteria (BIC) and ELBO scores for each observer and model. Each of these model-comparison methods penalizes the model for increased complexity. The maximum-likelihood fits for each model and task are shown in Fig S1A (covert) and Fig S1B (overt). The relative modelcomparison scores are shown in Fig S1C. For both BIC and ELBO, we found that the two models were indistinguishable from one another.

Model comparison with AIC
To ensure the robustness of our model comparison results, in addition to using the log marginal likelihood as a measure of goodness of fit, we calculated the Akaike information criterion (AIC) [7]. Unlike the log marginal likelihood, AIC uses a point estimate and penalizes for complexity by adding a correction for the number of parameters k: AIC = 2k−2 ln(L), whereL is the maximum log likelihood of the dataset. Like the log marginal likelihood, AIC is best interpreted as a relative score. The model comparison results using relative AIC scores (relative to the winning model) are shown in Fig S2. From the plot we see that our results do not change using a different metric (compare with Fig 3 in the main text). Furthermore, the ranks for all models do not change for either task when comparing -0.5 AIC and LML scores (ρ = 1.0, p < 0.0001). Note that for historical reason the AIC scores have an additional factor of two. Higher scores indicate a worse fit. Error bars: 95% C.I.

Model recovery
We performed a model recovery analysis to validate our model-fitting pipeline and ensure that models were identifiable [8]. For this analysis, we generated ten synthetic datasets from each model, observer, and task (1,980 datasets). Parameters for each simulated dataset were determined by sampling with replacement from the posterior over model parameters. We fitted these datasets with all models (17,820 fits), and for each pair of generating and fitting models we calculated the proportion of times each model fit the data best (i.e., had the greatest LML score), producing the confusion matrix in Fig S3. First, the fact that the confusion matrix is mostly diagonal means that most datasets were best fit by their true generating model, suggesting a generally successful recovery. Across both tasks, we found that the true generating model was the bestfitting model for 70.1%±9.0% of simulated datasets (covert: 66.06%±11.36%; overt: 74.0% ± 8.6%; mean and SEM across models). For most simulated datasets, the true generating model was recovered for all models except the Exp model (see diagonal in Fig S3), which was best fit by the Wilson Fig S3B) the RL model simulations were best fit by the Exp bias model. This is potentially due to the fact that observers exhibited a greater amount of conservatism in the covert task. Increased conservatism results in smaller, smoother changes of criterion, which is consistent with what we observed in the RL model (see the third row, third column panel in Fig 2D in the main text), so that data from the RL model are also well fit by the Exp bias model. However, these models were clearly distinguishable in the overt task (Fig S3C), which allows us to rule out the RL model. These results again provide support for the use of tasks, such as our overt task, that allow the researcher to better distinguish between computational models.

Parameter recovery
To determine whether our parameter estimation procedure was biased, we analyzed the parameter recovery performance for the Exp bias model. Specifically, for each observer we created ten synthetic datasets by sampling from the posterior over model parameters and simulating the experiment with the same experimental parameters as the observer experienced. Each synthetic dataset was then fit to the Exp bias model and the best fitting parameters (MAP estimates) were estimated. For each parameter and task, we conducted a paired-samples t-test comparing the average best fitting parameters to the average generating parameters. We did not find a statistically significant difference between the fitted and generating α Exp and w parameters for either task: α Exp (covert: t(10) = 1.14, p = 0.28; overt: t(10) = 1.01, p = 0.34) and w (covert: t(10) = −2.00, p = 0.07; overt: t(10) = 0.46, p = 0.66), suggesting good parameter recovery. While there was no significant difference between the fitted and generating noise parameter (σ v ) in the covert task (t(10) = 2.10, p = 0.06), we found a significant difference in the noise parameter (σ a ) in the overt task (t(10) = −16.53, p = 1.37e − 08). This difference remained significant after correcting for multiple comparisons using the Bonferroni cutoff of p = 0.0083. This result suggests that σ a was overestimated on average. 6 Measurement task

Procedure
During the 'measurement' session, observers completed a two-interval forcedchoice, orientation-discrimination task in which two black ellipses were presented sequentially on a mid-gray background and the observer reported the interval containing the more clockwise ellipse (Fig S4A). This allowed us to measure the observer's sensory uncertainty.

Analysis
A cumulative normal distribution was fit to the orientation-discrimination data (probability of choosing interval one as a function of the orientation difference between the first and second ellipse) using a maximum-likelihood criterion with parameters µ, σ, and λ (the mean, SD, and lapse rate). We define threshold as the underlying measurement SD σ v (correcting for the 2IFC task by dividing by √ 2). Figure S3. Model recovery. For each model, observer, and task, 10 sets of parameters were sampled from the model posterior and used to generate synthetic data (1,980 total simulations). The synthetic datasets were then fit to each model (17,820 fits) and the goodness of fit was judged by computing the LML. The proportion of "wins" (i.e., the number of times the simulated model outperformed the alternative models) is indicated by brightness. Model recovery performance is shown across both tasks (A), the covert task only (B), and the overt task only (C).   Figure S4. Measurement task. A: Trial sequence. Two ellipses were presented sequentially on a mid-gray background. The observer reported the interval containing the more clockwise ellipse. Feedback was provided. B:

Results
The best fitting psychometric function for one observer. The area of each data point is proportional to the number of trials.
not to confound category learning with probability learning. Training was identical to the covert-criterion task (Fig S5A). On each trial (N trials = 200), a black ellipse was presented on a mid-gray background and observers reported the category to which it belonged. Category probability was equal during training and we provided correctness feedback. To determine how well observers learned the category distributions, observers estimated the mean orientation of each category at the end of the training block by rotating an ellipse to match the mean orientation ( Fig S5B). Each category was estimated exactly once.

Results
Observers' estimates of the category means are shown in Fig S5C as Figure S5. Category training. A: Trial sequence. After stimulus offset observers reported the category by key press and received feedback. B: Mean estimation task. After completing the training block, observers rotated an ellipse to estimate the category means. C: Estimation results. Observers' category-mean estimates are shown as a function of the true category mean for each category, observer and task.

Individual model fits
The maximum a posteriori (MAP) model fits for each observer, task, and model are plotted below for all models. Note that the parameter values obtained for the Bayes r,π,β model via maximum likelihood are equivalent to MAP estimates, since for all parameters we used flat priors in the chosen parameterization.