^{1}

^{1}

^{2}

The authors have declared that no competing interests exist.

Humans can meaningfully report their confidence in a perceptual or cognitive decision. It is widely believed that these reports reflect the Bayesian probability that the decision is correct, but this hypothesis has not been rigorously tested against non-Bayesian alternatives. We use two perceptual categorization tasks in which Bayesian confidence reporting requires subjects to take sensory uncertainty into account in a specific way. We find that subjects do take sensory uncertainty into account when reporting confidence, suggesting that brain areas involved in reporting confidence can access low-level representations of sensory uncertainty, a prerequisite of Bayesian inference. However, behavior is not fully consistent with the Bayesian hypothesis and is better described by simple heuristic models that use uncertainty in a non-Bayesian way. Both conclusions are robust to changes in the uncertainty manipulation, task, response modality, model comparison metric, and additional flexibility in the Bayesian model. Our results suggest that adhering to a rational account of confidence behavior may require incorporating implementational constraints.

Humans are able to report a sense of confidence in decisions that we make. It is widely hypothesized that confidence reflects the computed probability that a decision is accurate; however, this hypothesis has not been fully explored. We use several human behavioral experiments to test a variety of models that may be considered to be distinct hypotheses about the computational underpinnings of confidence. We find that reported confidence does not appear to reflect the probability that a decision is correct, but instead emerges from a heuristic approximation of this probability.

People often have a sense of a level of confidence about their decisions. Such a “feeling of knowing” [

Bayesian decision theory provides a general and often quantitatively accurate account of perceptual decisions in a wide variety of tasks [

Recent work has proposed possible qualitative signatures of Bayesian confidence [

During each session, each subject completed two orientation categorization tasks, Tasks A and B. On each trial, a category _{C}) and the same standard deviation (_{C}); leftward-tilting stimuli were more likely to be from category 1. Variants of Task A are common in decision-making studies [_{1}, _{2}); stimuli around the horizontal were more likely to be from category 1. Variants of Task B are less common [

(

Subjects were highly trained on the categories; during training, we only used highest-reliability stimuli, and we provided trial-to-trial category correctness feedback. Subjects were then tested with 6 different reliability levels, which were chosen randomly on each trial. During testing, correctness feedback was withheld to avoid the possibility that confidence simply reflects a learned mapping between stimulus orientation and reliability and the probability of being correct [

Because we are interested in subjects’ intrinsic computation of confidence, we did not instruct or incentivize them to assign probability ranges to each button (e.g., by using a scoring rule [

To ensure that our results were independent of stimulus type, we used two kinds of stimuli. Some subjects saw oriented drifting Gabors; for these subjects, stimulus reliability was manipulated through contrast. Other subjects saw oriented ellipses; for these subjects, stimulus reliability was manipulated through ellipse elongation (

For modeling purposes, we assume that the observer’s internal representation of the stimulus is a noisy measurement

A Bayes-optimal observer uses knowledge of the generative model to make a decision that maximizes the probability of being correct. Here, when the measurement on a given trial is

In Task A,

From the point of view of the observer, _{A} and _{B}.

We introduce the Bayesian confidence hypothesis (BCH), stating that confidence reports depend on the internal representation of the stimulus (here

The observer’s decision can be summarized as a mapping from a combination of a measurement and an uncertainty level (

Each model corresponds to a different mapping from a measurement and uncertainty level to a category and confidence response. Colors correspond to category and confidence response, as in

At first glance, it seems obvious that sensory uncertainty is relevant to the computation of confidence. However, this is by no means a given; in fact, a prominent proposal is that confidence is based on the distance between the measurement and the decision boundary, without any role for sensory uncertainty [

We also tested heuristic models in which the subject uses their knowledge of their sensory uncertainty but does not compute a posterior distribution over category. We have previously classified such models as

We derived two additional probabilistic non-Bayesian models, Lin and Quad, from the observation that the Bayesian decision criteria are an approximately linear function of uncertainty in some measurement regimes and approximately quadratic in others. These models are able to produce approximately Bayesian behavior without actually performing any computation of the posterior. In Lin and Quad, subjects base their response on a linear or a quadratic function of

Each trial consists of the experimentally determined orientation and reliability level and the subject’s category and confidence response (an integer between 1 and 8). This is a very rich data set, which we summarize in

Error bars represent ±1 s.e.m. across 11 subjects. Shaded regions represent ±1 s.e.m. on model fits. (

Recently, a measure of the degree of association between accuracy and confidence, meta-

We used Markov Chain Monte Carlo (MCMC) sampling to fit models to raw individual-subject data. To account for overfitting, we compared models using leave-one-out cross-validated log likelihood scores (LOO) computed with the full posteriors obtained through MCMC [

We first compared Bayes to the Fixed model, in which the observer does not take trial-to-trial sensory uncertainty into account (

Bayes provides a better fit, but both models have large deviations from the data. Left and middle columns: model fits to mean button press as a function of reliability, true category, and task. Error bars represent ±1 s.e.m. across 11 subjects. Shaded regions represent ±1 s.e.m. on model fits, with each model on a separate row. Right column: LOO model comparison. Bars represent individual subject LOO scores for Bayes, relative to Fixed. Negative (leftward) values indicate that, for that subject, Bayes had a higher (better) LOO score than Fixed. Blue lines and shaded regions represent, respectively, medians and 95% CI of bootstrapped mean LOO differences across subjects. These values are equal to the summed LOO differences reported in the text divided by the number of subjects. Although we plot data as a function of the true category here, the model only takes in measurement and reliability as an input; it is not free to treat individual trials from each true category differently.

Although Bayes fits better than Fixed, it still shows systematic deviations from the data, especially at high reliabilities. (Because we fit our models to all of the raw data and because boundary parameters are shared across all reliability levels, the fit to high-reliability trials is constrained by the fit to low-reliability trials).

To see if we could improve Bayes’s fit, we tried a version that included decision noise, i.e. noise on the log posterior ratio _{d}. This is almost equivalent to the probability of a response being a logistic (softmax) function of

For the rest of the reported fits to behavior, we will only consider this version of Bayes with

Orientation Estimation performs worse than Bayes-

In both tasks, Bayes-

Finally, Lin and Quad outperform Bayes-

We summarize the performance of our core models in

Models were fit jointly to Task A and B category and confidence responses. Blue lines and shaded regions represent, respectively, medians and 95% CI of bootstrapped summed LOO differences across subjects. LOO differences for these and other models are shown in

One potential criticism of our fitting procedure is that we assumed a parameterized relationship between reliability and

Suboptimal behavior can be produced by optimal inference using incorrect generative models, a phenomenon known as “model mismatch” [

Bayes-_{C}, _{1}, and _{2}, the standard deviations of the stimulus distributions. We tested a model in which these values were free parameters, rather than fixed to the true value. We would expect these free parameters to improve the fit of Bayes-

Previous models also assumed that subjects had accurate knowledge of their own measurement noise; their perceptual uncertainty, used in the computation of _{measurement} and _{inference} as two independent functions of reliability [

A recent paper ([

In order to determine whether model rankings were primarily due to differences in one of the two tasks, we fit our models to each task individually. In Task A, Quad fits better than Bayes-

In order to see whether our results were peculiar to combined category and confidence responses, we fit our models to the category choices only. Lin fits better than Bayes-

To confirm that the fitted values of sensory uncertainty in the probabilistic models are meaningful, we treated Task A as an independent experiment to measure subjects’ sensory noise. The category choice data from Task A can be used to determine the four uncertainty parameters. We fit Fixed with a decision boundary of 0° (equivalent to a Bayesian choice model with no prior), using maximum likelihood estimation. We fixed these parameters and used them to fit our models to Task B category and confidence responses. Lin fits better than Bayes-

There has been some recent debate as to whether it is more appropriate to collect choice and confidence with a single motor response (as described above) or with separate responses [

It is possible that subjects behave suboptimally when they have to do multiple tasks in a session; in other words, perhaps one task “corrupts” the other. To explore this possibility, we ran an an experiment in which subjects completed Task B only. We chose Task B over Task A for this experiment because Task B has the desirable characteristic that uncertainty is required for optimal categorization. Quad fits better than Bayes-

We also fit only the choice data, and found that Lin fits about as well as Bayes-

None of our model comparison results depend on our choice of metric: in all three experiments, model rankings changed negligibly if we used AIC, BIC, AICc, or WAIC instead of LOO.

Although people can report subjective feelings of confidence, the computations that produce this feeling are not well understood. It has been proposed that confidence is the observer’s computed posterior probability that a decision is correct [

Our first finding is that, like the optimal observer, subjects use knowledge of their sensory uncertainty when reporting confidence in a categorical decision; models in which the observer ignores their sensory uncertainty provide a poor fit to the data (

Our study has several limitations. For instance, because of our short presentation time, we cannot say much about how our results generalize to tasks that require integration of evidence over time [

Like the present study, Aitchison et al. [

Navajas et al. [

Sanders et al. [

In detection and coarse discrimination tasks, Lau, Rahnev, and colleagues report that subjects overestimate their confidence in the periphery and for unattended stimuli. The authors have proposed a signal detection theory model in which high eccentricity or lower attention induces higher noise, and the confidence criterion may not change at all [

Another form of non-Bayesian confidence ratings is the recent proposal that, in confidence judgments, only the “positive evidence” in favor of the chosen option matters, instead of the “balance of evidence” between two options [

What do our findings tell us about the neural basis of confidence? Previous studies have found that neural activity in some brain areas (e.g., human medial temporal lobe [

Our results raise general issues about the status of Bayesian models as descriptions of behavior. First, because it is impossible to exhaustively test all models that might be considered “Bayesian,” we cannot rule out the entire class of models. However, we have tried to alleviate this issue as much as possible by testing a large number of Bayesian models—far more than the number of Bayesian and non-Bayesian models tested in other studies of confidence. Second, Bayesian models are often held in favor for their generalizability; one can determine the performance-maximizing strategy for any task. Although generalizability indeed makes Bayesian models attractive and powerful, we do not believe that this property should override a bad fit.

One could take two different views of our heuristic model results. The first view is that the heuristics should be taken seriously as principled models [

However, one might still conclude, after examining the fits of the Bayesian model, that the behavior is “approximately Bayesian” rather than “non-Bayesian.” As written, this is a semantic distinction because it relies on one’s definition of “approximate.” However, it can be turned into a more meaningful question: Are the differences between human behavior and Bayesian models accounted for by an unknown principle, such as an ecologically relevant objective function that includes both task performance and biological constraints?

Although there are benefits associated with veridical explicit representations of confidence [

The experiments were approved by the University Committee on Activities Involving Human Subjects of New York University. Informed consent was given by each subject before the experiment.

11 subjects (2 male), aged 20–42, participated in the experiment. Subjects received $10 per 40-60 minute session, plus a completion bonus of $15. All subjects were naïve to the purpose of the experiment. No subjects were fellow scientists.

^{2}). The stimulus was either a drifting Gabor (Subjects 3, 6, 8, 9, 10, and 11) or an ellipse (Subjects 1, 2, 4, 5, and 7). The Gabor had a peak luminance of 398 cd/m^{2} at 100% contrast, a spatial frequency of 0.5 cycles per degrees of visual angle (dva), a speed of 6 cycles per second, a Gaussian envelope with a standard deviation of 1.2 dva, and a randomized starting phase. Each ellipse had a total area of 2.4 dva^{2}, and was black (0.01 cd/m^{2}). We varied the contrast of the Gabor and the elongation (eccentricity) of the ellipse.

_{1} = −4° (category 1) and _{2} = 4° (category 2) and standard deviations _{1} = _{2} = 5°. In Task B, stimulus orientations were drawn from Gaussian distributions with means _{1} = _{2} = 0°, and standard deviations _{1} = 3° (category 1) and _{2} = 12° (category 2) (

Each subject completed 5 sessions. Each session consisted of two parts; the subject did Task A in the first part, followed by Task B in the second part, or vice versa (chosen randomly each session). Each part started with instruction and was followed by alternating blocks of 96 category training trials and 144 testing trials, for a total of three blocks of each type, with a block of 24 confidence training trials immediately after the first category training block. Combining all sessions and both tasks, each subject completed 2880 category training trials, 240 confidence training trials, and 4320 testing trials; we did not analyze category training or confidence training trials.

Subjects were not instructed to use the full range of confidence reports [

Since our models do not include any learning effects, we wanted to ensure that task performance was stable. For all tasks and experiments, we found no evidence that performance changed significantly as a function of the number of trials. For each experiment and task (the 5 lines in

Performance was computed as a moving average over test trials (200 trials wide). Shaded regions represent ±1 s.e.m. over subjects. Performance did not change significantly over the course of each experiment.

The following statistical differences were assessed using repeated-measures ANOVA.

In Task A, there was a significant effect of true category on category choice (_{1,10} = 285, ^{−7}). There was no main effect of reliability, which took 6 levels of contrast or ellipse elongation, on category choice (_{5,50} = 0.27, _{5,50} = 59.6, ^{−15}) (

In Task B, there was again a significant effect of true category on category choice (_{1,10} = 78.3, ^{−5}). There was no main effect of reliability (_{5,50} = 2.93, _{5,50} = 28, ^{−12}) (

In Task A, there was a significant effect of true category on response (_{1,10} = 136, ^{−6}). There was no main effect of reliability (_{5,50} = 0.61, _{5,50} = 58.7, ^{−13}) (

In Task B, there was a significant effect of true category on response (_{1,10} = 54.2, ^{−6}). There was a significant effect of reliability (_{5,50} = 4.84, _{5,50} = 29.2, ^{−8}) (

In Task A, there was a main effect of confidence on the proportion of reports (_{3,30} = 7.75, ^{−3}); low-confidence reports were more frequent than high-confidence reports. There was no significant effect of true category (_{1,10} = 0.784, _{3,30} = 1.45,

In Task B, there was a main effect of confidence on the proportion of reports (_{3,30} = 4.36, _{1,10} = 0.22, _{3,30} = 8.37,

In both tasks, reported confidence had a significant effect on performance (_{3,30} = 36.9, ^{−3}). Task also had a significant effect on performance (_{1,10} = 20.1, _{3,30} = 0.878,

In Task A (

This control experiment was identical to experiment 1 except for the following modifications:

Subjects first reported choice by pressing one of two buttons with their left hand, and then reported confidence by pressing one of four buttons with their right hand.

Subjects reported confidence in category training blocks, and received correctness feedback after reporting confidence.

There were no confidence training blocks.

In testing blocks, subjects received correctness feedback after each trial.

Subjects completed a total of 3240 testing trials.

8 subjects (0 male), aged 19–23, participated. None were participants in experiment 1, and again, none were fellow scientists.

Drifting Gabors were used; no subjects saw ellipses.

This experiment was identical to experiment 1 except for the following modifications:

Subjects completed blocks of Task B only.

Subjects completed a total of 3240 testing trials.

15 subjects (8 male), aged 19–30, participated. None were participants in experiments 1 or 2.

Drifting Gabors were used; no subjects saw ellipses.

For models (such as our core models) where the relationship between reliability (i.e., contrast or ellipse eccentricity) and noise was parametric, we assumed a power law relationship between reliability ^{2}: ^{2}(^{−β}. We have previously [

For all models except the Bayesian model with additive precision, we assumed additive orientation-dependent noise in the form of a rectified 2-cycle sinusoid, accounting for the finding that measurement noise is higher at non-cardinal orientations [

We coded all responses as _{i} and variance _{i}. Because we only use a small range of orientations, we can safely approximate measurement noise as a normal distribution rather than a Von Mises distribution. We find the boundaries _{0} = −∞° and _{8} = ∞°. For Task B, this quantity is
_{0} = 0° and _{8} = ∞°.

To obtain the log likelihood of the dataset, given a model with parameters

_{A} and _{B}: The log posterior ratio

To get _{A} and _{B}, we need to find the task-specific expressions for _{C} and _{C} into

The 8 possible category and confidence responses are determined by comparing the log posterior ratio _{0}, _{1}, …, _{8}). _{4} is equal to the log prior ratio _{4} is the only boundary parameter in models of category choice (and not confidence). _{0} is fixed at −∞ and _{8} is fixed at ∞. In all models, the observer chooses category 1 when

Because the decision boundaries are free parameters, our models effectively include a large family of possible cost functions. A different cost function would be equivalent to a rescaling of the confidence boundaries _{4}, the category decision boundary (i.e., the observer’s prior over category). For us, this boundary (like all other boundaries) is a free parameter.

The posterior probability of category 1 can be written as as

Levels of strength: The Bayesian model is unique in that it is possible to formulate a principled version with relatively few boundary parameters. In principle, it is possible that such a model could perform better than more flexible models, if those models are overfitting. We formulated several levels of strength of the BCH, with weaker versions having fewer assumptions and more sets of mappings between the posterior probability of being correct and the confidence report (

Solid lines represent the distributions of posterior probabilities for each category and task in the absence of measurement noise and sensory uncertainty. Dashed lines represent confidence criteria, generated from the mean of subject 4’s posterior distribution over parameters. Each model has a different number of sets of mappings between posterior probability and confidence report. In Bayes_{Ultrastrong}, there is one set of mappings. In Bayes_{Strong}, there is one set for Task A, and another for Task B. In Bayes_{Weak}, as in the non-Bayesian models, there is one set for Task A, and one set for each reported category in Task B. Plots were generated from the mean of subject 4’s posterior distribution over parameters as in

Most studies cannot distinguish between the ultrastrong and strong BCH because they test subjects in only one task. Furthermore, the weak BCH is only justifiable in tasks where the categories have different distributions of the posterior probability of being correct; the subject may then rescale their mappings between the posterior and their confidence. Here, one can see that Task B has this feature by observing that, in the bottom row of _{Ultrastrong}, Bayes_{Strong}) corresponding to each of these versions of the BCH.

In Bayes_{Ultrastrong}, _{4}: _{4+j} − _{4} = _{4} − _{4−j} for _{Ultrastrong}, _{A} = _{B}. So Bayes_{Ultrastrong} has a total of 4 free boundary parameters: _{1}, _{2}, _{3}, _{4}. Bayes_{Ultrastrong} consists of the observer determining the response by comparing _{A} and _{B} to a single symmetric set of boundaries (

Bayes_{Strong} is identical to Bayes_{Ultrastrong} except that _{A} is allowed to differ from _{B}. So Bayes_{Strong} has a total of 8 free boundary parameters: _{1A}, _{2A}, _{3A}, _{4A}, _{1B}, _{2B}, _{3B}, _{4B}. Bayes_{Strong} consists of the observer determining the response by comparing _{A} to a symmetric set of boundaries, and _{B} to a different symmetric set of boundaries (

Bayes_{Weak} is identical to Bayes_{Strong} except that symmetry is not enforced for _{B}. So Bayes_{Weak} has a total of 11 free boundary parameters: _{1A}, _{2A}, _{3A}, _{4A}, _{1B}, _{2B}, _{3B}, _{4B}, _{5B}, _{6B}, _{7B}. Bayes_{Weak} consists of the observer comparing _{A} to a symmetric set of boundaries, and _{B} to a different non-symmetric set of boundaries (

We did not include Bayes_{Strong} and Bayes_{Ultrastrong} in the core models reported in the main text, because Bayes_{Weak} provided a much better fit to the data. Because it was not necessary in the main text to distinguish the three strengths of Bayesian models, we refer to Bayes_{Weak} there simply as Bayes. However, we do include Bayes_{Strong} and Bayes_{Ultrastrong} in our model recovery analysis (described below) and in our supplemental model comparison tables.

Decision boundaries: In the Bayesian models without

In the Bayesian models with _{d} ∼ _{d}), where _{d} is a free parameter. We pre-computed 101 evenly spaced draws of _{d} and their corresponding probability densities _{d}). We used Eqs _{d}. We then used linear interpolation to find sets of measurement boundaries _{d} [_{d}, and computed the weighted average according to _{d}).

_{Ultrastrong}, it has 4 free boundary parameters. Although the model is a hybrid Bayesian-heuristic model, not a strictly Bayesian one, we refer to it as Bayes_{Ultrastrong} + precision in

_{r} = _{r}.

_{r}(_{r} + _{r}_{r}(_{r} + _{r}^{2} (Quad).

Lin and Quad are each a supermodel of Fixed. In other words, there are parameter settings where Lin and Quad are equivalent to Fixed (although our model comparison methods ensure that the models are still distinguishable, see “Model recovery” section). Additionally, in Task A, Quad is a supermodel of the Bayesian models without

All neurons have Gaussian tuning curves with variance

On each trial, we get some quantity that is a weighted sum of each neuron’s activity,

Rather than sum over all neurons, we assume an infinite number of neurons uniformly spanning all possible preferred stimuli

Now that we have the mean and variance of

A “full lapse” in which the category report is random, and confidence report is chosen from a distribution over the four levels defined by λ_{1}, the probability of a “very low confidence” response, and λ_{4}, the probability of a “very high confidence” response, with linear interpolation for the two intermediate levels.

A “confidence lapse” λ_{confidence} in which the category report is chosen normally, but the confidence report is chosen from a uniform distribution over the four levels.

A “repeat lapse” λ_{repeat} in which the category and confidence response is simply repeated from the previous trial.

In category choice models, we fit a standard category lapse rate λ, as well as the above “repeat lapse” λ_{repeat}.

_{L} and _{H} were the values of the lowest and highest reliabilities used. This way, _{L} and _{H} were free parameters that determined the s.d. of the measurement distributions for the lowest and highest reliabilities, and _{rel. 1}, …, _{rel. 6}) corresponding to each of the six reliability levels.

For models where subjects had incorrect knowledge about their measurement noise, we fit two sets of uncertainty-related parameters. One set was for the generative measurement noise (used in Eqs

All parameters that defined the width of a distribution (e.g., _{L}, _{H}, _{d}, _{rel. 1}, …) were sampled in log-space and exponentiated during the computation of the log likelihood. See

Rather than find a maximum likelihood estimate of the parameters, we sampled from the posterior distribution over parameters,

We sampled from the probability distribution using a Markov Chain Monte Carlo (MCMC) method, slice sampling [

After sampling, we conducted a visual check to confirm that our parameter ranges were sufficiently large. For each model, we plotted the posterior distribution over parameter values for each subject; an example plot is shown in

Lapse rate parameters, which tend to mass around 0, where they are necessarily bounded.

Log noise parameters, which have a large negative range where they are effectively at zero noise.

Upper confidence boundary parameters, which become small for subjects who frequently report “high confidence,” or large for subjects who frequently do.

Each subplot represents a parameter of the model. Each colored histogram represents the sampled posterior distribution for a parameter and a subject in experiment 1, with colors consistent for each subject. The limits of the x-axis indicates the allowable range for each parameter. Black triangles indicate the overall mean parameter value.

Most information criteria (such as AIC, BIC, and AICc) are based on a point estimate for _{MLE}, the

WAIC is a more Bayesian approach to information criteria that adds a correction for the effective number of parameters [

Although information criteria are computationally convenient, they are based on asymptotic results and assumptions about the data that may not always hold [_{u} is the _{i,u} is the importance weight of trial

We determined that our results were not dependent on our choice of metric. We computed AIC, BIC, AICc, WAIC, and LOO for all models in the 8 model groupings, multiplying the information criteria by _{MLE}. Then we computed Spearman’s rank correlation coefficient for every possible pairwise comparison of model comparison metrics for all model and dataset combinations, producing 80 total values (8 model groupings × 10 possible pairwise comparisons of model comparison metrics). All values were greater than 0.998, indicating that, had we used an information criterion instead of LOO, we would not have changed our conclusions. Furthermore, there are no model groupings in which the identities of the lowest- and highest-ranked models are dependent on the choice of metric. The agreement of these metrics strengthens our confidence in our conclusions.

To confirm that our sample size was large enough to trust our bootstrapped confidence intervals, we bootstrapped our bootstrapping procedure to see how the confidence intervals were affected by the number of subjects

(

Group level Bayesian model selection: We also used LOO scores to compute two metrics that allow for model heterogeneity across the group. The first metric was “protected exceedance probability,” the posterior probability that one model occurs more frequently than any other model in the set [

In all but one of the 8 model groupings, all three methods of metric aggregation identify the same overall best model. For example, in

Model fits were plotted by bootstrapping synthetic group datasets with the following procedure: For each task, model, and subject, we generated 20 synthetic datasets, each using a different set of parameters sampled, without replacement, from the posterior distribution of parameters. Each synthetic dataset was generated using the same stimuli as the ones presented to the real subject. We randomly selected a number of synthetic datasets equal to the number of subjects to create a synthetic group dataset. For each synthetic group dataset, we computed the mean output (e.g., button press, confidence, performance) per bin. We then repeated this 1,000 times and computed the mean and standard deviation of the mean output per bin across all 1,000 synthetic group datasets, which we then plotted as the shaded regions. Therefore, shaded regions represent the mean ±1 s.e.m. of synthetic group datasets.

For plots with orientation on the horizontal axis (e.g.,

We performed a model recovery analysis [

We found that the true generating model was the best-fitting model, on average, in all cases (

Shade represents the difference between the mean AIC score (across datasets) for each fitted model and for the one with the lowest mean AIC score. White squares indicate the model that had the lowest mean AIC score when fitted to data generated from each model. The squares on the diagonal indicate that the true generating model was the best-fitting model, on average, in all cases.

Ideally, we would have evaluated our model recovery fits using LOO, as we evaluated the fits to human data. However, LOO can only be obtained when fitting with MCMC sampling, which takes orders of magnitudes longer than fitting with MLE. It would be impossible to fit all 352 synthetic datasets in a short amount of time using the procedure and sampling quality standards described above (i.e., a large number of samples, across multiple converged chains). Furthermore, we do not believe that our model recovery is dependent on how the models are fit and the fits are evaluated; we found that AIC and LOO scores gave us near-identical model rankings for data from real subjects.

Models were fit jointly to Task A and B category and confidence responses. (a) Medians and 95% CI of bootstrapped sums of LOO differences, relative to the best model. Higher values indicate worse fits. (

(TIF)

Models were fit to Task A category and confidence responses. See

(TIF)

Models were fit to Task B category and confidence responses. See

(TIF)

Models were fit jointly to Task A and B category choices. See

(TIF)

Noise parameters were fit to Task A category choices and then fixed during the fitting of Task B category and confidence responses. See

(TIF)

Models were fit jointly to Task A and B category and confidence responses. See

(TIF)

Models were fit to Task B category and confidence responses. See

(TIF)

Models were fit to Task B category choices. See

(TIF)

In the main text, Bayes_{Weak}-

(TIF)

In the main text, Bayes_{Weak}-

(TIF)

(TIF)

(TIF)

(TIF)

(TIF)

(TIF)

Cells indicate medians and 95% CI of bootstrapped summed LOO score differences. A negative median indicates that the model in the corresponding row had a higher score (better fit) than the model in the corresponding column.

(PDF)

See

(PDF)

See

(PDF)

See

(PDF)

See

(PDF)

See

(PDF)

See

(PDF)

See

(PDF)

Each sheet corresponds with the sets of models pictured in

(XLS)

The authors would like to thank Luigi Acerbi for helpful ideas and tools related to model fitting and model comparison. We would also like to thank Luigi Acerbi, Rachel N. Denison, Andra Mihali, A. Emin Orhan, Bas van Opheusden, and Aspen H. Yoo for helpful conversations and comments about the manuscript.