Fig 1.
(A) RTNet architecture. RTNet consists of two modules. The first is a Bayesian neural network whose weights are stochastic such that at each processing step, a unique feedforward CNN is instantiated. As a result of the stochastic weights, the network processes the same image at each time step and generates noisy evidence through the activations of its output layer. The evidence for each choice option is then accumulated by an evidence accumulator module towards a pre-defined threshold. (B) Task. Subjects performed a digit discrimination task where they were presented with a noisy hand-written image of a digit between 1 and 8. The Subjects decided which digit was presented and then reported their confidence on a scale from 1-4. (C) The four experimental conditions. Task difficulty as well as speed-accuracy trade-off instructions were manipulated in a 2x2 factorial design. Task difficulty was adjusted by adding noise to the images, whereas speed-accuracy trade-off was manipulated by instructing subjects to focus on either accuracy or speed.
Fig 2.
Comparing the ability of confidence strategies to fit human confidence ratings.
We examined the quantitative fits of the seven confidence strategies (PE, BCH, Top2Diff, ProbTop2Diff, ProbAvgRes, Entropy and Softmax). The Top2Diff strategy outperformed all other confidence strategies by an average of at least 11 AIC points. Error bars depict 95% confidence intervals for the difference in AIC scores between Top2Diff and each of the remaining strategies.
Fig 3.
Comparing the ability of confidence strategies to fit patterns of human confidence.
We examined the qualitative fits of the seven confidence strategies. (A) Confidence decreases with task difficulty. All confidence strategies were able to reproduce this qualitative pattern, but the Top2Diff, ProbAvgRes and ProbTop2Diff models provided the closest fits to the data. (B) There was no significant change in confidence between the speed and accuracy focus conditions. All models incorrectly predicted that confidence should be higher in the accuracy- compared to the speed-focus condition, but ProbAvgRes and ProbTop2Diff models provided the closest fits to the data. (C) Confidence increased with task performance for correct trials but decreased with task performance for error trials giving rise to a folded-X pattern. All strategies except Entropy can reproduce this qualitative pattern, but the Top2Diff, ProbAvgRes and ProbTop2Diff models provided the closest fits to the data. Error bars depict SEM. SSE, sum of squared errors (smaller values indicate better fits).
Fig 4.
Stimulus category-wise predictions of average confidence for correct and error trials.
Confidence was uniformly greater for correct trials compared to error trials across all eight stimulus categories. In addition, there were small variations in confidence across the eight categories, possibly due to some digits being more prone to confusion with others. All confidence strategies correctly predicted that confidence was higher for correct trials compared to error trials, but Top2Diff best captured the variations in confidence found between the stimulus categories. In particular, all models except Top2Diff over-estimated confidence for error trials for at least two categories, with the PE, Softmax, and Entropy models over-estimating confidence for all stimulus categories. Error bars depict SEM. SSE, sum of squared errors (smaller values indicate better fits).
Fig 5.
Stimulus category-wise model predictions of individual confidence for correct choices.
(A) We computed average confidence for correct trials within each stimulus category separately for each individual subject and plotted these against the corresponding quantities predicted by each model. All models generated strong and significant correlations with observed confidence, suggesting that they were able to capture stimulus-related variations in confidence. (B) The Top2Diff strategy generated the numerically highest correlations with human confidence. However, these correlations were not significantly higher than those generated by ProbAvgRes, ProbTop2Diff, Entropy, and PE models. Error bars show 95% confidence intervals. (C) We fit a linear-mixed model that quantified how closely each model captured observed category-wise variations in confidence, while controlling for the repeated measurement across individuals. AIC scores derived from the linear mixed models showed that the Top2Diff strategy generated the best fits to category-specific confidence for correct trials with an AIC difference of at least 10 points with all other strategies.
Fig 6.
Stimulus category-wise model predictions of individual confidence for error trials.
(A) We computed average confidence for error trials within each stimulus category separately for each individual subject and plotted these against the corresponding quantities predicted by each model. As with correct trials, models generated strong and significant correlations with observed confidence, suggesting that they are generally able to capture stimulus-related variations in confidence. (B) The Top2Diff strategy generated the numerically highest correlations with human confidence, but these correlations were not significantly higher than those generated by ProbAvgRes, ProbTop2Diff, PE and BCH models. Error bars show 95% confidence intervals. (C) We fit a linear-mixed model that quantified how closely each model captured observed category-wise variations in confidence, while controlling for the repeated measurement across individuals. AIC scores from these regression models showed that the Top2Diff strategy generated the substantially better fits to category-specific confidence for error trials with an AIC difference of at least 62 points with all other strategies.