Fast and Accurate Learning When Making Discrete Numerical Estimates

Many everyday estimation tasks have an inherently discrete nature, whether the task is counting objects (e.g., a number of paint buckets) or estimating discretized continuous variables (e.g., the number of paint buckets needed to paint a room). While Bayesian inference is often used for modeling estimates made along continuous scales, discrete numerical estimates have not received as much attention, despite their common everyday occurrence. Using two tasks, a numerosity task and an area estimation task, we invoke Bayesian decision theory to characterize how people learn discrete numerical distributions and make numerical estimates. Across three experiments with novel stimulus distributions we found that participants fell between two common decision functions for converting their uncertain representation into a response: drawing a sample from their posterior distribution and taking the maximum of their posterior distribution. While this was consistent with the decision function found in previous work using continuous estimation tasks, surprisingly the prior distributions learned by participants in our experiments were much more adaptive: When making continuous estimates, participants have required thousands of trials to learn bimodal priors, but in our tasks participants learned discrete bimodal and even discrete quadrimodal priors within a few hundred trials. This makes discrete numerical estimation tasks good testbeds for investigating how people learn and make estimates.


Individual participant super-model parameters and model fits
This section gives the individual parameters fit to each of the participants across the three experiments. Table 1 gives the parameters for the participants in Experiment 1, Table 2 gives the parameters for the participants in Experiment 2, and Table 3 gives the parameters for the participants in Experiment 3.
In each of the tables, group names correspond to the group names used in the main text and participant numbers correspond to the participant numbers attached to the raw anonymized data. Parameters σ 1 and σ 2 represent the standard deviation parameters of the lognormal distributions from the first and second discrimination sessions respectively. For fitting purposes in Experiments 1 and 2, the mean of σ 1 and σ 2 was used, but for Experiment 3, because the discrimination sessions were different, σ 1 was used for the fitting.
For the super-model, n is the best-fitting number of samples taken from a posterior distribution exponentiated by β which itself is updated by a Gaussian kernel with standard deviation ψ. For large values of β the number of samples (n) becomes of little consequence (for example for β > 2.7, with X t = 23 and σ = 0.22 after learning for 300 trials, more than 95 percent of the probability is at the maximum a posteriori). For ψ less than 0.0115 about 95 percent of the mass of the kernel would be placed at the location of feedback i, i+1 i−1 N (log(j); log(i), (0.0115) 2 )dj ≈ 0.95 (for the largest value j = 32 presented in the experiment), thus we have the Dirichlet prior update mechanism as a special case.
For the model comparison, the three-letter model code indicates sequentially the type of prior (Dirichlet or Gaussian kernel), the decision function (Average, M ax, or S ample), and the noise model (N one, S oftmax, or T rembling hand). A comparison of model posterior probabilities that first converts BIC values into Bayes factors [1] and then assumes equal priors for each of the models is shown in Figure 1 for the None models only and in Figure 2 for the full set of models. The parameter ψ was the standard deviation of the Gaussian kernel (if applicable) and β was the exponent applied to the posterior distribution before applying the decision function. Model details are given in the Materials and Methods section of the main text.

Model verification
A crucial part of any type of computational modeling is some aspect of verification that the model is in fact performing its true role. This is not just to guard against errors in the programming of the model, but also to avoid problems in the specification of the model itself [2]. A standard way to do this is by feeding the model data that has been generated through a known process and known parameters. In our case we chose 5 parameter sets (see table 4) to characterize 5 fictional subjects, and simulated subject responses to 3 sets of stimuli. For each of these fictional data sets we could thus perform inference based on the hierarchical 'super-model', to infer the most likely set of parameters to have generated the data set, which were then compared to the parameters that generated the data.
Most of the inferences on the data sets recovered parameter values close to the original, except for set 14 where the inference process lead to erroneous values of n and β (although note that for large values of softmax β the number of samples from the posterior n becomes irrelevant).

A prior or a mapping?
This section addresses whether participants are actually learning a real prior distribution over the responses or whether they are learning a mapping: a set of learned boundaries that convert an impression of magnitude into a discrete response. In previous work, a linear in log-space mapping has formed the basis of an influential model of numerosity judgments [3]. This type of mapping cannot explain our results: linear mappings or monotonic transformations of linear mappings cannot contort themselves to produce multimodal distributions. The response distributions in this experiment can only be produced by a mapping that expands the proportion of responses to 23-25 and 29-31 at the expense of responses 26-28 and thus it is not a linear operation. Previous research has tested a more general class of mappings with continuous responses by first training participants with a particular likelihood distribution and then observing how responses changed when they were tested with a wider likelihood distribution [4]. As shown in Equation 6 of the main text, if a prior is used then responses on average would be drawn more toward the prior when the width of the likelihood increased. However, for a mapping there is no additional feedback so the responses would not be drawn closer to the prior. In [4], the change in responses was consistent with a prior.
Unfortunately in discrete numerical estimation the results are not as simple to evaluate. To show why, an example of a mapping is shown in Figure 3. Here a single sample is taken from a lognormal distribution and the region into which the sample falls determines the response. As the width of the likelihood distribution increases more responses will be made to numbers 23 and 31, and as the width becomes very large nearly half of the responses will be made to each of these two numbers. Because the extreme responses are squashed into these numbers, the mapping can also predict that the mean response will be closer to the prior mean.
Instead of looking at the mean results when there is a wider likelihood distribution, we performed a quantitative model comparison. We compared the 'super-model' defined in the Materials and Methods section of the main text with a model that instantiates a mapping. Instead of using a prior distribution, the mapping model consisted of a series of thresholds that defined the boundaries between responses. As in Figure 3, the mass of the likelihood distribution falling into each region determined the probability of making that response. Like the 'super-model', the mapping model also used a fixed probability (0.001) of producing a uniform response over the range of responses from 1 to 100.
For each individual participant in Experiment 3, best-fitting parameters for both models were found for the first estimation block (500 trials) using the likelihood width estimated from a discrimination block using the same display parameters. Then, using the likelihood width estimated from the other, more difficult discrimination block, both models made parameter-free predictions for the more difficult estimation block (200 trials). To prevent incorrect keypresses from influencing the results, responses less than 10 or greater than 100 were ignored in both the fitting and prediction stages. (This completely excluded the third participant of the difficult numerosity group, who made no responses greater than 10 in the more difficult estimation block. ) We compared the models in two ways. First we compared the two models predictions based for all numbers between 10 and 100 and found that thirteen of twenty participants were better fit by the prior distribution model than by the mapping model, most with large differences between the likelihoods of the predictions demonstrating a lot of heterogeneity in participant responses.
It is possible that the mapping model was predicting poorly because the training data did not constrain the parameters well -the parameters of the mapping model are more local and participants very rarely made responses outside of the presented range of values. A follow-up analysis grouped together all predictions and responses at each edge of the presented range with all values outside that range (i.e., 10-23 were grouped together and 29-100 were grouped together) and found that eleven of twenty participants were better fit by the prior model, further demonstrating evidence that a prior was used, but only evidence for some of the participants.
Some of this heterogeneity is potentially due to how the task was structured: because the training and test trials were separated, participants may not have carried over their prior from the first session to the second. In particular, this would explain the behavior of the participant who only produced numbers below ten in the second and more difficult estimation block.

Correlations between model parameters and response time
If participants are performing a form of approximate inference in which effort is traded off against accuracy, we might expect a correlation between the 'super-model' parameters and the average response time of participants. To test this, we correlated average response time with the number of samples n, the posterior distribution exponent β, and the Gaussian kernel width ψ both within experiment and across all experiments (Table 5). Participants with an average RT more than three standard deviations away from the mean across participants were excluded, removing one participant from Experiment 1.
We found no significant correlations between response time and any of the model parameters across any or all of the experiments (p < .05).