Bayesian regression explains how human participants handle parameter uncertainty

Accumulating evidence indicates that the human brain copes with sensory uncertainty in accordance with Bayes’ rule. However, it is unknown how humans make predictions when the generative model of the task at hand is described by uncertain parameters. Here, we tested whether and how humans take parameter uncertainty into account in a regression task. Participants extrapolated a parabola from a limited number of noisy points, shown on a computer screen. The quadratic parameter was drawn from a bimodal prior distribution. We tested whether human observers take full advantage of the given information, including the likelihood of the quadratic parameter value given the observed points and the quadratic parameter’s prior distribution. We compared human performance with Bayesian regression, which is the (Bayes) optimal solution to this problem, and three sub-optimal models, which are simpler to compute. Our results show that, under our specific experimental conditions, humans behave in a way that is consistent with Bayesian regression. Moreover, our results support the hypothesis that humans generate responses in a manner consistent with probability matching rather than Bayesian decision theory.

The human brain copes with sensory uncertainty in accordance with Bayes' rule.However, it is unknown how the brain makes predictions in the presence of parameter uncertainty.Here, we tested whether and how humans take parameter uncertainty into account in a regression task.Participants extrapolated a parabola from a limited number of noisy points, shown on a computer screen.The quadratic parameter was drawn from a prior distribution, unknown to the observers.We tested whether human observers take full advantage of the given information, including the likelihood function of the observed points and the prior distribution of the quadratic parameter.We compared human performance with Bayesian regression, which is the (Bayes) optimal solution to this problem, and three sub-optimal models, namely maximum likelihood regression, prior regression and maximum a posteriori regression, which are simpler to compute.Our results clearly show that humans use Bayesian regression.We further investigated several variants of Bayesian regression models depending on how the generative noise is treated and found that participants act in line with the more sophisticated version.

Bayesian regression | Psychophysics | Parameter uncertainty
T he brain evolved in an environment that requires fast decisions to be made based on noisy, ambiguous and sparse sensory information, noisy information processing and noisy effectors.Hence, decisions are typically made under substantial uncertainty.The Bayesian brain hypothesis states that the brain uses the framework of Bayesian probabilistic computation to make optimal decisions in the presence of uncertainty (1)(2)(3).A large body of research has established that many aspects of cognition are indeed well described by Bayesian statistics.These include magnitude estimation (4), color discrimination (5), cue combination (6), cross-modal integration (7,8), integration of prior knowledge (9,10) and motor control (11)(12)(13).
Most of these experimental studies can be cast into the problem of estimating a hidden quantity from sensory input.Much fewer experimental studies have been performed on regression tasks (but see (14) for an overview, and, e.g., (15)(16)(17)).In a regression task, the aim is to learn the mapping from a stimulus x to an output y after having been exposed to a training data set D = {(xi, yi)} N i=1 of N associations between stimulus xi and its corresponding yi.Since the mapping from x to y can be probabilistic, the aim of regression is to find an expression for p(y|x, D).Classification tasks (such as object recognition) or self-supervised tasks such as estimating the future position of an object from past observations are just a few examples from a long list of regression tasks performed on a daily basis.
The machine learning literature contains many solutions to the regression problem, such as nonlinear regression, support vector machines, Gaussian processes or deep neural networks (see (18) for an introduction).It is unclear, however, how humans perform regression tasks.Most of the machine learning solutions rely on the assumption that the mapping from x to y is parametrized by a set of parameters w, such that the original regression problem of finding p(y|x, D) is replaced by a parameter estimation problem, i.e., finding the best set of parameters w * for the parametrized mapping p(y|x, w * ).However, this approach is not Bayesian since no uncertainty over the parameters w is included in the regression model.The Bayesian approach to regression proceeds in two steps (19).First, the posterior distribution over the parameters p(w|D) is computed from the observed data D.Then, this posterior is used to compute the posterior predictive distribution by integrating over the parameters: p(y|x, D) = p(y|x, w)p(w|D)dw [1] Taking into account the uncertainty over parameters is particularly relevant for predictions when the data set size N is small compared to the number of parameters.Indeed, taking into account the uncertainty helps to generalize to unknown data and thereby alleviates overfitting.Parameter uncertainty also plays a key role in computing the prediction uncertainty.
Both of these aspects -overfitting on small data sets and lack of prediction uncertainty -currently limit the power of deep neural network models (20,21).These models have millions of parameters and their performance grows with the number of

Significance Statement
A central function of the brain is learning a model of the world from sensory inputs.Taking into account the noise and ambiguity of these inputs, the brain works with levels of belief rather than single values.Previous research mostly focused on how the brain represents its beliefs about the outside world.Here, we ask the following question: how do different levels of belief influence predictions made by the brain's model?We consider three methods of doing this, ranging from "not at all" to "totally", and compare them to human performance in a visual experiment.Our results suggest that brains are well aware of their degree of belief when making predictions.
layers (22,23).To prevent overfitting, training these models requires ever larger and more expensive training sets.
It is interesting to note that classic DNNs do not use weight uncertainty and are therefore limited in their ability to compute the prediction uncertainty.Recently, the idea of computing the probability distribution over weights in DNNs and using it for prediction has gained traction and given rise to the so-called Bayesian Neuronal Network (BNN), for example (24,25).BNNs promise better performance in the low-data regime.Additionally, BNNs output their prediction uncertainty, which is crucial when the cost of decisions is unequally distributed as is usually the case in behavioural tasks (26)(27)(28).
Here we ask if humans take advantage of Bayesian regression.To address this question, we conducted the following psychophysical experiment.Participants sat in front of a screen on which 4 points from a noisy parabola were shown.Their task was to correctly extrapolate the parabola, i.e., to find the vertical position of the parabola at a given horizontal location.The quadratic parameters of the parabolas were drawn from a bimodal prior distribution, designed to make the parabolas face either upwards or downwards.After recording the participant's response, we showed the parabola that generated the dots as feedback.This feedback enabled the participants to learn both the prior and the generative model.Because we wanted to test to what extent participants made decisions in accordance with a Bayesian regression strategy, we varied the level of noise of the parabola.The rationale is that the higher the noise level, the higher the uncertainty about the correct parameter and, according to Bayesian regression, the more participants should rely on the prior.We found that Bayesian regression indeed explained participants' responses better than maximum likelihood regression and maximum a posteriori regression.The latter two models are widely used by applied statisticians (29)(30)(31)(32)(33)(34)(35)(36)(37) and also in psychophysical modelling (38) but fail to account for parameter uncertainty.sessions began with 10 practice trials with virtually no noise (σg = 10 −5 ), followed by 4 blocks of 50 trials of low noise (σg = 0.03).In session 1, the low noise blocks were followed by 8 blocks of 50 trials of medium noise (σg = 0.1), while in session 2, the low noise blocks were followed by 8 blocks of 50 trials of high noise (σg = 0.4).In total, each subject ran 400 trials per noise level.See Materials and Methods Section E for more details.

A. A novel paradigm to test regression.
In each trial, we chose the parameter w of a parabola y = wx2 from a fixed prior distribution π(w).Participants were shown 4 dots which were on this parabola, but the parabola itself was not shown.We had 20 such 4-dot stimuli, which we repeated 20 times each.For a certain 4-dot stimulus Dj, the parameter wj was always the same.The x-positions of the 4 stimulus dots were fixed across all trials and were rather close to the vertex of the parabola.The parameter wj was either positive (parabola opening upwards) or negative (parabola opening downwards), with the same probability, i.e., 0.5.The prior distribution π(w) was bimodal with means of ±1 and standard deviations of 0.1 (see Material and Methods section A).We did not tell observers about the prior probability distribution.Also unbeknown to the observers, we jittered the dots for each stimulus Dj along the y-axis by adding zero-mean Gaussian noise of a certain level σg.The jitter was the same for all 20 repetitions of a stimulus Dj.A fifth dot was presented to the right of the 4-dot stimulus, always at the same x-position x .Participants could move the fifth dot up and down along the y-axis by using the up and down arrow keys.Participants were asked to adjust the y-position so that the dot correctly extrapolated the parabola.After the adjustment, we showed the generating parabola and the adjusted point as feedback.
Because we wanted to test to what extent observers rely on prior information, we used low, middle and high values of σg.The rationale is that the higher the noise level, the higher the uncertainty (the lower the likelihood) and the more participants rely on the prior.Thus, for each participant and each of the 3 noise levels, we ran 20 4-dot stimuli with 20 repetitions, which were randomly ordered (Figure 1, right).The set of responses to a 4-dot stimulus Dj is Rj = {r B. The regression models.We considered four regression models.Maximum likelihood regression (ML-R) does not take the prior distribution into account and computes only the point estimate of w that maximizes the likelihood p(Dj|w).The maximum a posteriori model (MAP-R) combines the likelihood with the prior distribution to compute the mode of the posterior distribution p(w|Dj).The Bayesian regression (B-R) model MBR takes the entire posterior distribution into account: Note that in the previous three models, we assumed that participants know the generative noise (σg) -an assumption that will be relaxed in Section D. As a null model, we included prior regression (P-R), which replaces the posterior with the prior, i.e., it does not use the likelihood.For more details, see Materials and Methods Section B. To model participants' responses, we also included the internal sources of noise arising during neural processing, decision making and the execution of motor action.To this end, we presented the same noise-free (σg = 0) stimulus 20 times and measured the variability of the responses σm.We then transformed the predictive distribution into a (predicted) response distribution over r by convolving with a Gaussian N (r − y ; 0, σ 2 m ) (see Materials and Methods Section C). Figure 2 shows the responses of a typical subject along with the model predictions.Both ML-R and MAP-R ignore one of the modes (here, the mode corresponding to a parabola which opens upwards).In addition, the parabola predicted by ML-R is much thinner (i.e., the absolute value of the quadratic parameter is high) than the parabola that the participant responded with because ML-R ignores the prior but humans likely do not.In the low noise regime (σg = 0.03) (Figure 2 (B)), the participant's unimodal response distribution rules out the null model, which shows a second mode.In the high noise regime (σg = 0.4) (Figure 2 (D)), MAP-R and ML-R fail to account for the fact that the participant distributed his responses across both modes.In the intermediate noise regime (σg = 0.1) (Figure 2 (C)), the participant sometimes gave tightly clustered responses and sometimes distributed them across both modes.Generally, B-R captures the participant's responses best as compared to the other models.C. B-R outperforms the other models.We showed the performance of a typical participant in Figure 2. Now, we compare model performance across all seven participants.For each of the seven participants individually, we computed the log probability that the model reproduces their responses, i.e., the y-position adjusted by the participant.We summed these log probabilities for all of the 20 stimuli Dj as a measure of the quality of the model.Figure 3 shows these values for each noise level and participant separately (left) and averaged across participants (right).The higher the value, the better the model performance.
For all noise levels, B-R is among the highest performing models.At the lowest and highest noise level, MAP-R and P-R show similar performance to B-R, respectively.The results are consistent across participants.At low noise levels, MAP-R and B-R perform equally well because the stimulus is unambiguous.In this case, the posterior belief about the quadratic parameter is well approximated by a single value, the MAP.In contrast, at high noise, the prior is more important in determining the posterior belief about the quadratic parameter.In fact, the posterior closely approximates the bimodal prior.For this reason, B-R and P-R perform similarly well.At the medium noise level, the posterior belief is influenced by the prior and likelihood to approximately the same degree.In this case, B-R outperforms the other models because it takes full advantage of the posterior distribution, in particular, its bimodality.At medium and high noise levels, even P-R clearly outperforms MAP-R and ML-R because the latter two models predict unimodal responses, i.e., they predict that the parabola either opens up-or downwards depending on the stimulus Dj. Their failure to model responses belonging to the other mode is strongly punished by the log-likelihood measure because the model predicts a close-to-zero probability for these responses.

D. B-R explains the generative noise-dependent increase in
response variance.When participants responded with a position that was far from a mode of the predicted response distribution, the log-likelihood strongly punished the corresponding model.To complement the log-likelihood analysis with a metric that is more sensitive to the overall response distribution, we used the variance.For each stimulus Dj, we computed the variance of the 20 recorded responses.For example, if all responses are located in the upper mode, the variance is close to zero.If they are distributed evenly across both modes, the variance is close to 16, which corresponds to the variance of the prior P-R.To obtain a single value for each participant and each noise level σg, we averaged the variance values across stimuli j.This value represents the averaged variance of responses at stimuli generated at σg. Next, we compared the average variance with the predicted averaged variance of each model.To this end, we generated a large number of stimuli {Dj} 1000 j=1 from σg and averaged the variance of the predicted response distributions (see Materials and Methods Section D).
Figure 4 (top left) compares the averaged variance of the participants' responses (gray) with the model predictions (color-coded).The B-R variance changes strongly with the noise level.The reason is that the generative noise σg determines the contribution of likelihood and prior for the responses; thus, the predicted responses change from a unimodal to bimodal in B-R.As a result, the B-R (red) variance values smoothly transition from the MAP/ML variances at the low noise level (σg = 0.03) to the P-R variance at the high noise level (σg = 0.4).This is not the case in the other models.There is some discrepancy between the data from the participants and the prediction of B-R.At high noise (σg = 0.4), the participants' variance increases more slowly than what B-R predicts.This means that participants cluster their responses close to one of the modes more often than predicted.
To explain the discrepancy, we tested two additional variants of B-R.Full Bayesian regression B-R f and three-point Bayesian regression B-R3.Both are based on the idea that participants do not regard the generative noise as constant but rather infer (in B-R f ) or estimate (in B-R3) it on a pertrial basis.When participants assumed that the generative noise was smaller than the one we used to generate the points, their responses were more clustered.As a consequence, the average variance decreased.This happened when the four stimulus points were by chance aligned on a parabola.When we computed the averaged variance with classical B-R, we used the true generative noise which was, in this case, larger than the estimated noise.Since the noise determines how much participants rely on the prior and likelihood relative to each other, the averaged variance of the prediction was larger than in the data.Both additional variants of B-R consider the noise on a trialby-trial basis, but they differ in how they do it.While B-R only infers the quadratic parameter w, B-R f extends the inference to the noise level σg as well.Since the generative model does not specify a prior over the (fixed) hyperparameter noise σg, we used an exponential prior and set its mean to the fixed hyperparameter (see Material and Methods eq. ( 9)).B-R f uses the joint posterior over w and σg to predict further responses.When the 4-dot stimulus closely resembled a parabola, the posterior belief over σg was highly peaked around a value that was smaller than the actual generative noise parameter.Thus, we expected more clustered responses and predicted a lower averaged variance.
B-R3 differs from B-R in that it includes the preprocessing step of estimating the generative noise on a trial-by-trial basis and treating the left or rightmost stimulus point as an outlier.The noise estimate is then used in standard B-R.The easiest approach to achieve this is using the maximum-likelihood estimator of the noise and all four stimulus points.Often, the maximum likelihood estimate was smaller than the fixed hyperparameter, reducing the averaged variance (see Supplementary Information).Data and prediction matched almost perfectly if we allowed for the possibility to disregard outliers during the estimation (see Materials and Methods section B).
Figure 4 (top right and bottom) show the variance distribution predicted (red, mean: line) by B-R, B-R f and B-R3.The data (gray, mean: black) is pooled across participants.In the data, there is a portion of almost zero-variance responses for all noise levels.The fraction of these highly clustered responses decreases as the generative noise increases but remains present for all experimental conditions.Only B-R3 captures this aspect of the data.Besides this, the variances associated with both variants of B-R rise more slowly than that of the original B-R, supporting the idea that the generative noise is indeed jointly estimated with the quadratic parameter.The comparison between measured and predicted responses strongly favours B-R.B-R with σg as a fixed hyperparameter overestimates the increase in variance compared to the data.By using two variants of B-R, we show that this discrepancy can be explained by assuming that participants estimate the generative noise on a per-trial basis.The two additional variants of B-R perform slightly worse than the original B-R in terms of the log-likelihood but still better than all other models (see Supplementary Information).
In summary, both the likelihood-based model comparison and the response variance results strongly favour B-R.

Discussion
Participants adjusted a dot such that it fits on a parabola determined by four other dots.We used the variance and the log-likelihood to compare the participants' responses to the predictions of four regression models: ML-R, MAP-R, P-R and B-R.Out of these, B-R best explained the participants' responses across different levels of stimulus noise.In particular, B-R interpolated between the variances of MAP-R at low noise levels and P-R at high noise levels.We found the same trend in the data, although B-R predicted larger values for the variance at high noise levels.We could explain this discrepancy with the assumption that participants estimate the noise level per trial and then apply it.We conclude that in our task, participants are capable of implicitly performing or at least approximating a complicated integral over the posterior distribution of the parameter given the data.
The rationale behind our experimental design was to use the simplest possible setup to study if humans perform Bayesian regression.Our one-dimensional response space was easy to visualize and the amount of data needed to compare the predictive and empirical distributions was limited.Bayesian regression easily extends beyond the simple low dimension problem of a single quadric nonlinearity to problems with higher dimensional parameters, such as polynomials of arbitrary degree, and to other basis functions, such as Gaussians.Future research needs to study to what level of complexity humans perform Bayesian regression.There is some evidence that the brain uses sampling to represent high dimensional distributions (39).This representation of distributions scales to high dimensions and offers a potential way of evaluating the high dimensional integral needed for Bayesian regression.
In our analysis, we assumed that participants know the generative model, including the prior over the parameters.Without this assumption, we would have had to account for potential temporal dynamics of learning with a participantspecific, time-dependent prior.For instance, it might take participants a non-negligible amount of time to learn the generative model or their responses could be influenced by immediately preceding trials.To avoid such complications, we showed the generative parabola after each trial.
Our work was inspired by the growing emphasis on parameter uncertainty in the machine learning community; however, it is important to highlight that function learning and extrapolation have been studied before.The function learning literature has addressed which types of functions humans can learn (40), how batch or sequential data representation affects learning (17), to what extent human behaviour can be modelled by parametric functions (41) and how well humans extrapolate (16).However, to the best our knowledge, these studies have so far failed to conduct a minimal experiment to establish that humans behave in accordance with Bayesian regression.Our contribution will help to better understand the brain's remarkable ability to learn and generalise from very little data and underpins the power of Bayesian regression as a framework in psychophysical modelling.

Materials and Methods
A. Stimulus generation from the bimodal prior.Here, we describe in detail how stimuli are generated.On the j th trial, participants are presented with a stimulus consisting of N = 4 points in a 2-dimensional space: 2 ), (x 4 )}.For each stimulus, we fix the x-values and generate the y-coordinates from a Gaussian generative model with a parabolic non-linearity and the generative parameter, w j : The parameter w j is drawn from a mixed Gaussian prior where the parameter set γ = (µπ, σ 2 π , c) consists of the mean µπ = 1, the standard deviation σπ = 0.1 and the mixing coefficient c = 1/2.We denote the total set of hyperparameters (suppressed for notational clarity), from the prior and the generative probability, by α = (γ, σ 2 g ).Each parameter w j corresponds to a generative parabola.Given this model and given a stimulus D j , we asked participants to predict the y-component y at x = 2, which is equivalent to mentally fitting a parabola to the four stimulus points and estimating the point of intersection with a vertical line at x .To train participants on the generative model and the prior, we showed participants the generative parabola after each trial.In total, we showed a set of 20 unique stimuli for each of the three noise levels σg ∈ {0.03, 0.1, 0.4}, and each unique stimulus was repeated 20 times.We denote the set of the 20 responses to the j th stimulus as R j = {r (j) 1 , . . .r (j) 20 }.This amounts to a total of 400 trials per noise level.The order of the stimuli was randomized.Figure 1 shows the set-up.

B. Regression models.
In each trial, the participants carried out an inference step and a prediction step.During the inference step, they inferred information about the quadratic parameter w j based on the presented data (i.e., stimulus) D j .The information they inferred depends on the inference model participants use.The inferred information was then used for a subsequent prediction y .Therefore, the participants overall task was to compute the predictive distribution: p(y |x , D j , M). Prior regression (P-R) is our null model.P-R assumes that participants make predictions based on their prior belief but disregard information from the stimulus: Maximum likelihood regression (ML-R) relies only on the likelihood maximizing parameter, w M L : Maximum a posteriori regression (MAP-R) uses the parameter that maximizes the posterior p(w|D j ) = p(D j |w)π(w)/p(D j ): Bayesian regression (B-R) uses the entire posterior for making predictions by marginalizing over it: p(y |x , D j , M BR ) = p(y |x , w)p(w|D j )dw with p(w|D j ) = p(D j |w)π(w)p −1 (D j ) [8] Full Bayesian regression (B-R f ) loosens the assumption that participants treat σg as a hyperparameter.Including σg with an exponential prior π(σg) in Bayesian regression generalizes eq. ( 8): We chose an exponential prior because it has only one free parameter.By setting the expectation over σ g equal to the actual hyperparameter σg, we eliminated this degree of freedom and retained our fitting free approach (see Supplementary Information).Three point regression (B-R 3 ) assumes that participants use an estimate σg based on M = 3 points.The posterior predictive is computed exactly as in ordinary B-R (eq.( 8)) except for the following replacement σg → σg, where σg is computed in a two step procedure.First, we obtained two maximum likelihood estimates of the generative noise by ignoring the left most and the right most point.Second, the minimum of these two was σg.We also tested M = 4 as an alternative but concluded that M = 3 gave a more convincing account of the observed variance (see Supplementary Information Figure S1).

C. Participants' internal noise.
To predict the participants' responses r from the regression models' output y , we had to account for the internal noise of the participants.We did this by showing a noise-free parabola and fitting a Gaussian with variance σ 2 m to each participant's response variability: p(r|y ) = N (r; y , σ 2 m ).As explained in more detail in the Supplementary Information, the predicted response distribution is then p(r|D j , x , M) = p(r|y )p(y |x , D j , M)dy . [10] D. Model comparison.We used the log-likelihood and the variance to compare the predicted and empirical response distributions.Log likelihood.To compute the log likelihood for a model M across all response at a given noise level σg, we summed the individual log likelihoods of each response r (the log of eq. ( 10)) across all stimuli D j : Variance prediction As a independent comparison of the data and the predicted response distribution, we used the variance.For each of the 20 stimuli D j we obtained a single empirical value from the 20 responses recorded.Hence, at each noise level σg, we have a variance distribution: where high variance values reflect ambiguous and difficult stimuli while low values indicate easy stimuli, prompting participants to give very similar responses across trials.The predicted variance distribution is expressed by where we used 1000 stimulus samples and computed the variance analytically (see Supplementary Information).We use this distribution to compute the means and quantiles shown in fig. 4 (top left) and the color-coded density in the other plots.

Log likelihood for Bayesian Regression variants
In the main text, we introduced two additional variants of Bayesian regression (B-R): B-R3 and full B-R f .Figure S2 (A) displays the log-likelihood of all variants of B-R (red), the prior regression (green) as well as ML-R (blue) and MAP-R (yellow).Out of the three B-R variants, the classical version performs best, then B-R f and finally B-R3.At σg = 0.1 and σg = 0.4, B-R3 describes the data worse than the prior.However, when we compute the loglikelihood for the entire data set by summing over σg, B-R3 still slightly outperforms the prior, as shwon in Figure S2 (B).Note that the additional B-R variants perform better than the original B-R in terms of the variance metric (see main text) but worse in terms of the likelihood.The reason is that the loglikelihood punishes responses far from the prediction strongly while the variance measure is less sensitive to wrong answers but measures the clustering of responses instead.Based on their trial-by-trial estimation, the additional variants of B-R assume a smaller level of generative noise.This leads to more confident predictions.Hence, when a response does not align with the prediction, it is punished more strongly than in the case of classical B-R.

Derivations of predictive distributions
Section B of the Materials and Methods section of the main text formally expresses the inference and prediction steps based on Maximum Likelihood regression (ML-R), Maximum a Posteriori regression (MAP-R), Bayesian Regression (B-R) and prior-regression (P-R), as well as the two variants of B-R, full Bayesian regression (B-R f ) and three point B-R (B-R3).In order to turn these formal expressions into explicit functions of the parameters, a couple of algebraic steps are required.In the following we report these steps for a more general case than the one presented in the main text.Specifically, we generalize the one-dimensional curvature parameter w to arbitrary dimensions p, consider any real nonlinear function φ(x) rather than a simple quadratic and a mixed Gaussian prior without the symmetry imposed in the main text.
A. Notation.First, we highlight the notation used during the derivation as compared to the main text in some detail.The fact that the parameter and nonlinear function are now vector valued is indicated with bold symbols: For example, a polynomial of order p − 1 can be expressed by y(x) = w T φ(x) by setting φ(x) = p−1 k=0 x k êk , where êk is a unit vector.In the main text, we denote the j th stimulus by Dj = (x (j) , y (j) ).Here, the bold notation indicated that we represented the x and y components of the N = 4 stimulus points as a N dimensional vector.In what follows, it is unnecessary to distinguish between different stimuli; however, it will be necessary to sum over the set of points that make up a single stimulus.Because of this, we will use the index i to iterate over the points in a stimulus D and omit j.The change with respect to the main text is given by: During the derivations, we frequently use multiplications of the following form , where the result is a D × D matrix.To avoid confusion, we will always make the sum over the data points explicit with the running index i.We use bolt notation for vectors of dimensionality D, such as w and φ Finally, we use π(w) and ρ(w) as short-hands for the prior and posterior respectively.With these preliminaries, we are ready to work through the algebra for each of the three models.

B. Maximum likelihood parameters. ML-R uses on the likelihood maximizing parameter wML:
p(y |x , Dj, M ML ) = pσ g (y |x , wML) with wML = arg max w p(Dj|w) [5] The maximizing parameter is: This follows from eq. ( 5) because we can find the maximum by setting the log-likelihood gradient to zero and solving for w: Solving this for w yields eq. ( 6).
C. Maximum a posteriori parameters.MAP-R uses the parameter that maximizes the posterior p(w|Dj) = p(Dj|w)π(w)/p(Dj): p(y |x , Dj, M MAP ) = pσ g (y |x , wMAP ) with wMAP = arg max w pσ g (w|Dj).[8] In order to find the global maximum of the posterior, we explicitly compute the posterior over the parameters w.This can be done in closed form because our mixed Gaussian prior is the conjugate prior of the Gaussian likelihood.Since the posterior is a Gaussian mixture again, there are only two potential candidates for the maximizing parameter, the means of the two components (indicated by the ± symbol): To determine which of the two posterior modes is the global maximum, we numerically evaluate: To derive eq. ( 9), we start from the mixed prior: π , [12] where ± again denotes the two modes.After the presentation of a stimulus D consisting of N points, we show that the posterior, indexed by ρ rather than π, resumes the form of a mixed Gaussian again: ρ . [13] The symbol C (±) represents the terms in each component that contain the gist of the following algebraic manipulations.To get explicit expressions for the parameters of the posterior, i.e., c ρ , we evaluate the product of two Gaussians in each summand individually.To this end, we use completion of the square in the exponent χ, introduced shortly.For a concise notation we temporally omit the index ±.Copying from the second line in eq. ( 13), the important terms in each component read: To complete the square, we first group terms in powers of w: where we already reported the result in terms of the posterior mean and variance (eqs.( 9) and ( 10)).Note that only the first term in the last line of eq. ( 15) depends on w.It represents the exponent of one of the (Gaussian) mixture components of the posterior.The remaining terms go into the update of the mixture coefficients c ρ .Substituting the simplified exponent eq. ( 15) back into Equation ( 14) and adding a factor of |2πΣρ| to account for the normalization of the Gaussian mixture component, we have: Comparing this expression with Equation ( 13), we identify the mixing coefficients of the posterior as C without the w dependent factor: The normalization then follows from the evidence term in Equation (13).Now, labeling the two components of the posterior explicitly with ± again, we compute the evidence: ρ , [18] such that c With eqs.( 9), (10) and (18) we have explicit expressions for the parameters of the mixed Gaussian as a result from combining the data points D with the prior.This enables us to efficiently update the prior.Because the mixed Gaussian is a conjugate prior for the Gaussian likelihood, we the posterior maintains the form of a mixed Gaussian: D. Posterior predictive distribution.B-R uses the entire posterior for making predictions by marginalizing over it: p(y |x , Dj, M BR ) = pσ g (y |x , w)pσ g (w|Dj)dw with pσ g (w|Dj) = pσ g (Dj|w)π(w)p −1 (Dj) [20] To get an explicit expression for the posterior predictive in Bayesian regression, we have to solve the integral in eq. ( 20): a posterior weighted average of forward predictions.The posterior predictive will be evaluated at the location x and we use φ := φ(x ).To reduce notation, the poster predictive variable is simply denoted as y (instead of y in the main text).
The algebraic steps to go from posterior to poster predictive, are similar to those of the previous section where we moved from prior to posterior.Again, the resulting expression is a mixed Gaussian and we adopt a similar notation as in the previous section, using the subindex β (instead of ρ for the posterior or π for the prior) to indicate the posterior predictive.As before with C, we introduce the symbol B as a shorthand for the relevant terms: Note that µ β is a one-dimensional quantity, therefore not written as bold symbol.As before, we explicitly evaluate one of the two Gaussian mixture components in y by rearranging the exponent.Omitting the label ± and indicating all auxiliary variable by a wiggle, we copy one of the two mixture components in eq. ( 22): where the exponent χ0 contains terms in w and y.We first group terms in powers of w and rearrange them by completion of the square to result in a Gaussian that we integrate out (only the normalization constant remains): This step was essentially similar to updating a Gaussian mixture with the data d = (x , y).The new mean and variance appear as auxiliary parameters (marked with a wiggle): Now we return to Equation ( 23) and carry out the marginalization, which amounts to a factor of |2π Σ| 1/2 : Next we turn to the reduced exponent χ1, group terms in powers of y and complete the square to identify the mixture component of the posterior predictive (a Gaussian in y): where in the second last step we identified the mean and precision of the posterior predictive and in the last step we gave the reduced constant a new name after completion of the square: Substituting this into Equation ( 23) and adding the normalization constant of the Gaussian over in y yields for s single mixture component: The update of the mixing constants follows from the pre-factors and the reduced exponent: ), [33] where the proportionality indicates that the coefficients on the right hand side must still be normalized to one.Together Equations ( 29), ( 30) and ( 33) report how to compute posterior predictive in Equation ( 22).
E. Prior predictive.As a null model, we tested if the measured responses could be explained simply in terms of the prior predictive: Computing the prior predictive requires exactly the same steps as in the previous section.The only difference is that our starting point is not posterior but the prior.In terms of the equations, this amounts to replacing all subindices ρ (posterior) with the subindex π (prior) in the previous section.

F. Derivation of full Bayesian regression.
To compute the posterior predictive distribution, we can expand on our previous results for ordinary B-R, namely the fact that know how to compute p(y |x , Dj, M BR ) = p(y |x , w, σ g )p(w|Dj)dw [35] for any fixed non-zero value of σ g .To extend to full B-R, we make two modifications in eq.(35).First, we replace the posterior by the joint posterior: p σ g (w|Dj) = p σ g (Dj|w)π(w)p −1 (Dj) → p σ g (Dj|w)π(w)π(σ g )p −1 (Dj) = p(w, σ g |Dj). [36] Note that the normalization constant is not the same on the LHS and RHS of the equation.The only change is the factor of p σ g and the fact that we have no closed form expression for the normalization anymore.Second, we integrate over σ g in eq. ( 35).Because we lost the closed form expression for the normalization constant, we use proportionality instead of equality: = p(y |x , w, σ g )p σ g (Dj|w)π(w)π(σ g )p −1 (Dj)dwdσ g [38] ∝ π(σ g )p(y |x , Dj, M BR )dσ g [39] where we dropped the normalization constant because it is numerically more convenient to normalize only the final expression, ie the posterior predictive distribution.Note that we owe the simplicity of this extension to the fact that chose a factorized prior over π(w, σ g ) = π(w)π(σ g ).When considering sequential inference, this choice has the clear downside that the prior is not conjugate to the gaussian likelihood.However, this does not come to bear in our work because we consider only one inference step upon seeing the stimulus.
We choose π(σ g ) as an exponential distribution.We use the data generating hyperparameter σg to fix the expectation of the such that π(σ g ): E[π(σ g )]

!
= σg ⇒ π(σ g ) = σ −1 g exp(−σ g /σg). [40] G. Bayesian regression with three and four-point estimates of the generative noise.To estimate the generative noise, from M ∈ {3, 4} points, we set the gradient of the log-likelihood to zero.The closed from solution for the estimate of the generative noise is: For the B-R3 method, we compute two such estimate by ignoring the left and right most stimulus point and then take the minimum value as input for Bayesian regression.For the B-R4 method used to generate fig.S3 we used M = 4 to obtain a single noise estimate.
While both methods B-R3 and B-R4 are heuristics, we opted for B-R3 because it better captures the empirical variance distribution, as confirmed by fig.S3.
H. Bayesian regression scales with generative noise.Here, we examine the limiting cases of σg → 0 and σg → ∞.We show that all three models (ML, MAP and BR) make the same prediction in the former case, and that the posterior predictive and the prior prediction collapse in the latter case.Both statements are shown by examining the limiting cases of the posterior.It is not necessary to work with the posterior predictive.
Small generative noise.Based on the data D = (xj, yj) N j=1 , ML makes a prediction with based on wML (eq.( 6)).Note that this prediction is independent of σg.The MAP prediction is given by eqs.( 9) and (10): → wML, as the data term dominates the prior.This makes intuitive sense, if the data is highly informative we do not rely on the prior.The same holds true for Bayesian regression because the bimodal posterior becomes a delta function around the ML-solution.
We already established that the mean associated with each of the two modes converge to the maximum likelihood solution, i.e., µ (±)   ρ → wML.We also know that the mixture coefficients c (±) ρ always sum to one.Now, we only have to verify that the covariance matrices of both modes converge to zero: With a delta-peaked posterior, the BR prediction collapses into the ML prediction as well: p(y|x, D, MBR) = δ(w − wML)p(y|x, w)dw = p(y|x, wML).
[43] Therefore all three methods make the same prediction.
Large noise limit.In the limit of σg → ∞, the data carries no information and the posterior (indicated by ρ) distribution converges to the prior distribution (indicated by π).Then BR makes the same prediction as a prediction based on the prior.It is straightforward to show that the mean and variance of each mode converge to the prior's mean and variance: π . [45] As a final step, we have to verify that the coefficients of the modes do not change.Equation (46) prescribes the update of the coefficients.The update can be expressed by the update factor f (±) between the normalized prior coefficient c (±) π and the unnormalized posterior coefficient c(±) If the update factor is independent the original mode, i.e., independent of the label ±, both mixture components receive the same update and since the final coefficients are normalized,

Fig. 1 .
Fig. 1.Experimental protocol.(A): procedure of a single trial.First, a fixation dot was presented for 1s before the 4-dot stimulus appeared.Observers then had unlimited time to adjust the fifth dot with the up and down arrow keys.They then clicked the space bar to confirm the final position of the adjustable dot.After the response, the generative parabola was shown for 1s as feedback.(B): experimental protocol.Both The different regression models make different predictions as to the extent to which participants use the generative probability p(y|x, w) and the prior π(w).

Fig. 2 .
Fig. 2. Example responses from a representative participant for different levels of generative noise σg.(A) A sample stimulus with the high noise level (σg = 0.4; blue dots), as shown to the participant.Contours indicate the equiprobable responses predicted by ML-R, MAP-R and B-R (not shown to the participant).The black dots show the measured responses.(B-D) The predicted response distribution (color-coded)vs.experimentally measured responses (gray) at x = 2.As σg increases, the data becomes less informative.Consequently, and in accordance with B-R, the response distribution becomes more bimodal.In the (Bottom right) plot, the mode of the M-L predictive distribution (blue) lays at high negative values, outside the plotting range.

Fig. 3 .
Fig. 3.The log-likelihoods (A) for all participants individually and (B) pooled across participants.The SEM is smaller than the markers.At σg = 0.03, ML-R and P-R show weak performance while MAP-R and B-R perform best and are indistinguishable within the errors.At σg = 0.1, the MAP-R performance drops relative to B-R.Finally, at σg = 0.4, B-R and P-R perform equally well while all other methods fail.The noise in the stimulus is so large that the posterior closely resembles the prior.

Fig. 4 .
Fig. 4. (A) Variance predicted by the models and averaged across participants (colorcoded) and measured variance (gray) for individual participants.B-R is the only model that captures the upwards trend in the data.The shading represents the 25% and 75% quantiles, computed from the distribution of variance across 1000 unique stimuli (and averaged over participants).While the errors are large enough to encompass all data points at σg = 0.1, B-R fails at σg = 0.4.The errors of the other models are too small to show.(B,C,D) Variance of the response distribution for a set of stimuli {Dj }j , pooled across participants (gray dots, mean: black line) and predicted (red linear shading, mean: red line).Each of the 7 participants responded to 20 stimuli, yielding 120 gray points per generative noise σg.Some stimuli always elicit an almost zero variance.As σg increases the distributions of variances moves upwards.The large spread of variances shows that participants distribute individual responses anywhere from very clustered in one of the modes to equally spread out across both modes, depending on the specific stimulus.We simulated 1000 stimuli per σg to obtain the predicted variance distribution.(B) Standard B-R captures the upwards trend in the data but fails to account for small variance at σg = 0.4.(C) B-R f cannot account for the almost zero variances.However, as σg increases it predicts that the response variance fall into two clusters, i.e., a second lobe appears below the mean.Likewise, the gray dots seem non-homogeneously distributed but we do not have enough data to quantitatively test this.(D) B-R3 accounts for the the almost zero variance because fitting σg on each trial occasionally yields very high certainty and unimodal responses.It also fits the mean responses well.

E. Participants.
Fig. S1.(A): The 3 rd participant's log-likelihood landscape for σm.(B) Measured responses (orange histogram) of the participant and the response function (blue).
Fig. S2.Loglikelihood for all models.A poor fit to the data yields more negative loglikelihood; a good fit results in less negative values.(A) Variants of B-R (red) perform less well than original B-R.(B) The total loglikelihood summed over the three levels of generative noise σg and averaged across participants.ML-R and MAP-R are not shown because their poor performance is visually evident from plot (A).