Bayesian comparison of explicit and implicit causal inference strategies in multisensory heading perception

The precision of multisensory perception improves when cues arising from the same cause are integrated, such as visual and vestibular heading cues for an observer moving through a stationary environment. In order to determine how the cues should be processed, the brain must infer the causal relationship underlying the multisensory cues. In heading perception, however, it is unclear whether observers follow the Bayesian strategy, a simpler non-Bayesian heuristic, or even perform causal inference at all. We developed an efficient and robust computational framework to perform Bayesian model comparison of causal inference strategies, which incorporates a number of alternative assumptions about the observers. With this framework, we investigated whether human observers’ performance in an explicit cause attribution and an implicit heading discrimination task can be modeled as a causal inference process. In the explicit causal inference task, all subjects accounted for cue disparity when reporting judgments of common cause, although not necessarily all in a Bayesian fashion. By contrast, but in agreement with previous findings, data from the heading discrimination task only could not rule out that several of the same observers were adopting a forced-fusion strategy, whereby cues are integrated regardless of disparity. Only when we combined evidence from both tasks we were able to rule out forced-fusion in the heading discrimination task. Crucially, findings were robust across a number of variants of models and analyses. Our results demonstrate that our proposed computational framework allows researchers to ask complex questions within a rigorous Bayesian framework that accounts for parameter and model uncertainty.

1 Cookbook for causal inference observers We describe here a fairly general recipe for building an observer model for causal inference in multisensory perception. We consider the most common case of two sensory modalities (see [1] for work on three modalities). Stimuli take value on some one-dimensional physical continuum, such as location or heading direction. 1 The observer model is designed to apply to three types of tasks: • Unisensory estimation/discrimination: The observer is presented with one stimulus from either modality, and is asked to report the value of the stimulus (or how the stimulus compares to a given reference).
• Bisensory estimation/discrimination: The observer is presented with two stimuli from different modalities, and is asked to report the value of either one, or of both (or how one of the stimuli, or both, compare to a given reference). Also referred to as implicit (causal) inference.
• (Bisensory) unity judgement: The observer is presented with two stimuli from different modalities, and is asked whether they were perceived as having the same value/source. Also referred to as explicit (causal) inference.
Depending on the experimental setup, the bisensory estimation/discrimination and unity judgment tasks might be performed in the same trial (a 'dual task' setup; see for example [2,3]). Our construction makes the following assumptions: • When two stimuli are presented in the same trial, the observer follows a 'causal inference strategy' to decide whether the stimuli belong to a common cause (C = 1) or not (C = 2).
• Conditioned on a given causal scenario (C = 1 or C = 2), or in the unisensory task, the observer performs the estimation/discrimination task according to Bayesian inference.
• When responding, the observer might exhibit additional suboptimalities, such as lapsing and cue switching.
A specific observer model is built by picking four model components (also called model factors): (1) a sensory noise model; (2) a prior over stimuli; (3) a causal inference strategy; and (4) additional sources of suboptimality.

Pick a sensory noise model
For each modality 'mod', pick a sensory noise model for the observer. The common assumption is a Gaussian measurement noise distribution of the form p(x mod |s mod ) = N x mod |s mod , σ 2 (s mod ) , where x mod is the noisy measurement, s mod the stimulus value, N x|µ, σ 2 is a normal distribution with mean µ and variance σ 2 , and the function σ 2 (s mod ) > 0 encodes how the variance of measurement noise depends on the stimulus value, which is a feature of the sensory domain. Common shapes could be a constant noise, or alternatively noise that grows proportionally to |s mod | (∼ Weber's law). The are no constraints on the shape of σ 2 (s mod ) besides positivity and, arguably, continuity. Eq. S1 is suitable for unbounded stimulus domains, or circular domains (such as orientation, or heading direction) with small angles, which effectively behave as unbounded domains. For an actually circular stimulus domain, we replace Eq. S1 with a wrapped normal distribution which, for σ(s mod ) < 360 • , is very well approximated by only three components k = −1, 0, 1. An alternative to Eq. S2 is to use a von Mises (i.e., circular normal) distribution; we show in Section 3 that the two choices are essentially equivalent.

Measurement distribution and likelihoods
We use Eqs. S1 (or S2) both for the sensory measurement noise distribution, that is the generative process of measurement x mod for a given stimulus s mod in the experiment, and for the observer's sensory likelihood used in the inference process of the posterior probability over s mod for a given measurement x mod . According to Bayes' rule, for the example of a unisensory stimulus, the latter takes the form p(s mod |x mod ) = p(x mod |s mod )p prior (s mod ) p(x mod |s mod )p prior (s mod )ds mod (S3) where p prior (s mod ) is the prior over unisensory stimuli (see Section 1.2), and here p(x mod |s mod ) is the likelihood.
Previous computational work has modified the equation of the measurement distribution by including terms, such as a scaling factor in front of x mod , not reflected in the likelihood. This form of model mismatch has the effect of introducing explicit biases in subjects' percepts. 2 The rationale for this adhoc modification of the measurement distribution is that such biases are observed experimentally, see for example [4,5] in the case of heading estimation. In our construction, instead, we follow the common practice in Bayesian psychophysics of assuming that biases in the observers' performance emerge implicitly and normatively from the interaction between statistics of the stimuli (i.e., priors) and precision of the sensory apparatuses (i.e., likelihoods) [6,7]. Recent theoretical work has shown that this might agree with encoding of stimuli in neural populations [8,9]. In particular, as demonstrated in these studies, priors will generally induce 'attractive' biases, whereas stimulus-dependent noise distributions (and, thus, likelihoods) can induce both 'attractive' and 'repulsive' biases. For this reason, we do not force biases by hand in the formulation of the sensory noise distribution, but this should not be mistaken for a lack of biases in the performance of our observer models.
The fact that we use the same expressions (and parameters) for both the sensory measurement distribution and the likelihood is equivalent to saying that observers implicitly know their own noise model (that is, how noise changes as function of other parameters of the task, such as reliability and stimulus eccentricity). This modeling choice is motivated both by experimental work that shows trial-to-trial reweighing of multisensory cues [10][11][12] and by theoretical reasons, in that models in which noise (e.g., variance of measurement distributions) and beliefs about noise (e.g., 'variance' in the likelihoods) are decoupled may suffer from a lack of identifiability, unless the experiment is designed to avoid such issues [13].

Pick a prior over stimuli
The observer will have a prior over stimuli in the unisensory and bisensory conditions. A common choice for the prior is an independent, identical Gaussian prior across modalities for stimuli s 1 and s 2 , where µ prior is the mean of the prior (which might represent a global bias, often assumed to be zero), and σ 2 prior represents the width of the prior (the wider the prior, the lesser its influence on behavior). The same prior is then applied to the common cause scenario and unisensory cases, This simple prior induces a 'compression' or 'regression to the mean' bias as observed in many psychophysical experiments [14]. Another possibility is that the observer develops a prior (approximately) based on the empirical distribution of stimuli presented in the experiment, which may differ from Eqs. S4 and S5.

Pick a causal inference strategy
The causal inference strategy defines how the observer decides on the hypotheses C = 1 and C = 2 when presented with two stimuli. In general, the causal inference strategy may or may not be Bayesian, can be deterministic or stochastic, and might dictate to combine the two causal scenarios (e.g., by performing a weighted average of C = 1 and C = 2). This strategy also determines what the observer would report in an explicit, unity-judgment task.

Bayesian causal inference strategies
A Bayesian strategy will compute the posterior probabilities of the two causal scenarios, given the two noisy measurements x 1 , x 2 , as follows, where p(C) represents the prior belief of a common or separate cause, with p(C = 1) = 1−p(C = 2) ≡ p c . While p c should typically stem from the statistics of the task, it is general practice to keep it as a free parameter of any Bayesian model, since subjects tend to exhibit a wide spectrum of beliefs about the probability of a common cause (see Fig 2 in [15]).
Different variants of Bayesian observers will use the posterior over causal scenarios differently to respond to estimation/discrimination task. Typical models are Bayesian model averaging (average the estimates of C = 1 and C = 2, weighted by their posterior probability), Bayesian model selection (pick the estimate of either C = 1 or C = 2, based on which one has the larger posterior probability), or Bayesian probability matching (pick either scenario stochastically, with probability equal to their posterior probability).
For the unity judgment task, the standard Bayesian strategy is to respond with the scenario (C = 1 or C = 2) with highest posterior probability. Another possibility is posterior probability matching, that is pick either scenario stochastically, with probability equal to their posterior.

Non-Bayesian causal inference strategies
The main feature of a non-Bayesian strategy is that it does not compute a posterior distribution over causal scenarios, but uses instead (usually simpler) heuristics as a decision rule to whether C = 1 or C = 2.
A typical heuristic of this kind stipulates that C = 1 whenever the two noisy measurements x 1 , x 2 are closer in value than some criterion κ, that is |x 1 − x 2 | < κ. If κ is fixed for all experimental conditions, we call this a fixed-criterion causal inference strategy [16]. If κ is allowed to change for different experimental conditions, and in particular as a function of stimulus reliability, then the decision rule becomes 'probabilistic', that is uncertainty-dependent [17].
A fixed-criterion strategy that discards reliability information might seem to clash with the assumption that observers know the stimulus reliability when combining cues. However, there is neural evidence that sensory integration (that is forced fusion, with reliability-dependent weighing) and causal inference happen in different brain areas [18]. For this reason, it is not obvious that reliability information would be automatically available to higher areas, or that it would be used in the correct way. Fixed-criterion models represent a valid 'null' alternative for a class of models in which reliability information is unavailable (or corrupted) at the causal inference stage.

Non-causal inference strategies
Extreme cases of causal inference strategies are observers that do not quite perform causal inference at all.
In this case, an observer might use a forced fusion strategy that always combines cues (C ≡ 1), or, alternatively, a forced segregation strategy that always segregates them (C ≡ 2). Mathematically, these strategies can be considered as limiting cases of previously presented causal inference strategies. For example, forced fusion is equivalent to a Bayesian causal inference strategy with p c → 1, or a fixedcriterion strategy with κ → ∞. Analogously, forced segregation is equivalent to a Bayesian strategy with p c → 0, or a fixed-criterion strategy with κ → 0.
As a generalization of forced fusion/segregation, we can consider a stochastic fusion strategy that on each trial has probability η of deciding C = 1, and C = 2 otherwise, where η might depend on the experimental condition.

Pick other sources of suboptimality
Experimental subjects will often exhibit additional sources of variability, which might be included explicitly in the model. Here we consider lapses and cue switching.
A common feature of many psychophysical models is a lapse rate, that is the probability λ that the observer gives a completely random response (typically, uniform over the range of possible responses) [19].
Another form of error for multisensory perception experiments is that the observer switches modality, that is in a bisensory estimation/discrimination task they respond about the wrong modality with switching rates ρ 1→2 and ρ 2→1 , respectively for responding with the second modality when asked about the first, and vice versa. Note that the switching rate can be used to implement suboptimal strategies such as cue capture, whereby all responses are absorbed by a single modality: pick the 'forced segregation' causal inference strategy, then set, say, ρ 2→1 = 1 and ρ 1→2 = 0, if responses are supposed to be captured by the first modality. Similarly, by picking 'forced segregation' with nonzero ρ 1→2 and ρ 2→1 , one can implement a switching strategy observer [20].

Observer model factors
In this section we describe details of the factors used to build the observer models in the paper.

Sensory noise
For a given modality mod ∈ {vis, vest}, the measurement noise distribution follows Eq. S1. Note that for a visual stimulus the measurement distribution and the variance in Eq. S1 also depend on the visual coherence level c vis in the trial, such that σ 2 (s vis ) ≡ σ 2 (s vis , c vis ), but in the following we omit this dependence to simplify the notation.
For the variance we consider two possible models, where σ 2 0 modality is the base variance and w mod is related to the Weber fraction near 0 • . In fact, for small values of s mod , Eq. S7 reduces to σ 2 (s mod ) ≈ σ 2 0 mod 1 + w 2 mod s 2 mod , which is a generalized Weber's law. 3 The broad shape of the chosen periodic formula for the eccentriticy-dependent noise model, which peaks at ±90 • , derives from empirical results in a visuo-vestibular task with the same apparatus with human and monkey subjects (see Fig 2 in [4]; see also [22]). We note that our noise shape differs from that adopted in other works (with different setups), which used a sinusoidal with twice the frequency that peaks at ±45 • , ±135 • [23,24]. Since in our setup the heading directions were restricted to the ±45 • range (with most directions in the ±25 • range), the exact shape of periodicity is largely irrelevant, but understanding differences in noise models may be important for experiments with wider heading direction ranges.
For the paper, we implemented the measurement distribution (and, thus, the stimulus likelihood in the inference process) as a mixture of three wrapped Gaussians (Eq. S2). However, we found that, due to the limited range of directions in our experiment, a single Gaussian was sufficient. Note that our choice of using Gaussians rather than von Mises (circular normal) distributions yields no loss of generality in practice, as we demonstrate in Section 3.
All constant noise models have four parameters (σ 0vest , and a separate σ 0vis for each visual coherence level, low, medium and high). Eccentricity-dependent models have two additional parameters, w vest and w vis (the latter is common to all visual stimuli, to prevent overfitting).

Prior
For unisensory trials, we assume that observers have a unimodal symmetric prior over heading directions, peaked at 0 • (the exact shape is irrelevant). Due to the form of the decision rule for the left/right discrimination task, such prior has no influence over the observer's response, which only depends on whether the noisy measurement falls to the left or to the right of straight ahead.
For bisensory trials (both unity judgment and inertial discrimination tasks), we consider two alternative models for priors. The empirical prior consists of an approximation of the actual prior used in the experiment, that is where S is the discrete set of pairs of visual and vestibular headings in the experiment. The two equations consider respectively only diagonal elements (equal heading directions, C = 1) or off-diagonal elements (different directions, C = 2) of Fig 1B in the main text. The approximation here is given by the two Gaussian distributions (defined on the discrete set), which impose additional shrinkage for the mean of the stimuli (governed by σ 2 prior ) and for the disparity (governed by ∆ 2 prior ). For σ 2 prior , ∆ 2 prior → ∞, Eq. S8 converges to the distributions of directions used in the experiment for C = 1 and C = 2.
Alternatively, we consider an independent prior, that is which assumes observers build a single prior over heading directions which is applied independently to both modalities [25]. The first integral is a formal way to impose s ≡ s vis = s vest . We note that a continuous approximation of Eq. S8 may seem more realistic than the adopted discrete distribution of directions. However, an observer model with a correlated, continuous prior is computationally intractable since evaluation of the log likelihood involves a non-analytical four-dimensional integral, which increases the computational burden by an order of magnitude. As a sanity check, we implemented observers that use a continuous approximation of Eq. S8 and verified on a subset of observers and models that results of model fits and model predictions were indeed nearly identical to the discrete case.
Independent prior models have one parameter σ prior for the width of the prior over headings. Empirical prior models have an additional parameter ∆ prior for the width of the prior over disparities.

Causal inference strategy
The basic causal inference strategies: Bayesian, fixed-criterion and fusion are described in the main text. We report here some additional definitions and derivations.
All integrals in this section are in the [−90 • , 90 • ] range, unless noted otherwise. The rationale of such integration range for our experiment is that subjects were informed that the movement was forward (either left or right of straight-forward). Moreover, due to the relatively narrow range of stimuli used in our experiment, we found with preliminary analyses that beliefs more than 90 • away from straight-ahead had negligible influence on left/right decisions. In the more general case of stimuli distributed along the full circle, the integration range should go to ±180 • . For a non-circular dimension, appropriate empirical bounds should be chosen (e.g., the width of the projection screen for a localization task).

Posterior probability of causal structure
For a Bayesian observer, the posterior probability of common cause is where Pr(C = 1) ≡ p c , the prior probability of a common cause, is a free parameter of the model. Then where the likelihoods are defined by Eq. S1, the prior is defined by Eqs. S8 and S9, and Pr(c vis ) = 1 3 . For the independent prior case we can further simplify whereas the solution for the empirical prior is similar, but with a sum over the discrete stimuli such that s vis = s vest . Conversely, the posterior probability of separate causes is which for the independent prior becomes that is the product of two one-dimensional integrals. For the empirical prior Eq. S11 does not simplify, but becomes a discrete sum over S (see Eq. S8).
Posterior probability of left/right discrimination (C = 1) In bisensory inertial discrimination trials the observer may implicitly contemplate two scenarios: that there is only one common cause (C = 1), or that there are two distinct causes (C = 2). We consider inference in the two separate scenarios, and then see how the observer can combine them. For C = 1, the observer's posterior probability density over over the inertial heading direction is which for the independent prior becomes p (s vest |x vis , x vest , c vis , C = 1) ∝ p(x vest |s vest )p(x vis |s vis = s vest , c vis )N s vest |0, σ 2 prior and the solution is similar for the empirical prior, constraining s vest to take only the discrete values used in the experiment for C = 1.

Posterior probability of left/right discrimination (C = 2)
For C = 2, the observer's posterior over inertial heading is which for the independent prior can be further simplified as whereas for the empirical prior the integral in Eq. S13 becomes a sum over discrete pairs of heading directions used in the experiment.

Posterior probability of left/right discrimination (C unknown)
If the causal structure is unknown, a Bayesian observer that follows a 'model averaging' strategy marginalizes over possible causal structures (here, C = 1 and C = 2) [25]. The observer's posterior probability density over the inertial heading direction is where v k (x vis , x vest , c vis ), for k = 1, 2, are the posterior causal weights assigned by the observer to the two causal structures, with v 2 (x vis , x vest , c vis ) = 1 − v 1 (x vis , x vest ) and 0 ≤ v 1 (x vis , x vest , c vis ) ≤ 1. For a Bayesian observer, the causal weights are equal to the posterior probabilities (Eq. S14); in the main text we describe other models.

Suboptimalities
For all our observer models, we considered a lapse rate λ. Due to the format of our bisensory discrimination data (i.e., only inertial left/right responses), which limits the identifiability of switching models, we did not consider a switching rate, leaving that to future work.

Model parameters
All models except stochastic fusion have five parameters θ default by default: three visual base noise parameters σ 0vis (c high ), σ 0vis (c med ), and σ 0vis (c low ); a vestibular base noise parameter σ 0vest ; and a lapse rate λ.

Comparison between wrapped normal and von Mises noise
In the presentation of our general causal inference observer model, and in the manuscript, we assumed that measurement noise distributions took the shape of (wrapped) normals (see Eqs. S1 and S2). Moreover, for wrapped normals, we advocated that three mixture components (k = 0, ±1) are sufficient. Our modeling proposal differs from the typical choice of using von Mises (circular normal) distributions for circular variables (see for example [23,24]). Here we test whether our choice is sensible and generally applicable, by asking whether there is a practical difference between using von Mises and wrapped normals, for experiments with stimuli over the entire circular domain. First, we note that, qualitatively, the von Mises and wrapped normals have very similar properties. They are both bell-shaped distributions over the circle, and they are both related to the normal distribution. von Mises distributions are the maximum-entropy distributions over the circle, so theoretically more appealing, but on the other hand wrapped normals, especially as a mixture of three Gaussians (one at the mean, the other two at ±360 • from the mean), have computational advantages. It remains to be established whether these distributions differ quantitatively in an empirically meaningful way. In the following analyses, we always consider wrapped normals approximated with three mixture components.

Theoretical comparison
To answer this question theoretically, we assess the difference between the two noise distributions by computing the Kullback-Leibler (KL) divergence between a von Mises distribution with a given concentration parameter κ and the best approximating wrapped normal (this construction assumes that the true underlying distribution is a von Mises, but the results are similar after inverting the role). The KL-divergence represents the expected difference in log likelihood between the two noise models per trial (assuming the data were generated from a von Mises). Thus, the inverse of the KL-divergence can be taken as a ballpark of the minimum number of samples required to empirically see a difference between the two models (that is, one point of log likelihood of difference summed over trials). We call this quantity the identifiability threshold.
As expected, for large values of κ (when the von Mises converges to a normal distribution) and for small values of κ (when the von Mises converges to a uniform distribution over the circle), the identifiability threshold between wrapped normal and von Mises is way over 10 3 , and even 10 4 , meaning that several thousand trials would be needed to distinguish the two models (assuming no other confounding elements). However, there is a range of values of κ, around ≈ 50-60 (that is, a circular SD of ≈ 7 • ), in which the identifiability threshold drops to ≈ 60-100. This analysis tells that, at least in some cases, the models could be distinguished within a large but feasible amount of trials. Whether the two noise models can be distinguished in practice is an empirical question, since in real data differences in the noise models will be obfuscated by other details. Moreover, it is possible that neither model is the true one (but they could be both equally good at approximating the true model). Finally, subjects' typical parameters might reside in ranges in which the two distributions are not empirically distinguishable.

Empirical comparison
To answer this question empirically, we took the data from a recent paper on causal inference in multisensory heading estimation [24]. For all subjects (17 datasets between Experiment 1 and 2, 400-600 trials per dataset), we fit the unisensory data (four conditions: one visual and three inertial) using the basic modeling framework described in the section "Analyses of Unisensory Data" of [24]. One minor difference with their analysis is that, as a principled way of dealing with outliers, we added for each subject a lapse rate parameter, shared across conditions (instead of discarding data points more than three standard deviations away from the mean). The lapse rate represents the probability of a completely random response (e.g., due to a missed stimulus, or a mistake in the response).
Crucially, we considered two models, one in which the noise model is a von Mises (as per [24]), and another one in which the noise model is a wrapped Gaussian (implemented as a mixture of Gaussians with three components). We fitted each dataset to both models via maximum-likelihood estimation. For the optimization, we used MATLAB's fmincon function with 100 random restarts, plus one starting point represented by the maximum-likelihood solution reported in [24, S2 Table]. Since both models have the same number of parameters (and, moreover, all parameters have the same meaning), we can directly compare differences in log likelihood without the need to account for model complexity. Across subjects, we found a difference of log likelihood of 0.13±0.18 (mean ± S.E.M.), which is negligible evidence in favor of the von Mises distribution. In fact, most of the evidence comes from a single subject; otherwise, eight subjects slightly favor the wrapped normal, and other eight slightly favor the von Mises. These results show that the two models are practically indistinguishable in real continuous estimation data. Note that this would be even more so with our data, since we have only discrete (binary) responses.
In conclusion, these analyses support our choice of using (wrapped) normals as an equivalent alternative to von Mises distributions, and suggest that wrapped normals, approximated via three mixture components, could be used more generally as a valid computational alternative to von Mises distributions.

Computational details
We describe in this section a number of computational and algorithmic details.

Integrals
Due to lack of analytical solutions, we computed all one-dimensional and two-dimensional integrals numerically, via either Simpson's or trapezoidal rule with a equi-spaced grid on the integration domain [26]. We had two types of integrals: integrals over x vis , x vest for marginalization over the noisy stimuli, and integrals over s vis and/or s vest for computation of the observer's decision rule (Eqs. S10, S11, S12 and S13).
For marginalization over noisy measurement x vis and x vest , we used a regular 401 × 401 grid for which we adjusted the range of integration in each modality to up to 5 SD from the mean of the noisy measurement distribution (or ±180 • , whichever was smaller). For large noise, we used wrapped normal distributions, which turned out to have little effect due to our setup.
For computation of the decision rule, we assumed that observers believed, due to the experimental setup and task instructions, that the movement direction would be forward, so limited to the ±90 • range. We adjusted the integration grid spacing ∆s (hence the number of grid points) adaptively for each parameter vector θ, defining and we rounded ∆s to the lowest exact fraction of the form 1 m , with m ∈ N and 1 ≤ m ≤ 8. The above heuristic afforded fast and accurate evaluation of the integrals, since the grid spacing was calibrated to be smaller than the length scale of the involved distributions (measurement noise and prior).
Finally, we note that we tried other standard numerical integration methods which were ineffective. Gauss-Hermite quadrature [26] led to large numerical errors because the integrand is discontinuous and bounded, a very bad fit for a polynomial. Global adaptive quadrature methods (such as quad in MAT-LAB, and other custom-made implementations) were simply too slow, even when reducing the requested precision. We coded all two-dimensional numerical integrals in C (via mex files in MATLAB) for maximal performance.

Optimization
For optimization of the log likelihood (maximum-likelihood estimation), we used Bayesian Adaptive Direct Search (BADS [27]; https://github.com/lacerbi/bads). BADS follows a mesh adaptive direct search (MADS) procedure that alternates poll steps and search steps. In the poll step, points are evaluated on a (random) mesh by taking one step in one coordinate direction at a time, until an improvement is found or all directions have been tried. The step size is doubled in case of success, halved otherwise. In the search step, a Gaussian process is fit to a (local) subset of the points evaluated so far. Points to evaluate during the search are iteratively chosen by maximizing the predicted improvement (with respect to the current optimum) over a set of candidate points. Adherence to the MADS framework guarrantees convergence to a (local) stationary point of a noiseless function under general conditions [28]. The basic scheme is enhanced with heuristics to accelerate the poll step, to update the Gaussian process hyperparameters, to generate a good set of candidate points in the search step, and to deal robustly with noisy functions. See [27] for details.
For each optimization run, we initialized our algorithm by randomly choosing a point inside a hypercube of plausible parameter values in parameter space. We refined the output of each BADS run with a run of patternsearch (MATLAB). To avoid local optima, for each optimization problem we performed 150 independent restarts of the whole procedure and picked the highest log likelihood value.
As a heuristic diagnostic of global convergence, we computed by bootstrap the value of the global optimum we would have found had we only used n r restarts, with 1 ≤ n r ≤ 150. We define the 'estimated regret' as the difference between the actual best value of the log likelihood found and the bootstrapped optimum. For each optimization problem, we computed the minimum value n * r for which the probability of having an estimated regret less than 1 was 99% (n * r ≡ ∞ if such n r does not exist). The rationale is that if the optimization landscape presents a large number of local optima, and new substantially improved optima keep being found with increasing n r , the bootstrapped estimated regret would keep changing with n r , and n * r would be 150 or ∞. For almost all optimization problems, we found n * r 150. This suggests that the number of restarts was large enough; although no optimization procedure in a non-convex setting can guarantee convergence to a global optimum in a finite time without further assumptions.

Markov Chain Monte Carlo (MCMC) sampling
As a complementary approach to maximum-likelihood model fitting, for each dataset and model we calculated the posterior distribution of the parameters via MCMC (see main text).
We used a custom-written sampling algorithm that combines slice sampling [29] with adaptive direction sampling [30]. 4 Slice sampling is a flexible MCMC method that, in contrast with the common Metropolis-Hastings transition operator, requires very little tuning in the choice of length scale. Adaptive direction sampling is an ensemble MCMC method that shares information between several dependent chains (also called 'walkers' [31]) in order to speed up mixing and exploration of the state space. For each ensemble we used 2(p + 1) walkers, where p is the number of parameters of the model. Walkers were initialized to a neighborhood of the best local optima found by the optimization algorithm. Each ensamble was run for 10 4 to 2.5 · 10 4 burn-in steps that were discarded, after which we collected 5 · 10 3 to 10 4 samples per ensemble.
At each step, our method iteratively selects one walker in the ensemble and first attempts an independent Metropolis update. The proposal distribution for the independent Metropolis is a variational mixture of Gaussians [32] fitted to a fraction of the samples obtained during burn-in via the vbgmm toolbox for MATLAB. 5 Note that the proposal distribution is fixed at the end of burn-in and does not change thereafter, ensuring that the Markov property is not affected (although non-Markovian adaptive MCMC methods could be applied; see [33]). After the Metropolis step, the method randomly applies with probability 1/3 one of three Markov transition operators to the active walker: coordinate-wise slice sampling [29], parallel-direction slice sampling [34], and adaptive-direction slice sampling [29,30]. We also fit a variational Gaussian mixture model to the last third of the samples at the end of the burnin period, and we used the variational mixture as a proposal distribution for an independent Metropolis step which was attempted at every step.
For each dataset and model, we ran three independent ensembles. We visually checked for convergence the marginal pdfs and distribution of log likelihoods of the three sampled chains. For all parameters, we computed Gelman and Rubin's potential scale reduction statistic R and effective sample size n eff [35] using Simo Särkkä and Aki Vehtari's psrf function for MATLAB. 6 For each dataset and model, we looked at the largest R (R max ) and smallest n eff (n effmin ) across parameters. Large values of R indicate convergence problems whereas values close to 1 suggest convergence. n eff is an estimate of the actual number of independent samples in the chains; a few hundred independent samples are sufficient for a coarse approximation of the posterior [35]. Longer chains were run when suspicion of a convergence problem arose from any of these methods. Samples from independent ensembles were then combined (thinned, if necessary), yielding 1.5 · 10 4 posterior samples per dataset and model. In the end, average R max (across datasets and models) was ∼ 1.002 (range: [1.000 − 1.035]), suggesting good convergence. Average n effmin was ∼ 8881 (range: [483 − 15059]), suggesting that we had obtained a reasonable approximation of the posteriors.

Pareto smoothed importance sampling diagnostics
As our main metric of model comparison we computed the Bayesian leave-one-out cross-validation score (LOO) via Pareto-smoothed importance sampling (PSIS; [36,37]); see Methods in the main text.
For a given trial 1 ≤ i ≤ N trials , with N trials the total number of trials, the PSIS approximation may fail if the leave-one-out posterior differs too much from the full posterior. As a natural diagnostic, PSIS also returns for each trial the exponent k i of the fitted Pareto distribution. If k i > 0.5 the variance of the raw importance ratios distribution does not exist, and for k i > 1 also the mean does not exist. In the latter case, the variance of the PSIS estimate is still finite but may be large. In practice, Vehtari et al. suggest to double-check trials with k i > 0.7 [37].
Across all our models and datasets, we found 2382 trials out of 1137100 with k i > 0.7 (0.21%). We examined the problematic trials, finding that the issue was in almost all cases the discontinuity of the observer's decision rule. For all problematic trials the LOO i scores were compatible with the values found for non-problematic trials, suggesting that the variance of the PSIS estimate was still within an acceptable range. We verified on a subset of subjects that the introduction a softmax with small spatial constant on the decision rule would remove the discontinuity and the problems with Pareto fitting, without significantly affecting the LOO i itself.

Model validation and recovery
We performed sanity checks and unit tests to verify the integrity of our code.
To test the implementation of our models, for a given observer (given model and parameter vector θ) we tested the data simulation code (functions that simulate responses; used e.g. to generate figures) against the log likelihood code (functions that compute the log likelihood of the data). For a number of subjects and models we verified that, at the maximum-likelihood solution, the log likelihood of the data approximated via simulation (by computing the probability of the responses via simple Monte Carlo) was ∼ equal to the log likelihood of the data computed numerically. This ensured that our simulation code matched the log likelihood code, being a sanity check for both. We performed a model recovery analysis to validate the correctness of our analysis pipeline, and assess our ability to distinguish models of interest using all tasks ('joint fits'); see e.g. [13,38]. For computational tractability, we restricted our analysis to six observer models: the most likely four models for each different causal inference strategy (to verify our ability to distinguish between strategies), and, for the most likely model, its variants along the prior and noise factors (to verify whether we can distinguish models along those axes). Thus, we consider the following models: Fix-X-E, Bay-X-E, Bay/FFu-X-I, Fix/FFu-C-I, Fix-X-I, Fix-C-E (see main text for a description). We generated synthetic datasets from each of these six models, for all three tasks jointly, using the same sets of stimuli that were originally displayed to the 11 subjects. For each subject, we took four randomly chosen posterior parameter vectors obtained via MCMC sampling (as described in Section 4.3), so as to ensure that the statistics of the simulated responses were similar to those of the subjects. Following this procedure, we generated 264 datasets in total (6 generating models × 11 subjects × 4 posterior samples). We then fit all 6 models to each synthetic dataset, yielding 1584 fitting problems. For computational tractability, we only performed maximum likelihood estimation (see Section 4.2, with 50 restarts), as opposed to MCMC sampling, whose cost would be prohibitive for this number of fits. The analysis was otherwise exactly the same as that used for fitting the subject data. We then computed the fraction of times that a model was the 'best fitting' model for a given generating model, according to AICc (considering that AICc approximates LOO in the limit of large data).

Fitted models
Generating models

Fraction recovered
Model recovery analysis. Each square represents the fraction of datasets that were 'best' fitted from a model (columns), for a given generating model (rows), according to the AICc score. Bright shades of gray correspond to larger fractions. The bright diagonal indicates that the true generating model was, on average, the best-fitting model in all cases, leading to a successful model recovery.
We found that the true generating model was recovered correctly in 89.4% of the datasets on average (see above). This finding means that our models are distinguishable in a realistic setting, and at the same time validates the model fitting pipeline (as it would be unlikely to obtain a successful recovery in the presence of a substantial coding error). Since our model recovery method differs from the procedure used on subject data in the comparison metric (AICc via maximum-likelihood estimation, rather than LOO via MCMC), we verified on subject data that AICc and LOO scores were highly correlated across subjects [39]. The Spearman's rank correlation coefficient between the two metrics was larger than 0.99 for each of the sixteen models in the joint fits, providing strong evidence that results of our model recovery analysis would also transfer to the framework used for the subject data.

Absolute goodness of fit
In this section we describe a general method to compute absolute goodness of fit, largely based on the approach of [40]. 7

Computing the absolute goodness of fit
Let X be a dataset of discrete categorical data grouped in M independent batches with K classes each, such that X jk is the number of observations for the j-th batch and the k-th class. We define N j = k X jk the number of observations for the j-th batch.
We assume that observations are ordered and independent, such that the distribution of observations in each batch j is the product of N j categorical distributions with parameters p j = (p j1 , . . . , p jK ) (frequencies), such that the probability of the data is with unknown vectors of frequencies p j .
We assume that we have a model of interest q that predicts frequencies q jk for the observations, with k q jk = 1 for 1 ≤ j ≤ M . As a reference, we consider the chance model q 0 with frequencies q 0 jk = 1/K. We define the absolute goodness of fit of q as where KL (p||q) is the Kullback-Leibler divergence (also known as relative entropy) between a 'true' distribution p and an 'approximating' distribution q. Importantly, g (q) = 0 when a model performs at chance, and g (q) ≤ 1, with g (q) = 1 only when the model matches the true distribution of the data. In other words, g (q) represents the fractional information gain over chance. Note that g (q) can be negative, in the unfortunate case that a model performs worse than chance.
As another important reference, we recommend to also compute the absolute goodness of fit g (q) of the histogram modelq, with frequencies defined from the empirical frequencies across batches as q jk = M l=1 X lk /N , for 1 ≤ j ≤ M and N = j N j . A comparison between g (q) and g (q) is informative of how better the current model is than a simple histogram of categorical observations collapsed across batches. In some circumstances, the chance model can be a straw model, whereas the histogram model may represent a more sensible reference point.
In order to estimate Eq. S15, we need to compute the relative entropy KL (p||q) between the data and a given distribution q, where the first term is the (negative) entropy of the data, and the second term is called the cross-entropy between p and q. We will show in the following sections that the negative cross-entropy is approximated by the cross-validated log likelihood of the data, LL CV (q). Combining Eq. S15 with our estimates of Eq. S16, we obtain . (S17) which is the negative log likelihood of the model, −LL(q).
Note that typically we also need to estimate the model parameters, and computing Eq. S23 on the same dataset used to estimate parameters will yield a biased estimate of the log likelihood (see e.g., [42]). Shen and Ma suggest to obtain an independent estimate of the log likelihood of the model via crossvalidation, LL CV [40]. According to their method, model parameters are estimated on half of the data, and the log likelihood of the model (and also the entropy of the data) is evaluated with the other half of the data. As an improvement over their method, we advocate to estimate the expected log likelihood via leave-one-out (LOO) cross-validation score obtained via MCMC [37]. This will produce an unbiased estimator of the expected log likelihood, and allows to use all the available data to obtain a more robust estimate of the relative entropy.
In conclusion, our estimate for the cross-entropy is H(p, q) = −LL CV (q), with LL CV (q) computed as the LOO score of the model, and it corresponds to Eq. 19 in [40].

LOO scores for all models
In this section we report tables of LOO scores for all models and subjects, which were used to perform group Bayesian Model Selection, the model comparison technique adopted in the main text. For each subject, LOO scores are shown relative to the LOO of the model with highest mean LOO across subject, which is printed in boldface. Models are ranked according to average LOO. Summing (equivalently, averaging) LOO scores across subjects is a simple 'fixed-effect' model comparison analysis, in which all subjects are believed to belong to the same model. Results of the fixed-effect analysis differ in details from the group Bayesian Model Selection, but the overall qualitative findings are analogous.