Brain Systems for Probabilistic and Dynamic Prediction: Computational Specificity and Integration

Using computational modelling and neuroimaging, two distinct brain systems are shown to use distinct algorithms to make parallel predictions about the environment. The predictions are then optimally combined to control behavior.

Before going into details about the models, it is importants to make clear the intended scope of the models presented here. To do this, we draw the reader's attention to the distinction between algorithmic models (the type used in this paper) and mechanistic models. By an algorithmic model we mean a model which describes the computations performed by the brain whilst remaining agnostic about the neural mechanisms which underlie these computations. A mechanistic model or biophysical model might also explain how computations are performed by neurons in terms of action potentials and synaptic transmission; we do not extend the models in this paper to the mechanistic level.
The purpose of algorithmic models, such as those used in this study is to make quantitative predictions about behaviour and brain function by capturing certain aspects of how human participants perform the task. For example, we hypothesised (in accordance with Bayesian logic) that when participants had access to two sources of information about the space invader's trajectory end-point, they should use precision-weighting (see main paper) to combine the sources.
To test this hypothesis, we constructed a model that used precision-weighting, and compared it to alternative models which did not use precision-weighting to see which model most closely approximated human participants' behaviour.
However, we are not suggesting that the equations described below are necessarily the same ones used by the brain to perform the task. In our model, we worked out the posterior mean and standard deviation using equations for the product of Gaussians, but the brain may arrive at the same posterior distribution using different calculations -for example by simply multiplying probability density functions across a spatial map. The question of interest is only whether the posterior distribution is a weighted combination of the two sources, or not.
Throughout the modelling work in this paper, we designed models to capture specific aspects of behaviour and/or test specific hypotheses; the models are not supposed to be a complete account of brain function beyond the specific questions they were designed to address.

The models
We constructed our model of the participants' behaviour in several stages.
Firstly and most importantly, the premise of the fMRI experiment was that participants use precision-weighting to decide how much to rely on each of the two computational strategies (statistical and dynamic modelling). To test this hypothesis, we compared predicted endpoints from a precision-weighted model to other non-precision-weighted models. Finally we describe how participants could learn the statistical distribution of end-points. This model is important in that it gives an estimate (or at least, an upper bound) for how well the participants could estimate the end-points' distribution on each trial. Because we needed to sample statistical distributions with different levels of precision, occasionally (every 20-40 trials), the ground-truth statistical end-points' distribution moved to a new position in space or changed its variance. Therefore participants could never know the true statistics of the environment, but had to learn these from previously observed landing points.
To account for subjects' incomplete knowledge of the statistical distribution, we built a Bayesian ideal observer that returned estimates of the current mean and variance of the underlying statistical distribution of end-points, conditional on the landing points witnessed by the participant. Although we present this model last, all references to the statistical model in this section and the main paper refer to what our Bayesian learner would believe the underlying statistical distribution to be, rather than to the actual parameters of the generative distribution which only a clairvoyant subject would know.
All the models described in this supplement were fit using a numerical grid approach in MATLAB. To our knowledge there are no known analytical solutions for the particular models proposed, and furthermore, with the small number of parameters involved, the full posterior could be calculated on a relatively fine grid in just a few minutes on a laptop computer. However, note that although the full posterior was discretized this does not mean that the estimates themselves are discretized, as they are calculated by taking expectations over the posterior.
All the model fits reported in this supplement were conducted only on the data from the fMRI session of the experiment (not the 350 training trials conducted outside the scanner). The first 20 trials of the fMRI session were practice runs in which the trajectory variance was set to zero; these trials were excluded from the model fitting presented here -so all model fits are based on 200 trials from the fMRI session.
1: Do participants use precision-weighting to combine sources of information?
The premise of the fMRI experiment was that participants should shift between two computational strategies (use of the statistical model and use of dynamic trajectory extrapolation) to acheive the same end (i.e. end-point prediction).
We hypothesised that as participants adjusted the weighting given to the two strategies, the brain areas involved in each computational system should change their level of activity, upregulating their activity when the computational strat-egy they computed was more behaviourally relevant and downregulating activity when the laternative strategy was more relevant. Using this logic, we sought to identify brain systems involved in the two computational strategies as those which upregulated their activity when one of the strategies gave a more precise prediction.
To justify this approach in the fMRI analysis it is essential to show that participants do actually adjust how much they rely on each source -historical/statistical and dynamic/trajectory -based on the relative precision of end-point predictions afforded by each model. To pre-empt the results, ultimately precision-weighting did indeed prove to be a good model of how participants actually did the task, and hence we were able to identify brain areas involved in each strategy by looking for fMRI voxels which varied their activity parametrically with the precision of each strategy's prediction.
We performed a formal model comparison to determine whether precisionweighting or some other strategy was used. Let the landing point of the space invader be denoted by x (as only the x-coordinate varied from trial to trial. Then x s is the landing point predicted by the statistical model alone, x d is the landing point predicted using the dynamic model from the currently observed trajectory, and x r is the participant's response. We constructed models in which responses were generated from two Gaussians, the statistical distribution p(x s ) ∼ N (µ s , σ s ), and the distribution of end point estimates given the current observed trajectory, p(x d ) ∼ N (µ d , σ d ). These two distributions were combined in different ways in different models to give a combined prediction x sd with mean µ sd . To evaulate the goodness of fit of each method for generating µ sd , we found the overall model log liklihood for each human participant using the assumption that responses x r were generated from µ sd (where µ sd was defined based on maximum likelihood values for the free parameters in the model) with Gaussian noise, such that p(x r |µ sd ) ∼ N (µ sd , k 2 ). k 2 , the response varaince, was fit to the data for each participant. Further details of model evaulation are given below in the 'model comparison' section.

Models for combining the two predictions
We hypothesised that participants would approximate the Bayes'-optimal solution: precision-weighting 1 . In other words, the two sources of information would be combined into a single prediction, taking into account their variance so that the source with the least variance (highest precision) is given more weight.
This is a plausible solution because it makes the best use of the available information, and also because it would be the natural result of combining two distributions across a single spatial map -simply by mulitplying the probability given each of two distributions for each point in space.
However, one alternative way in which the two information sources could be used would be to combine them without taking into account their relative precision (e.g. always predict a landing point in half way between the independent predictions for the two models). This could hypothetically occur, for example, if participants didn't have a useful estimate of the precision of each source. If participants used this un-weighted strategy, we would expect both computational brain systems to be equally active on all trials.
A second alternative would be for participants to switch between the two strategies, using only the strategy with the highest precision on any given trial. This might hypothetically occur if the two predictions were made by separate brain systems and for some reason could not be combined. This strategy would imply that the computational systems in the brain should switch between 'on' and 'off' states rather than varying their activity parametrically with precision.
The models are described below, and also summarised in Figure  1. Weighted combination.
The predicted end-point µ sd is at a position in between the end point predicted by the historical distribution or observed trajectories alone, where the relative 1 Precision is defined as the inverse of variance weighting given to each distribution (historical and trajectory) is proportional to its precision.
The predicted end point was part way in between the estimates from the two separate predictions; the combined prediction was always a fixed proportion of the way between the two separate predictions, which did not vary from trial to trial and did not depend on the relative precisions of the wo separate predictions.
The predicted end point was identical to either the prediction based on the statistical model, or the mean prediction based on the dynamic model, depending which had the highest precision; predictions were not combined within a trial. Mixing factor M . The mixing factor M determined the relative weight given to the dynamic model vs. the statistical model. The reason for doing this was that although we modelled how the variance of each prediction (dynamic and statistical) changed from trial to trial, we did not want to assume that equal values of σ s and σ d would be given equal weight; by fitting M we allowed for participants being relatively better at dynamic than statistical modelling and vice versa, and by fitting M for each individual we allowed the relative weight given to each strategy to vary across participants.

Free parameters and input distributions
Response bias b. We included a bias constant b, which was fit from the data.
This allowed a for a tendency to consistently respond to the left or right of the best-fit endpoint and was included on the basis of the informal observation that some participants did indeed tend to show a linear bias in responses.
Response noise k. We modelled participants' reponses as generated from the

Model comparison
To determine which model gave the best description of the behaviour of human participants, we calculated the log likelihood ratio between models (since all models had the same number of free parameters, there was no need to correct for model complexity).
For each model, we calculated the probability of human participants' responses x r , given that x r ∼ N(µ sd , k 2 ) (where the bold font denotes vector values for x r and µ sd , as these variables take different alues on each trial. Using Bayes' theorem, we calculated the model likelihood for each model, given the data and some set of parameters M, k, b, as p(x r ∼ N (µ sd , k 2 )|x r ) ∝ p(x r |x r ∼ N (µ sd , k 2 ).

Results
Over the group of subjects, the model which gave the best fit to participants' performance was weighted combination model, followed by the non-weighted combination model (overall logLR for the weighted vs. unweighted model was 105; range in logLR for individual participants was -0.8 to 12.9, mean across participants was 4.8), then the weighted non-combination model (overall logLR for the weighted vs. unweighted model was 363; range in logLR for individual participants was 7.9 to 30.6, mean across participants was 16.5). Log likelihood ratios for the different models are shown in Supplementary Table S1 and Supplementary Figure S1.

BIC analysis
We used the free parameters M and b to fit aspects of the participants behavior that we did not have strong predictions about a-priori (the M parameter was included because we were not sure if some participants would give more weighting to the dynamic of statistical model than might be expected based on optimal precision-weighting) and observations from inspection of the data (b was included as many subjects seemed to show a constant bias to respond to the left-or right-of the optimal endpoint). However, to ensure that the were either fixed or free. Where b was fixed, its value was set to zero. Where M was fixed, its value was set to reflect optimal performance: in the weighted combination model, M was set to 1, the value it would take if participants used optimal precision weighting; in the non-weighted combination model, M was set to match the ratio of precisions across the whole experiment. This value was obtained using the formula for the mean of the product of two Gaussians: hence M OP T was set to the average weighting obtained for the whole set of 200 trials.
Where t denotes trial number. For the weighted non-combination model M was set to 1. Overall the BIC analysis supports our conclusion that the participants used a precision-weighting strategy, but provides little evidence for an effect of the number of free parameters on model fits.

2: How could a spatial prior constrain trajectory extrapolation?
In the previous section we used behavioural evidence to test whether participants used precision-weighting to combine two predictions. In that section we used a simplified model in which the a trajectory of a certain shape (quadratic) was fit to the observed data points, using least squares regression, without reference to the statistical distribution of end points over many trials. This resulted in a probability density function over end-points given the trajectory, which was combined (in different ways for the different models) with a probability density function over end-points given the statistical model, to give a combined prediction. In other words, the dynamic and statistical predictions were calculated separately and combined in the reference frame of the trajectory end-point's coordinates.
However, we do not wish to assert that in the brain the two sources of information were combined only at the end of the trajectory, or that this combination necessarily occurs in the spatial reference frame of end-points' coordinates. It seems equally possible that the statistical model acts as a prior over possible trajectories, constraining a process of estimating the current trajectory which unfolds as each new data point from the trajectrory is observed.
The simple model used in the prvious section was useful in that it was com-putationally undemanding and hence could be run many times to determine the values of free parameters M, k, b. Furthermore, a simple model in which two predictions over endpoint location (from the trajectory and prior) could be used to test different hypotheses about how the predictions were combined (unweighted combination, weighted non-combination, etc).
Conveniently, in terms of endpoint predictions, this simple model is in fact exactly equivalent to a more complex model in which the statistical model constrains the fit of trajectory parameters throughout the trajectory estimation process, which more closely reflects how we hypothesise the brain might combine the two models (which is described in more detail below

Dynamic integrative model
We hypothesised that people predicted the space invaders' trajectories by constructing a dynamic model of the trajectory which was influenced by prior beliefs about the trajectory's eventual end point based on the statistical model. We define a dynamic model in this case as a model of the velocity and acceleration of the space invader given its position. More generally a dynamic model could be defined as one which describes how the value of a parameter changes over time. A dynamic model could be described mathematically using a set of differential equations.
To model participants' predictions of the space invader's trajectory, we constructed a dynamic model in which the equation of motion of the space invader was estimated, and some sets of parameters were considered a priori more likely because they gave rise to end-points which would be more likely given the statistical model.
To simulate the estimation process in the real task, the model was given each data point from the trajectory sequentially; the model was and updated online as each data point was observed. Essentially the model, which is described in detail below, was a quadratic Kalman filter for the curved trajectory of the space invader.

Generative/'ground truth' form of trajectories
The actual trajectory ('ground truth trajectory') of the space invader on each trial was defined by a constant acceleration equation in the horizontal dimension and a constant velocity in the vertical direction. That is In fact the velocity in y (the value of p) was fixed across all trials but the acceleration in x (the value of q) varied from trial to trial, as did the trajectory start point x 0 . Solving these equations gives a general solution (putting x in terms of y) in the form of a quadratic curve x = ay 2 + by + x 0 (7) In fact, b was always zero (in the ground truth trajectory) but we did not assume participants knew this; we allowed the model to estimate trajectories with different values of b.
So that participants could not predict the trajectory end-point statically from the start point, a new pair of values for a and x 0 were selected on each trial.
The trajectories were generated such that their end-points followed a Gaussian distribution (the statistical model which participants estimated) and obviously this constrained the joint choice of a and x 0 , but neither parameter individually could be predicted in advance, or used individually to predict the trajectory's end-point.
The data points presented to the participant were generated from the underlying trajectory by adding Gaussian random noise, in the horizontal dimension, to each observed data point x i independently. The purpose of including the noise was to create perceptual uncertainty (participants were told that their "radar signal" was noisy).
Note that although in our model we parameterized the trajectory in terms of x and y rather than x, y and t, there is a unique mapping between each quadratic curve of the form x = ay 2 +x 0 and each pair of differential equations, given that p is fixed. In other words, fitting a quadratic curve is simply a re-parameterization and should not be taken to imply that participants estimate a static curve rather than a dynamic equation of motion.

Model
We constructed a Bayesian model that estimated the trajectory online, updating its estimates as each data point from that trajectory was observed. The model assumed that data points were generated following the equation: where E i is Gaussian noise such that E i ∼ N (0, σ 2 ) and the subscript i denotes observed data points within a trajectory.
. . . in other words, the model 'knew' that the space invader had a constant acceleration in x and that noise was Gaussian. The only difference between the form of trajectory estimated by the model and form of the 'ground truth' trajectory was that the model did not know that the coefficient of the linear term (b) was always zero.
We used a numerical (grid) method to fit the values of the model's free parameters after each data point was observed. The model operated on a 4-dimensional state space for a, b, x 0 , and σ. After each data point was observed, the probability that each set of parameters a, b, x 0 , σ were the correct ones was updated using Bayes' Rule: . . . where the likelihood p(x i |a, b, x 0 , σ) was simply calculated by finding the value of x i that would be expected given the set of values a, b, x 0 using the 14 equation: . . . and the probability of the observed data point x i given a, b, x 0 and σ was then calculated using the probability density function for a Gaussian distribution.
The prior p(a, b, x 0 , σ), for data points i > 1, was simply the posterior probability of those parameter values from the previous data point.
For data point i = 1, the prior over a, b, x 0 , σ was initiated according to participants' prior beliefs about the statistica; distribution of trajectory end points.
Trajectory end points followed a Gaussian distribution (see Section 3 below for more details of how this prior was calculated on a trial-to-trial basis). Each set of parameters for the underlying trajectory (a, b, x 0 ) would predict a certain endpoint x end |abx 0 . To calculate a prior over the trajectory parameters a, b, x 0 rather than the spatial parameter x end , the probability of the endpoint x end |abx 0 given the prior was calculated using the probability density funtion for a Gaussian distribution, and this probability was assigned to the combination Qualitatively, it can be seen that the greatest difference between models using and ignoring the statistical model actually occurs at the start of the trajectory.
When there is relatively little information from the observed trajectory, the prior over endpoints based on the statistical model already constrains the possible trajectory considerably. This is evident from the fact that in the presence of a prior, trajectories ed in roughly the right position even after a few data points, can be written in terms of the probabilities of x end after each data point: In the full model (this section) we estimate the PDF over x end (or equivalently, over a, b, x 0 , σ) sequentially for each data point, using the PDF at data point x i−1 as the prior at data point i, and using the statistical model as the prior at data point x i (or equivalently, as the PDF for an initial 'data point' x 0 ). But as can be seen from equation 12, this is exactly equivalent to fitting a PDF over x end , or equivalently over a, b, x 0 , σ, to all the data points x 1 . . . x p simulatneously, and then combining it with the prior based on the statistical model x i -as we did in the simplified model in section 1.

Modelling participants' estimate of the statistical distribution of end-points
We hypothesised that participants used a priori knowledge about the statistical distribution of trajectory endpoints to resolve uncertainty when the observed trajectory was noisy. However, because the underlying distribution of endpoints (generative distribution) changed periodically, a model in which participants used the true generative endpoints' distribution was not realistic -how could they know what that distribution was on trials when that distribution had just changed? Yet the change in the true prior was an essential part of the design, because we needed to sample distributions with different variances.
To give a more realistic estimate of what participants' statistical model of the end-points' distribution was on any given trial, we created a Bayesian 'computer participant'. This 'computer participant' was a Baysian ideal observer and hence its beliefs about the historical distribution on each trial represented the most accurate estimate an observer could have, given the data points observed.
We did not have direct access to participants' estimates of the historical distribution, because we only measured their predictions in the presence of both the historical distribution and the trajectory. Hence, we could not readily measure whether the Bayesian learner gave a good fit to their actual learning and it can only be said that the present model gives an upper bound on the accuracy and precision of the statistical model, which must be a better estimate of participants' real beliefs than simply using the actual (generative/ ground truth) distribution, which only a clairvoyent subject would know. Note however that the precise method of learning is not really relevant to the main argument (that participants use precision-weighting) and the purpose of this model is only to improve our estimate of participants' beliefs in order to test whether precisionweighting was used.

Bayesian computer participant
Our Bayesian learner operated as follows: at the start of each trial it had a model (a prior) of the end-points' distribution from its experience of the environmnet's statistics before that trial. When a new end-point was observed, this prior was updated using Bayes' theorem. The Bayesian learner learned entirely based on the true end point on each trial, which was shown to participants at the end of the trial as feedback. Therefore the Bayesian learner could theoretically operate without any access to dynamic estimates of the trajectories' shapes.

Generative model
The Bayesian learner was provided with an accurate model of the shape of the generative distribution of end-points and the fact that its parameters could jump to new values; we simply assumed that participants had learned the overall structure of the environment during the training session of 1 hour / 350 trials. However, the parameters of the distribution and frequency of jumps were modelled as free parameters.
Trajectory end points were drawn from a Gaussian distribution with unknown mean µ(t) and variance σ 2 (t); these are the underlying values of which µ s (t) and σ 2 s (t) are estimators. For clarity we omit the subscript s in this entire section, and the subscript t now denotes trial number. Hence: . . . where x denotes the trajectory endpoint, previously called x end The mean and variance of this Gaussian could 'jump' independently to completely new values. Following a jump, all values of the jumped parameter (mean or variance) were considered equally likely, so that, if J µ and J σ are binary variables representing whether a jump occurred in µ or σ respectively. J µ and J σ follow a binomial distributions with probabilites α m u and α s igma respectively; these probabilities are free parameters in the model (i.e. they are inferred from the data).
At each trial, t the estimates of µ t and σ t used in fMRI modelling were the joint maximum likelihood values of {µ t , σ t }, calculated after marginalising over α µ and α σ .