^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: JGM MRF. Performed the experiments: JGM. Analyzed the data: JGM. Wrote the paper: JGM PNS. Supplied the intuitions: MRF. Supplied the concepts: JGM.

Sensory processing in the brain includes three key operations: multisensory integration—the task of combining cues into a single estimate of a common underlying stimulus; coordinate transformations—the change of reference frame for a stimulus (e.g., retinotopic to body-centered) effected through knowledge about an intervening variable (e.g., gaze position); and the incorporation of prior information. Statistically optimal sensory processing requires that each of these operations maintains the correct posterior distribution over the stimulus. Elements of this optimality have been demonstrated in many behavioral contexts in humans and other animals, suggesting that the neural computations are indeed optimal. That the relationships between sensory modalities are complex and plastic further suggests that these computations are learned—but how? We provide a principled answer, by treating the acquisition of these mappings as a case of density estimation, a well-studied problem in machine learning and statistics, in which the distribution of observed data is modeled in terms of a set of fixed parameters and a set of latent variables. In our case, the observed data are unisensory-population activities, the fixed parameters are synaptic connections, and the latent variables are multisensory-population activities. In particular, we train a restricted Boltzmann machine with the biologically plausible contrastive-divergence rule to learn a range of neural computations not previously demonstrated under a single approach: optimal integration; encoding of priors; hierarchical integration of cues; learning when not to integrate; and coordinate transformation. The model makes testable predictions about the nature of multisensory representations.

Over the first few years of their lives, humans (and other animals) appear to learn how to combine signals from multiple sense modalities: when to “integrate” them into a single percept, as with visual and proprioceptive information about one's body; when

The brain often receives information about the same feature of the same object from multiple sources; e.g., in a visually guided reach, both vision and proprioception provide information about hand location. Were both signals infinitely precise, one could simply be ignored; but fidelity is limited by irrelevant inputs, intrinsic neural noise, and the spatial precisions of the transducers, so there are better and worse ways to use them. The best will not throw away any information—in Bayesian terms, the posterior probability over the stimulus given the activities of the integrating neurons will match the corresponding posterior given the input signals. Encoding in the integrating neurons the entire posterior for each stimulus, and not merely the best point estimate, is crucial because this distribution contains information about the confidence of the estimate, which is required for optimal computation with the stimulus estimate

Psychophysical evidence suggests that animals—and therefore their brains—are indeed integrating multisensory inputs in such an “optimal” manner. Human subjects appear to choose actions based on the peak of the optimal posterior over the stimulus, given a variety of multisensory inputs

A plausible neural model of multisensory integration, then, must learn without supervision how to combine optimally signals from two or more input populations as well as

Here we show that the task of integration can be reformulated as

A network that has learned to perform the integration task will transmit to downstream neurons (

(A) The model and example data. World-driven data are generated according to the directed graphical model boxed in the lower right: On each trial, a hand location

We begin by examining the ability of our model to perform optimal multisensory integration, in the sense just described. We use our “standard” network, with a visible layer of 1,800 Poisson units, comprising two 30×30 input populations, and a hidden layer of half that number of Bernoulli units. We trained and tested this network on separate datasets, with stimuli chosen uniformly in the 2D space of joint angles (see

We first show that the hidden layer successfully encodes the optimal-posterior mean. For a fixed stimulus location,

The four ellipses in each plot correspond to the covariances of four different estimates of the stimulus: the MAP estimate of the stimulus using only the visual input population (magenta), the MAP estimate using the proprioceptive input population (orange), the MAP estimate using both populations (i.e., the true posterior mean, which is the optimal estimate; black), and the estimate based on decoding the hidden layer (“RBM-based estimate”; green). (The color conventions are the same throughout the paper.) Each ellipse bounds the

We can quantify the contribution of these imperfections to the total optimality of the model. Since the MAP estimate is the unique minimizer of the average (over all stimuli) mean square error, the

We next show that the hidden layer also encodes the optimal-posterior covariance. The posterior covariance represents the uncertainty on a single trial about the true stimulus location, given the specific spike counts on this trial. Since

The posterior precision (inverse covariance) on each trial is a

(A) Reconstruction of the input total spike counts,

How do these values translate into the quantity we really care about, the posterior covariance, and by implication the posterior distribution itself? To quantify this, we employ the standard measure of similarity for distributions, the KL divergence. Since the true posterior is Gaussian, and since the RBM encodes the (nearly) correct mean and variance of

What constitutes a proper point of comparison? Consider a fixed computation of the covariance which uses

Finally, we directly demonstrate the fidelity of the entire model posterior,

This quantity is

Networks with different numbers of hidden units (see legend;

We now relate our model to some familiar results from psychophysical investigations of multisensory integration. In the foregoing simulations, the input populations were driven by the same stimulus. The most common experimental probe of integration, however, is to examine the effects of a small, fixed discrepancy between two modalities—with, e.g., prism goggles or virtual feedback

To replicate these experiments, the trained network from

After training, the model was tested on data that differ from its training distribution. (A) Discrepant-input data:

Another way of measuring machine generalization is to test its performance under

Finally, we examine machine performance under

So far we trained on stimuli that were chosen uniformly in joint space, so that the posterior mean is simply the peak of the likelihood given the inputs,

For simplicity, we chose the prior

This is precisely what we see for the RBM-based estimate in

(A,B) : Learning a prior. The network was trained on population codes of an underlying stimulus that was drawn from a Gaussian (rather than uniform, as in the previous figures) prior. This makes the MAP estimate tighter (cf. the black ellipses here and in

We have been supposing the model to correspond to a multisensory area that combines proprioception of the (say) right hand with vision. When not looking at the right hand, then, the populations ought to be independent; and an appropriate model should be able to learn this even more complicated dataset, in which the two populations have a common source on only some subset of the total set of examples. This is another well known problem in psychophysics and computational neuroscience (see e.g.

A slightly simpler model includes among the input data an explicit signal as to whether the input populations are coupled, in our case by dedicating one neuron to reporting it. This model is shown in

A plausible neural model of multisensory integration will be

(A,B) A “hierarchical” network, in which a third modality must be integrated with the integrated estimate from the first stage—which is just the original model. (A) Data generation (bottom), population coding (middle), and network architecture (cf.

Again we focus on the error statistics of the posterior mean (

We consider now another, seemingly different, computational problem studied in the sensorimotor literature, coordinate transformation (sometimes called “sensory combination”

In this model, the proprioceptive population is responsible for a larger space than either of the other two variables, a consequence of our choice to sample in the space of the latter (see

We now examine some of the properties of the hidden units, especially those that electrophysiologists have focused on in multisensory neurons in rhesus macaques.

(A) Tuning curves for the multisensory-integration model/data (

Interestingly, the two-dimensional tuning for joint angles (left column) is multimodal for many cells—although also highly structured, as apparent from comparison of tuning for the trained (upper row) and untrained (lower row) networks. Although multimodal tuning has been found in multisensory areas, for example, area VIP (see

We therefore restrict attention to the 1D submanifold of joint space indicated by the black slash through the 2D tuning curves, corresponding to an arc in the visual space, since tuning over this range was reported in

Investigation of multisensory tuning properties has a longer history for coordinate transformations. Here, especially in Area 5d, MIP, and VIP, neurons have been reported to encode objects in references frames intermediate between eye and body (“partially shifting receptive fields”) —i.e., the receptive field moves with the eye, but not completely; and eye position modulates the amplitude of the tuning curve (“gain fields”)

More recently, Andersen and colleagues

We reproduce that analysis on our model here. In

Nevertheless, we emphasize that correspondence between model and data in

We have demonstrated a neural-network model of multisensory integration that achieves a number of desirable objectives that have not been captured before in a single model: learning

Our approach is based on two central ideas. The first, following

The second central idea is that this information-retention criterion will be satisfied by the hidden or “latent” variables,

With the network implementation of latent-variable density estimation, we have demonstrated how all three of these learning problems—optimal integration, the integration of prior information, and coordinate transformations—can be solved by multisensory neural circuits. We have previously argued that these three operations are exactly those required for planning multisensory-guided reaching movements

One example of a statistical feature that is constant across trials is the prior distribution of the stimulus, which the network therefore learns to encode in its weights. Whether prior distributions in the brain are encoded in synaptic weights

An interesting consequence of the present formulation is that it renders the gains random variables (see e.g.

The question of whether cortical circuits learn to encode any posterior covariance information at all, as opposed to merely the point estimate that psychophysical experiments elicit, is itself a crucial, open one. Of course, in theory one can always compute a posterior over the stimulus given some population activities

That fewer units are used to represent the same information (half as many in our simple integration model; see

Multisensory integration was first considered from the standpoint of information theory and unsupervised learning in

More recent models of multisensory integration or cross-modal transformation neglect some combination of the desiderata listed in the introduction. Basis-function networks with attractor dynamics

Finally, many authors have either anticipated

Notation is standard: capital letters for random variables, lowercase for their realizations; boldfaced font for vectors, italic for scalars.

Throughout, we work with the example case of integrating two-dimensional (2D) proprioceptive and visual signals of hand location, but the model maps straightforwardly onto any pair of co-varying sensory signals. These two signals report elbow and shoulder joint angles (

Each training vector consists of a set spike counts,

To avoid clipping effects at the edges, the space spanned by this grid of

The prior over the stimulus is either uniform or Gaussian in the space of joint angles. (Implementation of the Gaussian prior is detailed in

The priors over the gains,

To show that the model works, we must compare two posterior distributions over the stimulus: the posterior conditioned on the input data, ^{2} for the two populations, satisfying the requirement. These values are also comparable to empirical values for visual and proprioceptive localization variances from human psychophysics, 5 mm^{2} and 50 mm^{2}, resp.

Now, whereas a Gaussian posterior requires a flat or Gaussian prior, such a prior in prop space will induce an irregular prior in

(See

The nonlinearity (cosine) in the 1D “coordinate-transformation model” (

The addition of a non-flat prior (

The neural circuit for sensory integration was modeled as a restricted Boltzmann machine, a two-layer, undirected, generative model with no intralayer connections and full interlayer connections (

During RBM training

Weights and biases were initialized randomly, after which the networks were trained on batches of 40,000 vectors, with weight changes made after computing statistics on mini-batches of 40 vectors apiece. One cycle through all 1000 mini-batches constitutes an “epoch,” and learning was repeated on a batch for 15 epochs, after which the learning rates were lowered by a factor of

After training, learning was turned off, and the network was tested on a fresh batch of 40,000 data vectors (

For each trial, decoding the hidden vector consists of estimating from it the mean and covariance of the optimal posterior

(EPS)

(EPS)

(PDF)

Base code for training a deep belief network with contrastive divergence was taken from Salukhutdinov and Hinton

^{nd}edition. London: Chapman and Hall/CRC. pp. 26–32.