Bayesian binding and fusion models explain illusion and enhancement effects in audiovisual speech perception

Speech is perceived with both the ears and the eyes. Adding congruent visual speech improves the perception of a faint auditory speech stimulus, whereas adding incongruent visual speech can alter the perception of the utterance. The latter phenomenon is the case of the McGurk illusion, where an auditory stimulus such as e.g. “ba” dubbed onto a visual stimulus such as “ga” produces the illusion of hearing “da”. Bayesian models of multisensory perception suggest that both the enhancement and the illusion case can be described as a two-step process of binding (informed by prior knowledge) and fusion (informed by the information reliability of each sensory cue). However, there is to date no study which has accounted for how they each contribute to audiovisual speech perception. In this study, we expose subjects to both congruent and incongruent audiovisual speech, manipulating the binding and the fusion stages simultaneously. This is done by varying both temporal offset (binding) and auditory and visual signal-to-noise ratio (fusion). We fit two Bayesian models to the behavioural data and show that they can both account for the enhancement effect in congruent audiovisual speech, as well as the McGurk illusion. This modelling approach allows us to disentangle the effects of binding and fusion on behavioural responses. Moreover, we find that these models have greater predictive power than a forced fusion model. This study provides a systematic and quantitative approach to measuring audiovisual integration in the perception of the McGurk illusion as well as congruent audiovisual speech, which we hope will inform future work on audiovisual speech perception.

Let the stimulus S have a 2D Gaussian prior distribution with zero mean and variance parameters specifying the variance along the diagonal along the line x = y, (σ 2 d ) and orthogonal to that diagonal (σ 2 o ). By a 45 degree clockwise rotation of a normal distribution with the diagonal covariance matrix we can parametrize our prior by using the rotation matrix and the relation This yields a Gaussian with mean µ p = 0 and

Likelihood
Let the likelihood be a 2D Gaussian with

Posterior
The posterior distribution follows from the result that the product of two multivariate Gaussian densities is itself a multivariate Gaussian, scaled by a constant, i.e.
Furthermore, it can be shown that and since we know that the posterior distribution integrates to 1 (as does a multivariate normal distribution), we can conclude that c = P (A) and drop that term from our further calculations.
Inserting our parametrizetions of the prior and likelihood covariance matrices into Equation (7a), we get Similarly, we have Now, we let the variance on the diagonal go to infinity, effectively shaping the prior into a ridge along the diagonal. This yields Since we are only interested in the auditory responses, we find the marginal posterior distribution with respect to the auditory response, which corresponds to the first dimension in our model.
As the marginal distribution of a 2D Gaussian is a univariate Gaussian with mean and variance corresponding to the mean and covariance of that dimension in the bivariate distribution, we get Thus,μ A is a weighted mean of µ A and µ V with non-negative weights summing to 1 and depending on the relative variance of the respective dimensions (prior variance and auditory and visual variance in the likelihood).

Similarly,σ 2
A is a scaling of σ 2 A by the same scaling constant as that applied to µ A in equation (14a), and since that scaling constant lies on the interval [0, 1] we can conclude that 0 ≤σ 2 A ≤ σ 2 A . In the strong integration case, we would have σ 2 o = 0; that is, a delta spike along the main diagonal. In that case, equations (14a) and (14b) reduce to (respectively) which is exactly the maximum likelihood estimate as described by Ernst & Banks (2002), and Andersen (2015).
On the other hand, if we let σ 2 o → ∞ we get a flat prior, which should yield no integration of the auditory and visual cues (because there is no prior assumption that the cues are related). It is easy to see that our model fulfills this property, since In the intermediate cases where σ 2 o is nonzero and finite, it will shift µ A towards µ A compared to the maximum likelihood model, with a small shifts for small σ 2 o and bigger shifts for big σ 2 o . This corresponds intuitively to the notion that a narrow joint prior along the diagonal yields a stronger integration (i.e. higher visual influence on the auditory percept), whereas a more flat prior corresponds to a weaker assumption of a common cause of the cues and thus weights the visual information lower in computing the posterior auditory representation. Similarly, large σ 2 o will shiftσ 2 A towards σ 2 A whereas small values will shift it towards the maximum likelihood estimate.