To integrate or not to integrate: Temporal dynamics of hierarchical Bayesian causal inference

To form a percept of the environment, the brain needs to solve the binding problem—inferring whether signals come from a common cause and are integrated or come from independent causes and are segregated. Behaviourally, humans solve this problem near-optimally as predicted by Bayesian causal inference; but the neural mechanisms remain unclear. Combining Bayesian modelling, electroencephalography (EEG), and multivariate decoding in an audiovisual spatial localisation task, we show that the brain accomplishes Bayesian causal inference by dynamically encoding multiple spatial estimates. Initially, auditory and visual signal locations are estimated independently; next, an estimate is formed that combines information from vision and audition. Yet, it is only from 200 ms onwards that the brain integrates audiovisual signals weighted by their bottom-up sensory reliabilities and top-down task relevance into spatial priority maps that guide behavioural responses. As predicted by Bayesian causal inference, these spatial priority maps take into account the brain’s uncertainty about the world’s causal structure and flexibly arbitrate between sensory integration and segregation. The dynamic evolution of perceptual estimates thus reflects the hierarchical nature of Bayesian causal inference, a statistical computation, which is crucial for effective interactions with the environment.


Introduction
In our natural environment, our senses are exposed to a barrage of sensory signals: the sight of a rapidly approaching truck, its looming motor noise, the smell of traffic fumes. How the brain effortlessly merges these signals into a seamless percept of the environment remains unclear. The brain faces two fundamental computational challenges: First, we need to solve the 'binding' or 'causal inference' problem-deciding whether signals come from a common cause and thus should be integrated or instead be treated independently [1,2]. Second, when there is a common cause, the brain should integrate signals taking into account their uncertainties [3,4].
Hierarchical Bayesian causal inference provides a rational strategy to arbitrate between sensory integration and segregation in perception [2]. Bayesian causal inference explicitly models the potential causal structures that could have generated the sensory signals-i.e., whether signals come from common or independent sources. In line with Helmholtz's notion of 'unconscious inference', the brain is then thought to invert this generative model during perception [5]. In case of a common signal source, signals are integrated weighted in proportion to their relative sensory reliabilities (i.e., forced fusion [3,4,[6][7][8][9][10]). In case of independent sources, they are processed independently (i.e., full segregation [11,12]). Iin a particular instance, the brain does not know the world's causal structure that gave rise to the sensory signals. To account for this causal uncertainty, a final estimate (e.g., object's location) is obtained by averaging the estimates under the two causal structures (i.e., common versus independent source models) weighted by each causal structure's posterior probability-a strategy referred to as model averaging (for other decisional strategies, see [13]).
A large body of psychophysics research has demonstrated that human observers combine sensory signals near-optimally as predicted by Bayesian causal inference [2,[13][14][15][16]. Most prominently, when locating events in the environment, observers gracefully transition between sensory integration and segregation as a function of audiovisual spatial disparity [12]. For small spatial disparities, they integrate signals weighted by their reliabilities, leading to crossmodal spatial biases [17]; for larger spatial disparities, audiovisual interactions are attenuated. A recent functional MRI (fMRI) study showed how Bayesian causal inference is accomplished within the cortical hierarchy [14,16]: While early auditory and visual areas represented the signals on the basis that they were generated by independent sources (i.e., full segregation), the posterior parietal cortex integrated sensory signals into one unified percept (i.e., forced fusion). Only at the top of the cortical hierarchy, in anterior parietal cortex, the uncertainty about the world's causal structure was taken into account and signals were integrated into a spatial estimate consistent with Bayesian causal inference.
The organisation of Bayesian causal inference across the cortical hierarchy raises the critical question of how these neural computations unfold dynamically over time within a trial. How does the brain merge spatial information that is initially coded in different reference frames and representational formats? Whereas the brain is likely to recurrently update all spatial estimates by passing messages forwards and backwards across the cortical hierarchy [18][19][20], the unisensory estimates may to some extent precede the computation of the Bayesian causal inference estimate.
To characterise the neural dynamics of Bayesian causal inference, we presented human observers with auditory, visual, and audiovisual signals that varied in their spatial disparity in an auditory and visual spatial localisation task while recording their neural activity with electroencephalography (EEG). First, we employed cross-sensory decoding and temporal generalisation matrices [21] of the unisensory auditory and visual signal trials to characterise the emergence and the temporal stability of spatial representations across the senses. Second, combining psychophysics, EEG, and Bayesian modelling, we temporally resolved the evolution of unisensory segregation, forced fusion, and Bayesian causal inference in multisensory perception.

Results
To determine the computational principles that govern multisensory perception we presented 13 participants with synchronous audiovisual spatial signals (i.e., white noise burst and Gaussian cloud of dots) that varied in their audiovisual spatial disparity and visual reliability (Fig 1A  and 1B). On each trial, participants reported their perceived location of either the auditory or the visual signal. In addition, we included unisensory auditory and visual signal trials under auditory or visual report, respectively.
Combining psychophysics, EEG, and computational modelling, we addressed two questions: First, we investigated when and how human observers form spatial representations from unisensory visual or auditory inputs, which generalise across the two sensory modalities. Second, we studied the computational principles and neural dynamics that mediate the integration of audiovisual signals into spatial representations that take into account the observer's uncertainty about the world's causal structure consistent with Bayesian causal inference.

Shared and distinct neural representations of space across vision and audition-Unisensory auditory and visual conditions
Behavioural results. Participants were able to locate unisensory auditory and visual signals reliably as indicated by a significant Pearson correlation between participants' location responses and the true signal source location for both unisensory auditory (across subjects mean ± SEM: 0.88 ± 0.05), visual high reliability (VR+; across subjects mean ± SEM: 0.998 ± 0.19), and visual low reliability (VR−; across subjects mean ± SEM: 0.91 ± 0.05) conditions. As expected, observers were significantly less accurate when locating the sound than when locating the visual stimuli for both levels of visual reliability (VR+ versus A: t [12] 8.83, p < 0.0001; VR− versus A: t [12]  EEG results. Multivariate decoding of EEG activity patterns revealed how the brain dynamically encodes the location of unisensory auditory or visual signals. The decoding accuracy was expressed as the Pearson correlation coefficient between the true and the decoded stimulus locations and entered into so-called temporal generalisation matrices that illustrate the stability of EEG activity patterns encoding spatial location across time [21]. If a support vector regression (SVR) model trained on EEG activity patterns at time t can correctly decode the stimulus location not only at time t but also at other time points, then the stimulus location is encoded in EEG activity patterns that are relatively stable across time (for further details about this temporal generalisation approach, see [21]). If an SVR model cannot successfully generalise to EEG activity patterns at other time points, spatial locations are encoded in transient EEG activity patterns that differ across time.
For visual stimuli, spatial locations were successfully (i.e., significantly better than chance) decoded from EEG activity patterns from 60 ms onwards for visual stimuli (Fig 2,  Later (i.e., from about 250 ms), the location of the visual stimulus was encoded in a more sustained activity pattern (see S2 Fig), leading to successful cross-temporal generalisation from 300 ms to 700 ms post stimulus (i.e., significantly better than chance decoding accuracy is present far off the diagonal). In a 4 × 4 × 2 × 2 factorial design, the experiment manipulated (1) the location of the visual ('V') signal (−10˚, −3.3˚, 3.3˚, and 10˚), (2) the location of the auditory ('A') signal (−10˚, −3.3˚, 3.3˚, and 10˚), (3) the reliability of the visual signal (VR+ versus low VR−, as defined by the spread of the visual cloud), and (4) task relevance (auditory versus visual report). In addition, we included unisensory auditory and visual VR+ and VR− trials. The greyscale codes the spatial disparity between the auditory and visual locations for each AV condition (i.e., darker greyscale = larger spatial disparity). (B) Time course of an example trial. (C) Behavioural AV weight index w AV computed from behavioural responses (left) and from the predictions of the Bayesian causal inference model (right; across-participants circular mean ± 68% CI and individual w AV represented by filled/empty circles, n = 13). The AV weight index w AV is shown as a function of (1) visual reliability: high [VR+] versus low [VR−]; (2) task relevance: auditory versus visual report; and (3) AV spatial disparity: small (≦6.6; D−) versus large (>6.6; D+). The data used to make this figure are available in file S1 Data. AV, audiovisual; D+, high disparity; D−, low disparity; VR+, high visual reliability; VR−, low visual reliability.
By contrast, auditory spatial representations could be decoded significantly better than chance from about 95 ms onwards (see Fig 2,   Temporal generalisation matrices within and across auditory and visual senses. Each temporal generalisation matrix shows the decoding accuracy for each training (y-axis) and testing (x-axis) time point. We factorially manipulated the training data (auditory versus visual stimulation) and testing data (auditory versus visual stimulation). Decoding accuracy is quantified by the Pearson correlation between the true and the decoded locations of the auditory (or visual) stimulus. The grey line along the diagonal indicates where the training time is equal to the testing time (i.e., the time-resolved decoding accuracies). Horizontal and vertical grey lines indicate the stimulus onset. The thin grey lines encircle clusters with decoding accuracies that were significantly better than chance at p < 0.05 corrected for multiple comparisons. The thick grey lines encircle the clusters with decoding accuracies that were significantly better than chance jointly for both (1) auditory-to-visual and (2) visual-to-auditory cross-temporal generalisation at p < 0.05 corrected for multiple comparisons. The data used to make this figure are available in file S1 Data. on EEG activity patterns can decode auditory spatial location significantly better than chance also from EEG activity patterns across other time points, even as late as 700 ms post stimulus (significant cluster encircled by thin grey lines in Fig 2). This temporal generalisation profile indicates that auditory spatial locations were encoded in EEG activity patterns that were relatively stable across time later from 200 ms onwards. Visual inspection of the EEG topographies shows that auditory spatial location is encoded at these later processing stages in sustained activity patterns that correspond to the long latency auditory P2 component (see S3 Fig) [22][23][24].
In addition to temporal generalisation within each sensory modality, we also investigated the extent to which the SVR decoding model generalised across sensory modalities throughout poststimulus time. Whereas earlier neural representations were more specific to each particular sensory modality, the SVR model was able to generalise significantly better than chance from audition to vision and vice versa from 160 to 360 ms (Fig 2, upper left and lower right quadrant, areas encircled by thick grey line indicate significant generalisation across sensory modalities). This cross-sensory generalisation across visual-and auditory-evoked EEG activity patterns suggests that at those stages (i.e., 160 ms to 360 ms), the brain forms spatial representations that are relatively stable and rely on neural generators that may be partly shared across sensory modalities. By contrast, the spatial representations encoded in very early (<160 ms) EEG activity patterns did not enable successful cross-sensory generalisation, suggesting that they are modality-specific. These statistically significant cross-sensory generalisation results are also illustrated by the EEG topographies evoked by unisensory auditory and visual signals (see S2B and S3B Figs). From 200 ms to 400 ms, poststimulus auditory and visual stimuli elicit centro-posterior dominant topographies that depend on the stimulus location to some extent similarly in vision and audition. Although these results may point towards partly overlapping neural generators and representations potentially in parietal cortices that encode location both in audition and vision, it is important to emphasise that different configurations of neural generators can in principle elicit similar EEG scalp topographies.

Computational principles of audiovisual integration: GLM-based w AV and Bayesian modelling analysis-Audiovisual conditions
Combining psychophysics, multivariate EEG pattern decoding, and computational modelling, we next investigated the computational principles and neural dynamics underlying audiovisual integration of spatial representations using a general linear model (GLM)-based w AV and a Bayesian modelling analysis. As shown in Fig 3, both analyses were applied to the spatial estimates that were either reported by participants (i.e., behaviour, Fig 3B left 3 legend).
The GLM-based w AV analysis quantifies the influence of the true auditory and true visual location on (1) the reported or (2) EEG decoded auditory and visual spatial estimates in terms of an audiovisual weight index w AV .
The Bayesian modelling analysis formally assessed the extent to which (2) the full-segregation model(s) (Fig 3C, encircled in light blue, red or green), (2) the forced-fusion model ( Fig  3C, yellow), and (3) the Bayesian causal inference model (i.e., using model averaging as decision function, encircled in dark blue; see supporting material S1 Table for other decision functions) can account for the spatial estimates reported by observers (i.e., behaviour) or decoded from EEG activity pattern (i.e., neural).
Behavioural results. In a GLM-based w AV analysis, the behavioural audiovisual weight index w AV shows that observers integrated audiovisual signals weighted by their sensory The audiovisual weight index w AV was close to 90˚(i.e., pure visual influence) when the visual signal needed to be reported ( Fig 1C, dark lines). But it shifted towards 0˚when the auditory signal was task relevant (Fig 1C, grey lines). In other words, we observed a significant main effect of task relevance on behavioural w AV (p = 0.0002). Observers flexibly adjusted the weights they assigned to auditory and visual signals in the integration process as a function of task relevance, giving more emphasis to the sensory modality that needed to be reported. The main effect of task relevance on w AV is inconsistent with classical forced-fusion models, in which audiovisual signals are integrated into one single unified percept irrespective of task relevance of the sensory modalities. In other words, even in the case of audiovisual spatial disparity, the observer would perceive the auditory and visual signals at the same location. Instead, it indicates that observers maintain separate auditory and visual spatial estimates for an audiovisual spatially disparate stimulus.
Consistent with Bayesian causal inference, the difference in w AV between auditory and visual report significantly increased for large (>6.6˚) relative to small (�6.6˚) spatial disparities (i.e., significant interaction between task relevance and spatial disparity: p = 0.0002). In other words, audiovisual integration and cross-modal spatial biasing broke down when auditory and visual signals were far apart and likely to be caused by independent sources. This attenuation of audiovisual interactions for large relative to small spatial disparities (i.e., interaction between task relevance and disparity) is the characteristic profile of Bayesian causal inference (see model predictions for w AV in Fig 1C right).
Moreover, we observed significant two-way interactions between visual reliability and spatial disparity (p = 0.0014) and between visual reliability and task relevance (p = 0.0002). The effect of high versus low visual reliability was stronger when the two signals were close in space and the auditory (i.e., less reliable) signal needed to be reported. For auditory report conditions, the influence of the visual signal on the audiovisual spatial representation is stronger for high visual reliability and small disparity trials (Fig 1C, difference between dashed and solid grey line for the small spatial disparity condition). Again, this interaction is expected for Bayesian causal inference, because the spatial estimate furnished by the forced-fusion model receives a stronger weight in Bayesian causal inference for low-spatial-disparity trials, when it is likely that the two signals come from a common source. The GLM-based w AV and Bayesian modelling analysis were performed on auditory ('A') and visual ('V') spatial estimates that were indicated by participants as behavioural localisation responses (left, 'Behaviour') or decoded from participants' EEG activity patterns (right, 'Neural'). The neural spatial estimates were obtained by training an SVR model on ERP activity patterns at each time point of the AV congruent trials to learn the mapping from EEG pattern to external spatial locations (black diagonal line). This learnt mapping was then used to decode the spatial location from the ERP activity patterns of the spatially congruent and incongruent AV conditions (coloured arrows). (B) Distributions of spatial localisation responses (left, Behaviour: S Resp ) and decoded spatial estimates (right, Neural: S Dec ) were computed for each of the 64 conditions of the 4 (visual stimulus location) × 4 (auditory stimulus location) × 2 (visual reliability) × 2 (task relevance) factorial design. (C) Left: In the GLM-based w AV analysis, the perceived (or decoded at each time point) spatial estimates were predicted by the true visual and auditory spatial locations (S V1..8 , S A1..8 ) for each of the eight conditions in the 2 (visual reliability: high versus low) × 2 (task relevance: auditory versus visual report) × 2 (spatial disparity: �6.6˚versus >6.6˚) factorial design. As a summary index, we defined the relative audiovisual weight (w AV ) as the four-quadrant inverse tangent of the visual (ß V1..8 ) and auditory (ß A1..8 ) parameter estimates for each of the eight conditions in each regression model. Right: In the Bayesian modelling analysis, we fitted the following models to observers' behavioural and neural spatial estimates: SegA (green, for EEG only), SegV (red, for EEG only), SegV,A (light blue), 'forced fusion' ('Fusion', yellow), and BCI model (with model averaging, dark blue). We performed Bayesian model selection at the group level and computed the protected exceedance probability that one model is better than any of the other candidate models above and beyond chance [25]. (D) Left: Based on previous studies [14,16], we hypothesised that the w AV profile with an interaction between task relevance (i.e., visual versus auditory report) and spatial disparity that is characteristic for BCI would emerge relatively late. Right: Likewise, we expected the different models to dominate the EEG activity patterns to some extent sequentially: first the unisensory segregation model (SegV, SegA), followed by the forced-fusion model ('Fusion'), and finally the BCI estimate. The fading of colours indicates that we did not have specific hypotheses for those times. AV, audiovisual; BCI, Bayesian causal inference; D+, high disparity; D−, low disparity; EEG, electroencephalography; ERP, event-related potential; GLM, general linear model; S Dec , Spatial estimate decoded; SegA, unisensory auditory segregation; SegV, unisensory visual segregation; SegV,A, audiovisual full-segregation; S Resp , spatial estimate responded; stim, stimulus; SVR, support vector regression; VR+, high visual reliability; VR−, low visual reliability. Consistent with the profile of the audiovisual weight index w AV , formal Bayesian model comparison showed that the Bayesian causal inference model outperformed the full-segregation and forced-fusion models (85.6% ± 0.3% variance explained, protected exceedance probability > 0.99; Table 1). Fig 1C (right) shows the profile of the audiovisual weight index w AV that is predicted by the Bayesian causal inference model fitted to the observer's behavioural localisation responses. It illustrates that Bayesian causal inference inherently accounts for effects of task relevance (or reported modality) and the interaction between task relevance and spatial disparity by combining the forced-fusion estimate with the task-relevant full-segregation estimate weighted by the posterior probability of common and independent sources. Conversely, the interaction between reliability and spatial disparity arises because the forcedfusion model component, which integrates signals weighted by their reliabilities, is more dominant for small spatial disparities.
In summary, our audiovisual weight index w AV and Bayesian modelling analysis of observers' perceived/reported locations provided convergent evidence that human observers integrate audiovisual spatial signals weighted by their relative reliabilities at small spatial disparities. Yet, they mostly segregate audiovisual signals at large spatial disparities, when it is unlikely that signals come from a common source.
EEG results-Temporal dynamics of audiovisual integration. To characterise the neural dynamics underlying integration of audiovisual signals into spatial representations, we applied the GLM-based w AV and the Bayesian modelling analysis to the 'spatial estimates' that were decoded from EEG activity patterns at each time point (see Fig 3B right). Because both the GLM-based w AV and the Bayesian modelling analysis require reliable spatial estimates, we report and interpret results limited to the time window from 55 ms to 700 ms post stimulus (Fig 4, S4 Fig), during which the location of congruent audiovisual stimuli could be decoded better than chance from EEG activity patterns (p < 0.001).
The GLM-based analysis of audiovisual weight index w AV investigated the effects of visual reliability, task relevance, and spatial disparity on the audiovisual neural weight index w AV that quantifies the influence of auditory and visual signals on the spatial representations decoded from EEG activity patterns. Our results show that sensory reliability significantly influenced the neural w AV from 65 to 510 ms. As expected, the spatial representations were more strongly influenced by the true visual signal location when the visual signal was more reliable (i.e., significant main effect of visual reliability, Fig 4A, Table 2). Moreover, consistent with our behavioural findings, we also observed a significant main effect of task relevance between 190 and 700 ms ( Fig 4B, Table 2). As expected, the decoded location was more strongly influenced by the visual signal when the visual modality was task relevant. We also observed a significant interaction between task relevance and spatial disparity from 310 to 440 ms and 510 to 590 ms. As discussed in the context of the behavioural results, this interaction is the profile that is characteristic for Bayesian causal inference: the brain integrates sensory signals at low spatial disparity (i.e., small difference for auditory versus visual report) but computes different spatial estimates for auditory and visual signals at large spatial disparities (see Fig 4D, Table 2). In addition to these key findings, we also observed a brief but pronounced significant main effect of spatial disparity on w AV at about 55-130 ms. Whereas a sound attracted the decoded spatial location at small spatial disparity (i.e., w AV is shifted below 90˚, Fig 4C solid line), the decoded location is shifted away from the sound location (i.e., a repulsive effect) at large spatial disparity (i.e., w AV values above 90˚, Fig 4C, dashed line). Moreover, in this early time window, which coincides with the visual-evoked N100 response, the decoded spatial estimate was overall dominated by the visual stimulus location (i.e, w AV was close to 90˚for both small and large disparity). The effect of disparity may indicate that early multisensory processing is already influenced by a spatial window of integration ( Fig 4C, Table 2). Auditory stimuli affected the decoded spatial representations mainly when they were close in space with the visual signal. However, because spatial disparity was inherently correlated with the eccentricity of the EEG results for GLM-based w AV and Bayesian modelling analysis. The neural audiovisual weight index w AV (across-participants' circular mean ± 68% CI; n = 13). Neural w AV as a function of time is shown for (A) visual reliability: VR+ versus VR−; (B) task relevance: auditory ('A') versus visual ('V') report; (C) audiovisual spatial disparity: small (≦6.6; D−) versus large (>6.6; D+); (D) the interaction between task relevance and disparity. Shaded grey areas indicate the time windows during which the main effect of (A) visual reliability, (B) task relevance, (C) audiovisual spatial disparity, or (D) the interaction between task relevance and disparity on w AV was statistically significant at p < 0.05 corrected for multiple comparisons across time. (E) Time course of the circular-circular correlation (across-participants' mean after Fisher z-transformation ± 68% CI; n = 13) between the neural and the behavioural audiovisual weight index w AV . Shaded grey areas indicate significant correlation at p < 0.05 corrected for multiple comparisons across time. (F) Time course of the protected exceedance probabilities [25] of the five models of the Bayesian modelling analysis: SegA (green), SegV (red), SegV,A (light blue), 'forced fusion' ('Fusion', yellow), and BCI model (with model averaging, dark blue). The early time window until 55 ms (delimited by black vertical line on all plots) is shaded in white, because the decoding accuracy was not greater than chance for audiovisual congruent trials; hence, the neural weight index w AV and Bayesian model fits are not interpretable in this window. The data used to make this figure are available in file S1 Data. BCI, Bayesian causal inference; D+, high disparity; D−, low disparity; EEG, electroencephalography; GLM, general linear model; SegA, unisensory auditory segregation; SegV, unisensory visual segregation; SegV,A, audiovisual full-segregation; VR+, high visual reliability; VR−, low visual reliability.
https://doi.org/10.1371/journal.pbio.3000210.g004 Table 2. Statistical significance of main, interaction, and simple main effects for the behavioural and neural audiovisual weight indices (w AV ) ('model-free' approach). Temporal dynamics of Bayesian causal inference audiovisual signals by virtue of our factorial and spatially balanced design, these two effects cannot be fully dissociated. While signals were presented parafoveally or peripherally for small-disparity trials, they were presented always in the periphery for large-disparity trials.

Effect
For completeness, we also observed a significant interaction between spatial disparity and visual reliability between 55 and 135 ms and between 170 and 235 ms ( Table 2). This interaction results from a larger spatial window of integration for stimuli with low versus high visual reliability. In other words, it is easier to determine whether two signals come from different sources when the visual input is reliable, leading to a smaller window of integration.
Finally, we asked whether the neural audiovisual weights were related to the audiovisual weights that observers applied at the behavioural level. Hence, we computed the correlation between the values of the behavioural and neural weight indices w AV separately for each time point. The Fisher z-transformed correlation coefficient fluctuated around chance level until about 100 ms. From 100 ms onwards, it progressively increased over time until it peaked and reached a plateau at about 350 ms (R = 0.72). As expected, this coincides with the time window during which we observed a significant interaction between task relevance and spatial disparity-i.e., the profile characteristic for Bayesian causal inference. After 500 ms, it then slowly decreased towards the end of the trial. Cluster permutation test confirmed that the correlation between neural and behavioural weight indices w AV was significantly better than chance, revealing two significant clusters between 175 and 550 ms (p = 0.0012) and 575 and 665 ms (p = 0.013). These results indicate that the neural representations expressed in EEG activity patterns are critical for guiding observers' responses.
In the EEG Bayesian modelling analysis, we fitted five models to the spatial estimates decoded from EEG activity patterns separately for each time point: (1) 'full-segregation audiovisual', (2) 'forced-fusion', (3) the 'Bayesian causal inference', (4) the 'segregation auditory', and (5) the 'segregation visual' models ( Fig 3C). The segregation visual and segregation auditory models incorporate the hypothesis that neural generators may represent only the visual (or only the auditory) location irrespective of whether the visual (or auditory) location needs to be reported. In other words, they model a purely unisensory visual (or auditory) source. By contrast, the full-segregation audiovisual model embodies the hypothesis that a neural source represents the task-relevant location-i.e., the auditory location for auditory report and the visual location for visual report.
At the random-effects group level, Bayesian model comparison revealed a sequential pattern of protected exceedance probabilities across time ( Fig 4F): initially, the 'segregation visual' model dominated until about 100 ms post stimulus. This converges with our w AV analysis showing that spatial representations decoded from early EEG activity patterns are dominated by the location of the visual signal (i.e., w AV is close to 90˚). From 100 to about 200 ms, the forced-fusion model outperformed the other models, indicating that spatial estimates are now influenced jointly by the locations of auditory and visual signals irrespective of their spatial disparity or task relevance. Again, this mirrors our w AV results in which we observed a significant effect of reliability on w AV early (i.e., as expected for forced fusion), whereas the effect of task relevance arose later and became prominent from 250 ms onwards.
Hence, both w AV and Bayesian modelling analyses suggest that in this early time window, audiovisual signals are predominantly integrated weighted by their reliability into a unified spatial representation irrespective of task relevance, as predicted by forced-fusion models. From about 200 ms onwards, the protected exceedance probability of the Bayesian causal inference model progressively increased, peaking with an exceedance probability of >0.85 at about 350 ms followed by a plateau until 500 ms. Thus, consistent with the w AV results, audiovisual interactions consistent with Bayesian causal inference emerge relatively late at about 350 ms post stimulus.

Discussion
Integrating information from vision and audition into a coherent representation of the space around us is critical for effective interactions with the environment. This EEG study temporally resolved the neural dynamics that enable the brain to flexibly integrate auditory and visual signals into spatial representations in line with the predictions of Bayesian causal inference.
Auditory and visual senses code spatial location in different reference frames and representational formats [26]. Vision provides spatial information in eye-centred and audition in headcentred reference frames [27,28]. Furthermore, spatial location is directly coded in the retinotopic organisation in primary visual cortex [29], whereas spatial location in audition is computed from sound latency and amplitude differences between the ears, starting in the brainstem [27]. In auditory cortices of primates, spatial location is thought to be represented by neuronal populations with broad tuning functions [30,31]. In order to merge spatial information from vision and audition, the brain thus needs to establish coordinate mappings and/ or transform spatial information into partially shared 'hybrid' reference frames, as previously suggested by neurophysiological recordings in nonhuman primates [30,32]. In the first step, we therefore investigated the neural dynamics of spatial representations encoded in EEG activity patterns separately for unisensory auditory and visual signals using the method of temporal generalisation matrices [21]. In vision, spatial location was encoded initially at 60 ms in transient neural activity associated with the early P1 and N1 components and then turned into temporally more stable representations from 200 ms and particularly from 350 ms (Fig 2,  upper right quadrant, S2 Fig). In audition, spatial location was encoded by relatively stable EEG activity from 95 ms and particularly from 250 ms, which is associated with the auditory long latency P2 component [22][23][24] Activity patterns encoding spatial location generalised not only across time but also across sensory modalities between 160 and 360 ms. As indicated in Fig 2, SVR models trained on visual-evoked responses generalised to auditory-evoked responses and vice versa (upper left and lower right quadrant, significant cross-sensory generalisation encircled by thick grey line). These results suggest that unisensory auditory and visual spatial locations are initially represented by transient and modality-specific activity patterns. Later, at about 200 ms, they are transformed into temporally more stable representations that may rely on neural sources in frontoparietal cortices that are at least to some extent shared between auditory and visual modalities [22,33,34].
Next, we asked when and how the human brain combines spatial information from vision and audition into a coherent representation of space. The brain should integrate sensory signals only when they come from a common event but should segregate signals from independent events [1,2,12]. To investigate how the brain arbitrates between sensory integration and segregation, we presented observers with synchronous audiovisual signals that varied in their spatial disparity across trials. On each trial, observers reported either the auditory or the visual location. Our results show that a concurrent yet spatially disparate visual signal biased observers' perceived sound location towards the visual location-a phenomenon coined spatial ventriloquist illusion [17,35]. Consistent with reliability-weighted integration, this audiovisual spatial bias was significantly stronger when the visual signal was more reliable (Fig 1C left, grey solid versus dashed lines). Furthermore, observers reported different locations for auditory and visual signals, and this difference was even greater for large-relative to small-spatialdisparity trials. This significant interaction between spatial disparity and task relevance indicates that human observers arbitrate between sensory integration and segregation depending on the probabilities of different causal structures of the world that can be inferred from audiovisual spatial disparity.
Using EEG, we then investigated how the brain forms neural spatial representations dynamically post stimulus. Our analysis of the neural audiovisual weight index w AV shows that the spatial estimates decoded from EEG activity patterns are initially dominated by visual inputs (i.e., w AV close to 90˚). This visual dominance is most likely explained by the retinotopic representation of visual space that facilitates EEG decoding of space leading to visual predominance (for further discussion, see the Methods section). From about 65 ms onwards, visual reliability significantly influenced w AV (Fig 4A): as expected, the location of the visual signal exerted a stronger influence on the spatial estimate decoded from EEG activity patterns when the visual signal was more reliable than unreliable. By contrast, the signal's task relevance influenced the audiovisual weight index only later, from about 190 ms (Fig 4B). Thus, visual reliability as a bottom-up stimulus-bound factor impacted the sensory weighting in audiovisual integration prior to top-down effects of task relevance. We observed a significant interaction between task relevance and spatial disparity as the characteristic profile for Bayesian causal inference from about 310 ms: the difference in w AV between auditory and visual report was significantly greater for large-than for small-disparity trials (Fig 4D, Table 2). Thus, spatial disparity determined the influence of task-irrelevant signals on the spatial representations encoded in EEG activity from about 310 ms onwards. A task-irrelevant signal influenced the spatial representations mainly when auditory and visual signals were close in space and hence likely to come from a common event, but it had minimal influence when they were far apart in space. Collectively, our statistical analysis of the audiovisual weight index revealed a sequential emergence of visual dominance, reliability weighting (from about 100 ms), effects of task relevance (from about 200 ms), and finally the interaction between task relevance and spatial disparity (from about 310 ms, Fig 4A-4D).
This multistage process was also mirrored in the time course of exceedance probabilities furnished by our formal Bayesian model comparison: The unisensory visual segregation (SegV) model was the winning model for the first 100 ms, thereby modelling the early visual dominance. The audiovisual forced-fusion model embodying reliability-weighted integration dominated the time interval of 100-250 ms. Finally, the Bayesian causal inference model that enables the arbitration between sensory integration and segregation depending on spatial disparity outperformed all other models from 350 ms onwards. Hence, both our Bayesian modelling analysis and our w AV analysis showed that the hierarchical structure of Bayesian causal inference is reflected in the neural dynamics of spatial representations decoded from EEG. The Bayesian causal inference model also outperformed the audiovisual full-segregation (SegV,A) model that enables the representation of the location of the task-relevant stimulus unaffected by the location of the task-irrelevant stimulus. Instead, our Bayesian modelling analysis confirmed that from 350 ms onwards, the brain integrates audiovisual signals weighted by their bottom-up reliability and top-down task relevance into spatial priority maps [36,37] that take into account the probabilities of the different causal structures consistent with Bayesian causal inference. The spatial priority maps were behaviourally relevant for guiding spatial orienting and actions, as indicated by the correlation between the neural and behavioural audiovisual weight indices, which progressively increased from 100 ms and culminated at about 300-400 ms. Two recent studies have also demonstrated such a temporal evolution of Bayesian causal inference in an audiovisual temporal numerosity judgement task [38] and an audiovisual rate categorisation task [39].
The timing and the parietal-dominant topographies of the AV potentials (see S2 and S3 Figs) that form the basis for our spatial decoding (and hence for w AV and Bayesian modelling analyses) closely match the P3b component (i.e., a subcomponent of the classical P300). Although it is thought that the P3b relies on neural generators located mainly in parietal cortices [40,41], its specific functional role remains controversial [42]. Given its sensitivity to stimulus probability [43][44][45] and discriminability [46] as well as task context [42,47,48], it was proposed to reflect neural processes involved in transforming sensory evidence into decisions and actions [49]. Most recent research has suggested that the P3b may sustain processes of evidence accumulation [50] that are influenced by observers' prior [51], incoming evidence (i.e., likelihood [52]), and observers' belief updating [53]. Likewise, our supplementary time-frequency analyses revealed that alpha/beta power, which has previously been associated with the generation of the P3b component [54], depended on bottom-up visual reliability between 200 and 400 ms and top-down task relevance between 350 and 550 ms post stimulus (see S5 Fig, S2 Table and S1 Text), thereby mimicking the temporal evolution of bottom-up and top-down influences observed in our main w AV and Bayesian modelling analysis.
Yet, our main analysis took a different approach. Rather than focusing on the effects of visual reliability, task relevance/attention, and spatial disparity directly on event-related potentials (ERPs) or time-frequency power, the w AV analysis investigated how these manipulations affect the spatial representations encoded in EEG activity patterns, and the Bayesian modelling analysis accommodated those effects directly in the computations of Bayesian causal inference. Along similar lines, two recent fMRI studies characterised the computations involved in integrating audiovisual spatial inputs across the cortical hierarchy [14,16]: whereas low level auditory and visual areas predominantly encoded the unisensory auditory or visual locations (i.e., full-segregation model) [55][56][57][58][59][60][61][62][63][64], higher-order visual areas and posterior parietal cortices combined audiovisual signals weighted by their sensory reliabilities (i.e., forced-fusion model) [65][66][67][68]. Only at the top of the hierarchy, in anterior parietal cortices, did the brain integrate sensory signals consistent with Bayesian causal inference. Thus, the temporal evolution of Bayesian causal inference observed in our current EEG study mirrored its organisation across the cortical hierarchy observed in fMRI.
Fusing the results from EEG and fMRI studies (see caveats in the Methods section) thus suggests that Bayesian causal inference in multisensory perception relies on dynamic encoding of multiple spatial estimates across the cortical hierarchy. During early processing, multisensory perception is dominated by full-segregation models associated with activity in low-level sensory areas. Later audiovisual interactions that are governed by forced-fusion principles rely on posterior parietal areas. Finally, Bayesian causal inference estimates are formed in anterior parietal areas. Yet, although our results suggest that full segregation, forced fusion, and Bayesian causal inference dominate EEG activity patterns at different latencies, they do not imply a strictly feed-forward architecture. Instead, we propose that the brain concurrently accumulates evidence about the different spatial estimates and the underlying causal structure (i.e., common versus independent sources) most likely via multiple feedback loops across the cortical hierarchy [18,19]. Only after 350 ms is a final perceptual estimate formed in anterior parietal cortices that takes into account the uncertainty about the world's causal structure and combines audiovisual signals into spatial priority maps as predicted by Bayesian causal inference.

Participants
Sixteen right-handed participants participated in the experiment; three of those participants did not complete the entire experiment: two participants were excluded based on eye tracking results from the first day (the inclusion criterion was less than 10% of trials rejected because of eye blinks or saccades; see the Eye movement recording and analysis section for details), and one participant withdrew from the experiment. The remaining 13 participants (7 females, mean age = 22.1 years; SD = 3.0) completed the 3-day experiment and are thus included in the analysis. All participants had no history of neurological or psychiatric illnesses, had normal or corrected-to-normal vision, and had normal hearing.

Ethics statement
All participants gave informed written consent to participate in the experiment. The study was approved by the research ethics committee of the University of Birmingham (approval number: ERN_11_0470AP4) and was conducted in accordance with the principles outlined in the Declaration of Helsinki.

Stimuli
The visual ('V') stimulus was a cloud of 20 white dots (diameter = 0.43˚visual angle, stimulus duration: 50 ms) sampled from a bivariate Gaussian distribution with vertical standard deviation of 2˚and horizontal standard deviation of 2˚or 12˚visual angle presented on a dark grey background (67% contrast). Participants were told that the 20 dots were generated by one underlying source in the centre of the cloud. The visual cloud of dots was presented at one of four possible locations along the azimuth (i.e., −10˚, −3.3˚, 3.3˚, or 10˚).
The auditory ('A') stimulus was a 50-ms-long burst of white noise with a 5-ms on/off ramp. Each auditory stimulus was delivered at a 75-dB sound pressure level through one of four pairs of two vertically aligned loudspeakers placed above and below the monitor at four positions along the azimuth (i.e., −10˚, −3.3˚, 3.3˚, or 10˚). The volumes of the 2 × 4 speakers were carefully calibrated across and within each pair to ensure that participants perceived the sounds as emanating from the horizontal midline of the monitor.

Experimental design and procedure
In a spatial ventriloquist paradigm, participants were presented with synchronous, spatially congruent or disparate visual and auditory signals (Fig 1A and 1B). On each trial, visual and auditory locations were independently sampled from four possible locations along the azimuth (i.e., −10˚, −3.3˚, 3.3˚, or 10˚), leading to four levels of spatial disparity (i.e., 0˚, 6.6˚, 13.3˚, or 20˚; i.e., as indicated by the greyscale in Fig 1A). In addition, we manipulated the reliability of the visual signal by setting the horizontal standard deviation of the Gaussian cloud to a 2( high reliability) or 14˚(low reliability) visual angle. In an intersensory selective-attention paradigm, participants reported either their auditory or visual perceived signal location and ignored signals in the other modality. For the visual modality, they were asked to determine the location of the centre of the visual cloud of dots. Hence, the 4 × 4 × 2 × 2 factorial design manipulated (1) the location of the visual stimulus (−10˚, −3.3˚, 3.3˚, 10˚; i.e., the mean of the Gaussian), (2) the location of the auditory stimulus (−10˚, −3.3˚, 3.3˚, 10˚), (3) the reliability of the visual signal (2˚, 14˚; SD of the Gaussian), and (4) task relevance (auditory-/visual-selective report), resulting in 64 conditions (Fig 1A). To characterise the computational principles of multisensory integration, we reorganised these conditions into a 2 (visual reliability: high versus low) × 2 (task relevance: auditory versus visual report) × 2 (spatial disparity: �6.6˚versus >6.6˚) factorial design for the statistical analysis of the behavioural and EEG data. In addition, we included 4 (locations: −10˚, −3.3˚, 3.3˚, or 10˚) × 2 (visual reliability: high, low) unisensory visual conditions and 4 (locations: −10˚, −3.3˚, 3.3˚, or 10˚) unisensory auditory conditions. We did not manipulate auditory reliability, because the reliability of auditory spatial information is anyhow limited. Furthermore, the manipulation of visual reliability is sufficient to determine reliability-weighted integration as a computational principle and arbitrate between the different multisensory integration models (see Bayesian modelling analysis section).
On each trial, synchronous audiovisual, unisensory visual, or unisensory auditory signals were presented for 50 ms, followed by a response cue 1,000 ms after stimulus onset (Fig 1B). The response was cued by a central pure tone (1,000 Hz) and a blue colour change of the fixation cross presented in synchrony for 100 ms. Participants were instructed to withhold their response and avoid blinking until the presentation of the cue. They fixated on a central cross throughout the entire experiment. The next stimulus was presented after a variable response interval of 2.6-3.1 s.
Stimuli and conditions were presented in a pseudo-randomised fashion. The stimulus type (bisensory versus unisensory) and task relevance (auditory versus visual) was held constant within a run of 128 trials. This yielded four run types: audiovisual with auditory report, audiovisual with visual report, auditory with auditory report, and visual with visual report. The task relevance of the sensory modality in a given run was displayed to the participant at the beginning of the run. Furthermore, across runs we counterbalanced the response hand (i.e., left versus right hand) to partly dissociate spatial processing from motor responses. The order of the runs was counterbalanced across participants. All conditions within a run were presented an equal number of times. Each participant completed 60 runs, leading to 7,680 trials in total (3,840 auditory and 3,840 visual localisation tasks-i.e., 96 trials for each of the 76 conditions were included in total; apart from the four unisensory auditory conditions that included 192 trials). The runs were performed across 3 days with 20 runs per day. Each day was started with a brief practice run.

Experimental setup
Stimuli were presented using Psychtoolbox version 3.0.11 [69] (http://psychtoolbox.org/) under MATLAB R2014a (MathWorks) on a desktop PC running Windows 7. Visual stimuli were presented via a gamma-corrected 30" LCD monitor with a resolution of 2,560 × 1,600 pixels at a frame rate of 60 Hz. Auditory stimuli were presented at a sampling rate of 44.1 kHz via eight external speakers (Multimedia) and an ASUS Xonar DSX sound card. Exact audiovisual onset timing was confirmed by recording visual and auditory signals concurrently with a photodiode and a microphone. Participants rested their head on a chin rest at a distance of 475 mm from the monitor and at a height that matched participants' ears to the horizontal midline of the monitor. Participants responded by pressing one of four response buttons on a USB keypad with their index, middle, ring, and little finger, respectively.

Eye movement recording and analysis
To address potential concerns that results were confounded by eye movements, we recorded participants' eye movements. Eye recordings were calibrated in the recommended field of view (32˚horizontally and 24˚vertically) for the EyeLink 1000 Plus system with the desktop mount at a sampling rate of 2,000 Hz. Eye position data were on-line parsed into events (saccade, fixation, eye blink) using the EyeLink 1000 Plus software. The 'cognitive configuration' was used for saccade detection (velocity threshold = 30˚/sec, acceleration threshold = 8,000˚/sec 2 , motion threshold = 0.15˚) with an additional criterion of radial amplitude larger than 1˚. Individual trials were rejected if saccades or eye blinks were detected from −100 to 700 ms post stimulus.

Behavioural data analysis
Participants' stimulus localisation accuracy was assessed as the Pearson correlation between their location responses and the true signal source location separately for unisensory auditory, visual high reliability, and visual low reliability conditions. To confirm whether localisation accuracy in vision exceeded performance in audition in both visual reliabilities, we performed Monte-Carlo permutation tests. Specifically, we entered the subject-specific Fisher z-transformed Pearson correlation differences between vision and audition (i.e., visual-auditory) separately for the two visual reliability levels into a Monte-Carlo permutation test at the group level based on the one-sample t statistic with 5,000 permutations [70].

EEG data acquisition
Continuous EEG signals were recorded from 64 channels using Ag/AgCl active electrodes arranged in a 10-20 layout (ActiCap, Brain Products GmbH, Gilching, Germany) at a sampling rate of 1,000 Hz, referenced at FCz. Channel impedances were kept below 10 kO.

EEG preprocessing
Preprocessing was performed with the FieldTrip toolbox [71] (http://www.fieldtriptoolbox. org/). For the decoding analysis, raw data were high-pass filtered at 0.1 Hz, re-referenced to average reference, and low-pass filtered at 120 Hz. Trials were extracted with a 100-ms prestimulus and 700-ms poststimulus period and baseline corrected by subtracting the average value of the interval between −100 and 0 ms from the time course. Trials were then temporally smoothed with a 20-ms moving window and downsampled to 200 Hz (note that a 20-ms moving average is comparable to a finite impulse response [FIR] filter with a cutoff frequency of 50 Hz). Trials containing artefacts were rejected based on visual inspection. Furthermore, trials were rejected if (1) they included eye blinks, (2) they included saccades, (3) the distance between eye fixation and the central fixation cross exceeded 2˚, (4) participants responded prior to the response cue, or (5) there was no response. For ERPs (S2 and S3 Figs), the preprocessing was identical to the decoding analysis, except that a 45-Hz low-pass filter was applied without additional temporal smoothing with a temporal moving window. Grand average ERPs were computed by averaging all trials for each condition first within each participant and then across participants.

EEG multivariate pattern analysis
For the multivariate pattern analyses, we computed ERPs by averaging over sets of eight randomly assigned individual trials from the same condition. To characterise the temporal dynamics of the spatial representations, we trained linear SVR models (LIBSVM [72], https:// www.csie.ntu.edu.tw/~cjlin/libsvm/) to learn the mapping from ERP activity patterns of the (1) unisensory auditory (for auditory decoding), (2) unisensory visual (for visual decoding), or (3) audiovisual congruent conditions (for audiovisual decoding) to external spatial locations separately for each time point (every 5 ms) over the course of the trial (S2, S3 and S4 Figs). All SVR models were trained and evaluated in a 12-fold-stratified cross-validation (12 ERPs/fold) procedure with default hyperparameters (C = 1, ε = 0.001). The specific training and generalisation procedures were adjusted to the scientific questions (see the Shared and distinct neural representations of space across vision and audition section and the GLM analysis of audiovisual weight index w AV section for details).

Overview of behavioural and EEG analysis
Combining psychophysics, computational modelling, and EEG, we addressed two questions: First, focusing selectively on the unisensory auditory and unisensory visual conditions, we investigated when spatial representations are formed that generalise across auditory and visual modalities. Second, focusing on the audiovisual conditions, we investigated when and how human observers integrate audiovisual signals into spatial representations that take into account the observer's uncertainty about the world's causal structure consistent with Bayesian causal inference. In the following sections, we will describe the analysis approaches to address these two questions in turn.

Shared and distinct neural representations of space across vision and audition
First, we investigated how the brain forms spatial representations in either audition or vision using the so-called temporal generalisation method [21]. Here, the SVR model is trained at time point t to learn the mapping from, e.g., unisensory visual (or auditory) ERP pattern to external stimulus location. This learnt mapping is then used to predict spatial locations from unisensory visual (or auditory) ERP activity patterns across all other time points. Training and generalisation were applied separately to unisensory auditory and visual ERPs. To match the number of trials for auditory and visual conditions, we applied this analysis to the visual ERPs pooled over the two levels of visual reliability. The decoding accuracy as quantified by the Pearson correlation coefficient between the true and decoded stimulus locations is entered into a training time × generalisation time matrix. The generalisation ability across time illustrates the similarity of EEG activity patterns relevant for encoding features (i.e., here: spatial location) and has been proposed to assess the stability of neural representations [21]. In other words, if stimulus location is encoded in EEG activity patterns that are stable (or shared) across time, then an SVR model trained at time point t will be able to correctly decode stimulus location from EEG activity patterns at other time points. By contrast, if stimulus location is represented by transient or distinct EEG activity patterns across time, then an SVR model trained at time point t will not be able to decode stimulus location from EEG activity patterns at other time points. Hence, entering Pearson correlation coefficients as a measure for decoding accuracy for all combinations of training and test time into a temporal generalisation matrix has been argued to provide insights into the stability of neural representations whereby the spread of significant decoding accuracy to off-diagonal elements of the matrix indicates temporal generalisability or stability [21].
Second, to examine whether and when neural representations are formed that are shared across vision and audition, we generalised to ERP activity patterns across time not only from the same sensory modality but also from the other sensory modality (i.e., from vision to audition and vice versa). This cross-sensory generalisation reveals neural representations that are shared across sensory modalities.
To assess whether decoding accuracies were better than chance, we entered the subject-specific matrices of the Fisher z-transformed Pearson correlation coefficients into a between-subjects Monte-Carlo permutation test using the one-sample t statistic with 5,000 permutations ( [70], as implemented in the FieldTrip toolbox). To correct for multiple comparisons within the two-dimensional (i.e., time × time) data, cluster-level inference was used based on the maximum of the summed t values within each cluster ('maxsum') with a cluster-defining threshold of p < 0.05, and a two-tailed p-value was computed.

Computational principles of audiovisual integration: GLM-based analysis of audiovisual weight index w AV and Bayesian modelling analysis
To characterise how human observers integrate auditory and visual signals into spatial representations at the behavioural and neural levels, we developed a GLM-based analysis of an audiovisual weight index w AV and a Bayesian modelling analysis that we applied to both (1) the reported auditory and visual spatial estimates (i.e., participants' behavioural localisation responses) and (2) the neural spatial estimates decoded from EEG activity pattern evoked by audiovisual stimuli (see Fig 3 and [14,16]).

GLM analysis of audiovisual weight index w AV
SVR to decode spatial estimates from audiovisual EEG activity pattern. The neural spatial estimates were obtained by training a SVR model on the audiovisual congruent trials to learn the mapping from audiovisual ERP activity pattern at time t to external stimulus locations. This learnt mapping at time t was then used to decode the stimulus location from the ERP activity patterns of the spatially congruent and incongruent audiovisual conditions at time t (see Fig 3A and 3B right). These training and generalisation steps were repeated across all times t to obtain distributions of neural (i.e., decoded) spatial estimates for all 64 conditions for every time point t.
Regression model and computation of behavioural and neural audiovisual weight index w AV . In the 'GLM-based' analysis approach, we quantified the influence of the location of the auditory and visual stimuli on the reported (behavioural) or decoded (neural) spatial estimates using a linear regression model (see Fig 3C left). In this regression model, the reported (or decoded) spatial locations were predicted by the true auditory and visual stimulus locations for each of the eight conditions in the 2 (visual reliability: high versus low) × 2 (task relevance: auditory versus visual report) × 2 (spatial disparity: �6.6˚versus >6.6˚) factorial design ( Fig  1A).
Hence, the regression model included 16 regressors in total-i.e., 8 (conditions) × 2 (true auditory or visual spatial locations). We computed one behavioural regression model for participants' reported locations. Further, we computed 161 neural regression models for the spatial locations decoded from EEG activity pattern across time-i.e., one neural regression model for every 5-ms interval, leading to time courses of auditory (ß A ) and visual (ß V ) parameter estimates.
In each regression model, the auditory (ß A ) and visual (ß V ) parameter estimates quantified the influence of auditory and visual stimulus locations on the reported (or decoded) stimulus location for a particular condition. A positive ß V (or ß A ) indicates that the true visual (or auditory) location has a positive weight and hence an attractive effect on the reported or decoded location (e.g., it is shifted towards the true visual location; see Fig 3C left for an example). A negative ß V (or ß A ) indicates that the true visual (or auditory) location has a negative weight and hence a repulsive effect on the reported or decoded location (e.g., it is shifted away from the true visual location). The auditory and visual parameter estimates need to be interpreted together. To obtain a summary index, we computed the relative audiovisual weight (w AV ) as the four-quadrant inverse tangent of the visual (ß V ) and auditory (ß A ) parameter estimates for each of the eight conditions in each regression model (see Fig 3C left). The angles in radians are then converted to degrees: The four-quadrant inverse tangent was used to map each combination of (positive or negative) visual (ß V ) and auditory (ß A ) parameters uniquely to a value in the closed interval [−π, π], which was then transformed into degrees. If the reported/decoded estimate is dominated purely and positively by the visual signal (i.e., ß A = 0, ß V > 0), then w AV is 90˚. For pure (and positive) auditory dominance, it is 0˚(i.e., ß A > 0, ß V = 0). Furthermore, if the visual signal has an attractive influence (i.e., it attracts the perceived location towards the visual location) but the auditory signal has a repulsive influence (i.e., it shifts the perceived location away from the auditory location) on perceived/decoded location (i.e., ß A < 0, ß V > 0), then w AV is >90( e.g., Fig 4C, high-disparity condition). We obtained one w AV for each of the eight conditions at the behavioural level and one w AV for each of the eight conditions and time point (every 5 ms) at the neural level. The neural w AV time courses were temporally smoothed using a 20-ms moving average filter.

Statistical analysis of circular indices w AV for behavioural and neural data.
We performed the statistics on the behavioural and neural audiovisual weight indices using a 2 (auditory versus visual report) × 2 (high versus low visual reliability) × 2 (large versus small spatial disparity) factorial design based on the likelihood ratio statistics for circular measures (LRTS) [73]. Similar to an analysis of variance for linear data, LRTS computes the difference in loglikelihood functions for the full model that allows differences in the mean locations of circular measures between conditions (i.e., main and interaction effects) and the reduced null model that does not model any mean differences between conditions. LRTS were computed separately for the main effects (i.e., reliability, task relevance, spatial disparity) and interactions.
To refrain from making any parametric assumptions, we evaluated the main effects of visual reliability, task relevance, spatial disparity, and their interactions in the factorial design using randomisation tests (5,000 randomisations). To account for the within-subject repeated-measures design at the second random-effects level, randomisations were performed within each participant. For the main effects of visual reliability, task relevance, and spatial disparity, w AV values were permuted within the levels of the nontested factors. For tests of the two-way interactions (e.g., spatial disparity × task relevance), we permuted the simple main effects of the two factors of interest within the levels of the third factor [74]. For tests of the three-way interaction, values were freely permuted across all conditions [75]. These statistical tests were performed once for behavioural w AV and independently for each time point between 55 and 700 ms (i.e., 130 tests) for neural w AV (see below for multiple comparison correction across time points).
To assess the similarity between behavioural and neural audiovisual weight (w AV ) indices, we computed the circular correlation coefficient (as implemented in the CircStat toolbox [76]) between the eight behavioural (i.e., constant across time) and eight neural (i.e., variable across time) w AV from our 2 (high versus low visual reliability) × 2 (auditory versus visual report) × 2 (large versus small spatial disparity) factorial design separately for each time point.
Unless otherwise stated, results are reported at p < 0.05. To correct for multiple comparisons across time, cluster-level inference was used based on the maximum of the summed LRTS values within each cluster ('maxsum') with an uncorrected cluster-defining threshold of p < 0.05 (as implemented in the FieldTrip toolbox).
For plotting circular means of w AV (Fig 1C for behavioural w AV , Fig 4A-4D for neural w AV ), we computed the means' confidence intervals (as implemented in the CircStat toolbox [76]).

Bayesian modelling analysis
Description of Bayesian models and decision strategies. Next, we fitted the full-segregation model(s), the forced-fusion model, and the Bayesian causal inference model to the spatial estimates that were reported by observers (i.e., behavioural response distribution, Fig 3B left) or decoded from ERP activity patterns at time t (i.e., neural spatial estimate distribution, Fig  3B right). Using Bayesian model comparison, we then assessed which of these models is the best explanation for the behavioural or neural spatial estimates.
In the following, we will first describe the Bayesian causal inference model from which we will then derive the forced-fusion and full-segregation models as special cases (details can be found in [2,[13][14][15]).
Briefly, the generative model of Bayesian causal inference (see Fig 3C right) assumes that common (C = 1) or independent (C = 2) causes are sampled from a binomial distribution defined by the common cause prior P common . For a common source, the 'true' location S AV is drawn from the spatial prior distribution N(μ AV , σ P ). For two independent causes, the 'true' auditory (S A ) and visual (S V ) locations are drawn independently from this spatial prior distribution. For the spatial prior distribution, we assumed a central bias (i.e., μ = 0). We introduced sensory noise by drawing x A and x V independently from normal distributions centred on the true auditory (or visual) locations with parameters σ A 2 (or σ V 2 ). Thus, the generative model included the following free parameters: the common source prior p common , the spatial prior variance σ P 2 , the auditory variance σ A 2 , and the two visual variances σ V 2 corresponding to the two visual reliability levels.
The posterior probability of the underlying causal structure can be inferred by combining the common-source prior with the sensory evidence according to Bayes rule: In the case of a common source (C = 1), the optimal estimate of the audiovisual location is a reliability-weighted average of the auditory and visual percepts and the spatial prior (i.e., referred to as forced-fusion spatial estimate).
In the case of independent sources (C = 2), the auditory and visual stimulus locations (for the auditory and visual location report, respectively) are estimated independently (i.e., referred to as unisensory auditory or visual segregation estimates): To provide a final estimate of the auditory and visual locations, the brain can combine the estimates from the two causal structures using various decision functions such as 'model averaging', 'model selection', and 'probability matching' [13].
According to the 'model averaging' strategy, the brain combines the integrated forcedfusion spatial estimate with the segregated, task-relevant unisensory (i.e., either auditory or visual) spatial estimates weighted in proportion to the posterior probability of the underlying causal structures.
According to the 'model selection' strategy, the brain reports the spatial estimate selectively from the more likely causal structure (Eq 8 only shown forŜ A ): ( According to 'probability matching', the brain reports the spatial estimate of one causal structure stochastically selected in proportion to the posterior probability of this causal structure (Eq 9 only shown forŜ A ): ( Thus, Bayesian causal inference formally requires three spatial estimates (Ŝ AV;C¼1 ;Ŝ A;C¼2 ; S V;C¼2 ), which are combined into a final Bayesian causal inference estimate (Ŝ A orŜ V , depending on which sensory modality is task relevant) according to one of the three decision functions.
In the main paper, we present behavioural results using 'model averaging' as the decision function, which was associated with the highest model evidence and exceedance probability at the group level. S1 Table shows the model evidence, exceedance probabilities, and parameters for Bayesian causal inference across the three decision strategies for the behavioural data.
At the behavioural level, we evaluated whether and how participants integrate auditory and visual stimuli by comparing (1) the Bayesian causal inference model (i.e., with model averaging; Table 1), (2) the forced-fusion model that integrates auditory and visual signals in a mandatory fashion (i.e., formally, the BCI model with a fixed p common = 1, Fig 3C, encircled in yellow), and (3) the full-segregation model that estimates stimulus location independently for vision and audition (i.e., formally, the BCI model with a fixed p common = 0; i.e., Fig 3C, SegV,A encircled in light blue). This SegV,A model assumes that observers reportŜ A;C¼2 when they are asked to report the auditory location andŜ V;C¼2 when they are asked to report the visual location. In short, the SegV,A model reads out the spatial estimate from the task-relevant unisensory segregation model.
At the neural level, we may also conceive a neural source (or brain region) that representŝ S V;C¼2 , irrespective of which sensory modality needs to be reported (i.e., Fig 3C, SegV model, encircled in red). For instance, primary visual cortices may be considered predominantly unisensory with selective representations of the visual location even if the observer needs to report the auditory stimulus location. Likewise, we included a model that selectively represents the auditory location (i.e., Fig 3C, unisensory auditory segregation [SegA] model, encircled in green). By contrast, the full-segregation audiovisual model (i.e., SegV,A, encircled in light blue) can be thought of as a neural source (or brain area) that encodes the task-relevant estimate computed in a full-segregation model. It differs from the Bayesian causal inference model by not allowing for any audiovisual interactions or biases irrespective of the probabilities of the world's causal structure (i.e., operationally manipulated by spatial disparity in the current experiment).
At the behavioural level, the unisensory SegV and SegA models are not useful, because we would expect observers to follow instructions and report their auditory estimate for the auditory report conditions and their visual estimate for the visual report conditions. In other words, it does not seem reasonable to fit the unisensory SegV and SegA models jointly to visual and auditory localisation responses at the behavioural level. By contrast, at the neural level, spatial estimates decoded from EEG activity patterns may potentially reflect neural representations that are formed by 'predominantly unisensory' neural generators (e.g., primary visual cortex), particularly in early processing phases. Hence, we estimated and compared three models for the behavioural localisation reports and five models for the spatial estimates decoded from EEG activity patterns.
Model fitting to behavioural and neural spatial estimates and Bayesian model comparison. We fitted each model individually to participants' behavioural localisation responses (or spatial estimates decoded from EEG activity pattern at time t) based on the predicted distributions of the spatial estimates (i.e., pðŜjS A ; S V Þ; we useŜ as a variable to refer generically to any spatial estimate) for each combination of auditory (S A ) and visual (S V ) source location. These predicted distributions marginalise over the internal sensory inputs (i.e., x A , x V ) that are unknown to the experimenter (see [2] for further explanation). More specifically, we fit (1) the Bayesian causal inference model based on pðŜ A jS A ; S V Þ for auditory report conditions and pðŜ V jS A ; S V Þ for visual report conditions, (2) the forced-fusion model based on pðŜ AV;C¼1 jS A ; S V Þ, and (3) the SegV,A model based on pðŜ A;C¼2 jS A ; S V Þ for auditory report conditions and pðŜ V;C¼2 jS A ; S V Þ for visual report conditions. At the neural level, we also fit the SegV model based on pðŜ V;C¼2 jS A ; S V Þ and the SegA model based on pðŜ A;C¼2 jS A ; S V Þ to the spatial estimates decoded from EEG activity pattern across both visual and auditory report conditions.
To marginalise over the internal variables x A and x V that are not accessible to the experimenter, the predicted distributions were generated by simulating x A and x V 10,000 times for each of the 64 conditions and inferring the different sorts of spatial estimateŜ from Eq 3-9. To link any of those pðŜjS A ; S V Þ to participants' auditory and visual discrete localisation responses at the behavioural level, we assumed that participants selected the button that is closest toŜ and binned theŜ accordingly into a histogram (with four bins corresponding to the four buttons). Thus, we obtained a histogram of predicted localisation responses for each of those five models separately for each condition and individually for each participant. Based on these histograms, we computed the probability of a participant's counts of localisation responses using the multinomial distribution (see [2]). This gives the likelihood of the model given participants' response data. Assuming independence of conditions, we summed the log likelihoods across conditions.
At the neural level, we first binned the spatial estimates decoded from each ERP activity pattern at each time point based on their distance from the four true locations (i.e., −10˚, −3.3˚, 3.3˚, or 10˚) into four spatial bins before fitting the models to those discretised spatial estimates.
To obtain maximum-likelihood estimates for the parameters of the models (p common , σ P , σ A , σ V1 − σ V2 for the two levels of visual reliability; formally, the forced-fusion and segregation models assume p common = 1 or = 0, respectively), we used a nonlinear simplex optimisation algorithm as implemented in MATLAB's fminsearch function (MATLAB R2016a). This optimisation algorithm was initialised with a parameter setting that obtained the highest log likelihood in a prior grid search.
The model fit for behavioural and neural data (i.e., at each time point) was assessed by the coefficient of determination R 2 [77], defined as where lð�Þ and l(0) denote the log likelihoods of the fitted and the null model, respectively, and n is the number of data points. For the null model, we assumed that an observer randomly chooses one of the four response options; i.e., we assumed a discrete uniform distribution with a probability of 0.25. As in our case, the Bayesian causal inference model's responses were discretised to relate them to the four discrete response options, and the coefficient of ß determination was scaled (i.e., divided) by the maximum coefficient (see [77]) defined as To identify the optimal model for explaining participants' data (i.e., localisation responses at the behavioural level or spatial estimates decoded from EEG activity pattern at the neural level), we compared the candidate models using the Bayesian information criterion (BIC) as an approximation to the model evidence [78].
whereL denotes the likelihood, n the number of data points (i.e., EEG activity patterns summed over conditions at a time point t), and k the number of parameters. The BIC depends on both model complexity and model fit. We performed Bayesian model selection [25] at the group (i.e., random-effects) level as implemented in SPM8 [79] to obtain the protected exceedance probability that one model is better than any of the other candidate models above and beyond chance. Assumptions and caveats of EEG decoding analyses. The EEG activity patterns measured across 64 scalp electrodes represent a superposition of activity generated by potentially multiple neural sources located, for instance, in auditory, visual, and higher-order association areas. The extent to which auditory or visual information can be decoded from EEG activity patterns depends therefore inherently on how information is neurally encoded by the 'neural generators' in source space and on how these neural activities are expressed and superposed in sensor space (i.e., as measured by scalp electrodes). For example, visual space is retinotopically encoded, whereas auditory space is represented by broadly tuned neuronal populations (i.e., opponent channel coding model [31,80]), rate-based code [30,81], or spike latency and pattern [82,83]. These differences in encoding of auditory and visual space may contribute to the visual bias we observed for the audiovisual weight index w AV in early processing (Fig 4A-4D) and the dominance of the SegV model in the time course of exceedance probabilities (Fig 4F). Furthermore, particularly at later stages, scalp EEG patterns likely rely on superposition of activity of multiple neural generators so that 'decodability' will also depend on how source activities combine and project to the scalp (e.g., source orientation etc.). Given the inverse problem involved in inferring sources from EEG topographies, recent studies suggested combining information from fMRI and EEG activity pattern via representational similarity analyses [84,85]. Although we informally also pursue this approach in the Discussion section of the current paper, when merging information from a previous fMRI study that used the same ventriloquist paradigm and analyses with our current EEG results, we recognise the limitations of such an fMRI and EEG fusion approach. For instance, different features encoded in neural activity may be expressed in BOLD-response and EEG scalp topographies [86].
Finally, we trained the SVR model on the audiovisual congruent conditions pooled over task relevance and visual reliability to ensure that the decoder was based on activity patterns generated by sources related to auditory, visual, and audiovisual integration processes and that the effects of task relevance or reliability on the audiovisual weight index w AV cannot be attributed to differences in the decoding model (see [65] for a related discussion). ]) are shown for the main effects of visual reliability (row 1), task relevance (row 2), spatial disparity (row 3), and the visual reliability × task relevance interaction (row 4) at three selected electrodes (i.e., Fz = left; Pz = middle; Oz = right columns). For each effect, we show the power for the difference (or interaction) and the individual conditions coded in different colours as indicated for each row. Grey shaded areas indicate the time windows where at least one electrode was part of the significant cluster after correcting for multiple comparisons across time (i.e., −200 ms to 700 ms), frequency (i.e., 4-30 Hz), and topography. (B) Topographies of the t values averaged across the significant time windows of the corresponding effects. Electrodes marked with black stars were part of the significant cluster (corrected across topography × time × frequency). (TIF) S1 Table. Model parameters (across-subjects mean ± SEM) and fit indices of the BCI models with different decision functions. Model averaging (BCI avg ), model selection (BCI sel ), and probability matching (BCI match ). BCI, Bayesian causal inference; PEP, protected exceedance probability; R 2 , coefficient of determination; relBIC group , group-level relative Bayesian information criterion [25]. (DOCX) S2 Table. Time-frequency results. Significant effects are shown for overall relative to baseline, main effect of VR, and main effect of task relevance ('Task'), and the interaction between VR and task relevance is shown across rows. Columns of the table indicate the approximate time windows that the significant cluster spanned. All p-values are reported at the cluster level, corrected for multiple comparisons over time × topography × frequency. VR, visual reliability. (DOCX)