Speech Cues Contribute to Audiovisual Spatial Integration

Speech is the most important form of human communication but ambient sounds and competing talkers often degrade its acoustics. Fortunately the brain can use visual information, especially its highly precise spatial information, to improve speech comprehension in noisy environments. Previous studies have demonstrated that audiovisual integration depends strongly on spatiotemporal factors. However, some integrative phenomena such as McGurk interference persist even with gross spatial disparities, suggesting that spatial alignment is not necessary for robust integration of audiovisual place-of-articulation cues. It is therefore unclear how speech-cues interact with audiovisual spatial integration mechanisms. Here, we combine two well established psychophysical phenomena, the McGurk effect and the ventriloquist's illusion, to explore this dependency. Our results demonstrate that conflicting spatial cues may not interfere with audiovisual integration of speech, but conflicting speech-cues can impede integration in space. This suggests a direct but asymmetrical influence between ventral ‘what’ and dorsal ‘where’ pathways.


Introduction
Our brains continually integrate information from multiple sensory systems to improve perception [1,2,3,4,5]. For instance, watching a speaker's lip movements significantly enhances speech intelligibility [3,4], especially when speech is degraded by reverberations or competing talkers (e.g. at a cocktail party [6]). Furthermore, the brain can use visual information to refine unreliable auditory spatial estimates [7,8,9,10]. Previous research has clearly demonstrated the general importance of low-level stimulus attributes, such as spatial and temporal coincidence, in these integrative processes [11,12,13,14,15]. Specifically, the likelihood of integration decreases with increasing spatiotemporal disparity. Although space and time are widely accepted as important factors in integration, not all integrative processes require strict spatial alignment. For instance, the McGurk illusion [16] (e.g. when subjects are presented with an auditory /aba/ and visual / aga/ they typically report hearing /ada/) persists even with large spatial differences [17]. Thus, it seems that spatial information does not influence higher order processing of speech stimuli under all circumstances. However we do not know, conversely, whether speech cues influence basic multisensory processing of space. In this study, we directly test the hypothesis that phonetically incongruent audiovisual speech affects the integration of spatial information by measuring the effect of phonetically congruent and incongruent stimuli (i.e. McGurk pairs [16]) on the ventriloquist's illusion [18].
The ventriloquist's illusion provides a powerful assay of audiovisual spatial integration. Subjects often experience the illusion when presented with spatially disparate audiovisual stimuli (e.g. [15]). Typically, the perceived location of the sound is captured by the visual cues; however for a range of spatial disparities, this capture may succeed on some trials and fail on others, even with physically identical stimuli. The illusion can thus be harnessed as a direct index for ongoing spatial integration of sight and sound in the absence of stimulus confounds. Although studies have demonstrated that spatial integration, as measured by the ventriloquist's illusion, is susceptible to high order cognitive variables such as the ''cognitive compellingness'' of the stimulus set used [19], it is not clear how relatively high-order phonetic cues specifically affect these integrative processes.
In this study, we explored the relationship between audiovisual integration of spatial and phonetic cues. To do this, we used well described audiovisual illusions as metrics for integration in each domain: the ventriloquist's illusion in space [18] and McGurk interference for speech related cues [16]. Furthermore, to control and adequately sample acoustic space, we simulated it with subject specific head related transfer functions (HRTFs). We hypothesized that integration of speech cues, such as the place of articulation important for McGurk interference, would operate independently of audiovisual spatial relationships; in contrast, we hypothesized that speech cues would have a significant impact on audiovisual spatial integration.

Subjects and Ethics Statement
Fifteen healthy subjects (10 women, ages 19-26 yrs., mean 22 yrs.; 5 participated in an HRTF Validation Experiment, see below for details) gave written consent according to procedures approved by the University of California, Davis Institutional Review Board (UCD IRB) and were paid for their participation in the UCD IRB approved experiment described here. Participants learned English as an infant, had self-reported good hearing, and normal or corrected-to-normal vision. Four subjects (3 female) were excused from the study prior to data collection due to technical difficulties in estimating their head related transfer functions. Consequently, the reported findings include the remaining 11 (7 female) subjects.

Stimuli
Four consonants (/b/, /g/, /k/, /p/) were paired with /a/ to produce four vowel-consonant-vowel (VCV) speech tokens. These VCVs were spoken by a female actress with voice training and recorded using a digital camcorder and remote microphone. Video was acquired at 29.97 Hz and audio at 48 kHz. Videos were subsequently resampled at 60 Hz, converted to black and white, and luminance-normalized using in-house scripts. A single instance of each VCV was used in the experiment. These tokens were selected to maximize the temporal alignment (,5 ms offset at speech and consonant onset, well within the duration of a single frame of the 60 Hz video) and match the pitch and timbre of phonetically congruent and incongruent pairs. Sounds were presented dichotically at ,70 dB through ER-4B headphones (www.etymotic.com). All auditory stimuli were digitally filtered to compensate for the specific frequency response of the sound playback and recording equipment. White noise was presented diotically at approximately 54 dB to mask any transients introduced during speech filtering (see below). This resulted in an approximate speech-to-noise ratio of +16 dB.

Head related transfer function (HRTF) estimation
Individualized head related transfer functions (HRTFs) were used to create a virtual acoustic environment for each subject. Importantly, these were deliberately acquired in a reverberant room to improve sound externalization and prevent sounds from appearing ''in the head'', a problem that has limited the use of HRTFs in studies of spatial hearing [20]. Thus, each subject's location-specific transfer functions contained both aspects of the subject's HRTF and the room impulse response. Although describing these estimates as a pure HRTF is technically inaccurate, we use this notation rather than its proper notation (binaural room impulse response or BRIR) for clarity of presentation.
HRTFs were estimated by presenting 3 s of white noise from a Tannoy Precision 6D (www.tannoy.com) free-field monitor located approximately 2 m from the subject and recording the signal at the entrance of the subject's ear canals using AuSIM inner ear microphones (www.ausim3d.com) at a rate of 96 kHz. Subject-and location-specific transfer functions were calculated in MATLAB (www.mathworks.com) by dividing the Fast Fourier Transform (FFT) of the recording in each ear by the FFT of the original white noise sample at each location. The procedure was repeated at locations ranging from 236u (left of midline) +36u (right of midline) in 6u increments. Auditory speech stimuli were positioned in virtual space by filtering VCVs with a truncated (80 ms) time representation of the HRTF called the head related impulse response (HRIR). This resulted in robust externalization, minimized unwanted anomalies (e.g. ''front-back confusion''), and maintained the discrete spatial nature of the signal (see Figure 2).

Head related transfer function (HRTF) validation
HRTFs for ten healthy individuals (6 female, mean age 24, 5 participated in main experiment) were acquired as described above and the accuracy of perceived auditory space was assessed using a pointing task. Subjects sat in a comfortable, rotatable chair equipped with a custom headrest and laser pointer suspended directly above the subject's head. The subject was instructed to fixate on a marking located straight ahead (0u) during sound presentation. A single vowel-consonant-vowel (VCV) was then presented at one of thirteen virtual locations ranging from 236u to +36u in random order. Following sound offset, the subject oriented her body towards the perceived azimuthal location by rotating in the chair. The location of the laser pointer was recorded by the experimenter and converted to azimuthal angle in MATLAB. The subject returned to the starting position and the procedure was repeated for a total of twenty-six trials (two per location).

Task Procedures
Subjects sat in a comfortable chair in a dimly lit room and placed their chin in a chinrest to prevent head movements. Visual stimuli were presented on a Dell FPW2407 (www.dell.com) situated approximately 53 cm from the subject's brow (Figure 1). At this distance, the visual speech stimuli filled the center 6u of visual space (63u from midline). Auditory stimuli were paired with three possible types of videos, resulting in three conditions: Still Face, auditory stimuli paired with a motionless face; Congruent, auditory stimuli paired with their phonetically congruent videos; and Incongruent, auditory stimuli paired with a phonetically Incongruent video to create McGurk interference.
The main experiment lasted 1 hr and was divided into two halves. In each half, subjects were presented with stimuli from one of two sets; Set A (auditory /aba/ and /aga/) or Set B (auditory / apa/ and /aka/). Stimuli were presented in the Still Face, Congruent, and Incongruent (auditory /aba/ or /aga/ paired with visual /aga/ or /aba/ for Set A and auditory /apa/ or /aka/ paired with visual /aka/ or /apa/ for Set B) conditions. Each half consisted of three, 7 min sessions; each session consisted of 156 trials, for a total of 6 trials at each location, VCV, and condition. Locations, conditions, and VCVs were presented in pseudorandom order so that no single VCV, location, or condition was presented on more than three consecutive trials. Individual trials lasted approximately 2.6 s and consisted of three segments ( Figure 1A). The first was a 500 ms fixation period following the loading of the first frame of the video at trial onset, designed to minimize potential transient attentional artifacts. This buffer was followed by a 1.0 s stimulus presentation period, which was in turn followed by an approximately 1.0 s response window. Subjects completed three sessions comprised of stimuli from either Set A or B before beginning the second set. Set order was counterbalanced across subjects.
Subjects were instructed to respond to two separate questions for each trial. First, subjects reported if they perceived the sound to originate from the speaker's mouth (Same-Location) or somewhere else (Different-Location) by pressing their left index and middle fingers, respectively. Importantly, the experimenter repeatedly emphasized that the task was to compare the locations of what they heard and saw and that the presence or absence of mouth movement or whether or not the mouth movements matched what was heard was not part of the task. This emphasis helped ensure that any effects of condition are likely to be conservative. Additionally, subjects were instructed to maintain fixation throughout the experiment. The experimenter monitored for gross eye movements and any head movement via a remote camera placed in the testing room. Second, subjects reported which VCV they heard by pressing one of four buttons with their right index, middle, ring, and little fingers. These buttons were mapped to /aba/ or /aka/, /ada/ or /apa/, /aga/ or /ata/, and /other/ for Set A and Set B respectively. Although not explicitly presented, /ada/ and /ata/ were included as response options because subjects commonly report these percepts for auditory /b/ and /p/ paired with visual /g/ and /k/, respectively [16]. The / other/ category was included to allow for any unexpected response types. Subjects were instructed to press the /other/ button only if they could not unambiguously classify what they heard as one of the other three options. For each subject, speech sounds corresponding to the /other/ classification were noted by the experimenter during a short debriefing session.
Prior to beginning either set of stimuli, subjects completed a training session to learn the response mapping and task. This typically took 6-9 min and never exceeded 12 min of training. During these training sessions, each VCV was presented from 230u, 0u, or +30u degrees in the Still Face, Congruent, and Incongruent conditions. Subjects advanced to the main experiment once they responded Same-Location for a majority of sounds presented at 0u and Different-Location for most sounds presented at 230u or +30u. Prior to the start of each session, the experimenter placed a still image of the speaker's mouth in the center of the screen and looped a single auditory VCV located at 0u. The subject's head was maneuvered until the subject perceived the sound to originate from the Same-Location as the video. Care was taken to align the stimuli in both azimuth and elevation and yielded robust alignments (see Results). Stimulus presentation and response acquisition was coordinated with Neurobehavioral Systems' Presentation Software (www.neurobs.com).

Corrected McGurk Interference Rate
Despite numerous studies employing McGurk interference, there is no clear consensus on how best to quantify McGurk interference rates. Here we report a ''Corrected'' McGurk interference rate that controls for highly confusable auditory signals in the absence of visual information. Specifically, the percentage of responses classified as McGurk Interference (e.g. / ada/ or /other/ responses for B/G audiovisual pairs and /ata/ or /other/ for K/P audiovisual pairs) in the Still Face condition was subtracted from the percentage of these responses in the Incongruent (McGurk) condition. That is, we quantified how much more likely a response category indicative of McGurk interference is when auditory stimuli are paired with visual Incongruent versus Still Face videos. Since VCV identities were generally more confused in the Still Face condition than in the Congruent condition (i.e. errors in reported VCV identity were generally higher in the Still Face condition, see Table 1), this measure resulted in a conservative estimate of McGurk interference and prevented artificially inflated McGurk interference rates due to ambiguous auditory stimuli.

Psychometric Analysis
Psychometric analysis was performed by fitting the %Same-Location vs. Spatial-Disparity (Spatial-Location collapsed across side) for each subject and condition with a sigmoid curve [Y = 1/(1 + exp(-A*(X-B)))] with MATLAB's 'fit' function. Parameter A is indicative of the slope of the function while B is the threshold while X and Y represent the independent (here, spatial disparity (u)) and dependent (% Same-Location) variables, respectively. Parameter estimates were included in separate one-factor ANOVAs to assess any main effect of stimulus Condition [21].

Statistical Methods
All statistical tests were performed in STATISTICA version 8.0 (www.statsoft.com). All reported p-values have been Greenhouse-Geisser corrected for non-sphericity. Post-hoc tests were performed using Fisher's least significant difference (LSD). Unless otherwise noted, results are reported as mean 6 standard error of the mean (SEM).

HRTF validation and head alignment
HRTFs are notoriously difficult to estimate well and are often plagued by acoustic artifacts [20]. As a result, we conducted a validation experiment with ten individuals (five of whom participated in the main experiment; see Methods for details) to demonstrate that our HRTF protocol captures cues important in azimuthal sound. Figure 2 plots reported angles as a function of presented angle.
Responses were remarkably precise both within and between subjects and were well explained by a linear polynomial (y = 1.25*X+0.176, R 2 = 0.99). Thus, our HRTF estimation routine provides a precise (although with slightly overestimated slope by this measure) representation of acoustic space for all subjects.
An additional challenge to using HRTFs in the current study is that slight changes in head position result in audiovisual stimuli falling out of spatial registry. For instance, a slight rotation of the midsaggital plane to the right will result in auditory space shifting to the right while visual stimuli remain anchored in space. To verify that our efforts to align auditory and visual space were robust across subjects (see Methods), we fit the % Same-Location vs. Location function (collapsed across conditions) with a Gaussian function in MATLAB for each subject. The means of these Gaussian distributions were then subjected to a t-test. A mean of 0u would indicate perfect alignment while any non-zero value would indicate a systematic misalignment. Although individual subject peaks occasionally deviated from 0u (maximum of 6u, median 0.135u), there was no systematic relationship in the direction or magnitude of this shift (p = 0.58; mean = 20.0761.21u degrees). Together, these data suggest that our protocol captures cues important in sound localization and ensures robust audiovisual alignment.

Speech Classification and McGurk Interference
On average, subjects were able to correctly identify auditory VCVs on 83.5063.07% of trials in the Still Face condition (see Table 1).
Identification performance improved to 91.5262.34% in the Congruent condition (p,0.001; mean difference [d -] = 8.026 1.74%), suggesting that subjects used the visual information to aid in VCV identification. Ten of the eleven subjects experienced McGurk interference and tended to respond /ada/ for B/G, /ata / for P/K, /other/ (typically /abga/) for G/B, and /other/ (typically /apka/) for K/P audiovisual pairs (see Table 1). The remaining subject did not reliably experience McGurk interference with any of the four McGurk pairs but was still included in the analysis.
The (uncorrected) percentage of McGurk type responses is plotted as a function of condition and audiovisual pair in Figure 3A and Corrected McGurk Interference rates are plotted as a function of location in Figure 3B. Corrected McGurk Interference rates were collapsed across side and included in a two-factor [Spatial-Disparity x VCV] repeated-measure ANOVA. The main effect of VCV was insignificant (p = 0.23; g 2 p = 0.14) and did not interact with Spatial-Disparity (p = 0.22; g 2 p = 0.12). Finally, the main effect of Spatial-Disparity was insignificant (p = 0.54; g 2 p = 0.07; 60.0267.38% for 0u, 59.2367.22% for 6u, 58.1567.53 for 12u, 56.8467.24% for 18u; 56.6867.97% for 24u; 60.8667.39% for 30u; 56.8767.63% for 36u). Together, these data suggest that McGurk interference rates were consistent across audiovisual pairs and, more importantly, that speech cues are integrated independently of audiovisual spatial attributes, as suggested by previous studies (e.g. [17,22,23], but see [24]).

Conflicting Speech Cues Inhibit Spatial Integration
In contrast to the spatial insensitivity of McGurk interference, the ventriloquist's illusion is highly dependent on audiovisual spatial attributes (Figure 4). To determine the contribution of higher-order speech related cues to audiovisual spatial integration, we measured the percentage Same-Location responses with sounds paired with their phonetically Congruent or Incongruent videos. Critically, both Congruent and Incongruent stimuli maintained their temporal registry to within the temporal precision of the video (see Methods for more details). Thus any differences observed between conditions are unlikely due to differential temporal alignment.
The percentage of Same-Location responses is plotted as a function of location in Figure 4. As audiovisual spatial disparity increased in either direction, the likelihood of a Same-Location response decreased monotonically, resulting in a smooth function symmetric about 0u. Despite spatial precision on the order of 3-5u during a spatial pointing task (see Figure 2), audiovisual spatial comparisons tended to be less precise in the Same-or Different-Location task. For instance, in the Still Face condition (light grey squares) subjects responded Same-Location on approximately 25% of trials with an 18u spatial disparity. In contrast, subjects participating in the pointing task never confused a sound presented at 18u with the 0u location. Despite this nuance, subjects were more likely to respond Same-Location in the Congruent condition than in the Still Face condition (p,0.001; 50.9563.74% for Congruent and 37.9562.52% for Still Face; d -= 13.0062.23%).
We interpret this increase in the percentage of Same-Location responses as the ventriloquist's illusion [22].
To determine whether phonetic cues contribute to audiovisual integration in space, we included the percentage of Same-Location responses, collapsed across side, in a three-factor [Spatial-Disparity Congruent, Incongruent . Still Face) were significant while the main effect of VCV fell below significance after Greenhouse-Geisser correction (p,0.05). More importantly, VCV did not interact with any other factor with or without Greenhouse-Geisser correction (p.0.13), suggesting that all speech tokens were spatially equivalent. Instead, only Spatial-Disparity and Condition significantly interacted (p,0.006; g 2 p = 0.34). Post-hoc tests revealed a significantly greater likelihood of responding Same-Location in the Congruent condition relative to the Still Face condition at all non-zero spatial disparities. In contrast, the percentage of Same-Location responses was significantly different between Incongruent and Still Face conditions at all audiovisual disparities except 6u (p = 0.20) and 0u (p = 0.18). Most importantly, there was a significant difference between Congruent and Incongruent conditions at the smallest measured disparity of 6u. Specifically, the percentage of Same-Location responses decreased by 11.0363.94%, virtually abolishing the ventriloquist's illusion (p = 0.016). Thus, conflicting speech related cues significantly attenuated audiovisual integration in space, but only at the smallest measured disparity. There are several mechanisms through which the ventriloquist's illusion can manifest in its own right and be modified by conflicting phonetic cues: by 1) decreasing auditory spatial sensitivity (i.e. a change in the slope of a psychometric function), 2) shifting the dynamic range (i.e. a change in threshold of a psychometric function), or 3) both a change in dynamic range and spatial sensitivity. To further clarify how these potential mechanisms contribute to audiovisual spatial integration, we compared slope and threshold parameter estimates of a logistic curve fit to each subject's %Same-Location vs. Spatial-Disparity function for each condition (see Methods for details

Discussion
The brain uses information from multiple modalities to construct a statistically optimal representation of our world. Particularly important for human communication and the focus of this study is audiovisual integration. Our results provide new evidence about these integrative mechanisms by suggesting that higher order speech cues can guide audiovisual spatial integration, while the converse is not necessarily true. Specifically, although integrative processing of speech related cues operates independently of spatial information (e.g. McGurk interference), conflicting speech cues can affect the likelihood of spatial integration when their lower-level information might erroneously drive integration. These observations may be a direct consequence of asymmetrical information sharing between processing streams in the brain. For instance, existing evidence suggests that both visual and auditory information is processed along dorsal 'where' and ventral 'what' pathways involved in processing object location and identity, respectively [25,26,27]. Our data suggest that information sharing between these pathways may be asymmetric.
McGurk interference, a classic instance of audiovisual integration, served as a key element of our design. In agreement with prior reports [17,22,23,28] (but see [24]), we show that the McGurk illusion depends on conflicting speech-related cues (here, place of articulation) but is utterly independent of spatial disparity. This suggests that processing of speech-cues, likely along the ventral 'what' pathway, does not necessarily have access to auditory or visual spatial information. We would not, however, argue that such object-related integrative processes are strictly automatic or isolated from ongoing cognition and perception. In fact, recent evidence suggests that top-down cognitive factors, such as attention, can affect the likelihood of integrating audiovisual objects. For instance, McGurk interference can break down when subjects are engaged in an attentionally demanding task [29,30]. Together, this evidence shows that while audiovisual integration of high-level cues may benefit from supramodal cognitive resources, it proceeds without regard for spatial representations. Interestingly, as discussed below, the converse is not true: high-level cues can in fact affect spatial integration.
Our results show that speech cues are relevant for audiovisual spatial integration. Specifically, when audiovisual speech with virtually identical spatiotemporal properties is manipulated to create phonetically Congruent and Incongruent (McGurk) audiovisual pairs, visual spatial capture (i.e. the ventriloquist's illusion) is consistently attenuated in the Incongruent condition relative to its Congruent counterpart. Interestingly, this inhibitory effect only occurs when relatively low-level audiovisual spatial information nearly coincides. These data suggest that spatial processing, likely along the dorsal 'where' pathway, has access to higher order information from the ventral 'what' pathway. Although the neural mechanism remains elusive, perhaps the ventral stream encodes an error signal that inhibits spatial integration when object identities fail to match, but is only compelling enough to override highly confusable signals.
Importantly, the data do not suggest that conflicting phonetic cues cause global changes in auditory spatial processing, as a more detailed psychometric analysis of the Congruent and Incongruent conditions revealed no consistent changes in auditory spatial sensitivity (slope of the function) or the dynamic range of our subjects' auditory spatial sensitivity. Instead, this behaviorally consequential phenomenon seems only to prevent signals that are highly confusable according to spatiotemporal proximity, as described originally by Stein and Meredith [12], from being erroneously integrated.
Together our data combined with accumulating evidence from recent studies [29,30] suggest that multisensory integration is not governed solely by low-level spatial and temporal properties of the stimuli, as might be suggested by early neural models of integration [12]. Instead, we argue that the brain's effort to integrate information of common origin is often influenced by additional, abstract stimulus properties, such as speech cues and ongoing cognitive demands, such as attention [29,30]. Future studies might further explore the relationship between these newly appreciated contributors under more realistic conditions. Perhaps, in contrast to the current findings, spatial cues contribute to the integration of speech cues but under more challenging circumstances, for instance with multiple competing talkers. To our knowledge, such competitive audiovisual speech tasks are rare (see [31] for a recent example) and have so far not exploited McGurk interference to dissect the underlying neural mechanisms. This could be due in part to the inherent difficulty in exploring McGurk interference with peripheral visual stimuli, a likely necessity in such experiments. Specifically, McGurk interference is known to diminish rapidly as visual stimuli move from foveal to peripheral locations [28], probably due to spatial smoothing of the stimuli in the visual periphery [32]. Although these experiments may prove difficult, they would undoubtedly provide a deeper understanding of complex, real-world multisensory integration.
In contrast to our findings, a recent report by Colin et al. (2001) found that spatial integration operates independently of phonetic identity [22]. However, this apparent contradiction is easily reconciled when one considers the details of both experiments. Colin et al. recorded the likelihood of responding Same-Location from 680u in 20u increments. In contrast, we recorded responses at a much higher spatial resolution, and, more importantly, at disparities smaller than 20u. In the current report, we also fail to find differences in audiovisual spatial integration between Congruent and Incongruent conditions at disparities larger than 6u and would have arrived at identical conclusions had we not used such high spatial sampling. Once these factors are taken into account, the two independent studies corroborate one another despite reaching qualitatively different conclusions.
In sum, audiovisual integration is vital for day-to-day navigation and communication and is strongly driven by bottom-up spatial and temporal evidence. However, we demonstrate here that some integrative phenomena, such as McGurk interference, operate independently of spatial processing. In contrast, conflicting speech cues can impact audiovisual integration in space. These data suggest that under some circumstances, information is shared asymmetrically between dorsal 'where' and ventral 'what' processing streams. Future studies might explore how other high-level cognitive and perceptual properties affect this balance of influence during audiovisual integration of speech in noisy environments.