Figure 1.
Models of Functional Coupling between Voice and Face Modules Operating during Unimodal Speaker Recognition (after Associative Learning)
(A) Person identity recognition models [12,16].
(B) Adaptation of (A) with reciprocal interactions between modules.
(C) Adaptation of (B) with direct reciprocal interactions between sensory modules. The plain arrow indicates a bottom-up signal that affords speaker recognition. Dotted arrows indicate predictive signals (black, informing voice module). Hypotheses related to each model are indicated in tables. Hypoth: hypothesis, Assoc: associative learning, Recog: recognition.
Figure 2.
(Left) Two unimodal auditory sessions (Part 1 and Part 3), during which participants recognized either a target voice (tv) or target ring (tr) tone among different voices or ring tones, were carried out before and after learning to associate auditory stimuli with corresponding videos or written names. The two types of learning (Part 2) were performed by separate groups of participants. One group (n = 14) learned voice-face and ring tone-cell phone associations (sound-video group), while the other group (n = 15) learned voice-name and ring tone-brand name associations (sound-name group). All participants participated in a face area localizer experiment at the end of the protocol, in which they passively viewed pictures of different faces and objects presented in blocks.
(Right) Detail of the learning protocol. Participants were requested to match a voice or a ring tone with a picture or a name and gave their response by key-press. Feedback of the correct association (videos in the sound-video group and sound together with the name in the sound-name group) was immediately given after each trial so that the participants progressively learned the correct association.
Figure 3.
fMRI Activity in Response to Voice Recognition
The surface rendering shows responses during voice recognition compared with ring tone recognition (n = 29). In purple: after both voice-face and voice-name association learning (p < 0.001 uncorrected); In red: after voice-face more than after voice-name learning (p < 0.001 uncorrected). A coronal section through the FFA shows the overlap of the crossmodal activation by voices (in red) with the responses to faces presented visually during the face localizer experiment (yellow). Plots of signal change in response to voice recognition contrasted with ring tone recognition are displayed for each responsive brain region. Error bars represent 95% confidence interval of the mean.
Figure 4.
The Impact of Voice-Face Associative Learning on Functional Connectivity
All areas shown in red in Figure 3 were included in functional connectivity analyses assessed by means of PPI. In addition, the TVA was included as entry point in the voice recognition network. PPIs probed changes in functional connectivity across regions during voice recognition resulting from learning of voice-face associations. All five regions served both as probes and targets and the results are presented in double entry tables. The colours in boxes indicate the level of statistical significance associated with the hypothesis of enhanced connectivity: dark grey for p < 0.001, (uncorrected), light grey for p < 0.01, white for non significant, and dark for autocorrelation. Numbers indicate the coordinates (x, y, z in MNI template) of the voxel where maximal correlation was found. All 14 participants who had learned faces in response to voices were included in the analysis.
(A) Before voice-face association > after.
(B) After voice-face association > before. Figures below tables illustrate the impact of learning on functional connectivity. Enhanced connectivity is represented as dark grey links.
Figure 5.
Recognition Scores for both Groups for Voice and Ring Tone Recognition before and after Learning
ANOVA on repeated measures revealed a significant crossmodal learning effect in both groups for voice recognition (F[1,27] = 28, p <0.0001) and a condition (voice recognition before, voice recognition after learning) by group (sound-video, sound-name group) interaction (F[1,27] = 6, p <0.018) reflecting a larger learning effect for voices in the face group than in the name group. For ring tone recognition there was no corresponding condition (ring tone recognition before, ring tone recognition after learning) by group (sound-video, sound-name group) interaction (F[1,27] = 0.4, p < 0.6). There was a significant effect of stimulus type before (F[1,27] = 16, p < 0.0001) and after learning (F[1,27] = 33, p < 0.0001) indicating that voice recognition was overall more difficult than ring tone recognition (post-hoc paired t-tests: before learning t = 5.8, p < 0.0001, after learning t = 3.3, p < 0.003). All p-values are two-tailed. Error bars represent 95% confidence interval of the mean.
Figure 6.
Behavioural Results and fMRI Activity for the Learning Part of the Experiment
(A, B) Behavioural measures corresponding to learning are displayed in plots for response correctness (A), and for response time (B). There was no difference between groups regarding correctness. Differences within groups between matching voice and ring tone associations, was significant in the voice-name group only (paired t-test, p < 0.03). Voice-name matching yielded longer response times than voice-face matching as revealed by a condition (voice, ring tone) by group (sound-video, sound-name) interaction, ANOVA, F(1,27) = 6, p < 0.01). The difference in response time to voices and ring tones was significant in the voice-name group only (paired t-test t = 3.3, p < 0.009). All p-values are two-tailed. Error bars represent 95% confidence interval of the mean.
(C, D) Brain regions involved during voice-face and voice-name learning were analyzed separately (event-related) from the sessions involving auditory recognition. Activity in the anterior temporal cortex which is classically involved in multimodal person recognition was observed during both voice-face and voice-name matching (relative to ring tone-cell phone and ring tone-name matching) when compared with their respective control tasks where the associated stimuli (faces and names, cell phones and brand names) were simply categorized instead of matched with the preceding sounds (C). The same region of the anterior temporal cortex parametrically correlated with the speed of the response in the group performing a voice-face association (n = 14; p < 0.001, uncorrected) (D) and in the group performing a voice-name association (n = 14; p < 0.01, uncorrected). When both groups were analyzed together, parametric modulation with response speed was significant (n = 28; p < 0.001, uncorrected).