The authors have declared that no competing interests exist.
Conceived and designed the experiments: VGS SRS. Performed the experiments: VGS. Analyzed the data: VGS SRS. Contributed reagents/materials/analysis tools: SRS. Wrote the manuscript: VGS SRS.
The perception of emotions is often suggested to be multimodal in nature, and bimodal as compared to unimodal (auditory or visual) presentation of emotional stimuli can lead to superior emotion recognition. In previous studies, contrastive aftereffects in emotion perception caused by perceptual adaptation have been shown for faces and for auditory affective vocalization, when adaptors were of the same modality. By contrast, crossmodal aftereffects in the perception of emotional vocalizations have not been demonstrated yet. In three experiments we investigated the influence of emotional voice as well as dynamic facial video adaptors on the perception of emotion-ambiguous voices morphed on an angry-to-happy continuum. Contrastive aftereffects were found for unimodal (voice) adaptation conditions, in that test voices were perceived as happier after adaptation to angry voices, and vice versa. Bimodal (voice + dynamic face) adaptors tended to elicit larger contrastive aftereffects. Importantly, crossmodal (dynamic face) adaptors also elicited substantial aftereffects in male, but not in female participants. Our results (1) support the idea of contrastive processing of emotions (2), show for the first time crossmodal adaptation effects under certain conditions, consistent with the idea that emotion processing is multimodal in nature, and (3) suggest gender differences in the sensory integration of facial and vocal emotional stimuli.
The perception of emotional states is crucial for adequate social interaction. Emotions are expressed in the face, but also in the voice (e.g., [
An impressive source of evidence for the perceptual integration of facial movements and speech is the so called McGurk effect [
One of the first studies to report audio-visual integration in emotion perception was by de Gelder and Vroomen [
Here we use perceptual adaptation as a tool to investigate bimodal and crossmodal perception of vocal emotion. In general, adaptation to a certain stimulus quality diminishes the response of specific neurons sensitive to that quality, thus enhancing sensitivity to change, and often eliciting “contrastive” aftereffects in perception. For instance, prolonged viewing of a moving adaptor stimulus elicits a prominent motion aftereffect, such that a static stimulus is perceived as moving in a direction opposite to the adaptor [
Of particular relevance for the present study Bestelmeyer et al. [
The aim of the present study was to extend recent findings [
In the present study, we therefore co-recorded facial and vocal expressions of emotion, to ensure that visual and auditory representations of the stimuli represented the same underlying events. This allowed us to test the impact of unimodal (auditory), bimodal (audio-visual), and crossmodal (visual only) adaptors on the perception of emotion in the voice. A series of three experiments was conducted which were identical in experimental design, and which only differed in adaptor modality. Note that since “own-gender bias effects” have been previously reported for various aspects of face and voice perception (e.g., [
All three experiments in this paper were carried out in accordance with the Declaration of Helsinki, and were approved by the Ethics Committee of the University of Jena. All listeners gave written informed consent and received a payment of € 5 or course credit.
Twenty-four listeners (12 female) between the ages of 19 and 30 years (
High-quality audio recordings of four male (mAK, mJN, mSB, mUA) and four female (fDK, fEM, fMV, fSM) native German speakers were obtained in a quiet and semianechoic room using a Sennheiser MD 421-II microphone with a pop protection and a Zoom H4n audio interface (16-bit resolution, 44.1 or 48 kHz sampling rate; upsampled using Adobe Audition to 48 kHz due to synchronization issues). All but one speaker (fSM) were amateur actors. Videos were simultaneously recorded. Among a set of utterances, the relevant ones were four consonant-vocal-consonant-vocal (CVCV) syllables /baka/, /bapa/, /boko/, and /bopo/. After a short and general instruction, we recorded emotional utterances in three blocks in a fixed sequence, starting with neutral and followed by angry and happy conditions. Each utterance was auditioned by the session manager, and repeated several times by the speaker. For the emotional utterances, the session manager first read a short text describing a situation in which people typically react with hot anger or great pleasure, in order to induce angry or happy mood. Each utterance was repeated several times until the session manager was satisfied by the facial and vocal emotion expressed. Speakers were encouraged to make breaks at self-determined point in times. Still water was provided.
To select most convincing emotional utterances, recordings were evaluated by twelve raters (6 female;
Several raters stated via questionnaire to know by sight some speakers (mSB,
Selected Stimuli |
Response |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Speaker | Emotion | CVCV | ACC | ANG | DIS | HAP | SUR | NEU | SAD | FEA | miss |
fDK | ANG | baka | 0.75 | 9 (5.78) | 2 (3.50) | 1 (7.00) | |||||
bapa* | 0.92 | 11 (5.55) | 1 (5.00) | ||||||||
boko* | 0.92 | 11 (6.27) | 1 | ||||||||
bopo | 0.92 | 11 (4.18) | 1 (6.00) | ||||||||
HAP | baka | 1.00 | 12 (3.83) | ||||||||
bapa* | 0.92 | 11 (4.09) | 1(6.00) | ||||||||
boko* | 0.67 | 8 (4.38) | 4 (5.00) | ||||||||
bopo1 | |||||||||||
NEU | baka | 0.83 | 10 (1.20) | 2 (2.50) | |||||||
bapa* | 1.00 | 12 (1.50) | |||||||||
boko* | 0.50 | 6 (1.33) | 6 (3.67) | ||||||||
bopo | 0.58 | 7 (1.00) | 5 (4.40) | ||||||||
fMV | ANG | baka | 0.92 | 11 (5.91) | 1 (6.00) | ||||||
bapa* | 0.83 | 10 (5.40) | 1 (8.00) | 1 (4.00) | |||||||
boko* | 0.67 | 8 (4.63) | 3 (3.00) | 1 (1.00) | |||||||
bopo | 0.92 | 11 (5.36) | 1 (4.00) | ||||||||
HAP | baka | 0.58 | 2 (4.50) | 7 (4.00) | 3 (5.33) | ||||||
bapa* | 0.50 | 3 (2.67) | 6 (4.50) | 3 (5.67) | |||||||
boko* | 0.42 | 1 (1.00) | 1 (1.00) | 5 (6.20) | 5 (5.00) | ||||||
bopo | 0.58 | 7 (4.43) | 4 (6.00) | 1 | |||||||
NEU | baka | 0.75 | 1 (4.00) | 9 (1.78) | 2 (3.50) | ||||||
bapa* | 0.92 | 11 (1.64) | 1 (2.00) | ||||||||
boko* | 0.75 | 1 (2.00) | 9 (1.22) | 2 (4.50) | |||||||
bopo | 0.67 | 1 (5.00) | 8 (1.63) | 3 (4.33) | |||||||
mAK | ANG | baka | 0.58 | 7 (5.29) | 3 (4.33) | 2 (4.50) | |||||
bapa* | 0.92 | 11 (4.00) | 1 (5.00) | ||||||||
boko* | 0.92 | 11 (4.82) | 1 (4.00) | ||||||||
bopo | 1.00 | 12 (5.25) | |||||||||
HAP | baka | 0.58 | 7 (5.00) | 5 (5.00) | |||||||
bapa* | 0.83 | 1 (5.00) | 10 (4.80) | 1 (8.00) | |||||||
boko* | 0.75 | 2 (3.50) | 9 (4.56) | 1 (6.00 | |||||||
bopo | 0.58 | 7 (5.14) | 5 (4.60 | ||||||||
NEU | baka | 1.00 | 12 (1.42) | ||||||||
bapa* | 1.00 | 12 (1.42) | |||||||||
boko* | 0.92 | 11 (1.64) | 1 (3.00) | ||||||||
bopo | 0.67 | 1 (1.00) | 8 (1.25) | 3 (3.00) | |||||||
mUA | ANG | baka | 0.92 | 11 (4.45) | 1 (6.00) | ||||||
bapa* | 1.00 | 12 (4.33) | 1 (1.00) | ||||||||
boko* | 0.83 | 10 (5.20) | 1 (4.00) | ||||||||
bopo | 0.83 | 11 (4.36) | 1 (5.00) | ||||||||
HAP | baka | 0.50 | 6 (4.83) | 6 (5.33) | |||||||
bapa* | 0.67 | 8 (4.25) | 3 (5.33) | 1 (1.00) | |||||||
boko* | 0.33 | 4 (4.00) | 8 (5.00) | ||||||||
bopo | 0.33 | 1 (5.00) | 1 (3.00) | 4 (5.50) | 6 (5.17) | ||||||
NEU | baka | 0.92 | 11 (1.36) | 1 (2.00) | |||||||
bapa* | 0.83 | 10 (1.40) | 2 (4.00) | ||||||||
boko* | 0.67 | 8 (1.13) | 4 (4.00) | ||||||||
bopo | 0.83 | 10 (1.50) | 2 (4.50) |
The number of responses (in total 12; including any misses) and the mean intensity rating (in parentheses, measured on an 8-point scale from “1 - not intense” to “8 - very intense”) is given for each response category, i.e. angry (ANG), disgust (DIS), happy (HAP), surprised (SUR), neutral (NEU), sad (SAD), fearful (FEA). CVCV syllables /baka/ and /bopo/ were used for test stimulus generation, /bapa/ and /boko/ served as adaptor stimuli (marked with an asterisk)1).Note: Due to missing recordings, no ratings were available for /bopo/ utterances by female speaker fDK.
Finally, stimuli of four speakers (fDK, fMV, mAK, mUA) were chosen for the adaptation experiments, based on overall voice classification rates. However, female speaker fMV was selected instead of fEM, because
For each utterance (per speaker and emotion), we selected the recording with the highest classification rate among three repetitions. In case of ambiguity, the recording with highest (or, for neutral utterances, lowest) intensity ratings was chosen. The proportion of correct classification for finally selected stimuli was satisfactory (
Test voices were emotion-ambiguous resynthesized voices, resulting from an interpolation of angry and happy CVCVs (/baka/ and /bopo/). We used TANDEM-STRAIGHT [
Listeners had to classify 40 emotion-ambiguous test voices along eight morph continua (4 identities x 2 CVCVs x 5 ML), that were presented after adaptation to angry, happy, or neutral vocalizations of two different speakers (both either male or female). To minimize low level adaptation effects, adaptors containing /o/-vowels (/boko/) were combined with test voices containing /a/-vowels (/baka/) and vice versa, i.e. /bapa/ adaptors were combined with /bopo/ test voices. This design produced 240 trials with unique adaptor-test combinations (40 test voices x 2 adaptor speakers x 3 adaptor emotions). Each of the 240 trials was presented twice, resulting in 480 experimental trials. To maximize adaptation effects, these trials were presented in six blocks of 80 trials each, for which adaptor emotion was kept constant. Within each block, trial order was randomized. The order of blocks was counterbalanced across listeners, using a balanced Latin square (e.g., [
All instructions were presented in writing on a computer screen, to minimize interference from the experimenter’s voice. After a short practice block consisting of twelve trials with stimuli not used thereafter, listeners had the opportunity to ask questions in case of remaining confusion. Each trial started with a red fixation cross in the center of a black computer screen (500 ms), marking the upcoming adaptor stimulus. The fixation cross remained on the screen while the adaptor stimulus (
The general trial design and timing was equivalent in all three Experiments (Exp. 1 – Exp. 3). Experiments differed in adaptor modality only. Note: The person displayed has provided written informed consent for publication of this image.
We performed analyses of variance (ANOVAs), using epsilon corrections for heterogeneity of covariances [
An initial 3 x 2 x 5 x 2 x 2 ANOVA on the proportion of happy responses with the factors adaptor emotion (AEmo), test gender (TG), morph level (ML), and between subject factors listener gender (LG) and adaptor gender (AG), did not reveal any effects or interactions involving LG (all ps ≥ .086). We therefore performed an equivalent ANOVA, but without factor listener gender (for a summary of effects, refer to
AEmo | 2,44 | 12.785 | < .001*** | .368 | AEmo | 2,44 | 33.027 | < .001*** | .600 | |||||
ML | 4,88 | 162.347 | < .001*** | .881 | .561 | ML | 4,88 | 84.197 | < .001*** | .793 | .457 | |||
AEmo*TG | 2,44 | 5.280 | .009** | .194 | AEmo*AG | 2,44 | 4.793 | .013* | .179 | |||||
AEmo*TG*AG | 2,44 | 4.136 | .023* | .158 | AEmo*TG*AG | 2,44 | 8.701 | .001** | .283 | |||||
AEmo*ML*AG | 8,176 | 2.446 | .016* | .100 | ||||||||||
AEmo*TG*ML*AG | 8,176 | 2.115 | .037* | .088 | ||||||||||
AEmo | 2,22 | 8.467 | .002** | .435 | AEmo | 2,22 | 24.422 | < .001*** | .689 | |||||
ML | 4,44 | 96.779 | < .001*** | .898 | .574 | |||||||||
AEmo*TG | 2,22 | 0.784 | .469 | .067 | AEmo*TG | 2,22 | 1.279 | .298 | .104 | |||||
AEmo*ML | 8,88 | 0.575 | .796 | .050 | ||||||||||
AEmo*TG*ML | 8,88 | 1.133 | .349 | .093 | ||||||||||
AEmo | 2,22 | 6.546 | .006** | .373 | AEmo | 2,22 | 9.759 | .001** | .470 | |||||
ML | 4,44 | 68.649 | < .001*** | .862 | .539 | |||||||||
AEmo*TG | 2,22 | 11.023 | < .001*** | .501 | AEmo*TG | 2,22 | 10.293 | .003** | .483 | .740 | ||||
AEmo*ML | 8,88 | 2.548 | .015* | .188 | ||||||||||
AEmo*TG*ML | 8,88 | 2.302 | .027* | .173 | ||||||||||
AEmo | 2,22 | 13.531 | < .001*** | .552 | AEmo | 2,22 | 14.350 | < .001*** | .566 | |||||
ML | 4,44 | 57.170 | < .001*** | .839 | ||||||||||
AEmo*ML | 8,88 | 2.332 | .025* | .175 | ||||||||||
AEmo | 2,22 | 1.085 | .355 | .090 | AEmo | 2,22 | .611 | .552 | .053 | |||||
ML | 4,44 | 23.916 | < .001*** | .685 | .589 | |||||||||
AEmo*ML | 8,88 | 2.523 | .016* | .187 |
Summary of results from the overall ANOVAs on the proportion of “happy-responses” with the factors adaptor emotion (AEmo, 3), test gender (TG, 2), morph level (ML 5), and between subject factor and adaptor gender (AG, 2), as well as summary of results of post-hoc ANOVAs performed to follow-up significant interactions of Experiments 1 and 2. Note: Epsilon corrections (
The prominent main effect of ML,
These main effects were further qualified by several interactions involving AG and TG (refer to
For male adaptors, both main effects of ML and AEmo survived (
Mean proportions of “happy”-responses to morphed test voices in Experiment 1 (unimodal adaptation, A-C) and Experiment 2 (bimodal adaptation, D-F), depending on morph level and adaptor emotion. (A, D) Male adaptation condition, collapsed across test voice gender. (B, E) Female adaptation condition, with female test voices. (C, F) Female adaptation condition, with male test voices.
For female adaptors, while both main effects of ML and AEmo also survived, they were qualified by the three-way interaction AEmo x TG x ML,
Overall, the results of Experiment 1 corroborate and extend recent reports of high-level aftereffects of adaptation to vocal expression [
Twenty-four new listeners (12 female) between the ages of 18 and 35 years (
Test stimuli were the same 40 synthesized test voices as used in Experiment 1 (see Section 1.1.4). Adaptor stimuli were video recordings that had been captured simultaneously to the voice recordings and that were synchronized with the auditory adaptor stimuli of Experiment 1. Videos displayed the same four speakers while articulating /bapa/ and /boko/ in angry, happy, or neutral expression.
Design and procedure were as in Experiment 1 (see Section 1.1.5), with the only difference that adaptors were bimodal videos.
Statistical analyses were performed in analogy to Experiment 1. Errors of omission were excluded (omissions averaged to 0.48% of experimental trials).
As an initial ANOVA on the proportion of happy responses with the same factors as in Experiment 1 again did not reveal any effects or interactions involving listener gender (all ps ≥ .062), we performed an equivalent ANOVA without listener gender (for a summary of effects, refer to
For male adaptors, the main effect of AEmo survived (
For female adaptors, the main effect of AEmo also survived, but was qualified by a significant interaction of AEmo x TG (
Taken together, Experiment 2 demonstrated substantial aftereffects of adaptation to bimodal expressive videos on the perception of vocal emotion. The pattern of observed effects comprised fewer interactions, but was generally similar to the effects of adaptation to unimodal voice adaptors in Experiment 1 (
Twenty-four new listeners (12 female) between the ages of 19 and 34 years (
Test stimuli were the same 40 synthesized test voices as used in Experiments 1 and 2 (see Section 1.1.4). Adaptor stimuli were video as used in Experiment 2, but this time presented without sound, i.e. participants adapted to silently articulating emotional face videos.
Design and procedure were the same as in Experiment 1, with the only difference that adaptors were silently articulating videos (crossmodal adaptation).
Statistical analyses were performed in analogy to Experiments 1 and 2. Errors of omission (in total 1.17% of all experimental trials) were excluded.
The initial ANOVA on the proportion of happy responses, with the same factors as in Experiments 1 and 2, revealed a main effect of ML,
In contrast to Experiments 1 and 2, the main effect of AEmo was only marginally significant,
Analyzed | Effect | df1, df2 | ||||
---|---|---|---|---|---|---|
Both LGs | AEmo | 2,40 | 2.489 | .096† | .111 | |
ML | 4,80 | 147.740 | < .001*** | .881 | .701 | |
TG x ML | 4,80 | 6.038 | .001** | .232 | .842 | |
AEmo x LG | 2,40 | 3.277 | .048* | .141 | ||
LG = male | AEmo | 2,22 | 5.724 | .010** | .342 | |
ML | 4,44 | 63.646 | <.001*** | .853 | .674 | |
TG x ML | 4,88 | 4.766 | .003** | .302 | ||
LG = female | AEmo | 2,22 | 0.125 | .883 | .011 | |
ML | 4,44 | 87.071 | <.001*** | .888 | .422 | |
TG x ML | 4,88 | 2.376 | .095† | .178 | .683 |
Summary of results from the ANOVAs on the proportion of “happy”-responses with the factors adaptor emotion (AEmo, 3), test gender (TG, 2), morph level (ML, 5), and between subject factors listener gender (LG, 2) and adaptor gender (AG, 2), as well as a summary of results of post-hoc ANOVAs performed per listener gender. Note: Epsilon corrections (
Mean proportions of “happy”-responses to morphed test voices, depending on morph level and adaptor emotion. (A) Male listeners showed crossmodal adaptation effects, whereas (B) female listeners did not.
In contrast to Experiments 1 and 2, Experiment 3 revealed differences between female and male listeners. Specifically, whereas female participants did not show any crossmodal adaptation effect, substantial aftereffects from adaptation to crossmodal silent videos were found in male participant s. A more detailed discussion of these findings will be provided in the general discussion.
In order to directly compare aftereffects across the three experiments, we calculated the magnitude of adaptation aftereffects for each experimental condition, by subtracting the proportions of happy responses in the happy adaptation condition from the proportions of happy responses in the angry adaptation condition. We then computed a 2 x 5 x 2 x 3 x 2 ANOVA with factors test gender (TG) and morph level (ML) as within subject factors, and adaptor gender (AG), adaptor modality (AMod; unimodal, bimodal, and crossmodal, corresponding to Experiments 1, 2, and 3), and listener gender (LG) as between subject factors.
For a summary of effects, refer to
There was, however, a strong trend for an interaction of adaptor modality and listener gender,
Analyzed | Effect | df1, df2 | ||||
---|---|---|---|---|---|---|
Both LGs | AEmo | 2,40 | 2.489 | .096† | .111 | |
ML | 4,80 | 147.740 | <.001*** | .881 | .701 | |
TG x ML | 4,80 | 6.038 | .001** | .232 | .842 | |
AEmo x LG | 2,40 | 3.277 | .048* | .141 | ||
LG = male | AEmo | 2,22 | 5.724 | .010** | .342 | |
ML | 4,44 | 63.646 | <.001*** | .853 | .674 | |
TG x ML | 4,88 | 4.766 | .003** | .302 | ||
LG = female | AEmo | 0.125 | 2,22 | .883 | .011 | |
ML | 4,44 | 87.071 | <.001*** | .888 | .422 | |
TG x ML | 4,88 | 2.376 | .095† | .178 | .683 |
Summary of results from the ANOVAs on the magnitude of the adaptation effect, computed as difference between angry and happy adaptation condition with the factors test gender (TG, 2), morph level (ML, 5), and between subject factors adaptor gender (AG, 2), adaptor modality (AMod, 3), and listener gender (LG, 2). Note: Epsilon corrections (
(A) The interaction of adaptor modality and listener gender shows that female listeners exhibited numerically enhanced bimodal adaptation effects, and no crossmodal adaptation effects; for male listeners, adaptation effects were similar across adaptor modalities. (B) The interaction of test voice gender and morph level reflects larger adaptation effects for more ambiguous morph levels, specifically for female test voices. (C) The interaction of adaptor gender and test voice gender reflects larger adaptation effects for gender-congruent adaptor-test combinations for female adaptors, a pattern that was clear for Experiments 1 and 2, but not for Experiment 3. Note: The magnitude of adaptation effects was calculated by subtracting the percentages of “happy”-responses in the happy adaptation condition from the percentages of “happy”-responses in the angry adaptation condition.
There was also a main effect of TG,
Further results from the ANOVA across experiments are briefly reported for the sake of completion, although these confirmed in parts findings from the analyses of individual experiments, and did not interact with adaptor modality. An interaction of ML x TG,
In sum, the magnitude of adaptation effects on the perception of vocal emotion (computed as differences between angry and happy adaptation conditions), showed a different pattern between female and male listeners. In male listeners, we found a similar adaptation effect across adaptor modalities of ~10%. By contrast, in female listeners, there was no adaptation effect at all in crossmodal (silent video) adaptation condition, whereas bimodal adaptation tended to elicit larger effects compared to unimodal adaptation (although the latter difference was not significant, possibly due to limited statistical power as a result of the between-subjects design).
To probe the multimodal nature of emotion perception, we conducted a series of three experiments that assessed the influence of perceptual adaptation to different adaptor modalities on the perception of vocal emotional expressions. We used unimodal voices adaptors (Experiment 1), bimodal face-voice video adaptors (Experiment 2), or the same video adaptors without sound as crossmodal adaptors (Experiment 3).
We demonstrated contrastive aftereffects of adaptation to happy or angry voices, such that test voices morphed on a happy-to-angry continuum were perceived as more happy after prior adaptation to angry voices, and vice versa. These results confirm and extend those by Bestelmeyer et al. [
Another observation, although the relevant effect failed to reach statistical significance, was that bimodal adaptors elicited numerically larger aftereffects on vocal emotion perception when compared to unimodal adaptors. At a broad level, such a finding could be in line with the idea that emotional expressions from faces and voices are processed in a multimodal manner [
Finally, irrespective of listener gender, we also obtained some differences related to the gender of adaptor and test stimuli. First, for female test voices, adaptation effects were larger at emotion-ambiguous morph levels, whereas for male test voices adaptation effects were similar across the entire morph continuum (
While the present study has revealed a number of novel and clear findings, several limitations should also be noted. First, because our study involved a limited number of speakers, utterance types, and emotional expressions, it remains to be determined whether our results generalize to other situations. We note, however, that one other study reported similar unimodal vocal emotion aftereffects for angry-to-fearful test voice continua, and with only /a/ vowel utterances [
To conclude, the present series of experiments confirms recent findings of contrastive aftereffects in vocal emotion perception caused by adaptation. Here we provide the first evidence for crossmodal aftereffects in emotion perception, elicited by silent videos showing dynamic facial expressions of equivalent emotional events. Overall, our results pose strong support for the idea that the perception of emotions is multimodal in nature. Moreover, we also observed prominent gender differences which are attributed to crossmodal processing, and possibly to bimodal face-voice integration, and we suggest that both aspects warrant further research.
(DOCX)
We gratefully acknowledge the contributions by all speakers and listeners. We also thank several student assistants in our research unit (Lisa Blatz, Lisa Büchner, Achim Hötzel, Beatrice Jost, Katrin Lehmann, Julia Lietzke, Sarah Matthiess, Katharina Merhof, Finn Pauls, and Marlene Suhr) for various contributions relating to speaker and listener recruitment, stimulus preparation, and for assisting in data acquisition.