Adaptation Aftereffects in Vocal Emotion Perception Elicited by Expressive Faces and Voices

The perception of emotions is often suggested to be multimodal in nature, and bimodal as compared to unimodal (auditory or visual) presentation of emotional stimuli can lead to superior emotion recognition. In previous studies, contrastive aftereffects in emotion perception caused by perceptual adaptation have been shown for faces and for auditory affective vocalization, when adaptors were of the same modality. By contrast, crossmodal aftereffects in the perception of emotional vocalizations have not been demonstrated yet. In three experiments we investigated the influence of emotional voice as well as dynamic facial video adaptors on the perception of emotion-ambiguous voices morphed on an angry-to-happy continuum. Contrastive aftereffects were found for unimodal (voice) adaptation conditions, in that test voices were perceived as happier after adaptation to angry voices, and vice versa. Bimodal (voice + dynamic face) adaptors tended to elicit larger contrastive aftereffects. Importantly, crossmodal (dynamic face) adaptors also elicited substantial aftereffects in male, but not in female participants. Our results (1) support the idea of contrastive processing of emotions (2), show for the first time crossmodal adaptation effects under certain conditions, consistent with the idea that emotion processing is multimodal in nature, and (3) suggest gender differences in the sensory integration of facial and vocal emotional stimuli.


Introduction
The perception of emotional states is crucial for adequate social interaction. Emotions are expressed in the face, but also in the voice (e.g., [1]), or in gesture (e.g., [2]), and body movement (e.g., [2][3][4]). Although the majority of empirical studies investigated emotion perception in one modality only, many researchers now think that emotions are perceived in a multimodal manner [5]. Evidence supporting this idea includes reports on brain-damaged patients, who showed comparable impairments in processing specific emotions from faces and voices (e.g., [6,7], but see also 8).
An impressive source of evidence for the perceptual integration of facial movements and speech is the so called McGurk effect [9], which shows that simultaneous presentation of an auditory vocalization with non-matching facial speech can alter the perceived utterance (e.g. the presentation of an auditory /baba/ with a face simultaneously articulating /gaga/ typically leads to a "fused" percept of /dada/). A possible neurophysiological correlate of this effect has been described in studies that show an activation of auditory cortex when participants watched silent facial speech, in the absence of an auditory stimulus [10]. However, crossmodal processing is much less well investigated for paralinguistic social signals, including person identity and emotional expression (for a recent overview, see 11).
One of the first studies to report audio-visual integration in emotion perception was by de Gelder and Vroomen [12], who showed that the presentation of sad or happy voices with an emotion-ambiguous test face biased perceived facial emotion in the direction of the simultaneously presented tone of voice, even when voices should be ignored. Similar findings for facilitated processing of emotion-congruent audio-visual emotional signals were found by others [13,14]. More recently, evidence from magnetoencephalography has suggested that the posterior superior temporal sulcus area may be involved in the early perceptual integration of facial and vocal emotion (e.g., [15], but see also 16, for a relevant neuroimaging study). As a limitation, most of these studies used static expressive faces, even though facial motion is known to support emotion recognition (e.g., [17]). Moreover, audio-visual integration typically benefits from temporal synchrony of visual and auditory stimuli, which may be important to attribute stimuli from both modalities to the same underlying event [10]. Evidence from automatic pattern recognition also suggests superior performance when visual and auditory information is integrated at an early featural level [18].
Here we use perceptual adaptation as a tool to investigate bimodal and crossmodal perception of vocal emotion. In general, adaptation to a certain stimulus quality diminishes the response of specific neurons sensitive to that quality, thus enhancing sensitivity to change, and often eliciting "contrastive" aftereffects in perception. For instance, prolonged viewing of a moving adaptor stimulus elicits a prominent motion aftereffect, such that a static stimulus is perceived as moving in a direction opposite to the adaptor [19]. Recently, contrastive adaptation aftereffects were accounted not only for low-level stimulus qualities, but also for complex visual stimuli and faces, among them facial identity [20], face gender [21], facial age [22], or facial expression [21]. In the auditory domain, similar contrastive adaptation aftereffects were more recently reported for the perception of voice gender [23,24], vocal age [25], voice identity [26,27], or vocal affect [28].
Of particular relevance for the present study Bestelmeyer et al. [28] presented the first report of auditory adaptation in vocal affect perception. In that study, adaptation to angry vocalization (single /a/-vowels) caused emotion-ambiguous voices (morphed on an angry-fearful continuum) to be perceived as more fearful, and vice versa. A second experiment found equivalent aftereffects for natural and caricatured adaptor voices, which was interpreted to indicate that aftereffects are not exclusively due to low-level adaptation, but rather may depend on higher-level perception of the affective category of the adaptor. While Bestelmeyer et al. [28] studied unimodal voice adaptation only, Fox and Barton [29] investigated the influence of different emotional adaptor types (faces, visualnon-faces, words, and sounds) on facial emotion categorization, using angry-to-fearful facial expression morphs as static test faces. Importantly, while strong and significant aftereffects were elicited by emotional faces, auditory adaptation to emotional sounds did not elicit significant aftereffects. It may be noteworthy that, compared to sameperson combinations of adaptor and test faces, adaptor faces from different individuals caused somewhat smaller (though still significant) aftereffects to emotion perception. This could suggest a degree of identity-specific representation of facial expressions.
The aim of the present study was to extend recent findings [28] of contrastive aftereffects in the perception of vocal affect. Importantly, we compared a condition of unimodal auditory (voice) adaptation with two conditions that have not been studied before. Specifically, we investigated the degree to which bimodal (face-voice) and crossmodal (face) adaptation conditions would also cause aftereffects on the perception of emotion in test voices. A study by Collignon et al. [14] showed audio-visual integration in emotional processing, as fear and disgust categorization was faster and more accurate in bimodal situations as compared to unimodal (auditory or visual) stimuli presentation. Accordingly, we expected bimodal adaptation to elicit larger adaptation effects, when compared to a standard unimodal adaptation condition. In addition, although crossmodal aftereffects of voice-face adaptation have been found to be absent in a study that investigated the perception of facial expressions [29], we considered the possibility that crossmodal face-voice aftereffects might be demonstrated under more favorable conditions, in which both visual and auditory stimuli exhibit a high degree of temporal congruence and represent the same underlying dynamic event. Such conditions should contribute to efficient multisensory processing [10].
In the present study, we therefore co-recorded facial and vocal expressions of emotion, to ensure that visual and auditory representations of the stimuli represented the same underlying events. This allowed us to test the impact of unimodal (auditory), bimodal (audio-visual), and crossmodal (visual only) adaptors on the perception of emotion in the voice. A series of three experiments was conducted which were identical in experimental design, and which only differed in adaptor modality. Note that since "own-gender bias effects" have been previously reported for various aspects of face and voice perception (e.g., [30][31][32]), we analyzed gender effects at the level of both listeners and experimental stimuli.

Experiment 1 -Unimodal Adaptation
Method Ethics Statement. All three experiments in this paper were carried out in accordance with the Declaration of Helsinki, and were approved by the Ethics Committee of the University of Jena. All listeners gave written informed consent and received a payment of € 5 or course credit.
Listeners. Twenty-four listeners (12 female) between the ages of 19 and 30 years (M = 22.4, SD = 2.7) contributed data. None reported hearing disorders. The data of two additional listeners was excluded due to hardware problems.
Recording Procedure and Speaker Selection. Highquality audio recordings of four male (mAK, mJN, mSB, mUA) and four female (fDK, fEM, fMV, fSM) native German speakers were obtained in a quiet and semianechoic room using a Sennheiser MD 421-II microphone with a pop protection and a Zoom H4n audio interface (16-bit resolution, 44.1 or 48 kHz sampling rate; upsampled using Adobe Audition to 48 kHz due to synchronization issues). All but one speaker (fSM) were amateur actors. Videos were simultaneously recorded. Among a set of utterances, the relevant ones were four consonantvocal-consonant-vocal (CVCV) syllables /baka/, /bapa/, /boko/, and /bopo/. After a short and general instruction, we recorded emotional utterances in three blocks in a fixed sequence, starting with neutral and followed by angry and happy conditions. Each utterance was auditioned by the session manager, and repeated several times by the speaker. For the emotional utterances, the session manager first read a short text describing a situation in which people typically react with hot anger or great pleasure, in order to induce angry or happy mood. Each utterance was repeated several times until the session manager was satisfied by the facial and vocal emotion expressed. Speakers were encouraged to make breaks at selfdetermined point in times. Still water was provided.
To select most convincing emotional utterances, recordings were evaluated by twelve raters (6 female; M = 22.7 years, SD = 2.2). A total of 282 voice recordings (8 speakers x 4 CVCVs x 3 emotional conditions x 3 repetitions -6, please note that due to an error in the recording procedure, we did not record /boko/ of male speaker mSB in neutral expression and /bopo/ of female speaker fDK in happy expression) were presented in randomized order and listeners performed a 7-alternativeforced-choice (7-AFC) task with response options for neutral and six basic emotions (angry, happy, sad, disgust, fearful, surprised), and a subsequent rating on perceived intensity of the same stimulus using an 8-point-scale from '1 -gar nicht intensiv (neutral)' to '8 -sehr intensiv' ['1 -not intense at all (neutral)' to '8 -very intense']. For original classification data of emotional stimuli of all eight speakers, please refer to Table  S1.
Several raters stated via questionnaire to know by sight some speakers (mSB, N = 6; fSM, N = 8; fDK, N = 1; fEM, N = 1). To avoid interference from familiarity in the perception of emotions, we therefore excluded speakers fSM and mSB. Ratings of voices of the remaining six speakers were in general comparable. Overall, angry stimuli got highest correct classification rates (77%), followed by neutral (67%) and happy (44%). Note that some misclassifications may likely have occurred as a consequence of the experimental design, since listeners explicitly were given seven response options, and thus expected disgust, surprised, and sad stimuli to appear among the utterances. In fact, Table 1 suggests a clear pattern in which, if misclassified, happy utterances tended to be perceived as surprised, and neutral utterances tended to be perceived as sad.
Finally, stimuli of four speakers (fDK, fMV, mAK, mUA) were chosen for the adaptation experiments, based on overall voice classification rates. However, female speaker fMV was selected instead of fEM, because facial emotional expression of fEM was judged by the authors and five additional raters to be poor. Stimuli of speaker's fEM and mJN were used for practice trials.

Stimuli
Preparation. For each utterance (per speaker and emotion), we selected the recording with the highest classification rate among three repetitions. In case of ambiguity, the recording with highest (or, for neutral utterances, lowest) intensity ratings was chosen. The proportion of correct classification for finally selected stimuli was satisfactory (M = .767, SEM = .020). Male and female listeners did not differ in their judgments on voices (ps ≥ .109), and there were no differences between stimuli used for adaptor and test voices (ps ≥ .191). A 3 x 2 ANOVA with factors emotion categories and speaker gender revealed a main effect of emotion, F(2,22) = 8.917, p = .001, η 2 p =.448, with correct classification rates of .874 ± .029, .625 ± .055, and .802 ± .035, for angry, happy, and neutral stimuli, respectively. There was also a two-way interaction of emotion x speaker gender, F(2,22) = 4.823 p = .018, η 2 p = .305. No significant differences between speaker genders were observed for both angry and neutral stimuli, Ts(11) ≤ 1.890, ps ≥ .085. A small difference for happy stimuli of the happy category, T(11) = 2.286, p = .043 (Ms = .676 ± .061 and .573 ± .058, for female and male, respectively) reflected the fact that male stimuli were slightly more often categorized as "surprised" (see Table 1), a relatively common misclassification that might relate both to the design of the rating, and the fact that no surprised voices were presented. Differences between speaker gender disappeared, T(11) = 0.965, p = .406, when happy and surprised responses were combined to one category. Overall stimuli of different categories were highly discriminable, with almost no overlap of angry, happy and neutral classifications (see Table 1) and with only small speaker gender differences for happy voices. Classification rates and intensity ratings per stimulus and response category can be found in Table 1. A / bopo/ of fDK in happy intonation, missing from the original recording, was generated by replacing the second /b/ plosive (i.e. duration of closure, plosive release) of a happy /bobo/ recording by a happy /bapa/'s /p/ plosive. The second author and five additional raters could not perceive any modification or peculiarity in the resulting stimulus. Each utterance was saved in a single file (.wav, 48 kHz, mono) and intensity was scaled to 70 dB RMS using Praat [33]. A silence phase of 50 ms was added at the beginning and end of stimuli used to morph test voices. Adaptors in Experiment 1 were voice recordings of / bapa/ and /boko/ in neutral, angry and happy vocal expressions. We added silence phases during 12 video frames (~ 480 ms) both before voice onset and after voice offset. This was done to keep the timing of unimodal adaptors comparable to that of bimodal and crossmodal adaptors (used in Experiments 2 and 3, respectively).
Voice Morphing. Test voices were emotion-ambiguous resynthesized voices, resulting from an interpolation of angry and happy CVCVs (/baka/ and /bopo/). We used TANDEM-STRAIGHT [34] based morphing to create test voices with increasing "happy" proportions along the angry-to-happy morph continuum. A test voice of morph level x (MLx) refers to an interpolation between x% of the happy and (100-x)% of the angry voice recording with x ∈ [20,35,50,65,80]. We generated 40 test voices along eight morph continua (4 speakers x 2 CVCVs x 5 ML). Morphing requires manual mapping of corresponding time-and frequency-anchors in the spectrograms. For a more detailed description, please refer to Kawahara et al. [35]. In short, we set time anchors at key features of the utterances (i.e., onset and offset; initial burst of consonants; beginning, middle and end of formant transitions; stable phase of the vowels). We decided to map time anchors in Praat, due to convenient inspection of waveform and spectrogram, and then transferred time anchors to TANDEM-STRAIGHT. At time anchor positions, frequency anchors were then assigned at the center frequency of three to four formants where detectable.

Design and Procedure
Listeners had to classify 40 emotion-ambiguous test voices along eight morph continua (4 identities x 2 CVCVs x 5 ML), that were presented after adaptation to angry, happy, or neutral vocalizations of two different speakers (both either male or female). To minimize low level adaptation effects, adaptors containing /o/-vowels (/boko/) were combined with test voices The number of responses (in total 12; including any misses) and the mean intensity rating (in parentheses, measured on an 8-point scale from "1 -not intense" to "8 -very intense") is given for each response category, i.e. angry (ANG), disgust (DIS), happy (HAP), surprised (SUR), neutral (NEU), sad (SAD), fearful (FEA). CVCV syllables / baka/ and /bopo/ were used for test stimulus generation, /bapa/ and /boko/ served as adaptor stimuli (marked with an asterisk) 1)  To maximize adaptation effects, these trials were presented in six blocks of 80 trials each, for which adaptor emotion was kept constant. Within each block, trial order was randomized. The order of blocks was counterbalanced across listeners, using a balanced Latin square (e.g., [36]). To summarize, a 3 (adaptor emotion, AEmo) x 2 (test gender, TG) x 5 (morph level, ML) x 2 (adaptor gender, AG) x 2 (listener gender, LG) design was carried out, with both AG and LG as between-subject factors. All instructions were presented in writing on a computer screen, to minimize interference from the experimenter's voice. After a short practice block consisting of twelve trials with stimuli not used thereafter, listeners had the opportunity to ask questions in case of remaining confusion. Each trial started with a red fixation cross in the center of a black computer screen (500 ms), marking the upcoming adaptor stimulus. The fixation cross remained on the screen while the adaptor stimulus (M = 5010 ms, SD = 284; consisting of three identical adaptor voices, each with pre-and post-adaptor silence periods of ~480 ms, see Section 2.1.3) was presented. Subsequently, a green fixation cross (500 ms) replaced the red one for 500 ms, to mark the upcoming test voice (M = 796 ms ± 37). Listeners were instructed to attentively listen to the adaptor, and to perform an angry-happy 2-AFC classification for the test voice, pressing response keys "k" or "s" on a standard German computer keyboard, respectively. After test voice offset, the green fixation cross was replaced by a green question mark and responses were recorded 2000 ms from stimulus offset. If no response was entered (error of omission) a 500 ms screen prompted for faster response ("Bitte reagieren Sie schneller" ["Please respond faster"]); otherwise, a black screen was shown instead. Each block consisted of 80 randomized trials (2 adaptor voice identities x 2 TG x 2 test voice identities x 2 CVCVs x 5 ML). Individual breaks were allowed after blocks of 40 trials (Figure 1, for general trial design).

Statistical Analysis
We performed analyses of variance (ANOVAs), using epsilon corrections for heterogeneity of covariances [37] throughout where appropriate. Errors of omission (no key press; 1.05% of all experimental trials) were excluded from the analyses.

Results and Discussion
An initial 3 x 2 x 5 x 2 x 2 ANOVA on the proportion of happy responses with the factors adaptor emotion (AEmo), test gender (TG), morph level (ML), and between subject factors listener gender (LG) and adaptor gender (AG), did not reveal any effects or interactions involving LG (all ps ≥ .086). We therefore performed an equivalent ANOVA, but without factor listener gender (for a summary of effects, refer to Table 2).
Overall, the results of Experiment 1 corroborate and extend recent reports of high-level aftereffects of adaptation to vocal expression [28]. We found that emotion-ambiguous voices (on an angry-to-happy continuum) were perceived as more happy after adaptation to angry voices, and as more angry after adaptation to happy voices. Although these effects were independent of listener gender, more subtle modulations of adaptation aftereffects were caused by adaptor voice gender and other experimental variables. A more detailed discussion of these findings will be provided in the general discussion.  Adaptor stimuli were video recordings that had been captured simultaneously to the voice recordings and that were synchronized with the auditory adaptor stimuli of Experiment 1. Videos displayed the same four speakers while articulating / bapa/ and /boko/ in angry, happy, or neutral expression.

2.1.3: Design and Procedure.
Design and procedure were as in Experiment 1 (see Section 1.1.5), with the only difference that adaptors were bimodal videos.

2.1.4: Statistical Analysis.
Statistical analyses were performed in analogy to Experiment 1. Errors of omission were excluded (omissions averaged to 0.48% of experimental trials).

2.2: Results and Discussion
As an initial ANOVA on the proportion of happy responses with the same factors as in Experiment 1 again did not reveal any effects or interactions involving listener gender (all ps ≥ . 062), we performed an equivalent ANOVA without listener gender (for a summary of effects, refer to Table 2). Unsurprisingly, a strong main effect of ML, F(4,88) = 84.197, p < .001, ε HF = .457, η p 2 = .793, with a prominent linear trend, F(1,22) = 111.687, p < .001, η p 2 = .835, was again found. In addition, the ANOVA revealed a prominent main effect of AEmo, F(2,44) = 33.027, p < .001, η p 2 = .600, reflecting a contrastive pattern of aftereffects for angry, neutral and happy adaptation conditions (Ms = .577 ± .025, .511 ± .028, and .440 ± .021, respectively), which was further qualified by a two-way interactions with adaptor gender, and an additional three-way interaction involving both adaptor and test gender (Table 2). To investigate the nature of this three-way interaction, we analyzed data separately for each adaptor gender, by means of two separate 3 x 2 ANOVAs with factors AEmo x TG. For male adaptors, the main effect of AEmo survived ( Figure  2D), whereas the interaction AEmo x TG was not significant (ps ≥ .298). With respect to the main effect of AEmo, all pair-wise comparisons between means of angry, neutral and happy adaptation conditions (Ms = .603 ± .043, .533 ± .048, and .419 ± .033, respectively), were significant |Ts(11)| ≥ 2.511, ps ≤ . 029, with largest differences between angry and happy adaptation, T(11) = 8.115, p < .001.
For female adaptors, the main effect of AEmo also survived, but was qualified by a significant interaction of AEmo x TG ( Table 2). For female test voices alone, an effect of AEmo was significant; pair-wise comparisons between means of angry, neutral and happy adaptation conditions (Ms = .570 ± .044, . 443 ± .050, and .409 ± .037, respectively), were significant for both angry compared to neutral and to happy, with T(11) ≥ 4.831, p = .001, but not for neutral and happy, T(11) = .931, p = .372. By contrast, no significant effect of AEmo was observed for male test voices ( Figures 2E and 2F, respectively).
Taken together, Experiment 2 demonstrated substantial aftereffects of adaptation to bimodal expressive videos on the perception of vocal emotion. The pattern of observed effects comprised fewer interactions, but was generally similar to the effects of adaptation to unimodal voice adaptors in Experiment 1 (Figures 2A-2C). A visual inspection of the results also suggests that bimodal adaptors were somewhat more efficient than unimodal adaptors in causing aftereffects in vocal emotion perception. The effects of bimodal adaptation were again not significantly modulated by listener gender, but were modulated by adaptor and test voice gender. A more detailed discussion of these findings will be provided in the general discussion. Adaptor stimuli were video as used in Experiment 2, but this time presented without sound, i.e. participants adapted to silently articulating emotional face videos.

3.1.3: Design and Procedure.
Design and procedure were the same as in Experiment 1, with the only difference that adaptors were silently articulating videos (crossmodal adaptation).

3.1.4: Statistical Analysis.
Statistical analyses were performed in analogy to Experiments 1 and 2. Errors of omission (in total 1.17% of all experimental trials) were excluded.
In contrast to Experiments 1 and 2, Experiment 3 revealed differences between female and male listeners. Specifically, whereas female participants did not show any crossmodal adaptation effect, substantial aftereffects from adaptation to crossmodal silent videos were found in male participant s. A more detailed discussion of these findings will be provided in the general discussion.

4.1: Statistical Analysis
In order to directly compare aftereffects across the three experiments, we calculated the magnitude of adaptation aftereffects for each experimental condition, by subtracting the proportions of happy responses in the happy adaptation condition from the proportions of happy responses in the angry adaptation condition. We then computed a 2 x 5 x 2 x 3 x 2 ANOVA with factors test gender (TG) and morph level (ML) as within subject factors, and adaptor gender (AG), adaptor modality (AMod; unimodal, bimodal, and crossmodal, corresponding to Experiments 1, 2, and 3), and listener gender (LG) as between subject factors.
Further results from the ANOVA across experiments are briefly reported for the sake of completion, although these confirmed in parts findings from the analyses of individual experiments, and did not interact with adaptor modality. An interaction of ML x TG, F(4,240) = 2.781, p = .027, η p 2 = .044, revealed that increased aftereffects at more ambiguous morph levels were found for female test voices, F(4,284) = 3.801, p = . 006, ε HF = .928, η p 2 = .051, but not for male test voices, F(4,284) = 0.501, p = .735, η p 2 = .007 ( Figure 4B). In sum, the magnitude of adaptation effects on the perception of vocal emotion (computed as differences between angry and happy adaptation conditions), showed a different pattern between female and male listeners. In male listeners, we found a similar adaptation effect across adaptor modalities of ~10%. By contrast, in female listeners, there was no adaptation effect at all in crossmodal (silent video) adaptation condition, whereas bimodal adaptation tended to elicit larger effects compared to unimodal adaptation (although the latter difference was not significant, possibly due to limited statistical power as a result of the between-subjects design).

General Discussion
To probe the multimodal nature of emotion perception, we conducted a series of three experiments that assessed the influence of perceptual adaptation to different adaptor modalities on the perception of vocal emotional expressions. We used unimodal voices adaptors (Experiment 1), bimodal face-voice video adaptors (Experiment 2), or the same video adaptors without sound as crossmodal adaptors (Experiment 3).
We demonstrated contrastive aftereffects of adaptation to happy or angry voices, such that test voices morphed on a happy-to-angry continuum were perceived as more happy after prior adaptation to angry voices, and vice versa. These results confirm and extend those by Bestelmeyer et al. [28], who had first reported similar effects using tokens of the vowel /a/ that were morphed on an angry-to-fearful continuum. Our novel findings of crossmodal aftereffects (although only clear for male listeners) may be seen at variance with earlier studies [29] that did not find evidence for crossmodal aftereffects in emotion perception in a face perception task. One possible reason for the present successful demonstration of crossmodal aftereffects could be that we used dynamic video adaptors which not only represented emotional expressions, but which actually represented equivalent underlying actions as the unimodal voice adaptors did. This idea considers that crossmodal processing depends on both congruent temporal information [10] and on higher level factors such as audio-visual stimulus congruency, both of which may contribute to the "unity assumption" [38]. Note that the present data alone do not exclude the possibility that crossmodal adaptation in emotion perception could be unidirectional, particularly when considering that Fox and Barton [29] did not find an effect of emotional auditory adaptors on static face perception (but see 12 for the explicit suggestion of mandatory bidirectional links between faces and voices in emotion perception). Thus, further research is required to determine whether crossmodal adaptation by voice adaptors on the perception of dynamic facial emotions can be demonstrated.
Another observation, although the relevant effect failed to reach statistical significance, was that bimodal adaptors elicited numerically larger aftereffects on vocal emotion perception when compared to unimodal adaptors. At a broad level, such a finding could be in line with the idea that emotional expressions from faces and voices are processed in a multimodal manner [5,6]. We also note that not only was the impact of adaptor modality strong for female but not male listeners, but also that bimodal adaptation appeared to increase aftereffects somewhat more for female than for male listeners. While this finding clearly requires replication, it might be tentatively related to reports from spoken word perception, according to which women more efficiently integrate emotional information from prosodic and semantic sources, compared to men [39,40]. Although the vast literature on emotion processing is often taken to suggest that women more effectively process emotional signals, and may tend to show more empathyrelated responses [41], it has also been proposed that emotional signals provide more behaviorally relevant cues for men [42], and that men might be more efficient in emotion regulation in some conditions [43]. Those extensive reviews of sex differences in emotion processing generally indicate differences in the relevant neural networks, but also revealed a host of conflicting results. In the present study, crossmodal adaptation effects on vocal emotion perception were absent in women, while such effects were prominent in men. Although the precise mechanisms underlying this difference remain unclear, one possibility is that women (but not men) depend on simultaneous bimodal stimulation for face-voice processing to occur.
Finally, irrespective of listener gender, we also obtained some differences related to the gender of adaptor and test stimuli. First, for female test voices, adaptation effects were larger at emotion-ambiguous morph levels, whereas for male test voices adaptation effects were similar across the entire morph continuum ( Figure 4B). Second, we observed significantly larger adaptation effects in gender-congruent adaptor-test combinations overall ( Figure 4C). We note that this gender-congruency effect on vocal emotion adaptation was observed particularly for female adaptors in Experiments 1 and 2, i.e. when adaptors contained voices. Although original female happy voices (used as adaptors and for test voice morphing) were classified somewhat better than male happy voices (since the latter elicited slightly more "surprise" responses), this difference cannot explain the absence of aftereffects elicited by female adaptors for both unimodal and bimodal conditions in male test voices ( Figure 2C and 2F). Moreover, male adaptors elicited similar aftereffects to the same test voices, irrespective of test voice gender. This finding could therefore indicate that acoustic cues conveying emotional expression [1] are not entirely independent of speaker gender, although this conclusion is limited by the small number of voices used in the present study. Note that both genderspecific and gender-independent contributions to aftereffects have previously been described for the perception of age from both faces [22] and voices [27]. Note also that our effects of adaptor-test gender congruency do not appear to relate to an effect of congruency in the identity of adaptor and test speakers, since identity-congruent adaptor-test combinations clearly did not yield larger aftereffects when compared to identity-incongruent combinations. This could be a relevant contrast with a paper on face adaptation by Fox and Barton [29], who found reduced expression aftereffects for identity incongruent adaptor-test combinations. Although the reasons for these different outcomes are not completely clear, we note that while facial identity is easily perceived even from briefly stimuli, the difficulty to perceive voice identity from brief auditory samples (e.g., [31]), could be a factor for the absence of identity congruency effects in the present study.
While the present study has revealed a number of novel and clear findings, several limitations should also be noted. First, because our study involved a limited number of speakers, utterance types, and emotional expressions, it remains to be determined whether our results generalize to other situations. We note, however, that one other study reported similar unimodal vocal emotion aftereffects for angry-to-fearful test voice continua, and with only /a/ vowel utterances [28]. A degree of variability in our results might also be attributed to stimulus properties, such as differences in emotional expressiveness of individual stimuli. For instance, not all raters perceived emotionally "neutral" stimuli as "neutral" (Table 2), and this could have contributed to the finding that the neutral adaptation condition did not always generate classifications that were exactly midway between those generated by adaptation to angry or happy adaptors. However, it should also be kept in mind that these observations could reflect a degree of individual differences between raters, who can often exhibit different "category boundaries"(cf. [21], for further discussion).
To conclude, the present series of experiments confirms recent findings of contrastive aftereffects in vocal emotion perception caused by adaptation. Here we provide the first evidence for crossmodal aftereffects in emotion perception, elicited by silent videos showing dynamic facial expressions of equivalent emotional events. Overall, our results pose strong support for the idea that the perception of emotions is multimodal in nature. Moreover, we also observed prominent gender differences which are attributed to crossmodal processing, and possibly to bimodal face-voice integration, and we suggest that both aspects warrant further research. Table S1. Classification data of emotional stimuli of eight speakers in the rating experiment. Classification data (percentages) for the angry, happy and neutral voice recordings of eight speakers (4 female), and mean classification accuracy (ACC). Speakers fSM and mSB were excluded due to listener reports on familiarity. Speakers fDK, fMV, mAK, mUA were chosen for the adaptation experiments. Note: Percentages marked with an asterisk are based on = 108 ratings, for all others, N = 144. (DOCX)