Cross-cultural emotion recognition and evaluation of Radboud faces database with an Indian sample

Emotional databases are important tools to study emotion recognition and their effects on various cognitive processes. Since, well-standardized large-scale emotional expression database is not available in India, we evaluated Radboud faces database (RaFD)—a freely available database of emotional facial expressions of adult Caucasian models, for Indian sample. Using the pictures from RaFD, we investigated the similarity and differences in self-reported ratings on emotion recognition accuracy as well as parameters of valence, clarity, genuineness, intensity and arousal of emotional expression, by following the same rating procedure as used for the validation of RaFD. We also systematically evaluated the universality hypothesis of emotion perception by analyzing differences in accuracy and ratings for different emotional parameters across Indian and Dutch participants. As the original Radboud database lacked arousal rating, we added this as a emotional parameter along with all other parameters. The results show that the overall accuracy of emotional expression recognition by Indian participants was high and very similar to the ratings from Dutch participants. However, there were significant cross-cultural differences in classification of emotion categories and their corresponding parameters. Indians rated certain expressions comparatively more genuine, higher in valence, and less intense in comparison to original Radboud ratings. The misclassifications/ confusion for specific emotional categories differed across the two cultures indicating subtle but significant differences between the cultures. In addition to understanding the nature of facial emotion recognition, this study also evaluates and enables the use of RaFD within Indian population.


Introduction
In everyday social interactions, our decisions and actions are influenced by the facial expressions of the person with whom we communicate. Such emotional influences are studied using photographs of human models expressing distinct emotions. Unlike schematic faces, human facial stimuli offers the possibility of portraying a wider range of emotional expressions like disgust, fear, contempt, surprise, and sad with different levels of intensity. In addition, facial a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 supports group interactions and facilitates interdependence amongst its members [23,24]. These fundamental differences between the two types of cultures contribute to differences in general psycho-social attitudes [25][26][27][28] as well may contribute to differences in perception or ratings of emotional expressions as a function of whether they belong to in-group (their own region) or out-group members of a culture (other than their own region) [24,29]. For example, people in individualistic cultures (Americans) are more comfortable in displaying negative expressions than those from collectivistic cultures (Costa-Rica) [23,24]. A culture is not solely individualistic or collectivistic; nonetheless, Asians in general are considered more collectivistic than Western cultures in certain aspects [26]. India is not a purely collectivistic culture but rather shows features of both collectivism and individualism [24,25,27,28]. With Indian participants there are no measures of out-group facial emotion rating evaluated in the context of Individualism or collectivism. Given this background, we also wanted to check if rating and agreement rate data from Indian ratings for Radboud emotional faces (out-group) can be understood within the context of Individualism or collectivism.
Considering the above arguments, the motivation for this study was threefold. First, we aimed to test the universality of emotion recognition hypothesis in an Indian sample population from Allahabad with a full-fledged emotional database from another culture. At a broader level, we expected that there would be differences in subtle measures of emotion recognition like intensity, clarity etc. especially given that the faces belong to out-group members [16,30]. Second, we wanted to evaluate if differences in emotion recognition between Indian (out group) and Dutch (in group), follows those already reported for Individualistic or collectivistic cultures [21,22,24,29,31]. As Indians are reported to be a relatively more collectivistic culture than the Dutch [24], it could be expected that they would differ in agreement ratings for negative emotions in comparison to positive emotions, for out-group members. Third, we also aimed to validate the database for studies on emotions across cultures. To achieve this we selected the freely available Radboud Faces Database (RaFD) [5]. It offers ready to use colour pictures of Caucasian face stimuli of adults and children in three gaze directions and eight expressions: neutral, happy, angry, disgust, contempt, fear, surprise, and sad. All images were according to FACS guidelines and have been evaluated by taking ratings on parameters namely: valence, intensity, clarity, genuineness and correct identification of the expression [5]. For this study we selected only the adult facial expressions with frontal view and straight gaze direction.
Current research methodology is similar to that used by the developers of RaFD [5], in order to compare emotion rating and recognition differences between the two cultures (Indians and Dutch). In addition to the emotion categories and parameters originally used for RaFD, we also rated the database on 'Arousal' parameter, which is not available for RaFD. Emotions can be understood in terms of two parameters, namely valence and arousal [32,33], where valence could be positive or negative (pleasant/ unpleasant) and arousal represents the intensity of emotion felt by the participant (calm/ intense). Multiple studies suggest that emotion-cognition interactions are highly influenced by the arousal value of emotions than valence [34,35]. Given that arousal of an expression plays a significant role in emotion processing [36][37][38], having arousal ratings for this database would facilitate various cross cultural experimental studies in controlling for arousal values.

Participants
Forty naïve observers (age range: 18-35 years, 25 females) with normal or corrected-to-normal vision provided informed written consent and participated in the experiment. All experimental protocols were approved by the Institutional Ethics Review Board of University of Allahabad.

Apparatus
The stimuli were presented using E-Prime 2.0 Professional software [39] on a Samsung PC with windows (1024 x 768, 85 Hz) and the data was analyzed in Matlab [40] and R software [41].

Stimuli
Only the front-faced straight gaze adult models from RaFD were used in this experiment. We used only seven expressions, namely; happy, angry, sad, surprise, disgust, neutral, and fear. We did not include 'contempt' expression, as it was the least accurately rated expression in the Radboud ratings. Moreover, the low accuracy rates for 'contempt' expression have been attributed to variations in facial features representing contempt expression across different cultures and regions [42,43].
A total of 39 models (19 females) each depicting the seven above mentioned expressions (273 images) were divided into two experimental sets (Set-1 & 2) of 19 and 20 models respectively (otherwise the duration of the experiment exceeded beyond two hours and technically not feasible to run on single participant). Set-1 had 133 and Set-2 had 140 images and the pictures were rated by two different groups of participants. The assignment of models into two groups was random. All expressions from each model were presented within one set only. Twenty participants rated each set. Each image was rated only once by a participant giving us twenty unique ratings for each image.

Procedure
Each experimental set had two rating blocks presented sequentially across all the participants, namely attractiveness rating block and emotion rating block. Each trial began with the presentation of an image at the center of the screen, the task question above the image and the rating scale below the image (Fig 1). The images were present on the screen until the participant rated it. The participants entered the responses using a keyboard. For each model, all emotional expressions were presented sequentially then followed by next model and all its expressions. Model image order was randomized for both blocks.
Participants rated attractiveness on a 5-point Likert like scale (1-unattractive, 5 -attractive). Only neutral expression of each model was used in this task so that the participants become familiar with the images for a given set. Same models were used in the emotion-rating task. The first question was based on emotion categorization task, where the participants were instructed to report the intended expression of the image on a 7-point nominal scale (1happy; 2-surprise; 3-disgust; 4-neutral; 5-sad; 6-angry; 7-fearful). The participants were instructed to choose the label that best described the expression. This emotion categorization task was different from the original [5] task in two ways. First, we did not include the contempt expression in the task, for reasons mentioned above. Second, for the emotion categorization task we did not have 'others' option as used in their study [5]. Most of the 'others' responses among Dutch raters in the original article [5] were for contempt expression and, since we dropped the 'contempt' expression, we also dropped the 'others' option as well [44]. Apart from these two differences all other rating scales were similar to original task [5]. After emotion categorization task, participants rated the valence of the expression (negative to positive), clarity of the expression (unclear to clear), genuineness (false to genuine), intensity (weak to strong), and arousal (calm to excited) on a 5-point Likert type scale (1 to 5), one after the other sequentially for the same model. As mentioned previously, one of our objectives was to test the universality hypothesis of emotion recognition, and to achieve that we requested and obtained the original classification and rating data of Dutch participants [5] from the authors.

Attractiveness rating
On a scale of 1-5, the mean attractiveness ratings for male and female adult models were not significantly different, t(37) = -1.158, p = .254, CI = [-0.81 0.22]. Since we did not have individual values for attractiveness ratings from the Dutch participants, we were not able to do a statistical analysis comparing the attractiveness ratings for the two populations. But the mean ratings (mean ± standard deviation) of Indian and Dutch samples [5] for male (Indian = 2.13 ± 0.77, Dutch = 2.10 ± 0.58) and female (Indian = 2.42 ± 0.82, Dutch = 2.36 ± 0.53) adult models were similar across the two cultures (Fig 2).

Expression agreement analysis
We evaluated the agreement rates, that is, the percentage of instances an emotion was correctly categorized as the intended expression (Fig 3). Overall (mean ± standard deviation) agreement rates across all emotion categories were 83.9% ± 15.7% (Median = 85%). Agreement rates for individual pictures by Indian and Dutch raters are provided in the supporting information (S1 Table). A one variable repeated measure (RM) ANOVA on arcsine-transformed agreement rates for the seven expressions was performed. The Mauchly's test showed significant deviation of sphericity, W(6) = 0.26; p < .001, for expression (ε Expression = 0.68), so Greenhouse-Geisser corrected values were used. The analysis showed significant effect of expression, F(4.08, 155.04) = 25.08; p < .0001, η p 2 = 0.39. Post-hoc Tukey Kramer's analysis showed that the agreement rates were significantly higher (all p < .001, all Cohen's d = 1.17 d 1.93) for happy expression (M = 97.9%, SD = 3.2%) compared to all other expressions. The agreement rates for neutral, sad, surprise and disgust were not significantly different from each other ( Fig  3). The agreement rates for angry (M = 71.5%, SD = 9.8%) and fear (M = 71.9%, SD = 12.8%) were the lowest and significantly differed from all other expressions (all p < 0.01; all d =~1.9). In contrast, for the Dutch ratings, lowest agreement was observed for contempt (M = 50%, SD = 15%) and the second lowest was for disgust (M = 77.3%, SD = 11.1%) while highest agreement rate was for happy expression (M = 98%, SD = 3%). Since we did not include 'contempt' expression, we could not compare the two datasets for this particular expression. A two-way RM ANOVA for agreement rates comparing the ratings from the Dutch and the Indian participants was performed with expression (7 expressions: happy, surprise, disgust, neutral, angry, sad, and fear) and culture (2 Cultures: Indian and Dutch) as within subjects factors. Mauchly's test showed significant deviations from sphericity for the expression factor, W (6) = 0.13, p < .001. Greenhouse-Geisser corrections were applied to the expression factor (ε Expression = 0.60). There was a significant main effect of expression, F(3.6, 136.8) = 25.90, p < .001, η p 2 = 0.40, and culture, F(1, 38) = 29.90, p < .001, η p 2 = 0.44. Also, the interaction between expression and culture was significant, F(6, 228) = 17.14, p < .001, η p 2 = 0.31. Posthoc Tukey Kramer's analysis showed no significant difference (all p > .30) between agreement rates of Indian and Dutch raters for happy, surprise, disgust and sad expressions. But significantly low agreement rates were found among Indians compared to Dutch raters for angry (p < .01, d = 0.79), neutral (p = .045, d = 0.75) and fearful (p < .01, d = 1.36) expressions. As mentioned previously, angry and fear were the expressions for which lowest agreement rates were observed within Indian raters. There were few negative expressions (e.g. fear, angry) for which there was lack of consensus among raters and the agreement rates were low (~70%). Fig 4 shows a three-dimensional plot of mean percentage of chosen expressions by the participants, as a function of intended expressions by the models. This plot also represents a confusion matrix, that is, how often an intended expression (of a model in RaFD) was confused for any other expression in this force-choice paradigm. The confusion matrix shows that intended fear was confused as displaying surprise (10%) emotion, intended surprise was confused with fear (9%), and intended disgust was confused with angry (8%). Such categorization errors were also reported by the Dutch raters (see Fig 4,[5]). Indian raters categorized intended angry as sad (14%) and disgust (8%), while intended neutral was classified as sad (8%). Visual inspection indicates that Indian raters misclassified angry and neutral expressions more often as compared to Dutch raters.

Unbiased hit rate analysis
To control for response bias (that is, the response key for a given emotion is used only for that emotion), we conducted an unbiased hit rate analysis using confusion matrix evaluated above [45] as reported by [5]. Low unbiased hit rates indicate that stimuli from a given category are not classified correctly.
Two-way RM ANOVA on arcsine transformed unbiased hit rates with expressions and gender of the models was performed with Greenhouse-Geisser correction for the expression (ε Expression = 0.76) factor, as Mauchly's test showed significant deviation from sphericity for the same (W(6) = 0.36, p = .009). Unlike Radboud analysis [5], there was no significant effect of gender of the model, F(1, 39) = 0.87, p = . 36  , intensity (Fig 5B), clarity (Fig 5C), and genuineness ( Fig 5D) for Indian and Dutch raters and arousal (Fig 5E) for only Indian raters, across six expressions. To study the difference in rating parameters as a function of emotional facial expressions across the two cultures (Indians and Dutch), we performed a three way mixed ANOVA (2 x 2 x 7) for each individual rating parameter (valence, intensity, clarity, genuineness) with culture (Indian and Dutch) and expression (happy, surprise, disgust, neutral, angry, fear and sad) as within subjects factors and gender of the image (male and female) as a between subjects factor. For the arousal parameter since there were no corresponding ratings available from the Dutch participants, culture was not included as a parameter in the analysis.   Arousal analysis. Arousal analysis was conducted for Indian rating data only as there was no corresponding rating available in Dutch population. A two way mixed ANOVA with Gender of the models (2) as between subjects' factor and Expression (7) as within subjects factor was performed.

Correlation analysis
We also calculated Pearson's correlations (Table 1) between different parameters and found significantly high correlations between intensity and clarity, r(270) = 0.67, p < .001. Similar high correlation between intensity and clarity was also reported amongst Dutch raters. With Dutch ratings [5], low correlations were observed for genuineness with intensity, r = 0.10 and clarity, r = 0.24. In contrast, genuineness was significantly correlated with clarity, r(270) = 0.66, p < .001, and intensity, r(270) = 0.58, p < .001, among Indian raters. Happy and Neutral expressions were rated as most genuine; also happy was the only expression rated as positive for valence. Neutral was rated as neutral valence (M = 3.2, SD = 0.18), and Surprise was rated close to neutral (M = 3.01, SD = 0.21).

Inter-rater reliability index
In our study there are multiple models displaying similar emotions as well as each participant rates all the expression of any particular model. Thus, there is only one response from each participant/ emotion parameter. This gives each face 20 independent responses. In order to measure the strength/ consistency of ratings from these 20 different participants, Intra-class correlation coefficient (ICC) can be calculated. This analysis removes any measurement/ judgment errors given by the raters. ICC is also termed as an inter-rater reliability index, or as reliability coefficient [5,46] Table 2 and the values were similar to that from Dutch raters. As mentioned by Langner et al [5], we also could not parse between-rater variance out as we were not able to calculate higher indices ICC(2, 1) and ICC (2, k) due to the fact that different sets of participants rated different models. The values of

Gender specific emotion analysis
Many studies have shown that there is a tendency of male faces to be rated angrier than female and similarly female faces to be rated happier than males. On the other hand, cross-cultural studies have also shown that participants rate faces from different culture/race angrier [47,48]. But this trend may depend upon whether a particular culture is individualistic or collectivistic. We wanted to test whether there are any cross-cultural differences in emotion recognition as a function of gender of the models; by raters from a collectivistic society like India (this study) and individualistic society like the Netherlands, for rating the same Caucasian model's faces in both the cultures. For this analysis we used only valence and intensity parameters of happy and angry expressions.
To test if male models are rated angrier while female models are rated happier, we first performed Wilcoxon Rank sum test with Bonferroni correction (α = 0.01) between male and female images within Indian and Dutch rating data separately, for intensity and valence parameters. The test showed no significant difference in ratings of 'valence' and 'intensity' parameters between male and female images posing happy and angry expressions for both the populations (p > .05). These results do not support gender-based expression ratings [47,48] for happy and angry expressions, at least with the simple emotion perception and rating task used in this study.

Discussion
Using within-culture and cross-cultural analysis, this study has addressed few of the central questions in facial emotion recognition in the context of culture about; a) universality of emotion recognition across cultures, b) contribution of specific features of emotional faces (other than agreement rates) in evaluating universality, c) differences in emotion ratings by individualistic and collectivistic societies, and d) validation of the Radboud database to be used in an appropriate manner by Indian participants in future studies. In order to achieve this, we evaluated the emotion categorization accuracy (agreement rates), valence, intensity, clarity, genuineness and arousal judgments by Indian raters, employing a similar design and methodology as the original study [5] and comparing it with already available performance and ratings from Dutch raters. This also enabled us to validate the Radboud database, so that it can be used in the Indian context for the Indian population.
Are emotions classified universally across different cultures? An important measure to test universality is by investigating the recognition accuracy (agreement rate measurements) across different facial expressions. We observed that the overall agreement rate provided by Indian Cross-cultural emotion recognition participants across all emotion categories (88%) was comparable (88%) to that of Dutch participants. When this overall agreement data was divided into individual emotion categories, happy faces were most correctly recognized in comparison to all other expressions. This was evident by significantly higher values for mean agreement rates, unbiased hit rates as well as for the parameters of valence, intensity, clarity and genuineness for happy expression. A more critical way to understand universality in emotion recognition performance is through the confusion matrix or unbiased hit rates analysis. The Indian ratings again show that least confusion was observed for happy emotion recognition than all other emotional categories. This is also in line with the ratings by Dutch participants and implies that 'happy' can be considered as the least ambiguous expression in the Radboud database across both the cultures. Angry and fear had the lowest agreement rates and high misclassifications among Indian raters. However, with the Dutch raters, contempt and disgust were rated lowest and highly misclassified. It should be noted that we did not use the contempt expression in our study. For other expressions (happy, neutral, sad and surprise), the mean agreement rates were comparable between Dutch and Indian raters (see, S1 Table).
The above results for happy expression are supported by other database validation and cross cultural studies. The main differences in misclassifications were seen with high arousal negative expressions for both cultures, consistent with findings reported in cross-culture emotion studies [3,30,49,50]. Thus, as far as recognition accuracy is considered, happy expression can be said as to be recognized more universally. The observations are in line with the literature suggesting that Universality hypothesis may hold true [30,49,51], but only for specific emotion categories among cultural groups. A closer look at misclassification errors indicates significant differences across the two cultures; suggesting cultural differences and thus argues against strict universality.
In recent years, many studies have emphasized the use of measuring various parameters that affect the quality of perceived emotion. These include valence, intensity, clarity, genuineness, arousal as well as others (like trustworthiness). These parameters help in establishing the strength of emotion processing as well as a control for various experimental studies where parameters such as valence and arousal play differential roles in studies on emotion-attention interactions [32][33][34][35][36][37][38]. The results obtained by comparing specific emotional parameters across the two cultures revealed that all parameters (except clarity of emotion) were significantly different between cultures and showed significant interactions between culture and expression. Specifically, for valence and genuineness parameters, Indians rated specific emotions (happy, angry, sad, surprise, disgust and fear) as more positive or negative and more genuine than the Dutch participants. In contrast, the intensity of few emotions (surprise and angry) were rated higher by Dutch participants than Indians. The expression with most similar ratings for valence, intensity and clarity was the happy expression. Even with happy expression, Indian raters rated it as more genuine and more arousing. These results indicate that even though the participants are able to correctly classify/categorize emotions across culture, there are significant parameter specific differences in emotion perception that do not fully support the universality hypothesis between the two cultures.
As discussed in the introduction, a significant factor that may contribute to such cultural differences may reside within the behavior of societies; that is whether a particular culture is more individualistic or collectivistic [21,24,26]. Very few studies have evaluated this aspect in the context of perception of emotional faces rated by Indian participants. The current study compares ratings of out-group faces (Caucasian) by Indian raters and in-group faces by Dutch raters. For example, the agreement rates by Indian participants for negative emotions like anger and fear were significantly lower than those reported by Dutch participants (Fig 3). Additionally high misclassification errors were observed with Indian raters for negative emotions. This is consistent with studies that show better accuracy with individualistic compared to collectivistic societies [22,29,30]. Low accuracy for out-group negative emotions in a collectivistic society has been attributed to the fact that these emotions are discouraged in the context of interdependence and group formation [24,29]. Further, the mean intensity ratings of Indian participants were in general less than Dutch participants for most emotions (Fig 5B), but ratings for only angry and surprise showed statistical significance. This is partly consistent with results showing high intensity ratings with individualistic compared to collectivistic societies [21,31].
The results from the current study and inferences made on the basis of individualistic and collectivistic societies is in contrast to an earlier study [18], which compared valence ratings across three cultures (Indian, Japanese and American) with models posing for seven basic emotions also being from above three cultures. They did not observe biases in valence ratings depending upon whether a culture is individualistic or collectivist. One reason could be that the models posing for emotions in Indian and Japanese culture did not follow the FACS guidelines. The failure to follow FACS guidelines was used to explain the significantly high accuracy rates for American models posing for different emotions compared to the models from the other two cultures [18] and may have resulted in lack of differences in valence ratings across cultures. Further, highest accuracy was observed for participants performing emotion recognition from the same culture, referred to as the 'in-group advantage [17,43]. This in-group advantage is known to be reduced as a function of geographical proximity and cross-cultural interactions amongst the cultural groups tested [22,52,53,54]. This is evident in our study as well; for the expressions with significant difference between the two cultures, the agreement rates were higher in Dutch than the Indian data-set.
It has been argued that the in-group advantage disappears when models use a standardized protocol like FACS [2] for portraying emotions [22]. It is suggested that emotion stimuli database developed following FACS notation augments 'stimulus equivalence', but at the same time due to difference in level of intensity of portraying emotions by different models and due to physiognomic differences (encoder effect), the decoding would be affected to different extent across cultures, leading to cross-cultural variations (decoder effect) in the emotion recognition [22]. This is confirmed in our study also by misclassification analysis, which showed pronounced differences across the two cultures. The misclassification errors for Indian raters were more for all the expressions except happy and sad. On the other hand misclassification errors by Dutch raters were relatively less and restricted to surprise, fear, disgust and contempt. These results raise questions about universality of emotion perception and point to subtle cross-cultural differences in emotion perception.
Pertaining to gender differences in emotional expressions, within Indian and Dutch raters, we did not observe any significant difference in valence or intensity ratings between male and female face models posing happy and angry emotions. However, we did observe significant effect of model gender on clarity and genuineness parameters, but these were limited to negative valence emotions only (fear, sad and anger). Moreover, there was no significant interaction between culture and gender for any of the four parameters tested. These results do not support (at least with Indians rating the Radboud faces) the idea that males from other cultures are rated angrier, while female are rated happier [47,48].
The current study is at the core of an existing debate on universality of emotion recognition across different cultures. There is literature in support [12,15,51], and against [14,18,52] the universality theory. However, a general consensus is that emotion recognition across cultures for the basic emotions (happy, angry, sad, disgust, fear, surprise) are above chance level and are recognized reliably, but the accuracy varies across cultures [18,21,53]. This difference in accuracy has been attributed to various factors such as subtle differences in the expression style of different facial emotions across different cultures [19], familiarity of an emotion within a culture [52] or the frequency of occurrence of an emotion in a cultural group [30]. Our study reports that out of all the emotions used, ratings for happy recognition is most consistent across the two cultures, while significant differences in culture exist among other emotional categories and more specific features of emotional faces, arguing against the universality of emotion perception.

Conclusion
In conclusion, we would like to emphasize the importance of validation of image databases for studies on emotion processing in different cultures. This study facilitates the use of an established database like the Radboud database in India. While advocating the use of a cross-cultural database, caution must be exercised since not all expressions are classified or rated in the same manner compared to the original ratings. From a theoretical perspective, the study not only indicates cross-cultural differences in emotion classification but also demonstrates the presence of subtle differences in emotion perception even when an emotion is accurately categorized, raising questions about the universality of emotion perception.
Supporting information S1