Implicit measurement of emotional experience and its dynamics

Although many studies revealed that emotions and their dynamics have a profound impact on cognition and behavior, it has proven difficult to unobtrusively measure emotions. In the current study, our objective was to distinguish different experiences elicited by audiovisual stimuli designed to evoke particularly happy, sad, fear and disgust emotions, using electroencephalography (EEG) and a multivariate approach. We show that we were able to classify these emotional experiences well above chance level. Importantly, we retained all the information (frequency and topography) present in the data. This allowed us to interpret the differences between emotional experiences in terms of component psychological processes such as attention and arousal that are known to be associated with the observed activation patterns. In addition, we illustrate how this method of classifying emotional experiences can be applied on a moment-by-moment basis in order to track dynamic changes in the emotional response over time. The application of our approach may be of value in many contexts in which the experience of a given stimulus or situation changes over time, ranging from clinical to consumption settings.


Appendix B. Confirmation of elicited emotions by videos
In order to show that the videos we selected indeed predominantly elicited the experience of happy, sad, fear, and disgust (and not predominantly other emotions we did not inquire about), we ran an MTurk study with a free response format. 262 MTurkers each viewed four videos, with each video belonging to one of the four emotion categories (in total 20 videos with number of views per video varying from 50 to 56). Per video, we asked the participants which emotion they predominantly experienced during viewing the video. Before analyzing the results, two independent raters classified the MTurk responses as being valid responses or not, since some responses were clearly not valid (e.g., 'big snake', 'black guy speaking', 'teeth'). The proportion of agreement on validity of responses was .59, as expressed by Cohen's Kappa (good interrater agreement). The raters agreed that 54 out of the 273 unique responses should be considered invalid. Only the responses agreed on as invalid were deleted.
Examples of disagreement across the two raters on validity of responses were 'crying', 'comradery', and 'ugly'. One participant was removed because he/she did not comment on any video, two participants were removed because they replied with 'bored' in response to all videos, and 23 because they replied with only one valid response across the four videos they watched (thus questioning the motivation of the responder). Single invalid responses were removed as well. The final analysis consisted of 933 responses (mean N across videos = 47, SD = 2.9, minimum N = 39 (The Green Mile), maximum N = 51 (Trainspotting)).
In order to analyze how many respondents answered with a label that corresponded to the hypothesized emotion the video predominantly elicited, we searched for the presence of the following words or word stems in the responses (see Supplementary Table 2 and 3).

Supplementary Table 2. Responses with synonyms for emotion category labels
Emotion category label Corresponding words and word stems Happy happ*, joy* Sad sad*, grief, tearful Fear fear, afraid, fright*, scar* Disgust disgust In Supplementary Table 4, we present the percentage of respondents that answered with (synonyms for) the emotion category labels happy, sad, fear, and disgust, and in Table 5 we included the more broader synonyms for the emotion category labels. In the first part of the analysis we investigated how accurately we could classify the emotional experience that was elicited by viewing a variety of videos, based on the patterns of frequency distributions in the EEG data. This part of the analysis consisted of three stages.

Supplementary
The first stage was feature selection. Using a subset of the observations (all the videos for all participants), we selected features (i.e., electrodes, frequencies) that were the most informative in distinguishing the emotions, in order to reduce dimensionality of the data and provide insight into the differentiation of emotional responses. We started with randomly selecting two videos out of the five videos, for each of the emotions, for each of the participants. Using these two videos for each emotion, we created contrasts between emotions to find the most informative features. For each participant, and for each electrode and frequency, we determined the mean activity across the two videos for each emotion.
From this mean, for each emotion, we subtracted the mean of the other three emotions, combined from six videos. This results in a value that indicates how distinctive that electrode and frequency for a participant is regarding one emotional response versus the average of the other three emotional responses. In order to detect similarities in these distinctions across participants, we computed for every emotion one-sample t-tests across participants, for every electrode and frequency. Finally, in order to discover the most distinctive and thus informative features, we selected for every emotion the electrodes and frequencies with the 10% highest t-values (see Appendix E for similar results with 5% and 20% features: robustness check 2). We used the most informative features from all emotions combined in the subsequent stages.
Second, in order to associate the emotion labels with patterns in the EEG data, we used classification models in the form of SVMs with a linear kernel function. Having detected the most informative features, we only used these electrodes and frequencies from the EEG data as features in the SVM models for the remainder of the observations that were not used for feature selection (i.e., three videos for each emotion and participant). After having divided the remaining observations into 10 different folds with approximately equal representations of the emotions, the training stage employed 9 out of the 10 folds, referring to 9/10 of these remaining observations. Since we were not only interested in how well the emotions can be classified in general, but also in a comparison of the specific emotions oneby-one, we computed six two-class classification models for the six combinations of four emotions (happy-sad, happy-fear, happy-disgust, sad-fear, sad-disgust, fear-disgust). In addition, a multi-class classification model was estimated containing all four emotion categories. We approached the multi-class classification problem using the Error-Correcting Output Codes (ECOC) framework (Dietterich and Bakiri, 1995 1 ), in which multiclass learning is reduced to multiple (SVM) binary learners. We used an Error-Correcting Output Codes (ECOC) classifier as implemented by Matlab (R2016b; Statistics and Machine Learning Toolbox) with a one-versus-all coding design in the case of classifying all four emotions.
All this resulted in training seven classifiers to associate the emotion labels with patterns in the EEG data. We did not intend to classify a 'neutral response' from the neutral videos that were presented between emotion blocks, but including the neutral category in the classifiers yielded similar results (see Appendix F for robustness check 3).
Third, we tested the trained classifiers on the remaining data. We iterated stage two and three ten times for all possibilities of leaving out one of the ten folds. We then evaluated the ability of the classification models to generalize the distinction between emotions to new data by calculating the percentage of accurately predicted emotion labels across observations from all ten folds (i.e., the out of sample generalization accuracy).
An important quality of the analysis design is that the same data is not used twice, for reduction of the data and training/ testing of the models. This means however, that the specific subset of observations that we used in the different analysis stages, are of influence on the features that are selected as the most informative ones in the first stage, and also ultimately on the accuracy of the SVM models in the final stage. Since we did not want to select the most informative features based on specific videos, we decided to randomly select

Appendix D. Robustness check 1: Similar duration of analyzed segments across emotions
In order to empirically test whether classification was potentially biased due to the difference in duration of the videos (thus, possibly a systematically different signal to noise ratio in the EEG data) across emotion conditions, we conducted a robustness check. We repeated the complete analysis with only the data in response to the final 22 seconds of each video (i.e., the duration of the shortest video) in order to eliminate the effect of duration difference across emotion conditions. We find that the resulting classifiers were still able to generalize the distinction between all four emotional experiences to new data well above chance level, even when merely the final 22 seconds of the data were taken into account for all videos.
Hence, we can exclude that different durations of the videos across emotions biased the classification results. We can also conclude from these results that there is indeed information specific to the emotions in the extra length of the videos from the sad and happy categories, since especially this distinction is less accurately predicted by the final 22 seconds compared to the complete data (i.e., approximately a 10 percentage-point decrease).

Appendix F. Robustness check 3: Classifying the neutral videos viewed inbetween emotion blocks
Including 'neutral' as an emotion category yielded out of sample generalization accuracies for the classifiers that were in a similar range to the classifiers currently described in the main manuscript. This can be regarded as another robustness check since the neutral trials were not presented sequentially in one block to participants, but instead presented between the other emotion blocks, yet the classifiers were able to recognize these separate neutral trials as a category distinct from the trials belonging to the other emotion categories. Supplementary

Appendix G. Repetition of the analysis with different observations in the different stages to rule out a selection bias
An important quality of the analysis design is that the same data is not used twice in the different analysis stages, but this also means that the specific subset of observations that we used in the stages are of influence on the outcome. In the first stage, a specific subset of observations influences the features that are selected as the most informative ones, and in the final stage it influences the out of sample generalization accuracy of the SVM models. Since we did not want to select the most informative features based on specific videos, we decided to randomly select a subset of videos for feature selection. For each of the 37 participants, we

Appendix H. Results manipulation check
Because we wanted to verify the videos' effectiveness in eliciting the specific emotional responses in our participants (i.e., manipulation check), we asked participants to complete a questionnaire about the previously viewed videos after we finished the EEG data collection. Participants had to indicate for each video the extent to which they felt happy, sad, fear, and disgust during the video on a scale from one (not felt at all e.g., happy) to five (felt extremely e.g., happy).
As expected, the MANOVA indicated an interaction between video label (i.e., the target emotion) and the emotion participants reported they actually experienced, using Pillai's trace (V = 0.98, F(9, 28) = 129.89, p < .001). The follow-up ANOVAs showed that the ratings of a specific experienced emotion were higher for the videos that targeted this particular emotional response, compared to videos that did not target this emotional response Based on the results of the ANOVAs and the ICCs, we conclude that the emotional response that the video targeted to elicit, is indeed the emotion that the participants predominantly experienced during viewing of the videos. These results suggest that the EEG activity averaged across the duration of the videos, is representative of a happy, sad, fear, and disgust response, respectively, and that we can use this data to functionally localize specific emotion-related activity patterns. In the subsequent figures, we present how often the features were selected, and thus were used in training and testing the classifiers taking as criterion the 10% highest t-values per emotion, expressed as a percentage across the 500 repetitions. The colors represent the percentage of repetitions that the specific frequencies (mentioned below the different scalp maps), and electrodes (across the heads) were selected as the features with 10% highest t-values for the happy contrast.