Automatic detection of expressed emotion from Five-Minute Speech Samples: Challenges and opportunities

Research into clinical applications of speech-based emotion recognition (SER) technologies has been steadily increasing over the past few years. One such potential application is the automatic recognition of expressed emotion (EE) components within family environments. The identification of EE is highly important as they have been linked with a range of adverse life events. Manual coding of these events requires time-consuming specialist training, amplifying the need for automated approaches. Herein we describe an automated machine learning approach for determining the degree of warmth, a key component of EE, from acoustic and text natural language features. Our dataset of 52 recorded interviews is taken from recordings, collected over 20 years ago, from a nationally representative birth cohort of British twin children, and was manually coded for EE by two researchers (inter-rater reliability 0.84–0.90). We demonstrate that the degree of warmth can be predicted with an F1-score of 64.7% despite working with audio recordings of highly variable quality. Our highly promising results suggest that machine learning may be able to assist in the coding of EE in the near future.


Introduction
Speech-based emotion recognition (SER) technologies and systems have been steadily increasing in prominence in the speech processing literature over the last two decades [1,2].Typical SER approaches focus on one of two tasks: (i) the recognition of discrete emotions, typically the six 'basic' emotions identified by Ekman [3]; and (ii) continuous emotion recognition along a dimensional representation, typically arousal and valence [4].SER applications can provide feedback within supportive technologies within healthcare systems [5,6].For example, SER influenced applications have been proposed in early-diagnosis settings.Recently, speech markers have been used to assist in the inference of Attachment Condition in school-age children [7].This approach used an SER-style approach to recognise if children were emotionally secure or insecure.Similarly, the tracking of emotional engagement can be used to assess the negative impact of dementia on communication [8].
The work presented herein is also based around the concept of automatic speech-based emotion recognition within a clinical application.Specifically, we present the analysis of Expressed Emotion (EE), a family environment concept based on caregivers speaking freely about the relative/family member in their care [9,10].We focus on the automatic recognition of EE ratings from caregiver speech samples talking about their Submitted to Interspeech 2022 5-year-old twins.To the best of the authors' knowledge, this is the first time an automated approach for determining EE in such settings has been attempted.Adding to the challenges of this work is the quality of the audio recordings: they were originally made 20 years ago on audio cassette tapes.Despite the associated challenges, we are able to demonstrate that degree of warmth, a component of EE, can be predicted with 61.5% F1-score.
The rest of the paper is organised as follows: Section 2 presents the relevant background from the fields of psychiatry and psychology, Section 3 describes the set-up, data, and methods of this study, Section 4 contains the results, and concluding remarks are provided in Section 5.

Background
In the field of psychiatry, EE refers to the attitudes of caregivers towards their child and comprises criticism, hostility, and/or emotional over-involvement, as well as the degree of warmth shown.For over five decades, levels of EE within families have been studied by psychologists and psychiatrists to determine which adults with mental illness are likely to have the poorest outcomes [11,12,13].EE was originally measured through indepth face-to-face interviews but, due to time constraints, has subsequently been assessed through brief samples of caregivers speaking freely about their child.These interactions are known as the Five-Minute Speech Sample (FMSS) [14].
Coding of EE from these easy-to-collect speech samples focuses on the emotions that are apparent when the caregiver speaks about their child, drawing on both what is said and the tone of voice.This coding can contain clinically useful information.For example, EE rated from maternal speech samples plays a causal role in the development of antisocial behavioural problems in children [15] and subsequent serious mental illnesses.[16] Other studies have shown that ratings of negative emotions from parents' speech predict the onset and course of other mental health problems in children, including anxiety, depression, and attention-deficit hyperactivity disorder [17] underlining its usefulness as an early predictor of youth mental health difficulties.However, this promising prediction method is rarely used; the coding of speech is labourintensive and requires highly trained raters.Moreover, human rating potentially has limited reproducibility as it can be prone to drift and unconscious biases.Automating the assessment of EE could dramatically impact clinical practice, and provide clinicians with an important guide to the likelihood that a young person will develop mental health problems and permit them to effectively target preventive interventions and reduce incidence rates of mental disorders.).The caregiver was encouraged to talk freely but if s/he found this difficult, a series of semi-structured probes were used (e.g., "In what ways would you like [child] to be different?").These speech samples were coded by two trained raters according to manualised guidelines with high inter-rater reliability (r=0.84-0.90).Ratings included the degree of dissatisfaction/negativity and the degree of warmth that the caregiver expressed towards each child (0=none to 5=high).

Data
The interviews from the E-Risk study were recorded on cassette tapes using consumer-grade equipment available at the time.The cassettes were maintained in storage for 20 years and may have degraded during this time.The audio quality is highly variable with frequent inaudible passages and white noise.The tapes required digitisation, which was carried out by an external contractor using professional equipment.Although the interviews follow a loose structure based on the use of standard prompt questions, they contain passages of overlapping speech.Furthermore, there is often background chatter, interruptions by young children, and other ambient noises.To enable an analysis of linguistic content and for the training of an automatic speech recogniser (ASR), the interviews had to be transcribed.The significant cost of professional transcription and the small project budget imposed limitations on the amount of material we were able to transcribe.Furthermore, the low quality of the audio complicated the process, and the resulting transcripts contained numerous alignment errors, inaccurate segmentation, incorrect time-stamping, missing passages and some incorrectly rendered words.The final sample contained 37 transcribed interviews coded by human raters for EE, a small proportion of the total.

Methods
Our experiment aimed at to classify interview samples for the level of warmth expressed by mothers towards their twins.We sought to assess and compare the efficacy of using different combinations of features -acoustic-only, text-only, and both of these together.We adopted a strategy to increase the amount of available data.Using the Audacity 1 audio editor, we aligned the audio 1 https://www.audacityteam.org/ and transcripts and manually tagged all utterances to indicate the speaker, which twin was being referred to, and the content type of the utterance, following the loose interview structure.For example, the tag int-both-support indicates the interviewer asking about the level of support the mother received during pregnancy, relating to both twins.The tag mum-t1-away indicates a mother's description of her feelings when the elder twin (t1) is away from her.The equivalent tag for the younger twin is mum-t2-away.Our tagging schema included 38 distinct tags (19 interviewer prompt tags and 19 mother utterance tags).This tagging enabled us to double the size of the dataset by splitting each audio sample into utterances for elder and younger twins.Thus, the dataset used for our experiments contained 74 samples.The EE ratings for each twin were used as target labels for classification.
The original human-based coding of warmth from the E-Risk study consisted of 6 ordinal classes ranging from 0 to 5.However, the distribution of classes was imbalanced in the available dataset.We therefore merged classes into a 3-class schema to provide a more balanced distribution of classes for training our models.The distribution of classes in the final dataset is shown in Figure 1.The resulting distribution, while not perfectly balanced, was more even than the original coding.

Results
We used four different machine learning classifiers (with default parameters except for increasing the number of iterations) to predict the degree of warmth for the individual twins: Logistic Regression (LR), Linear Support Vector Classifier (Lin-SVC), Random Forest (RF), and K-Nearest Neighbours (KNN).We used the Scikit-learn [18] library in Python to train the classifiers.To evaluate the performance of the classifiers, we ran classification tasks five times using stratified 5-fold cross-validation with shuffling.The final metric was the average of the F1-score of the classifiers over the runs.

Acoustic features
We extracted different frame-based acoustic features from the caregivers' audio segments using the OPENSMILE Toolkit [19].We took the average and standard deviation of the features to build fixed-length feature sets for training.These features were introduced by the authors of OPEN-SMILE at various Interspeech challenges on the detection of  [22,23,24], and the extended Geneva Minimal Acoustic Parameter Set (eGeMAPS) [25].
Table 1 shows the average F1-score and standard deviation of the four classifiers on different acoustic features.The results varied across combinations of features and classifiers.The RF classifier using IS-10 features and the LR classifier with eGeMAPS resulted in the best average F1-score of 59.9%; the former having fewer errors (2.2% vs. 4.4%).KNN on both IS13-CMP and IS16-CMP scored lowest (F1=38.9%).IS13-CMP and IS16-CMP features on all classifiers compared to IS10-CMP features had statistically significantly lower F1scores (p-value < 0.05).In addition, the results obtained by the RF classifier were significantly better than the other classifiers.
We preprocessed the transcripts using spaCy [31], tokenising the text, removing punctuation and lingering whitespace, and lowercasing all tokens.For the pre-trained language models, which are limited to sequences of 512 tokens, the text was divided into chunks of 512 tokens and passed to the models using a sliding window approach with 50% overlap and the average and standard deviation of the last three layers was computed to create features for classification.For TF-IDFand pretrained language models, we trialled using the original transcripts with punctuation (reported in results with the suffix "-pun" ) and the pre-processed texts.Word embedding models used the pre-processed text only.For the embedding models, unknown words (words that did not appear in the model) were ignored.The final embedding representation for each transcript was the mean of embeddings for all known word tokens in the  transcript.In all runs, only utterances for caregivers were retained as this was the content used for the manual coding of warmth in the transcripts.We used the same classification models for the acoustic features.
Table 2 shows the average F1-score and standard deviation of the four classifiers on the different text features.The best average F1-score was achieved by a Lin-SVC classifier using the roberta-pun features (F1=53.7%),while the lowest performance was obtained with RF and bert features (F1=36.5%).The bert features on all classifiers also had significantly lower F1-scores (p-value < 0.05) compared to roberta-pun features.However, the difference between the classifiers' performance was not statistically significant.

Combining acoustic and text features
The manual coding of warmth (and EE in general) relies on both interview content and voice features.We therefore sought to assess the use of both modalities in the classification task, using a combination of acoustic and text features to train the models.Since the IS10-CMP acoustic features yielded the best results on all classifiers, we combined these features with each of the text features, with the exception of BERT, due to its significantly lower performance in the text-only task.The average F1-score of the classifiers on the combined features are shown in Table 3 3 .The overall figures across different clas- sifiers and features showed an increase in performance compared to both acoustic-only and text-only experiments.The best performance was obtained by the Lin-SVC classifier using TF-IDF-PUNfeatures (F1=61.5%).The second best performance was obtained by the same classifier using TF-IDFfeatures on the pre-processed text (F1=61%).A t-test showed no significant differences between the F1-scores across all features.However, Lin-SVC and LR classifiers performed significantly better than KNN and RF.
Figure 2 shows the Receiver Operating Characteristic (ROC) curves for one of the representative runs of the best performing classifier (Lin-SVC) on IS10-CMP combined with TF-IDF-PUNfeatures.Figure 3 shows the corresponding Confusion Matrix (CM).The best ROC curve (up to 0.7487) was obtained for high warmth (the minority class).Highest accuracy was also achieved for this class (67% or 6 out of 9), followed by the low class (the majority class) with accuracy of around 61% (20/33).The most challenging class to predict was moderate which had the highest degree of confusion (with low warmth) and an accuracy of 53%.

Conclusions
This study has highlighted some of the significant challenges of working with a dataset of real-world audio.First, the low quality of audio recorded on cassette tapes and low-grade equipment greatly complicated the task of transcription and significantly increased the investment required to make the dataset usable.Second, the cost of transcription and a restricted project budget limited the number of transcripts we could obtain.Manually tagging the utterances in the interviews proved an effective way of increasing the amount of data and enabled us to split interviews into two parts, one for each twin.Third, the small amount of available data imposed limitations on the choice of classification models we could use.It was not feasible to trial state-ofthe-art models that require much larger amounts of data, such as deep neural networks.Finally, due to class imbalance, it was necessary to merge the classes of our target variable.Our experiments should, therefore, be seen as a proxy for the task of predicting true warmth scores as coded in the E-Risk study.
the normalisation when combining features.Despite these limitations, our results tentatively indicate that combining acoustic and text features is optimal when trying to predict the levels of caregiver warmth expressed in Five Minute Speech Samples.This promising result suggests that machine learning classifiers may eventually be an adequate substitute for the process of manual coding of warmth, and EE more generally, by human raters.
In future work, we intend to prioritise the expansion of the dataset with additional transcriptions.A larger dataset would open up the possibility of using more sophisticated classification models.Ultimately, however, we aim to develop an approach based on automatic speech recognition in order to alleviate the burden of manual transcription.
We conclude by making recommendations for researchers faced with the challenges of working with real-world speech samples.Investing time in the manual preparation of data, such as tagging or other annotations, can help mitigate the effects of limited and low quality data.We suggest adapting experiments to the limitations of the data by using a variety of established and recent feature extraction and machine learning methods.Finally, in initial experiments consider using a combination of a reduced number of classes and conventional machine learning approaches, as this should help keep model complexity lower when only smaller amounts of data are available.

3. 1 .
Cohort Study The Environmental Risk (E-Risk) Longitudinal Twin Study tracks the development of a nationally representative birth cohort of 2,232 British twin children born in England and Wales in 1994-1995.They have been comprehensively assessed during home visits at ages 5, 7, 10, 12 and 18 years (with 93% retention).The Joint South London and Maudsley and Institute of Psychiatry Research Ethics Committee approved each phase of the study.When the children were 5 years old, speech samples of approximately five minutes were audio-recorded from caregivers (almost exclusively mothers) in their homes to elicit expressed emotion about each child.Trained interviewers asked caregivers to describe each of their children ("For the next 5 minutes, I would like you to describe [child] to me; what is [child] like?"

Figure 1 :
Figure 1: Distribution of caregiver warmth classes in 3-way schema with corresponding 6-way schema numerical classes.

Figure 2 :
Figure 2: Receiver operating characteristic curves of the threeway Lin-SVC classification using is10-par acoustic features combined with TF-IDF-PUNtext features.

Figure 3 :
Figure 3: Confusion matrix of the three-way Lin-SVC classification using is10-par acoustic features combined with TF-IDF-PUNtext features.

Table 1 :
Average F1-score and standard deviation (5 runs, 5-fold cross validation) of the four classifiers using different acoustic-only features.

Table 2 :
Average F1-score and standard deviation (5 runs, 5fold cross validation) of the four classifiers using different textonly features.

Table 3 :
Average F1-score and standard deviation (5 runs, 5fold cross validation) of the four classifiers using IS10-CMP acoustic features combined with different text features (Comb.Feat.).