Fig 1.
Summary of how we investigate the five worlds in this study.
Fig 2.
The methods considered to go beyond the state-of-the-art in the investigated worlds are illustrated: beyond the Closed World (bottom left), both real and distractor labels are used; beyond the Clean World (upper left), 6 types of noise at 4 SNRs are applied; beyond the Small World (upper middle), four data groups with different training sizes, two feature sets and two models are optimised through 3-fold speaker independent cross validation (CV) in 16 experiments; beyond the One World (bottom middle), classification and perception results by machines and humans are assessed through a one-to-one comparison of the Confusion Matrices (CM); in the Fuzzy World (right), the confusion patterns of the perception and classification experiments are evaluated.
Fig 3.
Emotions used in the perception and ML experiments.
From the 10 emotions: 6 are ‘real’, whose audio files were used in all perceptual and ML experiments (framed and blue), 4 are ‘distractors’, whose audio files were used only to train the ML models (italics and green); ‘basic’ emotions are capitalised; the inner ellipse indicates no arousal connotations.
Fig 4.
Frequencies between 0–8 kHz (most important for speech) and amplitudes between -40 to 40 dB, are shown for the artificial (brown, pink, white) and the real-life (bell, rain, train station) noises. All samples have 10 sec. length (Root Mean Square is normalised).
Fig 5.
Main components of the ML workflow: Data groups (A, B, C, and D) represented according to the diverse sizes of their training set; Feature sets (ComParE and wav2vec2); and ML models (SVM and MLP).
Fig 6.
Distribution of real samples and distractors.
Partitioning across the three sets (training, development, test) and data group (A, B, C, D) is indicated. The distribution of speakers is: Training (A = 2, B and C = 6 each, D = 16; Development and Test (A, B, C, D = 2 each).
Table 1.
Perceptual results for clean and noisified conditions.
Table 2.
Sums of ‘perceived as’ (hits and false alarms).
Table 3.
Confusion matrix for the perception of emotions by 132 listeners.
Fig 7.
Non-Metric Multi-Dimensional Scaling (NMDS).
The 2-dim(ensional) solutions represent (a) listeners’ perception and (b) automatic classification: hot anger (HO), panicked fear (PA), irritation (IR), depressed sadness (SA), elation (EL), and pleasure (PL); in clean and in rain noise. Kruskal’s stress for perception in (a): Clean (.115); Rain noise (.036); for classification in (b): Clean (.150); Rain noise (.114); bottom left, the x-axis is mirrored to display the dimensions similarly for perception and classification.
Table 4.
ML results considering all conditions together.
Table 5.
ML performance excluding the distractors from the training set.
Table 6.
Confusion matrix for the classification of data group D with MLP and wav2vec2 features in clean and rain conditions.
Fig 8.
Recall per class and UAR (in%) for human perception and MLP classifier.
The MLP is trained on data group D with wav2vec2 features. Results are given on EXP-2 considering all SNRs together for the noisy conditions. Mean across conditions (μ) is also given.