Perception and classification of emotions in nonsense speech: Humans versus machines

doi:10.1371/journal.pone.0281079

Fig 1.

The five worlds.

Summary of how we investigate the five worlds in this study.

More »

Expand

Fig 2.

Overview of the study.

The methods considered to go beyond the state-of-the-art in the investigated worlds are illustrated: beyond the Closed World (bottom left), both real and distractor labels are used; beyond the Clean World (upper left), 6 types of noise at 4 SNRs are applied; beyond the Small World (upper middle), four data groups with different training sizes, two feature sets and two models are optimised through 3-fold speaker independent cross validation (CV) in 16 experiments; beyond the One World (bottom middle), classification and perception results by machines and humans are assessed through a one-to-one comparison of the Confusion Matrices (CM); in the Fuzzy World (right), the confusion patterns of the perception and classification experiments are evaluated.

More »

Expand

Fig 3.

Emotions used in the perception and ML experiments.

From the 10 emotions: 6 are ‘real’, whose audio files were used in all perceptual and ML experiments (framed and blue), 4 are ‘distractors’, whose audio files were used only to train the ML models (italics and green); ‘basic’ emotions are capitalised; the inner ellipse indicates no arousal connotations.

More »

Expand

Fig 4.

Spectral distribution.

Frequencies between 0–8 kHz (most important for speech) and amplitudes between -40 to 40 dB, are shown for the artificial (brown, pink, white) and the real-life (bell, rain, train station) noises. All samples have 10 sec. length (Root Mean Square is normalised).

More »

Expand

Fig 5.

Experimental design.

Main components of the ML workflow: Data groups (A, B, C, and D) represented according to the diverse sizes of their training set; Feature sets (ComParE and wav2vec2); and ML models (SVM and MLP).

More »

Expand

Fig 6.

Distribution of real samples and distractors.

Partitioning across the three sets (training, development, test) and data group (A, B, C, D) is indicated. The distribution of speakers is: Training (A = 2, B and C = 6 each, D = 16; Development and Test (A, B, C, D = 2 each).

More »

Expand

Table 1.

Perceptual results for clean and noisified conditions.

More »

Expand

Table 2.

Sums of ‘perceived as’ (hits and false alarms).

More »

Expand

Table 3.

Confusion matrix for the perception of emotions by 132 listeners.

More »

Expand

Fig 7.

Non-Metric Multi-Dimensional Scaling (NMDS).

The 2-dim(ensional) solutions represent (a) listeners’ perception and (b) automatic classification: hot anger (HO), panicked fear (PA), irritation (IR), depressed sadness (SA), elation (EL), and pleasure (PL); in clean and in rain noise. Kruskal’s stress for perception in (a): Clean (.115); Rain noise (.036); for classification in (b): Clean (.150); Rain noise (.114); bottom left, the x-axis is mirrored to display the dimensions similarly for perception and classification.

More »

Expand

Table 4.

ML results considering all conditions together.

More »

Expand

Table 5.

ML performance excluding the distractors from the training set.

More »

Expand

Table 6.

Confusion matrix for the classification of data group D with MLP and wav2vec2 features in clean and rain conditions.

More »

Expand

Fig 8.

Recall per class and UAR (in%) for human perception and MLP classifier.

The MLP is trained on data group D with wav2vec2 features. Results are given on EXP-2 considering all SNRs together for the noisy conditions. Mean across conditions (μ) is also given.

More »

Expand