Speech emotion recognition using fine-tuned Wav2vec2.0 and neural controlled differential equations classifier

Ni Wang; Danyu Yang

doi:10.1371/journal.pone.0318297

Abstract

Speech emotion recognition (SER) has always been a popular yet challenging task with broad applications in areas such as social media communication and medical diagnostics. Due to the characteristics of speech emotion recognition dataset, which often have small data volumes and high complexity, effectively integrating and modeling audio data remains a significant challenge in this field. To address this, we propose a model architecture that combines fine-tuned Wave2vec2.0 with Neural Controlled Differential Equations (NCDEs): First, we use a fine-tuned Wav2vec2.0 to extract rich contextual information. Then we model the high-dimensional time series feature set using a Neural Controlled Differential Equation classifier. We set the vector field as an MLP and update the model’s hidden state by solving the controlled differential equation. We conducted speech emotion recognition experiments on the IEMOCAP dataset. The experiments show that our model achieves the weighted accuracy of 73.37% and the unweighted accuracy of 74.18%. Additionally, our model converges very quickly, reaching a good accuracy after just one epoch of training. Furthermore, our model exhibits excellent stability. The standard deviation of weighted accuracy (WA) is 0.45% and the standard deviation of unweighted accuracy (UA) is 0.39%.

Citation: Wang N, Yang D (2025) Speech emotion recognition using fine-tuned Wav2vec2.0 and neural controlled differential equations classifier. PLoS ONE 20(2): e0318297. https://doi.org/10.1371/journal.pone.0318297

Editor: Seyed Reza Shahamiri, University of Auckland, NEW ZEALAND

Received: October 4, 2024; Accepted: January 13, 2025; Published: February 20, 2025

Copyright: © 2025 Wang and Yang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data underlying the results presented in this study are available from the SAIL lab at the University of Southern California https://sail.usc.edu/iemocap/index.html.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In the field of artificial intelligence, emotion recognition has emerged as a crucial research direction within natural language processing and computer vision. This technology has increasingly demonstrated its vast potential applications in systems related to human-computer interaction, emotion monitoring, and intelligent services[1]. Among these technologies, speech-based emotion recognition stands out as a significant branch. Due to its ability to intuitively and naturally understand and respond to human emotions, it has broad application prospects in various fields such as social media analysis, intelligent customer service, and assistive diagnosis[2].

The key to speech emotion recognition (SER) lies in extracting data features from limited, small-scale datasets to represent complex emotional expressions [3]. Currently, research can be conducted from two main aspects: acoustic features and linguistic content. Utilizing deep learning for text processing of speech has become a relatively mature research field[4] . Although speech-to-text technology has advanced considerably, using textual content alone for SER is not yet an effective solution. This is because it loses many fundamental acoustic features that express emotions, such as pitch frequency, volume, and formants[5]. Another critical aspect is that temporal information in speech plays an important role in emotion recognition. Temporal information refers to the changes in the speech signal over time, including dynamic features such as rhythm variations and pitch fluctuations, which can help models capture the subtle changes in emotional expression that might not be as apparent in static acoustic features. Therefore, traditional methods of extracting Low-level descriptors (LLDs) followed by manual operations such as statistical pooling layers have not yielded satisfactory experimental results[6]. Consequently, directly extracting time-series data from the speech signal is a better approach to address existing challenges.

Currently, there are many methods for directly extracting features from speech, such as spectrograms, Mel-spectrograms, and MFCCs[7–9]. However, traditional extraction methods rely on intricate feature engineering, requiring manual design and selection of speech features. Wav2vec2.0[10], an advanced end-to-end speech representation model based on Transformers, learns rich representations from large-scale unlabeled speech data through pre-training. These features not only encompass the physical attributes of speech but also implicitly contain information about the speaker’s emotional state. As a result, wav2vec2.0 has demonstrated excellent performance in tasks such as speech recognition[11] and speech emotion recognition[12]. [13,14] have also proposed some fine-tuned models based on wav2vec, achieving good results in emotion recognition tasks. Recently, [15,16] have used the representations extracted by the model as features for emotion recognition tasks. [17] achieved the UA of 74.3% by data augmentation and using fine-tuned wav2vec2.0 model. We replicate their model by replacing the downstream model by Average-Pooling in the time dimension and the UA is 71.7% and the WA is 71.96%. This may be because we did not use any data augmentation operations and directly trained with the features extracted by the upstream model. However, data augmentation operations usually consume resources and manpower. Our end-to-end operation is more portable.

To effectively model time-series data for SER, various research methods have been proposed. [18,19] introduced Hidden Markov Models; [20,21] utilized Support Vector Machines. Additionally, various deep neural network models have been applied to this task, yielding outstanding experimental results [22–24]. On the IEMOCAP dataset, [22] proposed a novel weighted time-pooling strategy and achieved 63.5% WA and 58.8% UA. [23] uses a combined convolution-LSTM higher-complexity model to achieve 68.8% WA and 59.4% UA. [24] achieves 70.6% WA by TDNN-LSTM model. The Neural Control Difference Equations (NCDEs) model[25], a novel deep learning approach, is particularly suitable for handling data with continuous time-series characteristics, such as speech signals. The core idea of NCDEs is to model the evolution of dynamic systems through controlled differential equations. By combining the high-dimensional features extracted by fine-tuned wav2vec2.0 with the dynamic modeling capabilities of the NCDEs model, our work achieves state of art accuracy in SER.

The main contributions of this paper are as follows: (1) We propose a novel model architecture that combines fine-tuned Wav2vec2.0 with Neural Controlled Differential Equations (NCDE) for speech emotion recognition. Compared to other similar baseline models, the NCDEs model demonstrates significant advantages in SER on the IEMOCAP dataset, including high accuracy, exceptional stability, and fast convergence. (2) The NCDEs model can automatically learn the dynamic changes in speech signals without the need for manual design or selection of specific features, thus avoiding human intervention. This is particularly valuable for handling highly complex and nuanced emotional expressions. Our code is available at https://github.com/Ni-W/NCDE-wav2vec2.

1 Related work

1.1 Fine-tuned wav2vec2.0 model

Wav2vec2.0[10], developed by Facebook AI Research, builds on the original wav2vec model by introducing a more complex self-supervised mechanism. This allows the model to learn speech representations without labeled data by predicting missing parts of the input sequence. By incorporating the Transformer mechanism, the end-to-end wav2vec2.0 model avoids the need for manual feature design and selection, enabling it to more effectively capture subtle variations in speech data, thereby improving the accuracy of speech recognition and processing. The wav2vec2.0 model has demonstrated excellent performance in tasks such as automatic speech recognition and speaker verification[10,12,26,27].

Recently, pre-trained wav2vec2.0 models are fine-tuned on target datasets and then used as feature extractors to obtain features better suited for specific tasks[28,29]. Inspired by this approach, we choose to use the Fine-Tuned wav2vec2.0 model proposed by Li-Wei Chen et al.[17]: P-TAPT (Pseudo-label Task Adaptive Pretraining Model) as the upstream model to extract features suitable for SER. P-TAPT fine-tunes the model by adding a position-wise linear head and running a k-means clustering algorithm, overcoming the limitations of fixed frequency perception scales and resolutions. This approach provides more descriptive and higher-level feature representations, enhancing the model’s ability to capture rapidly changing signals and detailed features. It has achieved good experimental accuracy on the IEMOCAP datasets. The features extracted by P-TAPT contain rich contextual information and emotional representations, making them very valuable for understanding language content and emotion recognition.

1.2 Neural controlled differential equations

Neural Controlled Differential Equations (NCDEs), an innovative mathematical tool in modern data science, was first proposed by Patrick Kidger et al. in 2020[25]. NCDEs are commonly used to address issues related to irregularly sampled data. For the experiments in this study on speech emotion recognition, while audio recordings generally have a fixed sampling rate, speech signals exhibit rich dynamic variations along the time dimension. The advantage of NCDEs lies in their ability to capture the intrinsic dynamic properties of signals, such as pitch fluctuations and tempo variations. This enables the model to extract hidden, high-level information within speech features, even under fixed sampling conditions, thereby facilitating emotion recognition. On one hand, controlled differential equations (CDEs) introduce the concept of control paths to ordinary differential equations (ODEs), providing a mathematical description of external interventions in the system. NCDEs extend this framework by integrating neural networks, enhancing the model’s ability to approximate the complex dynamics of nonlinear systems. On the other hand, rough path theory supports the universality of NCDEs. Rough Path Theory[30], proposed by Terry Lyons in 1998, is a method for handling paths with irregularities and high fluctuations, such as financial data[31] and experimental physics data[32]. By enriching the path information, it enables the definition of integrals on more irregular paths. This theory studies highly irregular paths, such as paths without bounded variation.

NCDEs can be viewed as a continuous generalization of RNN models, interpreting input streams as evolving in continuous time. This allows NCDEs to exhibit outstanding advantages in handling complex time-series data, as demonstrated on datasets like CharacterTrajectories[33] and PhysioNet 2019 Challenge[34]. In [25], NCDEs substantially improved convergence speed and model accuracy in speech command recognition tasks[35]. NCDEs naturally model the temporal dependencies of data, enabling the model to capture more subtle changes. Therefore, compared to traditional mathematical models and machine learning methods, NCDEs are more efficient and accurate by handling continuous time-series data.

2 Proposed approach

In this section, we first introduce the relevant definitions and theoretical foundations of NCDEs. Next, we present the neural network model we have constructed. The overall design of the model is shown in Fig 1.

Download:

Fig 1. The architecture of NCDEs Classifier.

https://doi.org/10.1371/journal.pone.0318297.g001

2.1 Definition of NCDEs

We first review the definition of the solution to controlled differential equations (CDEs). Let T > 0 and let v , w ∈ ℕ . Let be a continuous path of bounded variation. Let be Lipschitz continuous. Let

A continuous path is said to solve a controlled differential equation, controlled by x, if

Let be a neural network depending on parameters θ . Let be a neural network depending on parameters θ . And is a neural network that maps the terminal value of the solution to the target space. Neural networks are typically composed of Lipschitz functions, so a neural network satisfies the Lipschitz condition. Both , and will often be parameterised as MLPs.

Define a Neural Controlled Differential Equation:

The hidden state y varies according to x, which is very similar to an RNN.

In the NCDEs model, the input data stream serves as the driving path, and the output is an image of the solution to the controlled differential equation.

2.2 Universal approximation with NCDEs

Based on Theorem C.25 in Appendix C.2.1 of [36], CDE can represent general functions on data streams. Accordingly, the NCDEs model also possesses universality.

Any continuous function on a compact subset of the set of continuous bounded variation paths can be uniformly approximated by the output of the NCDEs model.

In brief, given a compact set of data streams, any continuous function on this compact set can be uniformly approximated by NCDEs under the topology of uniform convergence. The proof of this theorem relies on the boundedness of interpolation and the definition of the topology of time series, as well as the classical neural network and signature universality approximation theorem. This theorem theoretically ensures that the NCDEs model is a universal approximator. Therefore, the functions approximated by NCDEs are sufficiently rich.

2.3 NCDEs classifier

NCDEs Classifier is conducted in JAX. JAX can directly leverage the XLA (Accelerated Linear Algebra) compiler to optimize code into efficient low-level instructions, providing a high-performance and user-friendly platform for deep learning tasks. The goal is to achieve more accurate predictions for speech emotion recognition tasks. The construction of the model is as follows:

2.3.1 Transposed convolution module.

Using the FT-wav2vec2.0 model, audio files are converted into data streams. Although we have extracted rich feature information, the feature dimension is as high as 768. Directly using this dimension for training would waste expensive computational resources and could lead to memory insufficiency. Furthermore, experiments have shown that reducing the feature dimension before training can significantly shorten training time (by nearly six times for NCDEs and nearly one time for other baseline models). Therefore, due to limited computational resources and with the aim of saving time, reducing computational complexity, and minimizing the risk of overfitting, we reduce the feature dimension for all models using a 1D ConvTranspose layer before training. This operation can improve the accuracy of NCDE classifier. Through convolution operations, the network can capture local patterns and correlations in the input data, retaining important information in the data stream to achieve a more efficient and practical model.

2.3.2 NCDE module.

First, a continuous control path is constructed using Cubic Interpolation[37]. This interpolation method meets the theoretical requirements for the universal approximation capability of NCDEs, and enables our solver to solve quickly. Second, an MLP transforms the data into the initial hidden state of the CDEs. The MLP effectively captures the nonlinear characteristics of the input data, providing a starting point for subsequent dynamic evolution. Third, the core of this module is a neural vector field , which consists of several MLP with dropout layers. updates the model’s state at each time step. The linear layers transform data to a high-dimensional space of dimension 220, while the Dropout layers effectively prevent overfitting, ensuring the model has good generalization ability. Finally, the model uses a differential equation solver from the JAX Diffrax library to solve the controlled differential equation.

2.3.3 Output module.

A linear layer maps the terminal value of the solution of the differential equation to the prediction space, and then a Softmax activation function is used to output the probability distribution for classes. Finally, batch normalization is applied to further enhance the stability of the training process.

3 Experiments and analysis

3.1 Data

The IEMOCAP dataset, produced by the University of Southern California, is an interactive emotional motion capture (USC-IEMOCAP) database[38]. This dataset includes a total of five sessions, each containing dialogues between two individuals along with their corresponding labeled speech transcripts (at the phoneme and word levels). We did not use transcripts for training, only speech data. This saves the step of manual text operation, greatly saving manpower and material resources. Each session includes both male and female voices to eliminate gender bias. The IEMOCAP dataset is divided into scripted and improvised dialogues, effectively minimizing the impact of the speakers’ own emotions on the speech. We used all the speech files for training, which is a common approach[39,40]. The dataset comprises nine emotion categories, but based on previous work in SER[6,17], this experiment only considers four emotion categories: Angry, Happy and excited, Neutral, and Sad. Here, Happy/excited are treated as a single category. The distribution of these categories in the dataset is as follows: Neutral (48.8%), Happy/Excited (12.3%), Sad (26.9%), and Angry (12%).

3.2 Feature extraction

Li-Wei Chen et al.[17] fine-tuned wav2vec2.0 specifically for the task of speech emotion recognition on the IEMOCAP dataset, resulting in richer feature representations and substantially improved model performance. We use the upstream part of the fine-tuned wav2vec2.0 model, P-TAPT, as the feature extractor. The model and parameters are obtained from [17]. For the IEMOCAP dataset, we extract speech features by segment, with a selected time dimension of 256. For speech segments shorter than 256, we pad with zeros, and for those longer than 256, we extract a central segment of length 256. As a result, only 16.1% of the data was unused in the experiment. Typically, the middle segment of the audio contains more informative content. Thus, we obtained a standardized dataset across the time dimension for training. We extracted 5531 feature segments from the five sessions. For validation, we select one session at a time while using the remaining four sessions for training, performing five-fold cross validation.

3.3 Experimental setup

For each model in the experiment, we chose to end training after a uniform epoch count of 15. The hyperparameters of each model were optimized individually. Cross-entropy loss was selected for model optimization. Common metrics for evaluating the performance of classification models were chosen: weighted accuracy (WA) and unweighted accuracy (UA). For each Session, the experiment was repeated five times, and the average accuracy and standard deviation are reported.

We selected a series of neural network models related to NCDEs as baselines, including GRU-Δt[25], GRU-D[41], ODE-RNN[42], SigModel [43] and Average-Pooling[17]. Among these, GRU-Δt, GRU-D, and ODE-RNN were proposed by [25] as baseline models for comparison with NCDEs. GRU-Δt is a GRU with the time difference between observations additionally used as an input; GRU-D builds on GRU-Δt by incorporating learned exponential decay between observations to refine the model; ODE-RNN is a special case of GRU-Δt, applying learned Neural Ordinary Differential Equations [36] to the hidden state between observations. Both SigModel and NCDEs are theoretically based on rough path theory. SigModel calculates the third-order signature to capture higher-order features of the data. Then MLP is used for classification. We refer to this model architecture as SigModel. The signature, as a core concept of rough path theory, is composed of the iterated integrals of a path and is a special case of NCDEs. We implemented this model using the signatory library in PyTorch. At the same time, we replicate the downstream model of [17] within our framework. We refer to this model as the Average-Pooling model. In this model, we directly average the features extracted by P-TAPT across the time dimension and then classify using MLP. Possibly due to the fact that we do not use any data augmentation techniques mentioned in the [17] within our framework, the accuracy obtained by replicating this model is 71.96% (WA) and 71.70% (UA). For our model, after extracting data features using the upstream model in [17], we replace the downstream model with the NCDEs model and get 73.37% (WA) and 74.19% (UA). Moreover, our model is more stable and has a faster convergence speed.

3.4 Parameter selection

The choice of hyperparameters is crucial for model training. We took the speech command recognition experiment in [25] as a reference and made grid search to determine the optimal model parameters. The batch size was initially set at 256 and then gradually decreased (128, 64, 32) for experimentation; the dropout rate started from 0.4 and was decreased by 0.05 each time (0.35, 0.3, 0.25, 0.2) for testing; the hidden layer size started from 40 and was adjusted (20, 80, 160). For the learning rate, we also tested a series of different values: 0.001, 0.002, 0.005, 0.01, 0.016, etc. We found that a lower learning rate could achieve a higher classification accuracy. Although compared with a higher learning rate, the training time for one round would increase slightly, the accuracy of our model would increase by 1% - 2%. At the same time, in terms of model stability, the lower learning rate also showed certain advantages. Through different experimental comparisons, we determined the current choice of hyperparameters: batch size [32], dropout rate [0.3], hidden layer size [40], and learning rate [0.001].

3.5 The impact of data class imbalance

By observing the confusion matrix heatmap in Fig 2, we can clearly see that the IEMOCAP dataset exhibits class imbalance. Class 3 (neutral) has the highest number of correctly classified samples [225.40] (It is the average value of five experiments.) and relatively few misclassified samples, indicating that the model performs best in classifying this category. This is likely because Class 3 (neutral) has the largest proportion of samples, making the model more inclined to learn features of this class during training. In contrast, Class 2 (sad) has fewer correctly classified samples [164.40] and a significant number of samples misclassified as Class 3 (neutral) [48.00]. This may be due to the smaller sample size of Class 2 (sad), which makes it more challenging for the model to extract sufficient features. The imbalance reflected in the confusion matrix heatmap aligns with the characteristics of the IEMOCAP dataset. This demonstrates that class imbalance indeed poses challenges for model training. During training, the model tends to prioritize learning features from majority classes, leading to a tendency to misclassify samples from other classes into these majority classes during testing. Additionally, the loss contribution from majority class dominates the overall loss, while the loss from minority class samples contributes less to gradient updates. As a result, the model struggles to optimally adjust its parameters for minority classes, reducing classification accuracy for these categories and limiting the model’s generalization ability. This phenomenon highlights one of the reasons why training models on the IEMOCAP dataset remains a challenge.

Download:

Fig 2. The Confusion Matrix Heatmap of NCDE.

https://doi.org/10.1371/journal.pone.0318297.g002

3.6 Results and discussion

As can be seen from Table 1, for the two accuracies of WA and UA, our model has achieved the highest accuracy among all experimental models, which are 73.37% and 74.19% respectively. Compared with the average pooling operation in the time dimension, it is improved by 1.41% and 2.49% respectively. In addition, our model is very stable, with standard deviations of 0.45% and 0.39% respectively, which are the smallest among all models. At the same time, SigModel also has good stability. As a special case of NCDEs, Signature has a standard deviation of less than 1%. This shows that models under the rough path theory can indeed learn rich context features and obtain more stable models. This may be because the model based on rough path theory considers the continuous evolution of the speech signal over time and the hidden space variable also changes with the input over time, making the model insensitive to the small perturbations commonly found in the data and enabling it to adapt more smoothly to different emotional expressions, which leads to a significant improvement in stability.

Download:

Table 1. Comparison of models with WA and UA(%).

https://doi.org/10.1371/journal.pone.0313772.t001

Figs 3 and 4 respectively present the WA and UA accuracies achieved by each model in 5 Sessions. It can be intuitively seen that the NCDEs-Classifier performs the best in each Session. In addition, it is observed that on the Session2 and Session4, the performance of each model is better than that on other Sessions. This may be related to the uneven distribution of the datasets themselves. The number of training sets in these two Sessions is also larger.

Download:

Fig 3. A line chart of the WA accuracy achieved by each baseline model in 5 sessions.

https://doi.org/10.1371/journal.pone.0318297.g003

Download:

Fig 4. A line chart of the UA accuracy achieved by each baseline model in 5 sessions.

https://doi.org/10.1371/journal.pone.0318297.g004

Fig 5 shows the UA accuracy of each of the first 15 epochs during the training process. It can be seen that the convergence speed of our model is very fast. In the first epoch, a very high UA accuracy of 73.37% can be achieved, and a high accuracy can be maintained in each subsequent epoch. The situation of WA is similar. The accuracy can reach 72.88% in the first epoch. This enables our model to greatly save computing resources, improve efficiency, and has application value. The accuracies of the other baseline models in the first few rounds are often very low.

Download:

Fig 5. The UA accuracy of each of the first 15 epochs.

https://doi.org/10.1371/journal.pone.0318297.g005

Overall, our proposed model, NCDEs Classifier combined with fine-tuned wav2vec2.0, has an outstanding performance in the speech emotion classification task on the IEMOCAP dataset. Our model has not added any data augmentation operations and has good portability. At the same time, the model has high accuracy and a fast convergence speed. The fluctuation of the model is very small, indicating high stability.

Conclusion

In the task of speech emotion recognition, unlike existing methods, this paper proposes a model based on Neural Controlled Differential Equations under rough path theory. Specifically, we update the model’s hidden state by solving controlled differential equations, which has outstanding advantages in handling complex time-series data. Experiments on the IEMOCAP dataset further demonstrate that our proposed model exhibits superior performance and great practical value. In future work, we plan to implement it on other datasets, such as the Surrey Audio-Visual Expressed Emotion (SAVEE) dataset, to further verify the portability of our model.

References

1. Ayata D, Yaslan Y, Kamasak ME. Emotion recognition from multimodal physiological signals for emotion aware healthcare systems. J Med Biol Eng. 2020 Apr;40:149–57.
- View Article
- Google Scholar
2. Swain M, Routray A, Kabisatpathy P. Databases, features and classifiers for speech emotion recognition: a review. Int J Speech Technol 2018;21(1):93–120.
- View Article
- Google Scholar
3. Tahon M, Devillers L. Towards a small set of robust acoustic features for emotion recognition: challenges. IEEE/ACM Trans Audio Speech Lang Process. 2015 Oct;24(1):16–28.
- View Article
- Google Scholar
4. Zhang X, Zhao J, LeCun Y. Character-level convolutional networks for text classification. In: advances in neural information processing systems; 2015 Dec 7–12, Montreal, Canada. Curran Associates; 2015. p. 28.
- View Article
- Google Scholar
5. Jin Q, Li C, Chen S, Wu H. Speech emotion recognition with acoustic and lexical features. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing; 2015 Apr 19–24. Brisbane, Australia. IEEE; 2015. p. 4749–53.
- View Article
- Google Scholar
6. Wang B, Liakata M, Ni H, Lyons T, Nevado-Holgado AJ, Saunders K. A path signature approach for speech emotion recognition. In: Interspeech; 2019 Sep 15–19, Graz, Austria. ISCA; 2019. p. 1661–5.
- View Article
- Google Scholar
7. Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process. 1980 Aug;28(4):357–66.
- View Article
- Google Scholar
8. Junqua JC, Haton JP. Robustness in automatic speech recognition: fundamentals and applications. Berlin: Springer; 2012.
- View Article
- Google Scholar
9. Yenigalla P, Kumar A, Tripathi S, Singh C, Kar S, Vepa J. Speech emotion recognition using spectrogram & phoneme embedding. In: Interspeech; 2018 Sep 2-6; Hyderabad, India. ISCA; 2018. p. 3688-92.
- View Article
- Google Scholar
10. Baevski A, Zhou Y, Mohamed A, Auli M. Wav2vec2.0: A framework for self-supervised learning of speech representations. In: Advances in neural information processing systems; 2020 Dec 06–12; Vancouver, Canada. Curran Associates; 2020;33. p. 12449–60.
- View Article
- Google Scholar
11. Yi C, Wang J, Cheng N, Zhou S, Xu B. Transfer ability of monolingual wav2vec2. 0 for low-resource speech recognition. In: 2021 International Joint Conference on Neural Networks; 2021 Jul 18-22; Shenzhen, China. IEEE; 2021. p. 1–6.
- View Article
- Google Scholar
12. Pepino L, Riera P, Ferrer L. Emotion recognition from speech using wav2vec 2.0 embeddings. In: Interspeech; 2021 Aug 30–Sep 3; Brno, Czech Republic. ISCA; 2021. p. 3400-4.
- View Article
- Google Scholar
13. Baevski A, Schneider S, Auli M. Vq-wav2vec: self-supervised learning of discrete speech representations. In: International Conference on Learning Representations; 2019 May 6–9; New Orleans, LA, USA. ICLR; 2019.
14. Gao Y, Chu C, Kawahara T. Two-stage finetuning of wav2vec 2.0 for speech emotion recognition with ASR and gender pretraining. In: Interspeech; 2023 Aug 20–24; Dublin, Ireland. ISCA; 2023. p. 3637–41.
- View Article
- Google Scholar
15. Siriwardhana S, Reis A, Weerasekera R, Nanayakkara S. Jointly fine-tuning “BERT-Like” self supervised models to improve multimodal speech emotion recognition. In: Interspeech; 2020 Oct 25–29; Shanghai, China. ISCA; 2020. p. 3755–9.
- View Article
- Google Scholar
16. Macary M, Tahon M, Est‘eve Y, Rousseau A. On the use of self-supervised pre-trained acoustic and linguistic features for continuous speech emotion recognition. In: 2021 IEEE Spoken Language Technology Workshop; 2021 Jan 19–22; Shenzhen, China. IEEE; 2021. p. 373–80.
- View Article
- Google Scholar
17. Chen LW, Rudnicky A. Exploring wav2vec2.0 fine tuning for improved speech emotion recognition. In: 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing; 2023 Jun 4–10; Rhodes, Greece. IEEE; 2023. p. 1–5.
- View Article
- Google Scholar
18. Schuller B, Rigoll G, Lang M. Hidden Markov model-based speech emotion recognition. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing; 2003 Apr 6–10; Hong Kong, China. IEEE; 2003. p. II–1.
- View Article
- Google Scholar
19. Sato N, Obuchi Y. Emotion recognition using mel-frequency cepstral coefficients. Inf Media Technol. 2007;2(3):835–48.
- View Article
- Google Scholar
20. Shen P, Changjun Z, Chen X. Automatic speech emotion recognition using support vector machine. In: 2011 International Conference on Electronic & Mechanical Engineering and Information Technology; 2011 Aug 12–14; Harbin, China. IEEE; 2011. p. 621–5.
- View Article
- Google Scholar
21. Rao KS, Koolagudi SG, Vempada RR. Emotion recognition from speech using global and local prosodic features. Int J Speech Technol 2012;16(2):143–60.
- View Article
- Google Scholar
22. Mirsamadi S, Barsoum E, Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing; 2017 Mar 5–9; New Orleans, LA, USA. IEEE; 2017. p. 2227–31.
- View Article
- Google Scholar
23. Satt A, Rozenberg S, Hoory R. Efficient emotion recognition from speech using deep learning on spectrograms. In: Interspeech; 2017 Aug 20–24; Stockholm, Sweden. ISCA; 2017. p. 1089–93.
- View Article
- Google Scholar
24. Sarma M, Ghahremani P, Povey D, Goel NK, Sarma KK, Dehak N. Emotion identification from raw speech signals using DNNs. In: Interspeech; 2018 Sep 2–6; Hyderabad, India. ISCA; 2018. p. 3097–101.
- View Article
- Google Scholar
25. Kidger P, Morrill J, Foster J, Lyons T. Neural controlled differential equations for irregular time series. In: Advances in Neural Information Processing Systems; 2020 Dec 06–12; Vancouver, Canada. Curran Associates; 2020;33. p. 6696–707.
- View Article
- Google Scholar
26. Zhao Z, Wang Y, Wang Y. Multi-level fusion of wav2vec2.0 and bert for multimodal emotion recognition. In: Interspeech; 2022 Sept 18–22; Incheon, South Korea. ISCA; 2022. p. 4725–29.
- View Article
- Google Scholar
27. Fan Z, Li M, Zhou S, Xu B. Exploring wav2vec 2.0 on speaker verification and language identification. In: Interspeech; 2021 Aug 30–Sep 3; Brno, Czech Republic. ISCA; 2021. p. 1509–13.
- View Article
- Google Scholar
28. Becerra H, Ragano A, Hines A. Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction. In: Interspeech; 2022 Sep 18-22; Incheon, South Korea. ISCA; 2022. p. 4088–92.
- View Article
- Google Scholar
29. Xia Y, Chen LW, Rudnicky A, Stern RM. Temporal context in speech emotion recognition. In: Interspeech; 2021 Aug 30–Sep 3; Brno, Czech Republic. ISCA; 2021. p. 3370–74.
- View Article
- Google Scholar
30. Lyons TJ. Differential equations driven by rough signals. Rev Mat Iberoam. 1998;14(2):215–310.
- View Article
- Google Scholar
31. Allan AL, Cuchiero C, Liu C, Prömel DJ. Model-free portfolio theory: a rough path approach. Math Financ 2023;33(3):709–65. pmid:38505114
- View Article
- PubMed/NCBI
- Google Scholar
32. Friz P, Gassiat P, Lyons T. Physical Brownian motion in a magnetic field as a rough path. Trans Amer Math Soc. 2015;367(11):7939–55.
- View Article
- Google Scholar
33. Bagnall A, Dau HA, Lines J, Flynn M, Large J, Bostrom A. The UEA multivariate time series classification archive. arXiv preprint. 2018. doi: https://doi.org/10.48550/arXiv.1811.00075
- View Article
- Google Scholar
34. Reyna MA, Josef CS, Jeter R, Shashikumar SP, Westover MB, Nemati S, et al. Early prediction of sepsis from clinical data: the PhysioNet/Computing in cardiology challenge 2019. Crit Care Med 2020;48(2):210–7. pmid:31939789
- View Article
- PubMed/NCBI
- Google Scholar
35. Warden P. Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint. 2018.
- View Article
- Google Scholar
36. Kidger P. On neural differential equations [dissertation]. Oxford (UK): University of Oxford; 2021.
- View Article
- Google Scholar
37. Maeland E. On the comparison of interpolation methods. IEEE Trans Med Imaging 1988;7(3):213–7. pmid:18230471
- View Article
- PubMed/NCBI
- Google Scholar
38. Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval. 2008 Dec;42:335–59.
- View Article
- Google Scholar
39. Issa D, Fatih Demirci M, Yazici A. Speech emotion recognition with deep convolutional neural networks. Biomed Signal Process Control. 2020;59:101894.
- View Article
- Google Scholar
40. Antoniou N, Katsamanis A, Giannakopoulos T, Narayanan S. Designing and evaluating speech emotion recognition systems: a reality check case study with IEMOCAP. In: IEEE International Conference on Acoustics, Speech and Signal Processing; 2023 Jun 04–10; Rhodes Island, Greece. IEEE; 2023. p. 1–5.
- View Article
- Google Scholar
41. Che Z, Purushotham S, Cho K, Sontag D, Liu Y. Recurrent neural networks for multivariate time series with missing values. Sci Rep. 2018 Apr 17;8:6085.
- View Article
- Google Scholar
42. Rubanova Y, Chen R, Duvenaud D. Latent ordinary differential equations for irregularly-sampled time series. Adv Neural Inf Process Syst. 2019;32.
- View Article
- Google Scholar
43. Fermanian A. Learning time-dependent data with the signature transform. Sorbonne Université. 2021.
- View Article
- Google Scholar

[ref1] 1. Ayata D, Yaslan Y, Kamasak ME. Emotion recognition from multimodal physiological signals for emotion aware healthcare systems. J Med Biol Eng. 2020 Apr;40:149–57.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Swain M, Routray A, Kabisatpathy P. Databases, features and classifiers for speech emotion recognition: a review. Int J Speech Technol 2018;21(1):93–120.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Tahon M, Devillers L. Towards a small set of robust acoustic features for emotion recognition: challenges. IEEE/ACM Trans Audio Speech Lang Process. 2015 Oct;24(1):16–28.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Zhang X, Zhao J, LeCun Y. Character-level convolutional networks for text classification. In: advances in neural information processing systems; 2015 Dec 7–12, Montreal, Canada. Curran Associates; 2015. p. 28.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Jin Q, Li C, Chen S, Wu H. Speech emotion recognition with acoustic and lexical features. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing; 2015 Apr 19–24. Brisbane, Australia. IEEE; 2015. p. 4749–53.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Wang B, Liakata M, Ni H, Lyons T, Nevado-Holgado AJ, Saunders K. A path signature approach for speech emotion recognition. In: Interspeech; 2019 Sep 15–19, Graz, Austria. ISCA; 2019. p. 1661–5.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process. 1980 Aug;28(4):357–66.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Junqua JC, Haton JP. Robustness in automatic speech recognition: fundamentals and applications. Berlin: Springer; 2012.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref9] 9. Yenigalla P, Kumar A, Tripathi S, Singh C, Kar S, Vepa J. Speech emotion recognition using spectrogram & phoneme embedding. In: Interspeech; 2018 Sep 2-6; Hyderabad, India. ISCA; 2018. p. 3688-92.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref10] 10. Baevski A, Zhou Y, Mohamed A, Auli M. Wav2vec2.0: A framework for self-supervised learning of speech representations. In: Advances in neural information processing systems; 2020 Dec 06–12; Vancouver, Canada. Curran Associates; 2020;33. p. 12449–60.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref11] 11. Yi C, Wang J, Cheng N, Zhou S, Xu B. Transfer ability of monolingual wav2vec2. 0 for low-resource speech recognition. In: 2021 International Joint Conference on Neural Networks; 2021 Jul 18-22; Shenzhen, China. IEEE; 2021. p. 1–6.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref12] 12. Pepino L, Riera P, Ferrer L. Emotion recognition from speech using wav2vec 2.0 embeddings. In: Interspeech; 2021 Aug 30–Sep 3; Brno, Czech Republic. ISCA; 2021. p. 3400-4.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref13] 13. Baevski A, Schneider S, Auli M. Vq-wav2vec: self-supervised learning of discrete speech representations. In: International Conference on Learning Representations; 2019 May 6–9; New Orleans, LA, USA. ICLR; 2019.

[ref14] 14. Gao Y, Chu C, Kawahara T. Two-stage finetuning of wav2vec 2.0 for speech emotion recognition with ASR and gender pretraining. In: Interspeech; 2023 Aug 20–24; Dublin, Ireland. ISCA; 2023. p. 3637–41.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref15] 15. Siriwardhana S, Reis A, Weerasekera R, Nanayakkara S. Jointly fine-tuning “BERT-Like” self supervised models to improve multimodal speech emotion recognition. In: Interspeech; 2020 Oct 25–29; Shanghai, China. ISCA; 2020. p. 3755–9.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref16] 16. Macary M, Tahon M, Est‘eve Y, Rousseau A. On the use of self-supervised pre-trained acoustic and linguistic features for continuous speech emotion recognition. In: 2021 IEEE Spoken Language Technology Workshop; 2021 Jan 19–22; Shenzhen, China. IEEE; 2021. p. 373–80.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref17] 17. Chen LW, Rudnicky A. Exploring wav2vec2.0 fine tuning for improved speech emotion recognition. In: 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing; 2023 Jun 4–10; Rhodes, Greece. IEEE; 2023. p. 1–5.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref18] 18. Schuller B, Rigoll G, Lang M. Hidden Markov model-based speech emotion recognition. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing; 2003 Apr 6–10; Hong Kong, China. IEEE; 2003. p. II–1.
View Article
Google Scholar

[51] View Article

[52] Google Scholar

[ref19] 19. Sato N, Obuchi Y. Emotion recognition using mel-frequency cepstral coefficients. Inf Media Technol. 2007;2(3):835–48.
View Article
Google Scholar

[54] View Article

[55] Google Scholar

[ref20] 20. Shen P, Changjun Z, Chen X. Automatic speech emotion recognition using support vector machine. In: 2011 International Conference on Electronic & Mechanical Engineering and Information Technology; 2011 Aug 12–14; Harbin, China. IEEE; 2011. p. 621–5.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref21] 21. Rao KS, Koolagudi SG, Vempada RR. Emotion recognition from speech using global and local prosodic features. Int J Speech Technol 2012;16(2):143–60.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref22] 22. Mirsamadi S, Barsoum E, Zhang C. Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing; 2017 Mar 5–9; New Orleans, LA, USA. IEEE; 2017. p. 2227–31.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref23] 23. Satt A, Rozenberg S, Hoory R. Efficient emotion recognition from speech using deep learning on spectrograms. In: Interspeech; 2017 Aug 20–24; Stockholm, Sweden. ISCA; 2017. p. 1089–93.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref24] 24. Sarma M, Ghahremani P, Povey D, Goel NK, Sarma KK, Dehak N. Emotion identification from raw speech signals using DNNs. In: Interspeech; 2018 Sep 2–6; Hyderabad, India. ISCA; 2018. p. 3097–101.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref25] 25. Kidger P, Morrill J, Foster J, Lyons T. Neural controlled differential equations for irregular time series. In: Advances in Neural Information Processing Systems; 2020 Dec 06–12; Vancouver, Canada. Curran Associates; 2020;33. p. 6696–707.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref26] 26. Zhao Z, Wang Y, Wang Y. Multi-level fusion of wav2vec2.0 and bert for multimodal emotion recognition. In: Interspeech; 2022 Sept 18–22; Incheon, South Korea. ISCA; 2022. p. 4725–29.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref27] 27. Fan Z, Li M, Zhou S, Xu B. Exploring wav2vec 2.0 on speaker verification and language identification. In: Interspeech; 2021 Aug 30–Sep 3; Brno, Czech Republic. ISCA; 2021. p. 1509–13.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref28] 28. Becerra H, Ragano A, Hines A. Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction. In: Interspeech; 2022 Sep 18-22; Incheon, South Korea. ISCA; 2022. p. 4088–92.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref29] 29. Xia Y, Chen LW, Rudnicky A, Stern RM. Temporal context in speech emotion recognition. In: Interspeech; 2021 Aug 30–Sep 3; Brno, Czech Republic. ISCA; 2021. p. 3370–74.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref30] 30. Lyons TJ. Differential equations driven by rough signals. Rev Mat Iberoam. 1998;14(2):215–310.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref31] 31. Allan AL, Cuchiero C, Liu C, Prömel DJ. Model-free portfolio theory: a rough path approach. Math Financ 2023;33(3):709–65. pmid:38505114
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref32] 32. Friz P, Gassiat P, Lyons T. Physical Brownian motion in a magnetic field as a rough path. Trans Amer Math Soc. 2015;367(11):7939–55.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref33] 33. Bagnall A, Dau HA, Lines J, Flynn M, Large J, Bostrom A. The UEA multivariate time series classification archive. arXiv preprint. 2018. doi: https://doi.org/10.48550/arXiv.1811.00075
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref34] 34. Reyna MA, Josef CS, Jeter R, Shashikumar SP, Westover MB, Nemati S, et al. Early prediction of sepsis from clinical data: the PhysioNet/Computing in cardiology challenge 2019. Crit Care Med 2020;48(2):210–7. pmid:31939789
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref35] 35. Warden P. Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint. 2018.
View Article
Google Scholar

[104] View Article

[105] Google Scholar

[ref36] 36. Kidger P. On neural differential equations [dissertation]. Oxford (UK): University of Oxford; 2021.
View Article
Google Scholar

[107] View Article

[108] Google Scholar

[ref37] 37. Maeland E. On the comparison of interpolation methods. IEEE Trans Med Imaging 1988;7(3):213–7. pmid:18230471
View Article
PubMed/NCBI
Google Scholar

[110] View Article

[111] PubMed/NCBI

[112] Google Scholar

[ref38] 38. Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval. 2008 Dec;42:335–59.
View Article
Google Scholar

[114] View Article

[115] Google Scholar

[ref39] 39. Issa D, Fatih Demirci M, Yazici A. Speech emotion recognition with deep convolutional neural networks. Biomed Signal Process Control. 2020;59:101894.
View Article
Google Scholar

[117] View Article

[118] Google Scholar

[ref40] 40. Antoniou N, Katsamanis A, Giannakopoulos T, Narayanan S. Designing and evaluating speech emotion recognition systems: a reality check case study with IEMOCAP. In: IEEE International Conference on Acoustics, Speech and Signal Processing; 2023 Jun 04–10; Rhodes Island, Greece. IEEE; 2023. p. 1–5.
View Article
Google Scholar

[120] View Article

[121] Google Scholar

[ref41] 41. Che Z, Purushotham S, Cho K, Sontag D, Liu Y. Recurrent neural networks for multivariate time series with missing values. Sci Rep. 2018 Apr 17;8:6085.
View Article
Google Scholar

[123] View Article

[124] Google Scholar

[ref42] 42. Rubanova Y, Chen R, Duvenaud D. Latent ordinary differential equations for irregularly-sampled time series. Adv Neural Inf Process Syst. 2019;32.
View Article
Google Scholar

[126] View Article

[127] Google Scholar

[ref43] 43. Fermanian A. Learning time-dependent data with the signature transform. Sorbonne Université. 2021.
View Article
Google Scholar

[129] View Article

[130] Google Scholar

Figures

Abstract

Introduction

1 Related work

1.1 Fine-tuned wav2vec2.0 model

1.2 Neural controlled differential equations

2 Proposed approach

2.1 Definition of NCDEs

2.2 Universal approximation with NCDEs

2.3 NCDEs classifier

2.3.1 Transposed convolution module.

2.3.2 NCDE module.

2.3.3 Output module.

3 Experiments and analysis

3.1 Data

3.2 Feature extraction

3.3 Experimental setup

3.4 Parameter selection

3.5 The impact of data class imbalance

3.6 Results and discussion

Conclusion

References