Low-resource MobileBERT for emotion recognition in imbalanced text datasets mitigating challenges with limited resources

Muhammad Hussain; Caikou Chen; Sami S. Albouq; Khlood Shinan; Fatmah Alanazi; Muhammad Waseem Iqbal; M. Usman Ashraf

doi:10.1371/journal.pone.0312867

Abstract

Modern dialogue systems rely on emotion recognition in conversation (ERC) as a core element enabling empathetic and human-like interactions. However, the weak correlation between emotions and semantics poses significant challenges to emotion recognition in dialogue. Semantically similar utterances can express different types of emotions, depending on the context or speaker. In order to tackle this challenge, our paper proposes a novel loss called Focal Weighted Loss (FWL) with adversarial training and the compact language model MobileBERT. Our proposed loss function handles the problem of imbalanced emotion classification through Focal Weighted Loss and adversarial training and does not require large batch sizes or more computational resources. Our approach has been employed on four text emotion recognition benchmark datasets, MELD, EmoryNLP, DailyDialog and IEMOCAP demonstrating competitive performance. Extensive experiments on these benchmark datasets validate the effectiveness of our proposed FWL with adversarial training. This enables more human-like interactions on digital platforms. Our approach shows its potential to deliver competitive performance under limited resource constraints, comparable to large language models.

Citation: Hussain M, Chen C, Albouq SS, Shinan K, Alanazi F, Iqbal MW, et al. (2025) Low-resource MobileBERT for emotion recognition in imbalanced text datasets mitigating challenges with limited resources. PLoS ONE 20(1): e0312867. https://doi.org/10.1371/journal.pone.0312867

Editor: Yaman Roumani, Oakland University, UNITED STATES OF AMERICA

Received: August 16, 2024; Accepted: October 14, 2024; Published: January 24, 2025

Copyright: © 2025 Hussain et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data for this study are publicly available from the figshare repository (https://doi.org/10.6084/m9.figshare.27189840).

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Emotions are a fundamental aspect of human communication and decision-making. In the expanding universe of online communication, social media, and the corporate world, the ability to accurately identify emotions within text has become a critical area of interest, bridging the gap between human nuances and digital interpretation. However, capturing the contextual semantics of personal experiences described in one’s utterance is challenging. Understanding and recognising these emotions can significantly enhance chatbot responsiveness, refine healthcare communication, and improve the analysis of social media [1] content for emotion, sentiment, and opinion mining. This, in turn, can lead to increased customer satisfaction and higher conversion rates. Emotion recognition during conversation has attracted significant attention from both academia and the corporate world [2, 3] due to the day by day growing usage and popularity of online social media and other social networking platforms such as Facebook, Twitter, YouTube, and other popular social media platforms [4]. Emotions expressed during conversation are of a dynamic nature and can be influenced by a variety of factors [5] including speaker dependence and the surrounding conversational environment [6].

However, the text emotion recognition task faces several challenges. Depending on the emotional context, similar emotion utterances may exhibit entirely different emotional attributes. Simultaneously, distinguishing conversation texts that contain similar emotional attributes is also extremely difficult [7]. As a result, significant implicit efforts have been made to construct distinctive utterance representations from two lines, consisting of model creation and representation learning. As a representation from the previous line [4] create and design recurrent models to track dialogue emotion recognition history for the classification task. Representation learning approaches mostly use supervised contrastive learning [8] for learning emotion recognition utterance representations [5], a prototypical contrastive learning technique to address the issue of class imbalance problems and achieve best performance as compared to the previous work, but still, they are struggling to improve the performance of minority classes. The findings demonstrate that similar emotions, such as excited and happy, are frequently misclassified as each other. Supervised contrastive learning is still facing difficulty with effectively differentiating similar text emotions. On the other hand, multimodal emotion recognition, while facial and speech emotion recognition methods have advanced significantly, text-based emotion recognition still demands further exploration and research [9]. Text emotion recognition is particularly challenging due to the way emotions are expressed and how the meanings of utterances vary based on the particular topic discussed, as well as the implicit knowledge shared between participants. Emotional dialogues around specific topics carry certain language patterns, affecting not only the utterance’s meaning but also the particular emotions conveyed by specific expressions.

Online text emotion recognition predicts events, opinions, and attitudes from social media and digital platforms, providing valuable insights into public emotions and preferences [10, 11]. Emotion recognition enables proactive responses to potential situations by analysing conversations and identifying emotions in real-time. Unlike sentiment analysis, which focuses on determining the sentiment of a text as a whole, emotion recognition aims to accurately identify the emotions expressed by each individual in a conversation. Accurately recognising the conversion is a difficult task because it demands a deep understanding of contextual cues, such as tone, nuance, and subtlety, to capture the emotions expressed by each speaker. There are various significant applications of emotion recognition, including social media monitoring, customer service, and political campaign analysis. For instance, commercial and government entities can gain valuable insights into public opinion and sentiment on various topics by analysing text data from social media platforms, online forums, and customer feedback. This process involves extracting and analysing emotions, sentiments, and public opinions to understand public attitudes towards different issues [12, 13]. By leveraging emotion recognition and sentiment analysis, businesses can gain a deeper understanding of their customer behaviours and attitudes to make better decisions in marketing, product development, and customer service.

However, emotion recognition models deep learning and transformers are computationally complex models that are computationally expensive to implement on low-resource devices like mobile devices. Although deep learning and transformer models are reported to have interesting findings, these approaches may not be feasible for computation-constrained devices. In this paper, we propose a loss function combined with a low-computability MobileBERT model that is capable of working with low computational resources without introducing unnecessary complexity. Emotion recognition in conversation. "No." and “Yeah, me too.” (Fig 1) can convey different emotions in different contexts.

Download:

Fig 1. Emotion recognition in conversation.

No. and Yeah, me too.

https://doi.org/10.1371/journal.pone.0312867.g001

To the extent of our knowledge, we are the first to use Focal Weighted Loss with adversarial training for the emotion recognition task, significantly enhancing the robustness and sensitivity of MobileBERT in the recognition of nuanced emotional expressions.

Novel Integration of Focal Weighted Loss with Adversarial Training: We implement the combination of Focal Weighted Loss with adversarial training for emotion recognition, leading to improved robustness and accuracy in MobileBERT.
Enhanced Emotion Recognition with Limited Resources: We demonstrated the effectiveness of our approach in achieving state-of-the-art results on four benchmark datasets (MELD, EmoryNLP, DailyDialog, and IEMOCAP) with limited computational resources.
Improved MobileBERT Performance: Our technique significantly enhanced MobileBERT’s performance on emotion recognition tasks, achieving accuracies of 63% on MELD, 43% on EmoryNLP, 62% on DailyDialog, and 63% on IEMOCAP.
Effective Management of Class Imbalance: We leveraged Focal Weighted Loss to mitigate class imbalance issues, ensuring that MobileBERT learnt to recognise emotions from minority classes more effectively.
Adversarial Training for Robustness: We implemented adversarial training to improve MobileBERT’s ability to handle perturbations and nuances in emotional expressions, leading to more reliable emotion recognition.

2 Related work

2.1 Emotion recognition

Emotion recognition is not a single field but an interdisciplinary field of research, with contributions from different fields such as computer vision, natural language processing, and psychological cognitive science. There are various applications of emotion recognition in various different domains like opinion mining, social media, medical and health care chat bots, human-computer interaction, and customer service. Research has explored emotion recognition in various forms of text data, including movie dialogues, instant messaging services, tweets [14], public comments, and news headlines [15]. However, emotion recognition in conversational scenes presents a unique challenge, differing significantly from analysing shorter texts like tweets or social media posts. Unlike context-independent shorter texts, where emotions are solely derived from the words used, whereas conversational texts are context-dependent, with the conversation history and speaker turns influencing the emotions expressed by other speakers [16]. Advancements in emotion recognition rely on comprehensive datasets and addressing specific research challenges. Multimodal emotion recognition, combining text, speech, and facial expressions, enhances accuracy and reliability [17]. These fusions highlight emotion recognition in conversation and its dynamic role in developing empathetic communication and human-machine interactions.

Most existing approaches to emotion recognition in conversion are graph-based approaches, sequence-based approaches, knowledge-enhanced based approaches, and transformer-based approaches. Graph-based approach methods DialogGCN [18], RGAT [19] Construct a graph using the utterance nodes. Represents speakers and utterances as nodes and constructs a unified graph where ConGCN [20] uses the whole emotion recognition in the conversion dataset. Utilises a directed acyclic graph, DAG-ERC [21], to represent the intrinsic structure within a conversation. However, knowledge enhanced base models utilise external knowledge [22–24]. The sequence-based approaches determine contextual information as sequences of utterances use the gated recurrent unit for capturing context information ICON [25], HiGRU [26]. Uses recurrent neural network methods to represent conversation dynamics DialogRNN [4]. The multi-turn reasoning modules [27] analyse the ERC problem from a cognitive perspective. Use context and the speaker’s memory using pre-trained language models CoMPM [28]. Usually, they utilise external knowledge from ATIMOC [29]. The capabilities of language models include CoMPM [28], which leverages the language model’s ability to learn and track contextual information. A transformer based approach supposes that each utterance is regarded as an independent sentence, regardless of its context dependence and speaker information. The problem can be transformed into a simple sentence classification so that pre-trained models [30–32] can be used directly for fine-tuning. Adopts BERT [33] to extract features from utterances, followed by transformer structure for modelling context. Speaker dependence, considering the auxiliary task of judging whether two utterances are from the same speaker, is used to model the speaker information. COSMIC [24] exploits RoBERTa as the feature extractor for each utterance and models the contextual dependency with RNN. In addition, the common knowledge transformer COMET [34] is incorporated to introduce world knowledge. However, supervised contrastive focus on ERC and apply supervised contrastive loss as an additional optimisation goal. To make full use of label information [35], extend it on self-supervised training so that samples belonging to the same label are gathered in the embedding space while pushing samples of different categories apart.

2.2 Genesis (evolution) pre-training transformer-based language models

The seminal work on pre-training introduced the transformer architecture for language comprehension [36] using a two-stage training approach self-supervised pre-training on a generic text corpus and fine-tune training on specific application data [37]. Initially, bi-directional RNNs were pre-trained for contextualised token representations. However, RNN-based models face limitations in long-term modelling and scaling due to recurrent connections. In contrast, transformer-based models have revolutionised the field through parallel processing and deep model development, achieving state-of-the-art results in various downstream NLP tasks BERT [30], ELECTRA [38], and GPT [39] have been implemented and have achieved state-of-the-art performance in several downstream natural language processing tasks. Various studies, such as transfer learning ERC [40] utterance-level dialogue understanding [41], and contextualised emotion tagging [42], despite their potential, transformer-based models have not yet fully addressed emotions. Pre-trained language models like BERT [43] have significantly advanced NLP, including emotion recognition sentiment analysis text summarization and name entity recognition, but computational requirements of large language model computational resources and required memory limit their deployment in resource-constrained environments. To address this issue, we have utilised the low-resource MobileBERT model for text emotion recognition, which offers a more efficient solution for resource-constrained settings. By leveraging the efficient and powerful MobileBERT architecture, our model integrates Focal Weighted Loss and adversarial training to achieve robust and optimal emotion recognition in conversations. By combining these techniques, we create a powerful fusion that enhances the model’s ability to capture subtle emotional cues and generalise well to unseen data, ultimately improving the reliability and performance of ERC systems.

3 Methodology

3.1 Problem statement

Given a collection of speakers S, an array of emotion labels E, and a specific conversation C, the goal of our research is to determine the emotional state of the speakers inside the conversation. There is a predetermined order to the conversational turns, which are [(s₁, u₁), (s₂, u₂), (s₃, u₃)……… (s_n, u_n)], where s_i denotes a speaker from the set S and u_i is the utterance of speaker s_i during the i^th turn. Our research is focused on the dynamics of emotion recognition in conversation in real time. In this case, the model is set up to predict the emotion label with Focal Weighted Loss to enhance the performance of emotion recognition. We propose the Focal Weighted Loss function according to the prediction certainty in order to fine-tune the model to better focus on emotionally difficult minority classes. It predicts the emotion label y^t for the current turn t using contextual and sequential data from previous turns [(s₁, u₁), (s₂, u₂), (s₃, u₃) (s_t, u_t)]. In addition, we take advantage of adversarial training to improve the model’s capability to identify intricate expressions and enhance its ability to adjust to challenges.

Algorithm: Training Process for emotion recognition with Focal Weighted Loss and adversarial training

Dtrain: the training dataset

R: the total number of epochs

M: MobileBERT for sequence classification

E: emotion labels

α, γ: parameters for focal loss, adjusted dynamically

ε: perturbation magnitude for adversarial training

Outputs:

M’: the optimally trained model

C_j: category centers for each label

Steps:

1. Initialization:

Initialize model M with MobileBERT.

Initialize parameters α and γ.

2. Training Loop:

for k = 0 to R do:

for each batch (X,y) in Dtrain

1. Forward Pass:

Compute logits: logits = M(X)

Calculate probabilities using softmax on logits p_t

2. Focal Weighted Loss Calculation:

FL (p_t) = − α_t(y) ⋅ (1−p_t)^γ ⋅ log (p_t)

α_t (y) is dynamically adjusted based on the misclassification rate of class y or its representation in the batch.

γ is adjusted based on the overall performance or specific training epoch.

Combine losses to form the batch loss:

3. Adversarial Training:

Generate adversarial examples:

Compute logits for adversarial examples

where M represents the model, and X_adv are the adversarial examples

Compute the loss for adversarial examples:

where Y are the true labels

4. Total Loss Calculation:

Compute the total loss by combining the original and adversarial losses:

where L_orig is the loss computed using the original data.

5. Backpropagation and Optimization:

L_total. backward ()

Update model parameters using an optimizer:

optimizer.step ()

optimizer.zero_grad ()

End of batch processing.

End of epoch processing.

3. Return:

Return the optimally trained model M′ and category centers C_j.

Dynamic Alpha Adjustment: Adjust α based on ongoing training insights. For underrepresented classes, or those with higher misclassification rates, increase α to give more weight to their errors.

Gamma Tuning: Adapt γ to make the model more sensitive to misclassified examples as training progresses, potentially increasing γ to focus more on harder cases

3.2 Low resource MobileBERT with focal weighted loss

The MobileBERT model is a compact and efficient thin version of the large language model BERT_LARGE [44] designed for computationally constrained mobile devices with optimised performance and processing speed in real-time applications. The MobileBERT model is fine-tuned for a wide range of language processing tasks while offering good performance. MobileBERT is 4.3× smaller and 5.5× faster than BERT_BASE Model. We boost the capabilities of the MobileBERT model by integrating Focal Weighted Loss, a modified loss function that builds upon traditional cross-entropy loss [45]. Unlike standard cross-entropy loss, Focal Weighted Loss incorporates a modulating factor in which the loss contribution from each class is dynamically adjusted based on its classification difficulty. This strategic approach allows the model to focus on difficult and minority cases and effectively address class imbalances, particularly in datasets with skewed emotional expression distributions. By integrating Focal Weighted Loss with the MobileBERT model, we significantly improve the model’s performance to recognise and distinguish between various emotional states from text. The model can capture emotional nuances well, understand context more effectively, and gain a more comprehensive understanding of emotional nature. This leads to overall enhanced performance and more accurate recognition of emotional states. (1) Where:

p_t is the model estimated probability for class t. α_t adjusts the importance of each class, addressing class imbalance. (1 − p_t)^γ is the focusing term reduces the loss contribution from easy examples allowing the model to focus on harder misclassified examples. log(p_t) log loss penalises incorrect predictions. λCWL class weighted loss adds additional class-weighted penalties to further address class imbalance. By integrating these elements, the FWL function effectively balances the attention given to minority classes and hard-to-classify examples, resulting in improved performance and robustness in low-resource text emotion recognition tasks.

3.3 Contextual adversarial training (CAT)

Employ contextual adversarial training [46, 47] to enhance the robustness of our low-resource MobileBERT model against adversarial attacks. Unlike traditional adversarial training methods that add perturbations to context-free layers, the contextual adversarial training (CAT) strategy applies adversarial perturbations to the context-aware network structure in a multi-channel way. This technique allows us to obtain diverse context features and improve the model’s robustness to contextual perturbations.

Let (u, y) denote the mini-batch input sampled from distribution D, where u represents the input features and y the corresponding labels. The context-aware model outputs probabilities p(y|u; θ), where θ denotes the model parameters. During each training step, we introduce contextual adversarial perturbations r_c-adv, computed against the current model parameters , and incorporate them into the context-aware hidden layers. The perturbations are generated using the following linear approximation under an L_q norm constraint with radius ϵ: (2) Where ∇_uJ(θ, u, y) gradient of the loss function J with respect to the input u, which indicates how the loss changes with small changes in the input.

3.4 Adversarial training procedure

Loss function.

Implement the Focal Weighted Loss, which combines weighted categorical cross-entropy and focal loss to handle class imbalance: (3)

Here, p are the model predictions, CE(p, y) cross-entropy loss, which measures the difference between the predicted probabilities p and the true labels y. It is effective for general classification tasks. whereas FL(p, y) focal loss, which increases the importance of correcting misclassified examples by down-weighting easy examples and focusing on hard-to-classify ones. and focal weight is a weighting factor that determines the contribution of the focal loss in the overall loss function. It ranges from 0 to 1. This loss is particularly beneficial in tasks with class imbalance, such as low-resource text emotion recognition.

3.5 Adversarial perturbation generation (FGSM attack)

Compute the gradient of the loss concerning the input embeddings. Adjust the embeddings by a small perturbation ϵ epsilon in the direction of the gradient sign.

(4)

4 Experiments

4.1 Model setting

We implement our proposed loss function on the MobileBERT model using the PyTorch framework for text emotion recognition tasks on the MELD, EmoryNLP, DailyDialog and IEMOCAP datasets. We use a pre-trained MobileBERT model to extract word embeddings from text transcription. We fine-tune the pre-trained MobileBERT model on each dataset using the Adam optimizer with a learning rate of 2e-5 and train for 4, 8, 10, and 12 epochs with a batch size of 32 and 64. We utilise the Focal Weighted Loss and adversarial training functions to address class imbalance issues in the datasets. Our model achieves performance comparable to large language models while requiring fewer resources, achieving robust and near-optimal results on four text emotion recognition datasets. Below (Fig 2) is the representation of our model and how it’s working.

Download:

Fig 2. MobileBERT emotion recognition.

https://doi.org/10.1371/journal.pone.0312867.g002

4.2 Model training

We trained the MobileBERT model on a Windows 10 PC with 32 GB of RAM, an RTX 3060 6 GB dedicated GPU, and a Core i7 2.60 GHz processor using Python 3.8.2. We employed the Adam optimizer with a learning rate of 2e-5 and trained the model for 12 epochs with a batch size of 64. We utilised the pre-trained MobileBERT model and fine-tuned it on our datasets, MELD, EmoryNLP, DailyDialog and IEMOCAP.

4.3 Datasets

MELD [48] A multi-party conversation corpus was collected from the TV show Friends. Which contains around 1400 conversations and 13000 utterances. Several speakers engaged in the dialogue. Each utterance in a dialogue has been labelled by one of these seven emotions (Fig 3), represent the emotions classes Neutral, joy, surprise, anger, sadness, disgust, and fear. We have used text-based emotion recognition.

Download:

Fig 3. Emotion distribution of MELD.

https://doi.org/10.1371/journal.pone.0312867.g003

EmoryNLP [49] A multiparty conversation corpus was collected from friends but varies from MELD in the choice of scenes and emotion labels. This dataset comprises 97 episodes, 897 scenes, and 12,606 utterances, where each utterance is annotated with one of the emotion labels (Fig 4), The emotion class labels are including neutral, joyful, scared, mad, peaceful, powerful, and sad.

Download:

Fig 4. Emotion distribution of EmoryNLP.

https://doi.org/10.1371/journal.pone.0312867.g004

DailyDialog [50] The DailyDialog dataset corpus is a human-written multiturn dynamic dialogue dataset, reflecting our daily lives and consists of communications from English learners. covering various topics about our daily lives. Each utterance is annotated with an emotion label (Fig 5) from 7 classes: neutral, peaceful, powerful, sad, joyful, mad, and scared.

Download:

Fig 5. Emotion Distribution of DailyDialog.

https://doi.org/10.1371/journal.pone.0312867.g005

IEMOCAP [51] The interactive emotional dynamic motion capture IEMOCAP is a dataset that consists of 151 recorded video dialogues, two speakers per session, for a total of 302 videos across the dataset. There are a total of ten different unique speakers. Each dialogue utterance is annotated with one of the following (Fig 6) emotion labels from the six below emotions frustrated, neutral, anger, sadness, excitement, and happiness.

Download:

Fig 6. Emotion distribution of IEMOCAP.

https://doi.org/10.1371/journal.pone.0312867.g006

4.4 Evaluation metrics

We observe class imbalances in the four benchmark datasets: Table 1 MELD, EmoryNLP, DailyDialog, and IEMOCAP. To address this, we adopt appropriate evaluation metrics Table 2 for each dataset. Specifically, we use the weighted F1 score for MELD, EmoryNLP, and IEMOCAP, consistent with previous research, to account for class imbalance. For the DailyDialog emotion dataset, which does not include neutral classes, we employ the Micro F1 score to evaluate our model’s performance. This enables a fair assessment of our model’s performance and facilitates comparison with previous works.

Download:

Table 1. Statistics of the MELD, EmoryNLP, DailyDialog, and IEMOCAP.

https://doi.org/10.1371/journal.pone.0312867.t001

Download:

Table 2. Performance comparisons of models.

https://doi.org/10.1371/journal.pone.0312867.t002

5 Results and discussion

We evaluate the proposed technique against state-of-the-art text-based emotion recognition approaches in conversation, with results presented in Table 3 By integrating our MobileBERT model with Focal Weighted Loss and adversarial training, we achieve optimal performance, outperforming existing methods and demonstrating the effectiveness of our approach, particularly when combined with the efficient MobileBERT model.

Download:

Table 3. FWL+AT is Focal Weighted Loss with adversarial training.

https://doi.org/10.1371/journal.pone.0312867.t003

5.1 Ablation study

To comprehensively evaluate the effectiveness of Focal Weighted Loss (FWL) and adversarial training (AT) on our low-resource MobileBERT model for text emotion recognition, we conducted a thorough ablation analysis. We fine-tuned the pre-trained MobileBERT-based encoder on the MELD, EmoryNLP, DailyDialog and IEMOCAP datasets.

We conduct ablation studies to evaluate our key contribution to MobileBERT with Focal Weighted Loss and adversarial training. When removing the proposed (FWL) and adversarial training loss and replacing it with a simple focal loss, we obtain inferior performance in terms of emotion recognition.

5.2 Combining focal weighted loss and adversarial training

By integrating Focal Weighted Loss with adversarial training, we found a powerful combination that effectively utilised the strengths of both approaches. Our results demonstrate that our innovative combination of Focal Weighted Loss with adversarial training delivers the best performance across all four benchmark text emotion recognition datasets, with a low-resource MobileBERT model on imbalanced text emotion recognition tasks.

Our approach lies in its complementary strengths. Focal Weighted Loss effectively addresses class imbalance issues, while adversarial training enhances robustness against adversarial attacks. By integrating these two strategies, we achieve the best overall performance, showcasing the vast potential of this approach for real-world applications.

5.3 Hyperparameter tuning

We conducted a comprehensive hyperparameter tuning investigation to determine the effects of how various hyperparameters influence the model’s performance. Our findings revealed that the optimal combinations of optimiser, learning rate, and batch size significantly improve the model’s accuracy in emotion recognition. After optimally tuning the hyperparameters, we found that the Adam optimiser with a learning rate of 2e-5 and a batch size of 64 produces the optimal performance. Figs 7–10 result of the confusion matrix integrated MobileBERT model with Focal Weighted Loss and adversarial training achieves the best overall performance across MELD, EmoryNLP, DailyDialog, and IEMOCAP text emotion recognition datasets.

Download:

Fig 7. Meld confusion matrix.

https://doi.org/10.1371/journal.pone.0312867.g007

Download:

Fig 8. EmoryNLP confusion matrix.

https://doi.org/10.1371/journal.pone.0312867.g008

Download:

Fig 9. DailyDialog confusion matrix.

https://doi.org/10.1371/journal.pone.0312867.g009

Download:

Fig 10. IEMOCAP confusion matrix.

https://doi.org/10.1371/journal.pone.0312867.g010

6 Conclusion and limitations

In this paper, we have proposed a novel loss function, Focal Weighted Loss (FWL), designed to address class imbalance in text emotion recognition tasks. By combining with adversarial training on low resources, we demonstrate that the compact MobileBERT model achieves superior performance on imbalanced datasets and is less sensitive to training the batch size, so it reduces the requirement of more computing resources. We have conducted our experiments on four widely used benchmark imbalanced text datasets, MELD, EmoryNLP, DailyDialog, and IEMOCAP. Our proposed method outperforms and performs well compared to large language models, demonstrating superior efficiency and achieving excellent results without requiring extensive computational resources. making our methods feasible for real-world applications where computational resources are limited.

6.1 Limitations

Our method has limitations. Our model not perform well on minority classes, which can limit its effectiveness in scenarios where emotions are imbalanced or have varying intensities. We need to conduct further research to compare our approach with other low-resource models and evaluate its performance on an imbalanced dataset. We need to explore the extension of our model to handle multi-modal inputs, which could provide a more comprehensive understanding of emotions.

References

1. Yu F., Guo J., Wu Z., and Dai X., “Emotion-Anchored Contrastive Learning Framework for Emotion Recognition in Conversation,” Mar. 2024, Accessed: Jun. 08, 2024. [Online]. https://arxiv.org/abs/2403.20289v1
- View Article
- Google Scholar
2. “Emotionflow: Capture the Dialogue Level Emotion Transitions”, Accessed: Oct. 07, 2023. [Online]. https://ieeexplore.ieee.org/abstract/document/9746464
3. Zhu L., Pergola G., Gui L., Zhou D., and He Y., “Topic-Driven and Knowledge-Aware Transformer for Dialogue Emotion Detection”, Accessed: Oct. 07, 2023. [Online]. http://github.com/something678/TodKat.
- View Article
- Google Scholar
4. Majumder N., Poria S., Hazarika D., Mihalcea R., and Gelbukh A., “DialogueRNN: An Attentive RNN for Emotion Detection in Conversations,” 2019. [Online]. www.aaai.org
- View Article
- Google Scholar
5. X. Song, L. Huang, H. Xue, and S. Hu, “Supervised Prototypical Contrastive Learning for Emotion Recognition in Conversation,” Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5197–5206, 2022.
6. Poria S., Majumder N., Mihalcea R., and Hovy E., “Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances,” IEEE Access, vol. 7, pp. 100943–100953, May 2019,
- View Article
- Google Scholar
7. D. Ong et al., “Is Discourse Role Important for Emotion Recognition in Conversation?,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 11121–11129, Jun. 2022.
8. Khosla P. et al., “Supervised Contrastive Learning,” Apr. 2020, Accessed: Jun. 09, 2024. [Online]. https://arxiv.org/abs/2004.11362v5
- View Article
- Google Scholar
9. Mohammad F. et al., “Text Augmentation-Based Model for Emotion Recognition Using Transformers,” Computers, Materials & Continua, vol. 76, no. 3, pp. 3523–3547, Oct. 2023,
- View Article
- Google Scholar
10. Mingying Shi J. M. Z. W. Y. J., Wang Mengru, Luo Yongcong, “Research on the Sentiment Analysis and Evolutionary Mechanism of Sudden Network Public Opinion Based on SNA-ARIMA Model with Text Mining,” Journal of Electrical Systems, vol. 20, no. 2, pp. 1961–1972, Apr. 2024,
- View Article
- Google Scholar
11. Islam M. S. et al., “‘Challenges and future in deep learning for sentiment analysis: a comprehensive review and a proposed novel hybrid approach,’” Artif Intell Rev, vol. 57, no. 3, pp. 1–79, Mar. 2024,
- View Article
- Google Scholar
12. Md Suhaimin M. S., Ahmad Hijazi M. H., Moung E. G., Nohuddin P. N. E., Chua S., and Coenen F., “Social media sentiment analysis and opinion mining in public security: Taxonomy, trend analysis, issues and future directions,” 2023.
- View Article
- Google Scholar
13. Aguilar-Moreno J. A., Palos-Sanchez P. R., and Pozo-Barajas R., “Sentiment analysis to support business decision-making. A bibliometric study,” 2024.
- View Article
- Google Scholar
14. Pant S. et al., “Korean Drama Scene Transcript Dataset for Emotion Recognition in Conversations,” IEEE Access, vol. 10, pp. 119221–119231, 2022,
- View Article
- Google Scholar
15. S. Taj, B. B. Shaikh, and A. Fatemah Meghji, “Sentiment analysis of news articles: A lexicon based approach,” 2019 2nd International Conference on Computing, Mathematics and Engineering Technologies, iCoMET 2019, Mar. 2019.
16. Phan D. A., Matsumoto Y., and Shindo H., “Autoencoder for Semisupervised Multiple Emotion Detection of Conversation Transcripts,” IEEE Trans Affect Comput, vol. 12, no. 03, pp. 682–691, Jul. 2021,
- View Article
- Google Scholar
17. Lian H., Lu C., Li S., Zhao Y., Tang C., and Zong Y., “A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face,” Entropy 2023, Vol. 25, Page 1440, vol. 25, no. 10, p. 1440, Oct. 2023, pmid:37895561
- View Article
- PubMed/NCBI
- Google Scholar
18. D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh, “DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation,” EMNLP-IJCNLP 2019–2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 154–164, Aug. 2019.
19. T. Ishiwatari, Y. Yasuda, T. Miyazaki, and J. Goto, “Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations,” in EMNLP 2020–2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2020.
20. D. Zhang, L. Wu, C. Sun, S. Li, Q. Zhu, and G. Zhou, “Modeling both context- And speaker-sensitive dependence for emotion detection in multi-speaker conversations,” in IJCAI International Joint Conference on Artificial Intelligence, 2019.
21. W. Shen, S. Wu, Y. Yang, and X. Quan, “Directed acyclic graph network for conversational emotion recognition,” in ACL-IJCNLP 2021—59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 1551–1560, 2021.
22. L. Zhu, G. Pergola, L. Gui, D. Zhou, and Y. He, “Topic-Driven and Knowledge-Aware Transformer for Dialogue Emotion Detection,” ACL-IJCNLP 2021—59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 1571–1582, 2021.
23. D. Zhang, X. Chen, S. Xu, and B. Xu, “Knowledge Aware Emotion Recognition in Textual Conversations via Multi-Task Incremental Transformer,” in COLING 2020—28th International Conference on Computational Linguistics, Proceedings of the Conference, 2020.
24. Ghosal D., Majumder N., Gelbukh A., Mihalcea R., and Poria S., “COSMIC: COmmonSense knowledge for eMotion identification in conversations,” in Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020, pp. 2470–2481, 2020.
25. D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann, “ICoN: Interactive conversational memory network for multimodal emotion detection,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, 2018.
26. W. Jiao, H. Yang, I. King, and M. R. Lyu, “HiGRU: Hierarchical gated recurrent units for utterance-level emotion recognition,” in NAACL HLT 2019–2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies—Proceedings of the Conference, 2019.
27. D. Hu, L. Wei, and X. Huai, “DialogueCRN: Contextual reasoning networks for emotion recognition in conversations,” in ACL-IJCNLP 2021—59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2021.
28. J. Lee and W. Lee, “CoMPM: Context Modeling with Speaker’s Pre-trained Memory Tracking for Emotion Recognition in Conversation,” in NAACL 2022–2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, pp. 5669–5679, Aug. 2021.
29. M. Sap et al., “ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning,” 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, pp. 3027–3035, Oct. 2018.
30. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL HLT 2019–2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies—Proceedings of the Conference, 2019.
31. Li S., Yan H., and Qiu X., “Contrast and Generation Make BART a Good Dialogue Emotion Recognizer,” Dec. 2021, [Online]. http://arxiv.org/abs/2112.11202
- View Article
- Google Scholar
32. Liu Y. et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” Jul. 2019, [Online]. http://arxiv.org/abs/1907.11692
- View Article
- Google Scholar
33. J. Li, D. Ji, F. Li, M. Zhang, and Y. Liu, “HiTrans: A Transformer-Based Context- and Speaker-Sensitive Model for Emotion Detection in Conversations,” COLING 2020—28th International Conference on Computational Linguistics, Proceedings of the Conference, pp. 4190–4200, 2020.
34. A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, and Y. Choi, “COMET: Commonsense Transformers for Automatic Knowledge Graph Construction,” ACL 2019—57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pp. 4762–4779, Jun. 2019.
35. Khosla P. et al., “Supervised Contrastive Learning,” Adv Neural Inf Process Syst, vol. 33, pp. 18661–18673, 2020, Accessed: Jun. 30, 2024. [Online]. https://t.ly/supcon
- View Article
- Google Scholar
36. H. Yang and J. Shen, “Emotion Dynamics Modeling via BERT,” Proceedings of the International Joint Conference on Neural Networks, vol. 2021-July, Apr. 2021.
37. M. E. Peters et al., “Deep contextualized word representations,” in NAACL HLT 2018–2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies—Proceedings of the Conference, 2018.
38. K. Clark, M. T. Luong, Q. V. Le, and C. D. Manning, “ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS,” in 8th International Conference on Learning Representations, ICLR 2020, 2020.
39. Alec R., Jeffrey W., Rewon C., David L., Dario A., and Ilya S., “Language Models are Unsupervised Multitask Learners | Enhanced Reader,” OpenAI Blog, vol. 1, no. 8, 2019.
- View Article
- Google Scholar
40. Hazarika D., Poria S., Zimmermann R., and Mihalcea R., “Conversational transfer learning for emotion recognition,” Information Fusion, vol. 65, 2021,
- View Article
- Google Scholar
41. Ghosal D., Majumder N., Mihalcea R., and Poria S., “Exploring the Role of Context in Utterance-level Emotion, Act and Intent Classification in Conversations: An Empirical Study,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021.
- View Article
- Google Scholar
42. Y. Wang, J. Zhang, J. Ma, S. Wang, and J. Xiao, “Contextualized emotion recognition in conversation as sequence tagging,” in SIGDIAL 2020—21st Annual Meeting of the Special Interest Group on Discourse and Dialogue, Proceedings of the Conference, 2020.
43. Wang H., Li J., Wu H., Hovy E., and Sun Y., “Pre-Trained Language Models and Their Applications,” 2023.
- View Article
- Google Scholar
44. Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou, “MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 2158–2170, Apr. 2020.
45. Zhang Z. and Sabuncu M. R., “Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels.”
- View Article
- Google Scholar
46. I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and Harnessing Adversarial Examples,” 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings, Dec. 2014, Accessed: May 29, 2024. [Online]. https://arxiv.org/abs/1412.6572v3
47. Miyato T., Maeda S. I., Koyama M., and Ishii S., “Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning,” IEEE Trans Pattern Anal Mach Intell, vol. 41, no. 8, 2019, pmid:30040630
- View Article
- PubMed/NCBI
- Google Scholar
48. S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion recognition in conversations,” in ACL 2019—57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2020.
49. Zahiri S. M. and Choi J. D., “Emotion Detection on TV Show Transcripts with Sequence-Based Convolutional Neural Networks.” [Online]. www.aaai.org
- View Article
- Google Scholar
50. Li Y., Su H., Shen X., Li W., Cao Z., and Niu S., “DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset,” Oct. 2017, Accessed: Apr. 26, 2024. [Online]. https://arxiv.org/abs/1710.03957v1
- View Article
- Google Scholar
51. Busso C. et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Lang Resour Eval, vol. 42, no. 4, 2008,
- View Article
- Google Scholar
52. P. Zhong, D. Wang, and C. Miao, “Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations,” EMNLP-IJCNLP 2019–2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 165–176, 2019.
53. W. Shen, J. Chen, X. Quan, and Z. Xie, “DialogXL: All-in-One XLNet for Multi-Party Conversation Emotion Recognition,” 35th AAAI Conference on Artificial Intelligence, AAAI 2021, vol. 15, pp. 13789–13797, Dec. 2020.
54. B. Lee and Y. S. Choi, “Graph Based Network with Contextualized Representations of Turns in Dialogue,” in EMNLP 2021–2021 Conference on Empirical Methods in Natural Language Processing, Proceedings, 2021.
55. Li J., Lin Z., Fu P., Si Q., and Wang W., “A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation,” Dec. 2020, Accessed: Sep. 22, 2024. [Online]. https://arxiv.org/abs/2012.14781v1
- View Article
- Google Scholar
56. “Bimodal Speech Emotion Recognition Using Pre-Trained Language Models | Request PDF.” Accessed: Sep. 22, 2024. [Online]. https://www.researchgate.net/publication/337781233_Bimodal_Speech_Emotion_Recognition_Using_Pre-Trained_Language_Models

[ref1] 1. Yu F., Guo J., Wu Z., and Dai X., “Emotion-Anchored Contrastive Learning Framework for Emotion Recognition in Conversation,” Mar. 2024, Accessed: Jun. 08, 2024. [Online]. https://arxiv.org/abs/2403.20289v1
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. “Emotionflow: Capture the Dialogue Level Emotion Transitions”, Accessed: Oct. 07, 2023. [Online]. https://ieeexplore.ieee.org/abstract/document/9746464

[ref3] 3. Zhu L., Pergola G., Gui L., Zhou D., and He Y., “Topic-Driven and Knowledge-Aware Transformer for Dialogue Emotion Detection”, Accessed: Oct. 07, 2023. [Online]. http://github.com/something678/TodKat.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref4] 4. Majumder N., Poria S., Hazarika D., Mihalcea R., and Gelbukh A., “DialogueRNN: An Attentive RNN for Emotion Detection in Conversations,” 2019. [Online]. www.aaai.org
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref5] 5. X. Song, L. Huang, H. Xue, and S. Hu, “Supervised Prototypical Contrastive Learning for Emotion Recognition in Conversation,” Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5197–5206, 2022.

[ref6] 6. Poria S., Majumder N., Mihalcea R., and Hovy E., “Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances,” IEEE Access, vol. 7, pp. 100943–100953, May 2019,
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref7] 7. D. Ong et al., “Is Discourse Role Important for Emotion Recognition in Conversation?,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 11121–11129, Jun. 2022.

[ref8] 8. Khosla P. et al., “Supervised Contrastive Learning,” Apr. 2020, Accessed: Jun. 09, 2024. [Online]. https://arxiv.org/abs/2004.11362v5
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref9] 9. Mohammad F. et al., “Text Augmentation-Based Model for Emotion Recognition Using Transformers,” Computers, Materials & Continua, vol. 76, no. 3, pp. 3523–3547, Oct. 2023,
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref10] 10. Mingying Shi J. M. Z. W. Y. J., Wang Mengru, Luo Yongcong, “Research on the Sentiment Analysis and Evolutionary Mechanism of Sudden Network Public Opinion Based on SNA-ARIMA Model with Text Mining,” Journal of Electrical Systems, vol. 20, no. 2, pp. 1961–1972, Apr. 2024,
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref11] 11. Islam M. S. et al., “‘Challenges and future in deep learning for sentiment analysis: a comprehensive review and a proposed novel hybrid approach,’” Artif Intell Rev, vol. 57, no. 3, pp. 1–79, Mar. 2024,
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref12] 12. Md Suhaimin M. S., Ahmad Hijazi M. H., Moung E. G., Nohuddin P. N. E., Chua S., and Coenen F., “Social media sentiment analysis and opinion mining in public security: Taxonomy, trend analysis, issues and future directions,” 2023.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref13] 13. Aguilar-Moreno J. A., Palos-Sanchez P. R., and Pozo-Barajas R., “Sentiment analysis to support business decision-making. A bibliometric study,” 2024.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref14] 14. Pant S. et al., “Korean Drama Scene Transcript Dataset for Emotion Recognition in Conversations,” IEEE Access, vol. 10, pp. 119221–119231, 2022,
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref15] 15. S. Taj, B. B. Shaikh, and A. Fatemah Meghji, “Sentiment analysis of news articles: A lexicon based approach,” 2019 2nd International Conference on Computing, Mathematics and Engineering Technologies, iCoMET 2019, Mar. 2019.

[ref16] 16. Phan D. A., Matsumoto Y., and Shindo H., “Autoencoder for Semisupervised Multiple Emotion Detection of Conversation Transcripts,” IEEE Trans Affect Comput, vol. 12, no. 03, pp. 682–691, Jul. 2021,
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref17] 17. Lian H., Lu C., Li S., Zhao Y., Tang C., and Zong Y., “A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face,” Entropy 2023, Vol. 25, Page 1440, vol. 25, no. 10, p. 1440, Oct. 2023, pmid:37895561
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref18] 18. D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh, “DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation,” EMNLP-IJCNLP 2019–2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 154–164, Aug. 2019.

[ref19] 19. T. Ishiwatari, Y. Yasuda, T. Miyazaki, and J. Goto, “Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations,” in EMNLP 2020–2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2020.

[ref20] 20. D. Zhang, L. Wu, C. Sun, S. Li, Q. Zhu, and G. Zhou, “Modeling both context- And speaker-sensitive dependence for emotion detection in multi-speaker conversations,” in IJCAI International Joint Conference on Artificial Intelligence, 2019.

[ref21] 21. W. Shen, S. Wu, Y. Yang, and X. Quan, “Directed acyclic graph network for conversational emotion recognition,” in ACL-IJCNLP 2021—59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 1551–1560, 2021.

[ref22] 22. L. Zhu, G. Pergola, L. Gui, D. Zhou, and Y. He, “Topic-Driven and Knowledge-Aware Transformer for Dialogue Emotion Detection,” ACL-IJCNLP 2021—59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 1571–1582, 2021.

[ref23] 23. D. Zhang, X. Chen, S. Xu, and B. Xu, “Knowledge Aware Emotion Recognition in Textual Conversations via Multi-Task Incremental Transformer,” in COLING 2020—28th International Conference on Computational Linguistics, Proceedings of the Conference, 2020.

[ref24] 24. Ghosal D., Majumder N., Gelbukh A., Mihalcea R., and Poria S., “COSMIC: COmmonSense knowledge for eMotion identification in conversations,” in Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020, pp. 2470–2481, 2020.

[ref25] 25. D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann, “ICoN: Interactive conversational memory network for multimodal emotion detection,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, 2018.

[ref26] 26. W. Jiao, H. Yang, I. King, and M. R. Lyu, “HiGRU: Hierarchical gated recurrent units for utterance-level emotion recognition,” in NAACL HLT 2019–2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies—Proceedings of the Conference, 2019.

[ref27] 27. D. Hu, L. Wei, and X. Huai, “DialogueCRN: Contextual reasoning networks for emotion recognition in conversations,” in ACL-IJCNLP 2021—59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2021.

[ref28] 28. J. Lee and W. Lee, “CoMPM: Context Modeling with Speaker’s Pre-trained Memory Tracking for Emotion Recognition in Conversation,” in NAACL 2022–2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, pp. 5669–5679, Aug. 2021.

[ref29] 29. M. Sap et al., “ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning,” 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, pp. 3027–3035, Oct. 2018.

[ref30] 30. J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL HLT 2019–2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies—Proceedings of the Conference, 2019.

[ref31] 31. Li S., Yan H., and Qiu X., “Contrast and Generation Make BART a Good Dialogue Emotion Recognizer,” Dec. 2021, [Online]. http://arxiv.org/abs/2112.11202
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref32] 32. Liu Y. et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” Jul. 2019, [Online]. http://arxiv.org/abs/1907.11692
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref33] 33. J. Li, D. Ji, F. Li, M. Zhang, and Y. Liu, “HiTrans: A Transformer-Based Context- and Speaker-Sensitive Model for Emotion Detection in Conversations,” COLING 2020—28th International Conference on Computational Linguistics, Proceedings of the Conference, pp. 4190–4200, 2020.

[ref34] 34. A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, and Y. Choi, “COMET: Commonsense Transformers for Automatic Knowledge Graph Construction,” ACL 2019—57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pp. 4762–4779, Jun. 2019.

[ref35] 35. Khosla P. et al., “Supervised Contrastive Learning,” Adv Neural Inf Process Syst, vol. 33, pp. 18661–18673, 2020, Accessed: Jun. 30, 2024. [Online]. https://t.ly/supcon
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref36] 36. H. Yang and J. Shen, “Emotion Dynamics Modeling via BERT,” Proceedings of the International Joint Conference on Neural Networks, vol. 2021-July, Apr. 2021.

[ref37] 37. M. E. Peters et al., “Deep contextualized word representations,” in NAACL HLT 2018–2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies—Proceedings of the Conference, 2018.

[ref38] 38. K. Clark, M. T. Luong, Q. V. Le, and C. D. Manning, “ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS,” in 8th International Conference on Learning Representations, ICLR 2020, 2020.

[ref39] 39. Alec R., Jeffrey W., Rewon C., David L., Dario A., and Ilya S., “Language Models are Unsupervised Multitask Learners | Enhanced Reader,” OpenAI Blog, vol. 1, no. 8, 2019.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref40] 40. Hazarika D., Poria S., Zimmermann R., and Mihalcea R., “Conversational transfer learning for emotion recognition,” Information Fusion, vol. 65, 2021,
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref41] 41. Ghosal D., Majumder N., Mihalcea R., and Poria S., “Exploring the Role of Context in Utterance-level Emotion, Act and Intent Classification in Conversations: An Empirical Study,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref42] 42. Y. Wang, J. Zhang, J. Ma, S. Wang, and J. Xiao, “Contextualized emotion recognition in conversation as sequence tagging,” in SIGDIAL 2020—21st Annual Meeting of the Special Interest Group on Discourse and Dialogue, Proceedings of the Conference, 2020.

[ref43] 43. Wang H., Li J., Wu H., Hovy E., and Sun Y., “Pre-Trained Language Models and Their Applications,” 2023.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref44] 44. Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou, “MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices,” Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 2158–2170, Apr. 2020.

[ref45] 45. Zhang Z. and Sabuncu M. R., “Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels.”
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref46] 46. I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and Harnessing Adversarial Examples,” 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings, Dec. 2014, Accessed: May 29, 2024. [Online]. https://arxiv.org/abs/1412.6572v3

[ref47] 47. Miyato T., Maeda S. I., Koyama M., and Ishii S., “Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning,” IEEE Trans Pattern Anal Mach Intell, vol. 41, no. 8, 2019, pmid:30040630
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref48] 48. S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion recognition in conversations,” in ACL 2019—57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2020.

[ref49] 49. Zahiri S. M. and Choi J. D., “Emotion Detection on TV Show Transcripts with Sequence-Based Convolutional Neural Networks.” [Online]. www.aaai.org
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref50] 50. Li Y., Su H., Shen X., Li W., Cao Z., and Niu S., “DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset,” Oct. 2017, Accessed: Apr. 26, 2024. [Online]. https://arxiv.org/abs/1710.03957v1
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref51] 51. Busso C. et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Lang Resour Eval, vol. 42, no. 4, 2008,
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref52] 52. P. Zhong, D. Wang, and C. Miao, “Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations,” EMNLP-IJCNLP 2019–2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 165–176, 2019.

[ref53] 53. W. Shen, J. Chen, X. Quan, and Z. Xie, “DialogXL: All-in-One XLNet for Multi-Party Conversation Emotion Recognition,” 35th AAAI Conference on Artificial Intelligence, AAAI 2021, vol. 15, pp. 13789–13797, Dec. 2020.

[ref54] 54. B. Lee and Y. S. Choi, “Graph Based Network with Contextualized Representations of Turns in Dialogue,” in EMNLP 2021–2021 Conference on Empirical Methods in Natural Language Processing, Proceedings, 2021.

[ref55] 55. Li J., Lin Z., Fu P., Si Q., and Wang W., “A Hierarchical Transformer with Speaker Modeling for Emotion Recognition in Conversation,” Dec. 2020, Accessed: Sep. 22, 2024. [Online]. https://arxiv.org/abs/2012.14781v1
View Article
Google Scholar

[108] View Article

[109] Google Scholar

[ref56] 56. “Bimodal Speech Emotion Recognition Using Pre-Trained Language Models | Request PDF.” Accessed: Sep. 22, 2024. [Online]. https://www.researchgate.net/publication/337781233_Bimodal_Speech_Emotion_Recognition_Using_Pre-Trained_Language_Models

Figures

Abstract

1 Introduction

2 Related work

2.1 Emotion recognition

2.2 Genesis (evolution) pre-training transformer-based language models

3 Methodology

3.1 Problem statement

3.2 Low resource MobileBERT with focal weighted loss

3.3 Contextual adversarial training (CAT)

3.4 Adversarial training procedure

Loss function.

3.5 Adversarial perturbation generation (FGSM attack)

4 Experiments

4.1 Model setting

4.2 Model training

4.3 Datasets

4.4 Evaluation metrics

5 Results and discussion

5.1 Ablation study

5.2 Combining focal weighted loss and adversarial training

5.3 Hyperparameter tuning

6 Conclusion and limitations

6.1 Limitations

References