Figures
Abstract
Conversational agents, commonly referred to as chatbots, have become an integral part of various important applications, such as customer support and virtual assistants in enterprise-level solutions. Despite their widespread use, current solutions predominantly rely on a single modality input, primarily text, which limits their ability to fully understand the correct emotions of users and respond accordingly. The increasing demand for multimodal inputs in conversation such as text, audio, and videos, highlights the need for a more comprehensive approach. Existing conversational agents face significant challenges in generating emotionally aware responses, as they lack the ability to effectively handle emotion embeddings, leading to limitations in emotional accuracy and contextual appropriateness. Focusing on these challenges, this research work leads to a novel multimodal approach which incorporates features from text, audio, and visual. The EAC-Agent proposed a sequence-to-sequence model with transformer, along with pre-trained embeddings such as GloVe. The self and cross-modal attention on text, audio and visual is used to generate a more emotionally intelligent response. EAC-Agent is validated through comparison with existing techniques on two benchmark datasets. The obtained results demonstrate superior performance in emotion classification and response generation. The proposed model achieves an accuracy of 76.27% on IEMOCAP and 67.57% on MELD for emotion recognition from multimodal user inputs. In addition, the emotion-aware response generation module shows clear improvements, with perplexity values of 39.01 and 42.30, BLEU scores of 0.31 and 0.30, and ROUGE-L scores of 0.45 and 0.44 on IEMOCAP and MELD, respectively. EAC-Agent demonstrates clear superiority over existing models and holds great promise for applications in customer service, healthcare, and other areas requiring empathetic and contextually appropriate interactions.
Citation: Jamil S, Ali T, Nawaz A, Shahid A (2026) EAC-Agent: A deep learning framework for multimodal emotion-aware conversational agent with contextual response generation. PLoS One 21(4): e0346770. https://doi.org/10.1371/journal.pone.0346770
Editor: Shuai Liu, Hunan Normal University, CHINA
Received: July 29, 2025; Accepted: March 24, 2026; Published: April 17, 2026
Copyright: © 2026 Jamil et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All source code and pre-processing scripts used in this study are publicly available at: https://github.com/ShahidJamil20/EAC-Agent and https://github.com/ShahidJamil20/EAC-Preprocess. The raw datasets used in this study are third-party resources. Access to the IEMOCAP dataset requires submitting a request at https://sail.usc.edu/iemocap/, after which the data can be downloaded upon approval. The MELD dataset is available at https://affective-meld.github.io/. Due to licensing restrictions, the raw datasets cannot be redistributed by the authors. However, the minimal processed datasets used in this study, including multimodal feature files, are publicly available at https://doi.org/10.6084/m9.figshare.31310113 and are sufficient to reproduce the results reported in this study.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Chatbots, or conversational agents (CAs), are digital systems/programs that replicate human interaction through Natural Language Processing (NLP) [1]. Due to the interactivity and user-friendly designs of these technologies, many people prefer them over traditional static Frequently Asked Questions (FAQ) systems [2]. Over the years, CAs have become more important in numerous areas as a result of growing technological developments in AI. When you look at the different types of conversational systems, there are many benefits when compared to human agents: they can be available 24/7; they can engage with thousands of users simultaneously [3], and such agents provide a more tailored experience based on individual user data. As mentioned in Fig 1, these systems are being used in different domains, especially in healthcare, where they support patients in psychological treatments and provide medical information [4]. Similarly in business, they smoothly perform customer service, enhance the quality of service, and effectively minimize the cost by handling large numbers of customer queries [5]. In the field of education, CAs contribute to personalized tutoring and support interactive learning [6]. While talking about entertainment, they are again excellent in storytelling and gaming.
The figure summarizes the key benefits of conversational agents in enhancing user interaction, personalization, and accessibility.
However, many existing conversational systems still struggle to accurately interpret the emotional context of user queries and respond in an empathetic manner, which can negatively affect user experience [7,8]. Initially designed for simple conversations and entertainment, such systems now encounter significant challenges related to emotional intelligence in responses. CAs may struggle to produce fully satisfactory responses when they lack awareness of a user’s emotional state. Whereas, if these systems are well aware of the emotional state of the users, they will be in a stronger position to satisfy users by generating empathetically more relevant responses [9]. For example, to improve communication, an effective conversational system should be able to detect when a user is angry and respond empathetically. Traditional text-based systems, however, struggle to accurately identify a user’s emotional state, as textual input alone cannot effectively capture emotional cues such as facial expressions and voice tone.
Multi-modal approaches are in a better position to understand user emotions, because they use text, audio, image and video modalities [10]. This, in turn, enables conversational systems to generate more empathetic responses. For example, images are best for detecting the facial expression of the user in a happy state. Similarly textual data are suitable for the disgust state rather from image. In conclusion, both happy and disgust emotion could be better identified when combined from these modalities [11]. Now users are more satisfied with CAs during conversation, and this is because of a multi-modal approach which is helping in better emotional detection [12]. CAs play especially more vital role in the situation when a user is in a frustrated state [13] and the best example is the field of customer service, where the emotion embedded response will normalize the depressed emotion of the customer. In light of an existing research study, 40–45% of queries from the users are other than normal state, which need empathetic response [14], so that the user will be feeling as s/he is talking to a human and showing sympathy.
In this work, emotion embeddings refer to dense latent vector representations learned by an emotion classifier that encode the user’s emotional state. These embeddings are not pre-softmax hidden activations or static class embeddings, but conditioning vectors derived from multimodal features. They are explicitly used to condition the response generation process, enabling the model to produce emotionally aligned responses.
While modern CAs utilize techniques such as sentiment analysis and machine learning to detect emotional content, there is still substantial room for improvement in their emotional intelligence [15]. Recently done research work is using deep learning techniques, enabling such agents to generate human-like responses [16]. Transformer-based models like Generative Pre-trained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT) excel in understanding and generating context-aware responses using attention mechanisms, building on earlier Seq2Seq models with attention mechanisms such as LSTM, GRU based architectures [17]. First, dialogue robots like DialoGPT are based on the transformer model created by Google (the Transformer). The transformer model used for dialogue robots has been trained using datasets that contain actual conversations among real human beings [18]. Second, memory networks like MANN have the ability to keep track of conversation history using an external source of memory, which allows the model to access and retrieve conversation history to generate contextually aware and accurate responses. In addition to these two technologies, research on memory networks and pointer networks has improved the capacity of CAs to incorporate context and increased the blankness and depth of their responses, as demonstrated by the memory networks.
Growing interest has emerged to develop multimodal affective conversational agents due to the development of new technologies. The model proposed introduces an innovative method of providing multimodal affective conversational agents by creating a fusion network that combines text, image, and audio features, to significantly improve the accuracy of emotion detection during engagement with users. The research work uses a semi-transformer-based model, which is further divided into four main components. The first component is required to compute the fixed-size vector text, visual and audio. Then positional embeddings are added to this vector and its purpose is to not miss the contextual information from each modality. For text, we used GloVe, for images we used Convolutional Neural Networks (CNN), and for audio we used Mel-Frequency Cepstral Coefficients (MFCC). The second component is about the fusion network, where we are using an early fusion approach by combining features from all the three modalities. The third component is the emotion classifier, to which the single fused feature vector is fed, and it generates the emotion embeddings. In the fourth and final component, both the generated embeddings and the user’s query are forwarded to the sequence to sequence with attention encoder to generate the emotion embedded response at the decoder side. All these components are helping the CAs to be more human-like.
The summarized contribution of this research work is as follows:
- This study addresses the limitations of unimodal emotion detection methods by proposing a multimodal approach that integrates text, visual, and audio features, offering a more comprehensive solution for emotion recognition during conversations.
- This study introduces a novel early fusion approach which integrates features extracted from text, visual and audio modalities which produce a higher accuracy of emotion detection.
- The proposed model embeds unique emotional features into generated responses, setting a new standard for multimodal emotion detection and response generation in conversational systems.
- The experimental evaluation indicates that the proposed model attains an overall accuracy of 76.27%, 67.57% and a weighted F1 score of 76.36%, 67.50% outperforming baseline models across benchmark IEMOCAP and MELD datasets respectively.
The remaining sections of the paper are organized as follows. In Related Work section, the literature in the relevant field is reviewed, while the Methodology section is all about the proposed methodology. In the Results and Discussion section, we reported the effectiveness of the proposed approach by setting up the experiments resulting in excellent results. Finally, the Conclusion section provides the summary of the research work and outlines future work.
Related work
During the last several years, many researchers have contributed to the field of multimodal emotion-aware conversational agents, advancing techniques for emotion detection and response generation through the integration of text, visual, and audio modalities, the summary is described in Table 1.
Recent studies emphasize that relying on a single modality often limits a model’s ability to capture the full context of human interactions. Multimodal learning addresses this limitation by combining information from multiple sources, such as text, audio, and visual cues, allowing models to learn richer and more meaningful representations. By leveraging the complementary strengths of different modalities, deep learning–based fusion approaches have shown improved performance in tasks that require contextual understanding and affective awareness. These advances are particularly relevant for emotion-aware conversational agents, where integrating multimodal information can lead to more contextually appropriate and emotionally aligned response generation [27].
The emotion recognition system introduced by Ghosal et al. employs graph convolution neural network-based methods [28]. Their model uses both the intra-speaker (speaker’s own speech) and inter-speaker (the other person’s speech) sequential and contextual dependence of the utterance and uses that to classify the utterance as belonging to one of the previously labeled emotion classes. In the experiments performed, Ghosal et al. used only textual content and observed statistically significant improvements over the current leading classification methods. The authors observed that short utterances often contain multiple meanings and associations that can only be understood through the use of multimodal data to provide context for understanding (e.g., ‘ok. yes.’). When evaluated on the Multimodal Emotion Lines Dataset (MELD), the authors’ method produced an F1 measure of 58.10%.
Ho et al. proposed an approach to recognizing emotion in speech, which utilizes a combination of various modalities [29]. This approach uses both Recurrent Neural Network (RNN) and a self-to-multi-head attention mechanism. Firstly, they extract MFCC features from audio and use the BERT model for features from textual data. In the next phase, these features are fed into the RNN, where the self-attention layer is also activated. Features are fused using a multi-head attention mechanism to correctly predict the emotion from these modalities. The F1 score is computed separately for text and then from both text and speech on the MELD dataset. In text, it is 59.98%, whereas in multimodal it is 60.59%, showing an improvement by using multi-modal.
The study referred to in [30], examined the method employed within Deep Learning Algorithms to help provide an efficient means of determining emotions based on the combination of different types of data such as (Text, Video and Audio). The study identified that a text-based response is the most appropriate response for integrating emotion. There were two primary approaches considered in the study, the first being The Early Fusion Method and the second is The Late Fusion Method. It can be clearly demonstrated through the experimental results of this study that both of these methods were effective for allowing the system to utilize the emotion within the multimodal conversation to generate an effective emotionally embedded response. In addition, this approach improved the accuracy level of the system by 8% through the incorporation of these methods. The effectiveness of the response generated from an attention mechanism was used to measure the accuracy of the response.
Emotion detection in conversations has emerged as an increasingly important topic within the NLP community. The method that they used is their distinctive way of approach; it is simply using another source of information (audio) to help obtain and include additional contextual information regarding the person contributing to a conversation by way of an emotional detection model using GNNs (Graph Neural Networks) [31]. Thus, knowing “who” is speaking and “how” the person is talking plays a major factor in the overall effectiveness of the results produced by utilizing both of these sources. Because of the use of both forms of data, the results from their evaluation on the MELD dataset (Multimodal Emotion Detection) were noticeably above all other related models. But it’s also worth mentioning that their relatively low weighted measure (F1 Score of 55%) can likely be attributed to the dataset used in comparison to the complexity of the Emory NLP dataset.
The dynamics of multimodal sentiment and emotion analysis highlight the advances in method/techniques where there is more than just simple concatenation of inputs or static attention strategies; the first technique combines fuzzy cognition-based dynamic fusion networks (Fcdnet) with contextual relevance via fuzzy cognitive reasoning, allowing an adaptive balance of modality contributions and a more effective representation of the semantic and affective meaning across text, audio and visual input. The emphasis on cognitive/rationale value of each of the three modalities as well as their complementary relationship enhances feature interactions, therefore increasing the performance of multimodal sentiment benchmarks, indicating an exciting possibility for emotion-aware conversational systems that interpret nuanced emotional contexts via multiple data sources [32].
Several recent studies have focused on improving multimodal emotion recognition in conversations by combining information from text, audio, and visual signals. Ma et al. [33] propose a transformer-based model with self-distillation to strengthen cross-modal feature learning and achieve better results on the IEMOCAP and MELD datasets. Hu et al. [34] introduce MMGCN, which uses graph convolution networks to model interactions between speakers and different modalities. Mao et al. [35] present DialogueTRM, a hierarchical transformer model designed to capture emotional changes and long-range context in conversations. Similarly, Hu et al. [36] propose MM-DFN, which dynamically integrates multimodal features for more effective emotion recognition. All these approaches are evaluated on the widely used IEMOCAP and MELD datasets and represent important contributions to multimodal conversational emotion analysis.
In addition to widely used benchmarks such as IEMOCAP and MELD, prior studies have introduced a range of multimodal datasets to facilitate research in conversational emotion analysis and affective computing. The Multimodal Emotion Intensity and Sentiment Dataset (MEISD) goes beyond traditional emotion recognition methods because it explicitly models emotion intensity by combining textual, acoustic and visual representations [37]. CMU-MOSEI and CMU-MOSI Datasets are much larger scale multimodal annotated datasets that include both sentiment and emotion recognition, enabling analysis of multiple affective expression types based on spoken language, facial expression and vocal characteristics [24]. Other datasets such as Emory NLP [38], DailyDialog [39], and the OMG Emotion [40] help researchers learn and understand how emotional dynamics occur within the context of conversation or multimedia communication. Although there are many multimodal datasets currently available, IEMOCAP and MELD are still the most frequently utilized benchmarks for creating emotion responsive conversational agent systems, particularly when researchers compare their results against previously established baseline models [37].
Researching the literature regarding current research on emotion detection and response generation from conversational agents suggests that while many of these techniques have good accuracy and efficiency, there are also a number of limitations to these technologies as mentioned in Table 1. The implications of this study emphasize the critical need to develop advanced emotion-aware multimodal conversational agents that will allow for real-time human emotional recognition and response. Existing methods, including unimodal and some multimodal approaches, often struggle with challenges such as incomplete emotional context, modality misalignment, and the need for precise feature integration. The proposed model addresses these limitations by leveraging deep learning techniques, such as early fusion networks and attention mechanisms, to effectively combine text, audio, and visual modalities.
Methodology
This section discusses the core methodology of EAC-Agent. Fig 2 shows the architecture and the complete workflow of the proposed research work, which is composed of four important components. Feature extraction is the very first phase in the sequence, then fusing the features from each modality. The next component is about emotion embeddings and the last one is responsible for text-based response generation.
The figure presents the overall workflow of the proposed system, including multimodal feature extraction, self- and cross-attention-based fusion, and emotion-aware response generation.
Preprocessing and feature extraction
In the preprocessing stage, we employed GloVe-based embedding strategy for text, Gaussian Mixture Model (GMM) for audio and Vision Transformer (ViT) for videos. On the other hand, feature extraction enables the identification and quantification of relevant patterns within the raw data, which in most cases are not immediately apparent for emotion detection. One of the basic purposes of extracting the key features from each modality is to reduce the data dimensionality. Text preprocessing is the key step to format the textual data in a way so that the feature extraction process will be performed more efficiently and, as a result, to improve the performance of the model. In text preprocessing, we first do tokenization, then cleaning, normalization, lemmatization and stemming. In the end, the padding and truncation step is performed. After that, the sequences will become a fixed length L:
Text embeddings transform the textual data into numerical vectors along with preserving the semantic relationships between words, which is ideal for the models to process. For embedding we use GloVe, which maps each word wi to a dense vector . The co-occurrence matrix Mij counts the total number of word pairs in a context window:
To extract features from audio, we perform audio preprocessing. In this process we first remove background noise from the audio by using spectral gating. After that, the audio signal x(t) is segmented into the overlapping frames of length N and hop of size H as:
where:
- x(t) is the original audio signal
- w(t) is the window function (e.g., Hann or Hamming window)
- H is the hop size (in samples)
- f is the frame index
- N is the total number of frames
For a signal of length T samples, the number of frames is:
For each frame, MFCCs are extracted with the help of Short-Time Fourier Transform (STFT):
The resultant MFCC feature vector of frame f is . The GMM is then used for feature modelling:
where wg is the weight, is a vector of mean values and
is the covariance matrix of size g × g having the gth Gaussian component. To represent the audio signal as a fixed length embedding, the mean and variance of the GMM components are computed and concatenated. The final audio embedding e is given by:
Where and
are the mean and variance of the gth Gaussian component. This results in a fixed length representation that captures the underlying structure of the audio signal.
Video preprocessing begins with frame extraction, where a video sequence is divided into T frames, each of size H × W × C. Each frame It is split into non-overlapping patches of size P × P, resulting in
patches per frame. These patches are flattened into vectors
for the ith patch in frame It:
The patches are then linearly embedded into fixed-length vectors using a learnable weight matrix
and bias
:
Positional encodings ei, generated using sinusoidal functions, are added to retain spatial information:
Resulting in the final input for the Vision Transformer (ViT).
The embedded patches are processed by the Vision Transformer, which consists of multi-head self-attention and feed-forward layers. The self-attention mechanism computes attention weights using queries Q, keys K, and values V:
The output is passed through a feed-forward network with ReLU activation:
The sequence of patch embeddings is aggregated into a single video-level representation by using the class token
:
The final video embedding captures both spatial and temporal characteristics, enabling downstream tasks such as classification and action recognition.
Fig 3 presents the process adopted for deriving audio features in the proposed framework. The input speech signal is initially preprocessed to suppress background noise and remove unwanted distortions, yielding a refined audio signal. From this processed signal, MFCCs are computed to capture relevant spectral patterns associated with human speech perception. These coefficients are subsequently summarized using simple statistical measures, such as their average and dispersion, to obtain a fixed-length representation that can be effectively utilized for emotion-aware modeling.
The figure illustrates the steps involved in noise reduction, MFCC computation, and statistical modeling for generating acoustic representations.
Fig 4 outlines the procedure used to extract visual features within the proposed framework. Each input image is first segmented into a set of uniform patches, which are subsequently converted into vector representations and augmented with positional information to retain spatial order. These patch embeddings are then fed into a Transformer encoder, where self-attention mechanisms and feed-forward layers model interactions across different regions of the image. The final output is produced through an MLP-based projection, yielding compact visual representations suitable for subsequent analysis or prediction tasks.
The figure illustrates the process of patch extraction, embedding, and self-attention-based feature learning for visual representation.
Feature fusion using self and cross-attention
Given the embeddings from each modality (i.e., text, audio and visual) our aim is to fuse information both temporally and across modalities. Let us denote these as:
where represent the feature embeddings at the ith utterance for the text, audio, and video modalities, respectively.
Self-attention within modalities
Self-attention is applied independently within each modality across its temporal utterance sequence. This mechanism captures intra-modal temporal dependencies:
For each modality , the self-attention mechanism computes:
where the query, key, and value matrices are computed as:
with being learnable parameters.
Cross-modal attention
Self-attention mechanisms are effective at modeling temporal relationships within individual modalities; however, they are insufficient for capturing interactions between different modalities. To overcome this, we introduce a directional cross-modal attention strategy that allows each modality to incorporate relevant cues from the others in a selective manner. Let ti, ai, and represent the self-attended embeddings of text, audio, and visual modalities, respectively, at utterance index i. The cross-modal attention process is defined directionally as
, indicating that modality m1 receives contextual information from modality m2, which serves as the source. For each ordered modality pair
, the cross-attended representation
is computed using scaled dot-product attention:
where denotes the query vector generated from the target modality m1,
denotes the key vector generated from the source modality m2,
denotes the value vector from the source modality, and dk is the dimensionality of the attention subspace used for scaling. The query, key, and value vectors are obtained through linear projections as follows:
where and
represent the target and source modality embeddings at utterance i, respectively, and
,
, and
are learnable projection matrices specific to the modality pair.
Text-guided cross-modal attention
To integrate acoustic and visual emotional cues into textual representations, the text modality attends to both audio and visual modalities:
where and
denote the cross-attended textual representations conditioned on audio and visual information, respectively.
Audio-guided cross-modal attention
To allow acoustic representations to be guided by semantic and visual context, the audio modality attends to text and visual modalities:
where and
represent audio features enhanced by textual and visual information, respectively.
Visual-guided cross-modal attention
To align visual emotional expressions with linguistic and acoustic signals, the visual modality attends to text and audio modalities:
where and
denote visual representations augmented with textual and acoustic cues, respectively.
In the proposed design, each cross-modal attention unit is equipped with projection layers tailored to specific modality pairs, allowing the model to learn detailed alignment relationships between heterogeneous inputs. By structuring the attention in a directional manner, information from one modality is incorporated into another in a controlled and selective way. The outputs of this cross-modal interaction are then combined with the corresponding self-attended representations, producing a unified feature representation that supports multimodal emotion comprehension and response generation.
Fused representation
The final fused representation Fi at each time step i is obtained by concatenation:
This representation captures both temporal dependencies within modalities and cross-modal interactions, enabling comprehensive multimodal emotion understanding.
Fig 5 describes how attention mechanisms are used to combine information from multiple modalities. For each individual modality (text, audio, or video) self-attention is applied to learn internal relationships among its constituent elements. In contrast, cross-attention operates across modalities, allowing features from one stream to be informed by representations from the others. Through this coordinated interaction, the model is able to fuse complementary cues from different sources, resulting in representations that better reflect both contextual dependencies and emotional nuances.
The figure illustrates the interaction between self-attention and cross-attention for multimodal feature fusion.
Response generation
Following the multi-modal fusion stage, we obtain a sequence of fused representations , where each
captures both temporal and cross-modal dependencies at utterance i. The response generation module utilizes this enriched multi-modal context to generate emotionally aware, contextually appropriate responses.
We employ a Transformer decoder to auto regressively generate the target response , where
is the token at position j, and
denotes the vocabulary. The decoder attends to the fused representations
via encoder-decoder cross-attention:
where:
with being learnable parameters. Here,
denotes a standard scaled dot-product attention mechanism that aligns the decoder hidden state with the fused multimodal context. Specifically, the decoder hidden representation
at generation step j is projected to a query vector
, while the fused multimodal representations
are projected to key and value matrices
. The attention weights are computed by measuring the similarity between the query and each key vector, normalised using the softmax function, and are subsequently used to compute a weighted sum over the value vectors. This operation enables the decoder to selectively attend to the most relevant multimodal emotional context when generating each response token.
The token probability distribution is computed as:
where and
.
Results and discussion
In this section we look into the results obtained from the performed experiments to measure the performance of the proposed model. The evaluations proved that the proposed model is performing better than the baseline models.
Dataset
There are several datasets available, as already discussed in the literature review. We have chosen two well-known multi-modal datasets, Multimodal Emotion Lines Dataset (MELD) and Interactive Emotional Dyadic Motion Capture (IEMOCAP), since they are widely used benchmark datasets in literature [23,33–36]. Table 2 describes the complete details of each dataset along with its availability source. The first dataset is IEMOCAP, including text, audio and video modalities of 12 hours duration. The dataset includes conversations between ten different actors. Each conversation is annotated with one of the six emotions used in the dataset. The second dataset is the MELD, which is also a multimodal dataset containing text, audio and video data. This dataset includes 13,000 utterances from the Friends TV series and 1,400 dialogues. Each utterance from the MELD dataset is annotated with one of the seven emotions used in the dataset. Further information about the pre-processed multimodal features used in this study are provided in supporting study S1 File (IEMOCAP) and S2 File (MELD).
Table 3 presents the complete statistical picture of the two datasets in a brief way by classifying the total number of conversations, and utterances in each. The datasets are split into training and testing sets in 80:20 ratio. The utterances in IEMOCAP are organized into five sessions, with the first four sessions designated for training and the final session reserved for testing. Each utterance is assigned one of the emotion labels.
Performance evaluation measure
The four common standards are mostly used to check how effective the model will be. Hence, we also used these four measures to claim that the proposed model is an effective one. The following equations are used for precision, recall, F1 score and accuracy:
We employ standard classification metrics to evaluate model performance:
To measure classifier effectiveness, we use:
where:
- CP(i|positive) is the cumulative proportion of positive class samples
- CP(i|negative) is the cumulative proportion of negative class samples
- i represents the classification threshold
Baseline methods
The study compares the proposed model with the following baseline approaches:
- Baseline 1: SDT [33] presents a transformer-based model with self-distillation, enabling knowledge transfer from both hard and soft labels to each modality. The architecture employs:
where
balances the loss terms.
- Baseline 2: MMGCN [34] constructs a conversation graph
with multimodal nodes and edges, applying graph convolution:
where
is the adjacency matrix with self-loops.
- Baseline 3: DialogueTRM [35] uses hierarchical transformers with multi-grained fusion:
where
are learnable modality weights.
- Baseline 4: MM-DFN [36] implements dynamic fusion via:
where gt is the gating vector at time t.
Results
We compared our proposed fusion model with four different fusion models to claim that it is leading all four. Tables 4 and 5 represent the performance of baseline methods and EAC-Agent on IEMOCAP and MELD datasets, respectively. Results show that our model performs better as compared to the state-of-the-art on overall performance both in terms of accuracy and weighted F1 (w-F1) score on both IEMOCAP and MELD Datasets. The rest of the three models are below SDT. In both Tables 4 and 5, the last row reports the overall accuracy and w-F1.
Fig 6 shows that the proposed fusion strategy significantly outperforms other fusion methods. The results suggest that direct fusion via Add or Concatenation is suboptimal. The proposed method enhances performance by first filtering irrelevant information at the unimodal level, then dynamically assigning weights across modalities at the multimodal level, resulting in more effective multimodal representation fusion.
The figure compares the classification performance of various fusion strategies on the IEMOCAP and MELD datasets.
The confusion matrices in Fig 7 show that the proposed model performs well on both the IEMOCAP and MELD datasets, especially for common emotions such as Neutral, Sad, and Joy. Some confusion can be seen between closely related emotions, such as Happy and Excited or Angry and Frustrated in IEMOCAP, as well as between Surprise and Neutral in MELD, which reflects the similarity in their emotional expressions.
The figure illustrates the classification performance of the proposed model on the IEMOCAP and MELD datasets.
To evaluate the generated response and also the embedded emotion of the proposed model, we use Perplexity, BLEU-4 and ROUGE-L score as described in Tables 6 and 7. It can be observed that image and audio modalities are included in these evaluations, although they are not commonly used with BLEU and ROUGE metrics. Since BLEU-4 and ROUGE-L are text-based evaluation metrics, audio and image modalities are included as input conditioning signals to guide the generation of textual responses. These modalities provide complementary emotional and contextual cues, such as vocal tone and facial expressions, that influence response generation but are not directly evaluated. The reported scores therefore reflect the quality of the generated text under different conditioning settings, allowing analysis of how each modality contributes to response coherence and emotional alignment.
Ablation study
To understand the importance and the contribution of different modalities in correct emotion detection, we performed tests by applying different ablation settings.
The ablation analyses of both the IEMOCAP and MELD datasets appear in Table 8, which shows how important the contributions of each modality are relative to one another in terms of how individually or in combination they contribute toward an overall improvement in performance. The text-only modality performs best among all of the modalities on both datasets, attaining an accuracy of 66.42% for IEMOCAP and 66.82% for MELD. The results suggest that the linguistic content of text provides the most reliable and strongest signal for emotion detection, as text contains explicit semantic and contextual information, whereas the acoustic modalities contain many ambiguous acoustic cues such as pitch and intensity that convey emotional intensity and do not convey the speaker’s intention. Similarly, the performance of the visual-only modality is the weakest among all modalities, particularly for IEMOCAP with accuracy at 42.45%, which is likely due to the variability of facial expressions from speaker to speaker as well as occluded features, changed poses during the recording, and/or misaligned visual frames; thus, clearly indicating that visual information alone should not be relied upon to infer emotional intent.
When modalities are combined, a consistent performance improvement is observed. The text+audio configuration provides the most substantial gain among all bimodal combinations, achieving 74.45% accuracy on IEMOCAP and 69.78% on MELD. This improvement highlights the complementary relationship between semantic information from text and prosodic cues from audio, which together enable more robust emotion discrimination. The text+visual combination also improves performance compared to single modalities, although its gains are slightly lower than text+audio, suggesting that visual cues are beneficial but less stable than acoustic features. The audio+visual combination yields comparatively limited improvement and performs substantially worse than text-inclusive configurations, particularly on MELD (49.12% accuracy). This observation underscores the importance of linguistic grounding in emotion understanding, as non-textual modalities alone lack sufficient contextual information to accurately infer emotional intent. Finally, the proposed EAC-Agent, which integrates all three modalities, achieves the best overall performance across both datasets, with accuracies of 76.27% on IEMOCAP and 67.57% on MELD. This confirms that jointly modeling text, audio, and visual information allows the system to capture complementary emotional cues more effectively than any individual or partial modality combination.
Conclusion
In this proposed work, we introduced a modern CA which allows users to express their emotions in many ways: like text, audio and visual. A novel approach is proposed to fuse these multiple modalities in a more effective way by using deep learning techniques. As a result, the CA is in a better position to engage the user by generating an empathetic response, which means the user feels that he is talking to a sincere companion. The experiments show that the users are more satisfied after taking a session with the CA, and, on the other hand, they stayed longer than the one with the traditional CA. The proposed work offers an ideal framework for developing an intelligent CA, which will be the best contribution to NLP as well as Human-Computer Interaction (HCI). In the future this research could be the baseline to enhance it across different languages, and also include the cultural context of the region. Moreover, this research can play a vital role in domains like customer service and mental health.
Supporting information
S1 File. iemocap multimodal features.pkl.
Pre-processed multimodal feature dataset derived from the IEMOCAP corpus, and is publicly available at https://figshare.com/ndownloader/files/61784017.
https://doi.org/10.1371/journal.pone.0346770.s001
(DOCX)
S2 File. meld multimodal features.pkl.
Pre-processed multimodal feature dataset derived from the MELD corpus, and is publicly available at https://figshare.com/ndownloader/files/61784020.
https://doi.org/10.1371/journal.pone.0346770.s002
(DOCX)
References
- 1.
Shrivastava N, Tewari P, Sujatha S, Bogireddy SR, Varshney N, Sharma V. Natural Language Processing for Conversational AI: Chatbots and Virtual Assistants. In: 2025 IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI), 2025. 1–6. https://doi.org/10.1109/iatmsi64286.2025.10984818
- 2.
Ranoliya BR, Raghuwanshi N, Singh S. Chatbot for university related FAQs. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE; 2017. 1525–30.
- 3. Ngai EWT, Lee MCM, Luo M, Chan PSL, Liang T. An intelligent knowledge-based chatbot for customer service. Electronic Commerce Research and Applications. 2021;50:101098.
- 4. Lin Z, Wang Y, Zhou Y, Du F, Yang Y. MLM-EOE: Automatic Depression Detection via Sentimental Annotation and Multi-Expert Ensemble. IEEE Trans Affective Comput. 2025;16(4):2842–58.
- 5. Misischia CV, Poecze F, Strauss C. Chatbots in customer service: Their relevance and impact on service quality. Procedia Computer Science. 2022;201:421–8.
- 6. Okonkwo CW, Ade-Ibijola A. Chatbots applications in education: A systematic review. Computers and Education: Artificial Intelligence. 2021;2:100033.
- 7. Wang T, Hou B, Li J, Shi P, Zhang B, Snoussi H. TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering. Advanced Intelligent Systems. 2023;5(4).
- 8. Meng T, Shou Y, Ai W, Du J, Liu H, Li K. A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition. Neurocomputing. 2024;569:127109.
- 9.
Welivita A, Yeh C-H, Pu P. Empathetic Response Generation for Distress Support. In: Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue, 2023. 632–44. ttps://doi.org/10.18653/v1/2023.sigdial-1.59
- 10. Junchi M, Chaudhry HN, Kulsoom F, Guihua Y, Khan SU, Biswas S, et al. MULTICAUSENET temporal attention for multimodal emotion cause pair extraction. Sci Rep. 2025;15(1):19372. pmid:40461499
- 11. Chaudhry HN, Kulsoom F, Ullah Khan Z, Aman M, Khan SU, Albanyan A. TASCI: transformers for aspect-based sentiment analysis with contextual intent integration. PeerJ Comput Sci. 2025;11:e2760. pmid:40567804
- 12.
Das D. Classifying emotional utterances by employing multi-modal speech emotion recognition. In: Proceedings of the Workshop on Speech and Music Processing, 2021. 1–13.
- 13. Jiang H, Chen X, Miao D, Zhang H, Qin X, Du S, et al. 3WD-DRT: A three-way decision enhanced dynamic routing transformer for cost-sensitive multimodal sentiment analysis. Information Sciences. 2026;725:122704.
- 14.
Christensen S, Johnsrud S, Ruocco M, Ramampiaro H. Context-aware sequence-to-sequence models for conversational systems. arXiv preprint. 2018.
- 15. Saffaryazdi N, Gunasekaran TS, Loveys K, Broadbent E, Billinghurst M. Empathetic Conversational Agents: Utilizing Neural and Physiological Signals for Enhanced Empathetic Interactions. International Journal of Human–Computer Interaction. 2025;42(6):4555–79.
- 16. Xu GJW, Pan S, Sun PZH, Guo K, Park SH, Yan F, et al. Human-Factors-in-Aviation-Loop: Multimodal Deep Learning for Pilot Situation Awareness Analysis Using Gaze Position and Flight Control Data. IEEE Trans Intell Transport Syst. 2025;26(6):8065–77.
- 17. Wang T, Li J, Wu H-N, Li C, Snoussi H, Wu Y. ResLNet: deep residual LSTM network with longer input for action recognition. Front Comput Sci. 2022;16(6).
- 18. Chaudhry HN, Kulsoom F, Ullah Khan Z. (GRAVITY) Graph-Based Reasoning With Attention and Visual Information Using Transformers for Yielding Answers. IEEE Access. 2025;13:160411–37.
- 19. Chang Y-C, Hsing Y-C. Emotion-infused deep neural network for emotionally resonant conversation. Applied Soft Computing. 2021;113:107861.
- 20. Liu W, Chen X, Miao D, Zhang H, Qin X, Du S, et al. SEAD-MGFE-Net: Schrödinger equation-based adaptive dropout multi-granular feature enhancement network for conversational aspect-based sentiment quadruple analysis. Information Sciences. 2026;723:122684.
- 21.
Dinan E, Logacheva V, Malykh V, Miller A, Shuster K, Urbanek J, et al. The Second Conversational Intelligence Challenge (ConvAI2). The Springer Series on Challenges in Machine Learning. Springer International Publishing. 2019. p. 187–208. https://doi.org/10.1007/978-3-030-29135-8_7
- 22. Dileep Kumar MJ, Sukesh Rao M, Narendra KC. Multimodal Emotion Recognition: A Comprehensive Survey of Datasets, Methods, and Applications. IEEE Access. 2025;13:201067–97.
- 23.
Popovic I, Culibrk D, Mirkovic M, Vukmirovic S. Automatic Speech Recognition and Natural Language Understanding for Emotion Detection in Multi-party Conversations. In: Proceedings of the 1st International Workshop on Multimodal Conversational AI, 2020. 31–8. https://doi.org/10.1145/3423325.3423737
- 24. Lian H, Lu C, Li S, Zhao Y, Tang C, Zong Y. A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face. Entropy (Basel). 2023;25(10):1440. pmid:37895561
- 25.
Zadeh AB, Liang PP, Vanbriesen J, Poria S, Tong E, Morency LP. Multimodal language analysis in the wild: CMU-MOSEI dataset and benchmark. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018. 2236–46.
- 26. Ramaswamy MPA, Palaniswamy S. Multimodal emotion recognition: A comprehensive review, trends, and challenges. WIREs Data Min & Knowl. 2024;14(6).
- 27. Shou Y, Meng T, Ai W, Fu F, Yin N, Li K. A Comprehensive Survey on Multi-modal Conversational Emotion Recognition with Deep Learning. ACM Trans Inf Syst. 2026;44(2):1–48.
- 28.
Ghosal D, Majumder N, Poria S, Chhaya N, Gelbukh A. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. 154–64. https://doi.org/10.18653/v1/d19-1015
- 29. Ho N-H, Yang H-J, Kim S-H, Lee G. Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network. IEEE Access. 2020;8:61672–86.
- 30.
Bhangdia Y, Bhansali R, Chaudhari N, Chandnani D, Dhore ML. Speech Emotion Recognition and Sentiment Analysis based Therapist Bot. In: 2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA), 2021. 96–101. https://doi.org/10.1109/icirca51532.2021.9544671
- 31.
Zhang D, Wu L, Sun C, Li S, Zhu Q, Zhou G. Modeling both Context- and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019. 5415–21. https://doi.org/10.24963/ijcai.2019/752
- 32. Liu S, Luo Z, Fu W. Fcdnet: Fuzzy Cognition-Based Dynamic Fusion Network for Multimodal Sentiment Analysis. IEEE Trans Fuzzy Syst. 2025;33(1):3–14.
- 33. Ma H, Wang J, Lin H, Zhang B, Zhang Y, Xu B. A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations. IEEE Trans Multimedia. 2024;26:776–88.
- 34.
Hu J, Liu Y, Zhao J, Jin Q. MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. arXiv preprint. 2021.
- 35.
Mao Y, Liu G, Wang X, Gao W, Li X. DialogueTRM: Exploring Multi-Modal Emotional Dynamics in a Conversation. In: Findings of the Association for Computational Linguistics: EMNLP 2021, 2021. 2694–704. https://doi.org/10.18653/v1/2021.findings-emnlp.229
- 36.
Hu D, Hou X, Wei L, Jiang L, Mo Y. MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022. 7037–41. https://doi.org/10.1109/icassp43922.2022.9747397
- 37.
Firdaus M, Chauhan H, Ekbal A, Bhattacharyya P. MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations. In: Proceedings of the 28th International Conference on Computational Linguistics (COLING), 2020. 4441–53.
- 38.
Zahiri S, Choi JD. EmoryNLP: A multi-party, multi-modal dataset for emotion recognition in conversations. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. 527–41.
- 39.
Li Y, Su H, Shen X, Li W, Cao Z, Niu S. DailyDialog: A manually labelled multi-turn dialogue dataset. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017. 986–95.
- 40.
Barros P, Churamani N, Lakomkin E, Siqueira H, Sutherland A, Wermter S. The OMG-Emotion Behavior Dataset. In: 2018 International Joint Conference on Neural Networks (IJCNN), 2018. https://doi.org/10.1109/ijcnn.2018.8489099