RETRACTED: HGLER: A hierarchical heterogeneous graph networks for enhanced multimodal emotion recognition in conversations

Qingping Zhou

doi:10.1371/journal.pone.0330632

Abstract

This research has proposed a new Emotion Recognition in Conversation (ERC) model known as Hierarchical Graph Learning for Emotion Recognition (HGLER), built to go beyond the existing approaches that find it difficult to request long-distance context and interaction across different data types. Rather than simply mixing different kinds of information, as is the case with traditional methods, HGLER uses a 2-part graph technique whereby conversations are represented in a 2-fold manner: one aimed at illustrating how various parts of the conversation relate and another for enhancing learning from various types of data. This dual-graph system can represent multimodal data value for value by exploiting the benefits of each type of data yet tracking their interactions. The HGLER model was applied to two widely used datasets, IEMOCAP and MELD, with many varieties of information, texts, pictures, or sounds, hence, to see to what extent the model can understand emotions in conversations. Preprocessing methods common in practice were done to make things consistent, and the datasets were set aside for training, validation, and testing informed by previous works. The model was tested using two standard datasets, including IEMOCAP and MELD. On IEMOCAP, HGLER posted an F1-score of 96.36% and accuracy of 96.28%; on MELD, it posted an F1-score of 96.82% and accuracy of 93.68%, surpassing some state-of-the-art methods. The model also showed some superb performance in terms of its convergence, generalization, and convergence stability during training. These findings demonstrate that hierarchical graph-based learning can be applied in enhancing emotional comprehension in systems dealing with several forms of information in handling conversations. However, slight changes in validation loss observed suggest there are areas of model stability and generalization to be improved on. These results validate that using hierarchical graph-based learning in multimodal ERC does well and promises to enhance emotional understanding in conversational AI systems.

Citation: Zhou Q (2025) RETRACTED: HGLER: A hierarchical heterogeneous graph networks for enhanced multimodal emotion recognition in conversations. PLoS One 20(9): e0330632. https://doi.org/10.1371/journal.pone.0330632

Editor: Shuai Liu, Hunan Normal University, CHINA

Received: May 19, 2025; Accepted: August 4, 2025; Published: September 5, 2025

Copyright: © 2025 Qingping Zhou. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

Greater emphasis on human-computer interaction (HCI) has emerged recently because of the rapid development of natural language processing (NLP) technology [1]. A key component of human-computer interaction (HCI) [2] is the capacity to respond to conversation content and to engage in emotional communication with robots, as shown in studies [3,4]. A wide variety of applications are quickly adopting emotion recognition in conversation (ERC), including medical diagnosis, opinion mining, dialogue creation, recommendation systems, and more. Thus, ERC has become an essential tool for facilitating better and more tailored interactions.

For conversation systems to correctly identify the emotions expressed in statements, contextual knowledge plays a crucial role. Consequently, this setting has been the subject of a tremendous deal of research. One example is HRED-A [5], which uses an attention mechanism in conjunction with a recursive encoder-decoder design to draw attention to certain segments of the discourse. By incorporating external knowledge from knowledge bases into the Transformer model, KET [6] enhances emotion perception. During the discussion, DialogueRNN [7] makes use of three different kinds of gated recurrent units (GRUs): one that updates the speaker’s state, one that retrieves contextual information, and one that alters the global emotional state. When it comes to emotion detection tasks, DialogueGCN [8] builds a graph structure that uses data about the speaker and the conversation sequence to combine general and speaker-specific information. Nevertheless, to identify emotions in conversation, these methods depend solely on textual input. When people engage in person, they usually use a mix of text, body language, facial expressions, and gestures in their dialogues [9]. Emotion identification models that rely on a single metric are unable to keep up with the ever-increasing performance requirements of real-world applications.

Multimodal emotion recognition leverages the complimentary and resilient properties of many modalities, in contrast to single-modal emotion recognition [10]. By integrating textual, auditory, and visual elements, BC-LSTM [11] enhances emotion identification through the utilization of multimodal information. Using a multi-turn reasoning module that mimics human cognitive processes by cycling between intuitive retrieval and conscious reasoning, Dialogue CRN [12] incorporates perceptual and cognitive phases to extract and integrate emotional characteristics. Through multiview fusion, MFN [13] incorporates features from several modalities. On the other hand, when dealing with sequential data, it can’t model contextual information. In contrast, ICON [14] captures the speaker-interlocutor contextual connection using a GRU-based multi-hop memory network and an attention module that zeroes in on crucial emotional signals. Despite including several modalities, these techniques fail miserably when it comes to capturing conversational context over vast distances.

Much recent research has utilized Graph Neural Networks (GNNs) [15] for tasks of identifying emotions in conversations, inspired by their high performance in relational modeling [16]. MMGCN [17] constructs a graph for the task just like GTC-ResNet but employs Graph Convolutional Networks to blend multimodal inputs and replicate distant cooperative effects. The proposed method eliminates acknowledgment of various modality distinctions. The research uses ConGCN [18] along with a multi-structured network to capture speaker linkages, thus resolving this problem. During graph convolution processes, ConGCN processes both node characteristics with their neighbor information together with related contextual elements the framework detects. The researchers behind COGMEN built a directed graph structure that utilizes speaker relationships from three different types of modality input [19]. RGCN maintains track of relational dependencies both between different speakers and among speakers within their group, leading to successful results.

The techniques mentioned above fail to address modal heterogeneity because they simply fuse attributes from all three modalities while disregarding between-modality relationships. The research in [20] developed separate three graph structures preserving just two modalities to advance standard graph construction procedures for handling this issue. The system employs attention-based mechanisms in these graphs for detecting information within each modality while tracking multi-modal correspondences. The GraphMFT system asserts that restricting the graph to two modalities produces decreased heterogeneity which improves tri-modal fusion. The GraphCFC approach demonstrates multimodal talks as directional graphs which include contextual utterances. Through PairCC the method resolves heterogeneity problems between modalities. Understanding the network structure remains critical because local nodes within the network system can contain information from either within one modality or between different modalities. The effective interaction between modalities receives significant impact when contextual information from one modality directly conflicts with interaction input from another in multimodal fusion [21]. Existing multimodal ERC models often have difficulty with (1) understanding long-range context (like DialogueRNN, MFN), (2) handling interactions between different types of data because they are so different (like GraphCFC, MMGCN), and (3) becoming too uniform in GNN-based structures. Our proposed HGLER model introduces three novel contributions to address these issues:

A dual heterogeneous graph structure is designed to separately learn utterance-level context and cross-modal interactions, addressing challenges in long-distance and multi-modality learning.
Use of RGCN, Graph Transformer, and GraphSAGE modules to handle heterogeneous relations and dynamically focus on relevant context.
An adaptive residual connection module to mitigate over-smoothing in deep GNN layers.
These innovations allow HGLER to outperform state-of-the-art models (e.g., MM-Refiner, STAGE) by 13–20% in F1-score on benchmark datasets IEMOCAP and MELD.

We present the structure of this paper as follows: In section 1, the background and motivation for multimodal Emotion Recognition in Conversation (ERC) are presented. In Section 2, the proposed architecture of the proposed HGLER and its learning mechanisms are introduced. In Section 3, we describe setup, datasets, and evaluation metrics. Section 4 presents the results and comparison with the existing techniques and the crucial results. Finally, the last chapter, Section 5, summarizes the work and discusses future research paths.

2. Related works

2.1. Emotion recognition in conversation

ERC requires analysis of conversation data to evaluate the emotional tones in each utterance using specific designated emotions from available logs and speaker profiles. ERC models currently exist with two distinct operational modes depending on their number of integrated modalities. During early development, the field needed unimodal emotion recognition because it offered practical advantages and a straightforward design and relied solely on single-input data. The hierarchical processing model of HiGRU [22] breaks down dialogues into separate levels starting from utterances up to conversations before applying distinct encoding operations on each level. The extraction of emotional signals relies on emotion-related information together with mental processes and social connectivity information. CSMIC [23] unifies its operations with a fundamental knowledge base system. The conversational network model DialogueGCN creates a dialogue graph where each spoken sentence is a point, showing how emotions are connected and how the conversation changes through lines that link these points. All techniques currently used for emotion detection rely on uni-modal emotion identification systems.

The approach known as Multimodal emotion detection grew popular during recent years because it offers complete and exact emotional understanding of discourse in multi-modal interactions that blend the use of voice, text and facial expressions. A complete understanding of communication emotion requires multimodal emotion detection which unites auditory, textual and visual data inputs. BC-LSTM [24] unites different modalities through characteristic merger while failing to depict their actual interaction patterns. The CMN system [25] employs memory networks to track past dialogue context by means of emotion recognition modules that extract and connect the data. Emotional data from multiple sources becomes part of ICON’s [26] interconnected global memory system which enhances the awareness of context. MFN [27] resolves multi modal fusion issues through its modality integration and feature alignment mechanism while DialogueCRN [28] uses perceptual and cognitive processing stages to copy human emotional cognition. The method makes use of multiple reasoning processes which alternate between conscious analytical operations and instinctive memory retrieval mechanisms for important emotional trigger identification.

DialogueRNN [7] emulates the emotional flow inside a conversation using an RNN-based architecture. It tracks the speaker’s emotional states and notes the dynamics of the surrounding environment to extract beneficial emotional components. Many of these models, however, struggle to properly represent long-range contextual dependencies or inter-modal interactions. Graph-based models have filled this gap. MM-DFN [29] is one such example; it combines a dynamic fusion module with a graph structure to integrate multimodal contextual data. DIMMN [30] captures the changing patterns of multimodal input and dynamically simulates emotional fluctuations throughout talks. Despite their efforts, there are still obstacles to overcome in conversation emotion identification when it comes to capturing complex contextual relationships and cross-modal interactions.

2.2. Graph convolutional networks

One kind of deep learning model that excels in processing graph-structured data is the Graph Convolutional Network (GCN). To process data in non-Euclidean domains, such as graphs, GCNs are constructed, in contrast to conventional Convolutional Neural Networks (CNNs) that function on data in Euclidean spaces, like pictures. Social network analysis and recommendation systems are two areas that can greatly benefit from GCNs due to their local connectedness and weight sharing [31]. Some significant types of GCNs that have evolved rapidly include Graph Attention Networks (GAT) [32], Relational Graph Convolutional Networks (RGCN) [20], GraphTransformer [33], and GraphSAGE [34]. Variants like this have made their way into the Emotion Recognition in Conversations (ERC) challenge due to their improved modeling capabilities. One example is DialogueGCN [8], which uses GCNs to record the interdependence between statements in a conversation. MMGCN [35] uses GCNs to integrate various modalities and model long-range dependencies by building graphs with speaker dependency information. COGMEN [36] builds a directed graph using speaker connections after merging information from three modalities. It then employs a Transformer architecture to characterize contextual relationships.

Recent advancements in multimodal emotion detection have introduced several innovative practices that could serve as a solid basis for comparison with our proposed HGLER model. HiMul-LGG is a layered incorporation that has local and global graphs to learn the exertion of disparate types of data about one another and also incorporates decision fusion to equate the difference in varieties of data [21]. SMIN combines a semi-supervised learning algorithm in combination with cross-modal memory units to improve low-resource performance [37]. GraphCFC attempts to reduce modality disparity by forming pairwise modality graphs where contrastive fusion will preserve the substantial interactions. To study the EEG signals, EMT applies to a kind of model referred to as a transformer, which aids in determining how attention can enhance the ability to recognize the emotion [38]. The attribute of M3Net is a two-channel graph technique with mirror attention, which enables the two nodes to capture both the order and certain details of various kinds of data [39]. Finally, MM-DFN applies the switching attention and intelligent gating to integrate the emotional cues of various sources in conversations [29]. These papers demonstrate that graph and attention approaches to Emotion Recognition in Conversation (ERC) tasks are diversified and define the lower bounds of effectiveness measurement of hierarchical graph learning methods such as HGLER.

GraphMFT [17] takes an alternative approach by constructing graph structures with a maximum of two modalities simultaneously, utilizing GCNs to record inter-node interactions and handling variability in modalities. M3Net [39] achieves state-of-the-art performance in ERC tasks by using two concurrent graph propagation channels to thoroughly investigate the intricate connections between modalities and contextual data. When it comes to modeling intra- and inter-modal interactions, GCNs and their variations have been incredibly effective. This makes them create a great framework for understanding the complex dynamics of emotion in communication.

3. Materials and methods

The following section explains the proposed approach in excellent detail. The entire framework presented in Fig 1 includes unimodal feature extraction along with directed graph construction and learning, followed by feature integration, which leads to emotion prediction. Fig 1 demonstrates a sophisticated learning model that integrates text together with sound and images through various types of graphs learning with deep neural networks. Different methods of feature extraction operate separately on each input stream during our initial processing stage. The text processing of Sentence Transformers generates extensive text representations, MFCC extracts vocal and emotional information from audio signals, and DenseNet-121 analyzes video frames for visual analysis. The extracted features generate a mixed depiction that shows precise time segments for text, sound, and visual materials. The model maintains data sequence through internal graph connections inside individual input layers and uses these same connections to connect different input types for understanding temporal and contextual associations across various data formats. The advanced network design has Relational Graph Convolutional Networks (RGCN) that handle different types of connections, then use Graph Transformers to control how important the information is, and GraphSAGE to gather details from nearby nodes for handling new information. The deep neural network utilizes batch normalization to process enhanced features before applying ReLU activations on dense layers, which use dropout regularization, leading to classification through the final softmax layer. The system utilizes multiple benefits of data learning and graph-based reasoning to create accurate decision systems that apply to tasks such as emotional recognition and multi-source sentiment analysis.

Download:

Fig 1. Overall architecture of HGLER model.

https://doi.org/10.1371/journal.pone.0330632.g001

3.1. Multi-model feature extraction

The initial step involves deriving multiple feature types that represent the speaker directly from video information. These features encompass information from various sources such as audio, visual cues, and textual transcripts. After extracting multimodal features from the gathered data, the HGLER model analyzes the conversation to predict emotional states. The data extraction procedures detail how each data type is processed for analysis by the HGLER model.

3.1.1. Sentence transformer.

Natural language processing (NLP) faces an essential challenge when attempting to extract semantic meaning from text throughout various languages. Sentence embedding models achieve vectorization of complete sentences through dense fixed-length representations that contain their semantic meaning. Rotational vectorization is very important for many applications, like systems that find similar meanings, tools that detect rephrased sentences, technologies that answer questions, and methods for grouping documents. BERT succeeded in monolingual operations yet proved inefficient for such tasks while showing limited cross-lingual adaptability. Research teams developed MiniLM as a lightweight, powerful model that maintains performance levels by lowering computational requirements. Part of the Microsoft-developed MiniLM family is the paraphrase-multilingual MiniLML12-v2 model [40]. This specific version tailor’s performance toward providing quality sentence embeddings across more than 50 languages. This model receives specialized training to detect semantic relationships and paraphrases and thus becomes optimal for multi-language situations. This 12-layer Transformer encoder model features a hidden size of 384 and 6 attention heads but maintains reduced dimensions compared to BERT-base (768 hidden size, 12 heads). As a result, it functions effectively for real-time operations and operates efficiently on limited systems [41].

The core functionality of this model depends on the Transformer encoder that establishes relations among all tokens within a sentence through self-attention operations. Given an input sequence , each token is embedded and passed through multiple layers of self-attention and feedforward transformations. The self-attention mechanism calculates a weighted sum of the representations of all other tokens for each token, thereby encoding the contextual meaning of a word based on its surroundings. The resulting hidden states for all tokens are denoted as , where each represents the contextual embedding of the token at position i. To convert token-level embeddings into a sentence-level embedding, the model employs a mean pooling strategy. This method computes the average of all token embeddings from the final layer to generate a fixed-size vector representation of the entire sentence. Mathematically, if are the token embeddings, the sentence embedding s is derived as:

(1)

The use of pooling methodology makes sure each sentence segment provides an equivalent contribution to its final embedding while preserving semantic cohesion.

3.1.2. MFCC (Mel-Frequency Cepstral Coefficients).

Raw waveform signals must transform into useful speech and audio representations that identify human speech patterns in speech and audio processing systems. The most employed method for achieving this purpose is the Mel-Frequency Cepstral Coefficients (MFCC). MFCCs show audio signal power spectrum characteristics as low-dimensional compact features that use human perception-based Mel scale frequencies. The MFCC extraction process starts by implementing pre-emphasis as the initial stage, which intensifies the energy of audio signal higher frequencies. The speech signal has lower frequency energy, which is compensated for by pre-emphasis in the initial processing step. During this process we apply a high-pass filter, which mathematically subtracts previous sample data from current samples to perform the transformation [42].

(2)

Where α is typically around 0.95. The application of this process enhances signal-to-noise ratio while enhancing features that appear in high-frequency components. After pre-emphasizing the audio, we cut the signal into overlapping segments, typically measuring from 20 to 40 milliseconds. Short-time analysis of speech works best because speech remains relatively consistent over brief periods even though it is non-stationary. When using this 16 kHz sampling rate, the 25-ms frame contains exactly 400 samples. Maintaining continuous sequence in adjacent sections requires a 10-millisecond window between successive frames to be utilized as a standard practice. The windowing operation utilizes a Hamming window alongside others to prepare signals for frequency analysis. The process minimizes spectral leakage because the distortion affects subsequent frequency analysis results.

The Fast Fourier Transform (FFT) applies time domain-to- frequency domain transformations for each windowed frame. An FFT operation produces spectral results that indicate how signal energy distributes across varying frequencies. The auditory perception system requires spectral data to pass through a Mel filter bank analysis. The computer system applies a logarithmic function to the energies from the filter banks to duplicate human auditory sound intensity perception [43]. The process creates a loud output to reflect our human perception because human loudness processing operates logarithmically and not in a linear way. Thus, we obtain a set of log Mel-spectrum values that demonstrate energy distribution within perceptually relevant frequency bands. The log-transformed values undergo a process of Discrete Cosine Transform (DCT). The discrete cosine transformation helps reduce the energy of the log filter band coefficients while performing tasks like principal component analysis, which is used to simplify data. The calculation produces MFCCs starting from the first four or five sets that generate the final feature vector used per frame. The spectrum comprised of these coefficients shows naturally the spectrum of the logarithm of the spectrum and delivers powerful envelope detection of power spectral data. Machine learning algorithms find the final MFCCs exceptionally helpful for processing audio signals due to their low-dimensional time-series pattern. Static and dynamic coefficients, which include the first and second derivatives of the static coefficients (delta and delta-delta), measure the dynamic characteristics of speech signals. Using a combination of MFCCs with added time-derived features lets models access spectral and temporal information, which leads to better accuracy and reliability in tasks that involve emotion detection or speaker identification.

3.1.3. DenseNet121.

A dense connection model among all network layers distinguished DenseNet-121 as a convolutional neural network from the DenseNet family [44]. DenseNet establishes a feed-forward algorithm that links all layers to each other for processing. Formally, the output of the

l^th layer is defined as:

(3)

The output from the layer, denoted as , is computed using a transformation function , The sequence of operations which includes batch normalization then ReLU activation followed by convolution defines this model. DenseNet engages direct concatenation for linking every layer it contains to each preceding one while ResNet operates through summation output merging. The unique design enables improved gradient propagation as shown in Fig 2 while it uses previously obtained features to reduce the overall parameter count [45].

Download:

Fig 2. DenseNet121 architecture for visual feature extraction.

https://doi.org/10.1371/journal.pone.0330632.g002

(4)

A function designed to estimate the likelihood of a given weather image belonging to a specific class. For the vision modalities, we utilize fully connected networks for encoding, as described below:

(5)

3.2. Heterogeneous graph construction and learning

The multimodal emotion recognition system uses a graph definition G = (V, E, R) to represent mmm utterance dialogues. The directed graph design shows both the utterance time sequence and the interactive relationships of multiple modalities. An input graph has V as the collection of all utterances while E indicates source-to-destination relations and R shows the relational categories graph. An utterance breaks into three separate modality-based shapes called acoustic, visual and textual. The model uses vectors to illustrate nodes throughout its representation. The arrangement of 3m nodes by time and modality creates a dialogue representation with m number of utterances [20].

(6)

The intra-modal connection forms link only between two nodes belonging to the same modality type. The construction of these connections depends on the sliding window technique. The node system in the network connects with a predetermined set of preceding statements and succeeding statements that share the same modality. The network structure achieves better dialogue analysis because it can identify time-based relationships within the local interaction context [46].

(7)

The intermodal edges support the connection of nodes across various modalities within a single utterance. With this structure, the model can combine data from acoustic input and visual signals with text information. A link exists in both directions between a textual node and its matching acoustic node from that same utterance.

(8)

With this structure, the model can combine data from acoustic input and visual signals with text information. A link exists in both directions between a textual node and its matching acoustic node from that same utterance. The optimization of nodes with their relationship elements within a graph represents graph learning, which enhances downstream tasks, including classification or regression. Multimodal emotion recognition needs to create a directed graph that displays the timing and connections between different types of expressions before applying a learning method to develop useful node structures that capture both local and global context. The graph includes nodes representing single modality elements from utterances like text, audio, or visual along with edges that link these components either by dependency order or modality overlap. The essential aspect of graph learning is distributing information through the structure to improve the expressive ability of node features [38].

The system applies message-passing methods that allow nodes to modify their representations through neighboring node information aggregation. The weights of information and its combination process depend heavily on the type of edge and its connecting direction. Each edge type within the temporal graph demands different levels of semantic significance, and the model should adapt by adjusting either weight value or attention mechanisms. Node-level multimodal information fusion happens through intermodal edges, which promote different modality signals to unify toward the resulting representation. The learning process utilizes graph neural networks (GNNs), among which GCNs along with GATs and R-GCNs provide different implementation frameworks. Such designs broaden standard deep learning techniques to analyze information within graph domains by developing irregular framework variations of convolution and attention systems. The network learns by repeatedly refining its node representation through rules that help combine information from neighboring nodes. During backpropagation, giving emotion labels to each node helps improve the features of the nodes and understand the importance of the connections between them. We construct two heterogeneous graph structures: (1) the Context Graph , where nodes represent utterances from the same modality (text/audio/visual), and edges reflect temporal adjacency using a sliding window mechanism; and (2) the Cross-modal Graph , where nodes represent tri-modal components within a single utterance, and edges connect different modalities (e.g., text-to-audio). Each graph is processed using independent GNNs (RGCN and Graph Transformer), producing intermediate embeddings h_c and h_x. These are concatenated and passed to a dense layer:

Training methods in dynamic or adaptive graph learning allow graph structures to expand based on the initial findings made during the learning process. The model develops the ability to derive and adjust connectivity between nodes through comparison of node similarity or task-relevant correlation. The network becomes more flexible because of this feature, allowing it to automatically find hidden relationships that traditional methods for creating graphs can’t see. We use additional lost functions and graph-level regularization strategies to achieve smoothness, sparsity, and modularity within the learned structural patterns. We use a method that handles two types of data at once to manage differences in data types: we create three separate groups (text-audio, text-visual, and audio-visual), learn from them, and combine them using attention-weighted integration to create strong multimodal embeddings with little overlap.

3.3. Emotion classification (DNN Classifier)

Transfer learning enabled the classification of facial images toward autism diagnosis in this method. The approach takes advantage of existing model features through an implementation that stacks new specialized layers onto the pre-trained model to address new tasks. The system maintains established representation patterns through this approach as it develops new capabilities for the present dataset. The transfer learning approach involves replacing standard, fully connected layers with ones designed for the new task. The architecture uses a pre-trained backbone as an extractive foundation, which supports feature identification. Before feeding the facial images (sized 224 × 224 × 3) into the model [47–49], preprocessing and data augmentation techniques were applied to improve generalization. As part of the proposed final architecture, illustrated in Fig 1, a series of layers were appended to the pre-trained base. These included a batch normalization layer to standardize input distributions, followed by a flattening layer that transforms multi-dimensional feature maps into a one-dimensional vector. This step is succeeded by two dense layers activated with the ReLU function, each with 512 units, designed to increase the network’s learning capacity. Between these layers, an additional batch normalization layer was inserted to stabilize training and accelerate convergence. We added a dropout layer with a rate of 0.2 to mitigate overfitting, which randomly disables a fraction of neurons during training. Finally, the model ends with a dense layer that has six neurons and uses a softmax activation function, representing six output categories that correspond to the six emotion classes.

4. Experimental results and discussion

This section elaborates on the datasets and experimental approach, which includes detailed descriptions of the datasets used and evaluation metrics, along with comparative baseline analysis and individual component effectiveness assessment for the proposed framework.

4.1. Datasets

The proposed approach received performance evaluation through experiments completed on IEMOCAP as well as MELD benchmark datasets [50]. Table 1 lists the statistics of both datasets. The IEMOCAP dataset provides approximately 12 hours of combined content with video features along with audio recordings, facial expression data, and speech text transcripts. The dataset includes interactions between ten actors in 151 dialogues, and their total number of utterances reaches 7,433. The dataset features conversations consisting of staged and unplanned dialogue that seeks natural emotional reactions from participants. The dataset contains emotional states ranging from happiness to sadness and anger as well as neutral states. The MELD dataset functions as a recognized benchmark for evaluating procedures for identifying emotions across multiple channels. The MELD dataset consists of more than 1,400 dialogues and about 13,000 utterances that derive from the TV show Friends. The dataset consists of several recorded conversations containing multiple speakers whose statements have been classified into seven emotional categories, including anger along with disgust, fear, joy, neutral sentiments, sadness, and surprise [51].

Download:

Table 1. Sample distribution of datasets used for experiments.

https://doi.org/10.1371/journal.pone.0330632.t001

4.2. Experimental settings

The experiments took place on an NVIDIA GeForce RTX 4090 GPU through a system that used Ubuntu as its operating system. The research used the PyTorch 2.2.1 deep learning framework with Python 3.11.8 for implementation. Adam was used as the model training optimizer, and the GNN Head parameter was set to 5 during the process. The IEMOCAP dataset experiments utilized a learning rate of 0.00008, a dropout rate of 0.3, and a 0.00001 regularization parameter when operating with a batch size of 32. The MELD dataset training kept a batch size of 32 and received adjustments to the learning rate setting at 0.0004 while the regularization term achieved a value of 0.000005. The evaluation of performance on the datasets included accuracy calculations as well as a weighted average F1-score evaluation.

4.3. Recent models for comparison

We tested the performance and strength of our HGLER model using multiple top recent models that recognize emotions through diverse types of data during comparison tests. We opted to observe the performance of MAG-BERT, MISA, Self-MM, MELIN, EMOFormer, MulT, MMIM, STAGE, ModalityDrop, MM-Refiner, and MTAG. The multimodal adaptation gate in MAG-BERT improves BERT by automatically adjusting how much importance each mode has, which helps it deal with noise or missing information better. MISA (Modality, Invariant, and Specific Representations) splits its learning into two parts: one focuses on improving representations for each specific modality, while the other works on common features that apply to all modalities, allowing for gradual multimodal processing. Self-MM enacts self-supervised learning to synchronize modality features, which boosts its ability to withstand distributional shifts.

MELIN uses memory-boosted hierarchical networks to study the long-term connections between different types of content and conversation patterns [52]. EMOFormer utilizes transformers to attain high contextual understanding while facilitating both time-sensitive and cross-mood attention in emotional detection [53]. MulT uses cross-modal transformers to analyze data interaction patterns between types of information following their initial fusion, which enhances both understanding and accuracy levels. MMIM utilizes multimodal interaction memory to detect important temporal modalities that exist between various information streams [37]. STAGE (Selective Transfer and Aggregation Graph Emotion Model) selects emotional signals from various modalities automatically using attention-based graph reasoning for aggregation purposes. Through ModalityDrop training, the model learns better performance by applying information removal procedures that parallel real-life data loss conditions [54]. The MM-Refiner system achieves higher performance through information integration from multiple data sources, which gains knowledge about missed information throughout successive rounds. MTAG (Multi-Task Affective Graph) defines emotion recognition as a multi-task graph learning problem that optimizes the identification of shared emotional indicators across different tasks and dialogues [55].

4.4. Comparative results of recent models and proposed HGLER

Table 2 shows a detailed comparison of the new HGLER model with several top methods on the IEMOCAP and MELD datasets using four important evaluation metrics: precision, recall, F1-score, and accuracy. The results clearly demonstrate the superior performance of HGLER over all compared models. The HGLER model performed the best on the IEMOCAP dataset by producing 96.25% precision and 95.39% recall as well as 96.36% F1-score and 96.28% accuracy. The performance index of HGLER surpasses other established models since it achieved F1-scores of 96.36% and an accuracy of 96.28% in comparison to ModalityDrop’s F1-score of 83.21% and accuracy of 71.69% as well as MM-Refiner’s F1-score of 73.08% and accuracy of 80.30%. HGLER demonstrates exceptional ability in detecting emotional expressions throughout multimodal conversations because of its excellent IEMOCAP performance.

Download:

Table 2. Performance Comparison of proposed HGLER with recent models on IEMOCAP and MELD.

https://doi.org/10.1371/journal.pone.0330632.t002

HGLER achieved 94.56% precision together with 97.38% recall, 96.82% F1-score, and 93.68% accuracy when tested on the MELD dataset. HGLER achieved superior performance compared to MM-Refiner and STAGE by producing F1-score rates of 78.07% and 77.89% along with accuracy rates of 77.90% and 78.10%, respectively. HGLER demonstrates exceptional capability to detect various emotion-related cues on MELD, which results in superior recall performance. The experimental findings prove the effectiveness of HGLER for emotion recognition in various multimodal datasets. HGLER delivers superior performance compared to multiple sophisticated models for transformer-based fusion and graph modeling and modality interaction while demonstrating its capability to function in complex situations of dialogue emotion recognition.

4.4. Performance comparison of HGLER against different settings

The HGLER model achieved evaluation results presented in Table 3, which combines data types for both IEMOCAP and MELD datasets. The text-only information exceeded audio-only and video-only assessments in both IEMOCAP and MELD by achieving F1-scores of 0.71 and 0.65. Unimodal audio data achieved superior results to unimodal video data because voice cues provide better emotional content than visual cues on their own. The recognition of emotions showed higher effectiveness when text was combined with audio, while the other combinations of audio with video and video with text did not perform as well on the datasets studied. Combined operation of modalities generated the maximum improvement of system performance when all components worked together. When the model applied its trimodal configuration, it reached a 0.96 F1 score on IEMOCAP with 96.2% accuracy along with a 0.96 F1 score and 93.6% accuracy on MELD. Performance improvement has reached outstanding levels since different input modalities deliver matching information that enables better identification of emotional context. The integration of textual data by HGLER produces the most precise outcome when combining all available modalities during processing.

Download:

Table 3. Performance Comparison of proposed HGLER in different modalities on IEMOCAP and MELD.

https://doi.org/10.1371/journal.pone.0330632.t003

4.5. Ablation experiments

The study in Table 4 evaluates how well HGLER works by testing it with RGCN, Graph Transformer, and SAGE modules, using the IEMOCAP and MELD datasets. Single modules together with their combined power enable the framework to achieve its final output, according to the study results. The Graph Transformer functioned independently (Row 1), obtained an F1-score of 0.74 and 0.61 on IEMOCAP and MELD, respectively. The solo performance of SAGE surpassed the other models by reaching a 0.79 F1-score in IEMOCAP and a 0.70 F1-score in MELD, which demonstrates its improved understanding of graph-based meanings. The standalone performance of Graph Transformer becomes evident when it works independently from RGCN, although other programs can be included (Row 3). When the model used RGCN with Graph Transformer and SAGE together, it achieved F1-scores of 0.96 and accuracies of 96.2% (IEMOCAP) and 93.6% (MELD), as reported in Row 4. A combination of overlapping components results in outstanding performance improvement, which generates detailed and better graph representations. The experimental results show that using multi-graph learning methods together in HGLER leads to successful results for detecting emotions in different types of data.

Download:

Table 4. Impact of different components of graph modules on the performance of proposed HGLER on IEMOCAP and MELD.

https://doi.org/10.1371/journal.pone.0330632.t004

Fig 3 illustrates the model performance relationship of HGLER when M and N values are modified between IEMOCAP and MELD datasets. Set M and N to 8 in the IEMOCAP dataset, which achieves maximum accuracy of 93.65% alongside a maximum F1-score of 91.15%. The model achieves considerably better emotional recognition results when 8 contextual nodes exist before and after a specific utterance. Performance begins to decrease when context exceeds this level because extended contextual data increases both the noise and the redundancy levels. The right plot from the MELD dataset demonstrates the same performance behavior, which reaches its maximum with M = N = 8 and delivers 89.10% accuracy along with a 92.40% F1-score. The F1 scores in MELD decrease rapidly after reaching peak performance at the optimal setting, yet accuracy stays stable. MELD appears to be more receptive to the quantity of contextual information and therefore requires a particular window size to achieve optimal results. The study findings confirm that precise adjustment of past and future nodes represents a critical factor for achieving optimal performance in dialogue graph emotion recognition assessment.

Download:

Fig 3. The impact of the past M and future N nodes on the performance of the HGLER model (Left) IEMOCAP (Right) MELD.

https://doi.org/10.1371/journal.pone.0330632.g003

HGLER represents an innovative solution to emotion detection across multiple data sources through its structure that unites hierarchical graph learning methodologies with contextual awareness representation. The model performs well on standard tests including IEMOCAP and MELD through its ability to unify audio, video and text data with advanced techniques RGCN, Graph Transformer, and SAGE. The model is particularly good at understanding how different parts of a conversation relate to each other and the overall context, which results in high accuracy and F1-scores, especially when it uses the best amount of past and future nodes. However, even with these strengths, HGLER has some drawbacks, such as being more complicated to compute because it uses several graph modules and being sensitive to hyperparameter settings like how many context nodes are used. Additionally, the model may struggle with noisy or incomplete modality data in real-world scenarios, which can affect its generalizability outside controlled datasets.

4.6. Performance evaluation of proposed HGLER model

Fig 4 shows how well the HGLER model performed on the IEMOCAP dataset, including its classification accuracy and loss over 50 training epochs. From the classification accuracy plot (left), we note that training accuracy increased rapidly in the first few epochs, coming quite close to 1.0, which shows that the model was able to learn the important patterns in the training data. On the other hand, the accuracy of validation grew less actively and leveled off at 0.8. The distinct lines of the training and validation accuracy curves indicate that the model is overfitting, since it did a much better job with training data than it did with new validation data. Such a difference means that although the model is very capable of describing complex relations in the training set, it generalizes poorly to other (new) data. The training loss is enriched in the classification loss plot (right), whose sharp reduction throughout the early stages of training suggests excellent optimization and error minimization. During the training process, the loss continued to smoothly decline toward a minimum, which is a good sign that the model converged on the training set quite strongly. In contrast, the pattern of the validation loss was more erratic. First, it adopted a downward trend just like the training loss and soon started fluctuating, slightly unstable but ending up at a much higher value than that of the training loss. This variation in validation loss together with the difference in accuracy is additional proof of overfitting; it is inferred that the model has memorized the pattern of the training data instead of learning generic patterns applicable to new observations. Overall performance reveals that HGLER is good at learning and fitting the training data set, but it has a somewhat limited ability to generalize to new data. It is possible to overcome the restriction by including the regularization methods, such as dropout, weight decay, or early stopping, to lessen overfitting. Also, strategies, such as cross-validation or data augmentation, may introduce additional robustness and generalization to the model, making it more consistent in performance across a variety of samples.

Download:

Fig 4. Performance of the HGLER model on IEMOCAP.

https://doi.org/10.1371/journal.pone.0330632.g004

On the MELD dataset, after 50 epochs of training, the performance of the HGLER model is shown in Fig 5, which reveals performance as well as loss. In the accuracy graph on the left, the training accuracy increases very rapidly over the first few rounds. It has almost perfect efficiency, about 1.0. This implies that the model learned well, and it learned very well on the training data. The training accuracy curve in the classification accuracy plot (left) shows rapid learning early in the epochs and early close to 1.0 perfect rates. It means that the model has undergone effective learning with strong optimization on the training samples. On the other hand, the validation accuracy grows gradually and then stabilizes at 0.85 and gradually comes down in the later epoch. The decline at the end indicates possible overfitting, in which case the model begins learning by memorizing the training data rather than understanding general patterns, making it not so good at unseen data. The classification loss plot (on the right side) follows the same trend, with training loss having a sharp fall during the early epoch, which subsequently decreases to near a value of zero as the model learns to employ its parameters. On the contrary, the validation loss demonstrates a sharp decline at the initial epochs, with pronounced oscillations around the midpoints of epochs. Somewhat surprisingly, validation loss still decreases rather gradually and dips below training loss at around 20 epochs. Such behavior indicates that the model can train meaningful features from the validation set as training goes on, yet the instability in the middle range can be explained by sensitivity to some samples or batch inconsistencies. The overall analysis of the results may suggest that the HGLER model does understand the features of the MELD dataset during training, but it is not so good at learning when being trained, whereas during validation it fails to make changes. The decline in validation accuracy during the latter stages and the fluctuations in loss suggest possible problems in the form of overfitting or unequal learning of different forms of emotions. Solving such problems may include the application of regularization methods or both early stopping and additional fine-tuning to produce better, more stable, more generalized performance on unseen data.

Download:

Fig 5. Performance of the HGLER model on MELD.

https://doi.org/10.1371/journal.pone.0330632.g005

To address overfitting, dropout (rate = 0.2) and weight decay were incorporated during model training. Despite these efforts, some performance gap between training and validation remains, which we aim to reduce in future work through techniques such as early stopping, data augmentation, and more robust cross-validation.

The confusion matrix of the proposed HGLER model on the IEMOCAP dataset is given in Fig 6, the output TP, TN, FP, and FN when the predicted emotion is compared with the actual emotion on the IEMOCAP dataset for six emotion categories, which is presented in Table 1. Happy, sad, neutral, angry, excited, frustrated. The model predicts highly accurately (high numbers of correct predictions along the diagonal, especially among neutral (275), excited (260), and sad (215), for example good recognition of these emotions). The misclassifications are minimal and mostly occur between semantically similar emotions, such as frustrated“occasionally being misclassified as neutral or angry. The HGLER model shows strong emotion recognition and generalization on this benchmark dataset, achieving 96.23% overall accuracy.

Download:

Fig 6. Confusion Matrix of the HGLER model on IEMOCAP.

https://doi.org/10.1371/journal.pone.0330632.g006

Experimentation with the outputs of the IEMOCAP and MELD datasets revealed that the model has some weaknesses in the HGLER model. Although, during modeling, the approach showed impressive learning properties on the train, reaching almost perfect accuracy, the difference between training and validation performance reflects vulnerability to overfitting. This gap is particularly prominent when looking at IEMOCAP data, which indicates that the model may have just memorized some details of training data rather than learning bigger features that will be useful to new, unseen examples. Also, the ups and downs observed in validation loss graphs for both datasets indicate that possibly the optimization process may not be stable, which might be attributed to variations in data batches, or maybe the model reacts differently to various emotional categories. These oscillations imply that the model’s learning process may not be entirely smooth or seamless and thus compromise its generalization. In addition, the marginal deterioration in validation accuracy towards the end epochs on MELD questions the ability to accurately capture complex emotional transitions consistently. To resolve such problems, stronger regularization techniques such as dropout or weight decay will play a significant role together with the use of data augmentation for better generalization. Future work may also examine highly advanced optimization techniques and cross-validation processes to report more stable and generalized learning for various emotional expressions.

5. Conclusion and future work

In this paper, we introduced a new system called HGLER for recognizing emotions from different types of information, which is built to effectively understand emotional signals from conversations. The model incorporates both local and global dependency learning mechanisms to help explain utterance-level emotions across different modalities. In particular, HGLER is designed to exploit hierarchical graph learning to support context relationships without loss of intermodal feature interactions. This design allows the model to extract richer representations of intonations, thereby improving the performance in classification. Careful evaluation with two standard datasets, IEMOCAP and MELD, demonstrated that HGLER is effective and that it provides higher accuracy and faster convergence of results in comparison with the traditional ones. Moreover, a close look at how the model learns showed that it can adapt well to complicated conversations, but small fluctuations in validation loss indicated that there are still some areas that could be improved. In future work, we intend to explore advanced regularization techniques and data augmentation strategies to further enhance model generalization and stability. Additionally, incorporating external knowledge sources, such as semantic graphs and affective lexicons, could improve the model’s ability to distinguish subtle emotional expressions, addressing the challenges of overlapping emotion categories and enhancing real-world applicability.

References

1. Akram A. An empirical study of AI-generated text detection tools. Adv Machine Learn Artif Intell. 2023;4(2):44–55.
- View Article
- Google Scholar
2. Akram A, Khan I, Rashid J, Saddique M, Idrees M, Ghadi YY, et al. Enhanced steganalysis for color images using curvelet features and support vector machine. CMC. 2024;78(1):1311–28.
- View Article
- Google Scholar
3. Qin L, Che W, Li Y, Ni M, Liu T. DCR-Net: a deep co-interactive relation network for joint dialog act recognition and sentiment classification. AAAI. 2020;34(05):8665–72.
- View Article
- Google Scholar
4. Song Z, Zheng X, Liu L, Xu M, Huang X. Generating responses with a specific emotion in dialog. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics; 2019.
- View Article
- Google Scholar
5. Zhou H, Huang M, Zhang T, Zhu X, Liu B. Emotional chatting machine: emotional conversation generation with internal and external memory. AAAI. 2018;32(1).
- View Article
- Google Scholar
6. Sordoni A, Bengio Y, Vahabi H, Lioma C, Grue Simonsen J, Nie JY. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. New York, NY, USA: Association for Computing Machinery; 2015: 553–62.
- View Article
- Google Scholar
7. Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E. DialogueRNN: an attentive RNN for emotion detection in conversations. AAAI. 2019;33(01):6818–25.
- View Article
- Google Scholar
8. Ghosal D, Majumder N, Poria S, Chhaya N, Gelbukh A. DialogueGCN: a graph convolutional neural network for emotion recognition in conversation. 2019.
9. Geetha AV, Mala T, Priyanka D, Uma E. Multimodal Emotion Recognition with Deep Learning: Advancements, challenges, and future directions. Information Fusion. 2024;105:102218.
- View Article
- Google Scholar
10. Odusami M, Maskeliūnas R, Damaševičius R, Misra S. Machine learning with multimodal neuroimaging data to classify stages of Alzheimer’s disease: a systematic review and meta-analysis. Cogn Neurodyn. 2024;18(3):775–94. pmid:38826669
- View Article
- PubMed/NCBI
- Google Scholar
11. Murthy JS, Siddesh GM. Multimedia video analytics using deep hybrid fusion algorithm. Multimed Tools Appl. 2024;84(14):14167–85.
- View Article
- Google Scholar
12. Xu F, Sun T, Zhou W, Yu Z, Lu J, Du Q. Multi-level causal reasoning for emotion recognition in conversations. In: 2024 IEEE Smart World Congress (SWC). 2024. 1120–6.
- View Article
- Google Scholar
13. Qian Y, Su L, Xu M, Tang L. enhanced protein secondary structure prediction through multi-view multi-feature evolutionary deep fusion method. IEEE Trans Emerg Top Comput Intell. 2025;:1–12.
- View Article
- Google Scholar
14. Shen X, Huang X, Zou S, Gan X. Multimodal knowledge-enhanced interactive network with mixed contrastive learning for emotion recognition in conversation. Neurocomputing. 2024;582:127550.
- View Article
- Google Scholar
15. Akram A, Rashid J, Jaffar MA, Faheem M, Amin RU. Segmentation and classification of skin lesions using hybrid deep learning method in the Internet of Medical Things. Skin Res Technol. 2023;29(11):e13524. pmid:38009016
- View Article
- PubMed/NCBI
- Google Scholar
16. Khemani B, Patil S, Kotecha K, Tanwar S. A review of graph neural networks: concepts, architectures, techniques, challenges, datasets, applications, and future directions. J Big Data. 2024;11(1).
- View Article
- Google Scholar
17. Lu N, Han Z, Han M, Qian J. Bi-stream graph learning based multimodal fusion for emotion recognition in conversation. Information Fusion. 2024;106:102272.
- View Article
- Google Scholar
18. Shen S, Liu F, Wang H, Zhou A. Towards speaker-unknown emotion recognition in conversation via progressive contrastive deep supervision. IEEE Trans Affective Comput. 2025:1–13.
- View Article
- Google Scholar
19. Đurkić T, Simić N, Suzić S, Bajović D, Perić Z, Delić V. Multimodal emotion recognition using compressed graph neural networks. In: Karpov A, Delić V, eds. Speech and Computer. Cham: Springer Nature Switzerland; 2025: 109–21.
20. Peng J, Tang H, Zheng W. Hierarchical heterogeneous graph network based multimodal emotion recognition in conversation. Multimedia Syst. 2025;31(2).
- View Article
- Google Scholar
21. Fu C, Qian F, Su K, Su Y, Wang Z, Shi J, et al. HiMul-LGG: a hierarchical decision fusion-based local-global graph neural network for multimodal emotion recognition in conversation. Neural Netw. 2025;181:106764. pmid:39368277
- View Article
- PubMed/NCBI
- Google Scholar
22. Jiao W, Yang H, King I, Lyu MR. HiGRU: hierarchical gated recurrent units for utterance-level emotion recognition. 2019.
23. Ghosal D, Majumder N, Gelbukh A, Mihalcea R, Poria S. COSMIC: COmmonSense knowledge for eMotion Identification in Conversations. 2020.
24. Gao Y, Xu C. Residual learning with bi-LSTM and multi-head attention for multi-modal emotion recognition. In: 2023 IEEE International Conference on Image Processing and Computer Applications (ICIPCA). 2023: 409–13.
- View Article
- Google Scholar
25. Hazarika D, Poria S, Zadeh A, Cambria E, Morency L-P, Zimmermann R. Conversational memory network for emotion recognition in dyadic dialogue videos. Proc Conf. 2018;2018:2122–32. pmid:32219222
- View Article
- PubMed/NCBI
- Google Scholar
26. Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R. ICON: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics; 2018: 2594–604.
- View Article
- Google Scholar
27. Gandhi A, Adhvaryu K, Poria S, Cambria E, Hussain A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion. 2023;91:424–44.
- View Article
- Google Scholar
28. Hu D, Wei L, Huai X. DialogueCRN: contextual reasoning networks for emotion recognition in conversations. 2021.
29. Hu D, Hou X, Wei L, Jiang L, Mo Y. MM-DFN: multimodal dynamic fusion network for emotion recognition in conversations. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022: 7037–41.
- View Article
- Google Scholar
30. Wen J, Jiang D, Tu G, Liu C, Cambria E. Dynamic interactive multiview memory network for emotion recognition in conversation. Information Fusion. 2023;91:123–33.
- View Article
- Google Scholar
31. Guo Q, Luo Y, Ou Y, Liu M, Liu J-G. Identification of perceptive users based on the graph convolutional network. Expert Syst Applications. 2025;267:125844.
- View Article
- Google Scholar
32. Cao Z, Luo C. Link prediction for knowledge graphs based on extended relational graph attention networks. Expert Syst Applications. 2025;259:125260.
- View Article
- Google Scholar
33. Jin X, Zhu F, Shen Y, Jeon G, Camacho D. Data-driven dynamic graph convolution transformer network model for EEG emotion recognition under IoMT environment. Big Data Min Anal. 2025;8(3):712–25.
- View Article
- Google Scholar
34. Khemani B, Patil S, Malave S, Gupta J. Improved graph convolutional network for emotion analysis in social media text. MethodsX. 2025;14:103325. pmid:40491513
- View Article
- PubMed/NCBI
- Google Scholar
35. Zhang C, Liu Y, Cheng B. A MoE multimodal graph attention network framework for multimodal emotion recognition. In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2025: 1–5.
- View Article
- Google Scholar
36. Dai Y, Li J, Li Y, Lu G. Multi-modal graph context extraction and consensus-aware learning for emotion recognition in conversation. Knowledge Based Syst. 2024;298:111954.
- View Article
- Google Scholar
37. Lian Z, Liu B, Tao J. SMIN: semi-supervised multi-modal interaction network for conversational emotion recognition. IEEE Trans Affective Comput. 2023;14(3):2415–29.
- View Article
- Google Scholar
38. Ding Y, Tong C, Zhang S, Jiang M, Li Y, Lim KJ, et al. EmT: a novel transformer for generalized cross-subject EEG emotion recognition. IEEE Trans Neural Netw Learn Syst. 2025;36(6):10381–93. pmid:40208757
- View Article
- PubMed/NCBI
- Google Scholar
39. Jiang X, An B, Zhao G, Qian X. M3Net: efficient time-frequency integration network with mirror attention for audio classification on edge. AAAI. 2025;39(17):17644–52.
- View Article
- Google Scholar
40. Park Y, Shin Y. Adaptive bi-encoder model selection and ensemble for text classification. Mathematics. 2024;12(19):3090.
- View Article
- Google Scholar
41. Makhmudov F, Kultimuratov A, Cho Y-I. enhancing multimodal emotion recognition through attention mechanisms in BERT and CNN architectures. Appl Sci. 2024;14(10):4199.
- View Article
- Google Scholar
42. Raju AS, Kumari VS. Mel frequency cepstral coefficients based speech emotion recognition using decision tree algorithm in comparison with support vector machine classifier for better accuracy. In: 2024 International Conference on Trends in Quantum Computing and Emerging Business Technologies. 2024: 1–5.
- View Article
- Google Scholar
43. Sindhu R, Arunachalam V. Design of an Approximate Radix-2 FFT butterfly unit for LSTM-Speech signal-based Parkinson’s disease classifier. Circuits Syst Signal Process. 2025;44(6):4258–78.
- View Article
- Google Scholar
44. Muthukkumar R. Enhancing the identification of autism spectrum disorder in facial expressions using DenseResNet-Based transfer learning approach. Biomed Signal Process Control. 2025;103:107433.
- View Article
- Google Scholar
45. Chowdhury JH, Ramanna S, Kotecha K. Speech emotion recognition with light weight deep neural ensemble model using hand crafted features. Sci Rep. 2025;15(1):11824. pmid:40195486
- View Article
- PubMed/NCBI
- Google Scholar
46. Fu F, Ai W, Yang F, Shou Y, Meng T, Li K. SDR-GNN: spectral domain reconstruction graph neural network for incomplete multimodal learning in conversational emotion recognition. Knowledge Based Syst. 2025;309:112825.
- View Article
- Google Scholar
47. Akram A, Rashid J, Jaffar A, Hajjej F, Iqbal W, Sarwar N. Weber law based approach for multi-class image forgery detection. CMC. 2024;78(1):145–66.
- View Article
- Google Scholar
48. Akram A, Ramzan S, Rasool A, Jaffar A, Furqan U, Javed W. Image splicing detection using discriminative robust local binary pattern and support vector machine. WJE. 2022;19(4):459–66.
- View Article
- Google Scholar
49. Akram A, Jaffar MA, Rashid J, Boulaaras SM, Faheem M. CMV2U-Net: A U-shaped network with edge-weighted features for detecting and localizing image splicing. J Forensic Sci. 2025;70(3):1026–43. pmid:40177991
- View Article
- PubMed/NCBI
- Google Scholar
50. Aurelio C, Chowanda A. Using CNN and transformer model for unimodal speech emotion recognition on MELD and IEMOCAP. In: 2025 International Conference on Advancement in Data Science, E-learning and Information System (ICADEIS). 2025: 1–7.
- View Article
- Google Scholar
51. Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R. MELD: a multimodal multi-party dataset for emotion recognition in conversations. 2019.
52. Melin P, Sánchez D, Castillo O. Fuzzy dynamic parameter adaptation for gray wolf optimization of modular granular neural networks applied to human recognition using the iris biometric measure. Handbook on computer learning and intelligence. World Scientific; 2021: 947–72.
53. Hasan R, Nigar M, Mamun N, Paul S. EmoFormer: A Text-Independent Speech Emotion Recognition using a Hybrid Transformer-CNN model. 2025.
54. Li B, Fei H, Liao L, Zhao Y, Teng C, Chua T-S, et al. Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In: Proceedings of the 31st ACM International Conference on Multimedia. 2023: 5923–34.
- View Article
- Google Scholar
55. Yang J, Wang Y, Yi R, Zhu Y, Rehman A, Zadeh A. MTAG: modal-temporal attention graph for unaligned human multimodal language sequences. 2021.

[ref1] 1. Akram A. An empirical study of AI-generated text detection tools. Adv Machine Learn Artif Intell. 2023;4(2):44–55.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Akram A, Khan I, Rashid J, Saddique M, Idrees M, Ghadi YY, et al. Enhanced steganalysis for color images using curvelet features and support vector machine. CMC. 2024;78(1):1311–28.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Qin L, Che W, Li Y, Ni M, Liu T. DCR-Net: a deep co-interactive relation network for joint dialog act recognition and sentiment classification. AAAI. 2020;34(05):8665–72.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Song Z, Zheng X, Liu L, Xu M, Huang X. Generating responses with a specific emotion in dialog. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics; 2019.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref5] 5. Zhou H, Huang M, Zhang T, Zhu X, Liu B. Emotional chatting machine: emotional conversation generation with internal and external memory. AAAI. 2018;32(1).
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref6] 6. Sordoni A, Bengio Y, Vahabi H, Lioma C, Grue Simonsen J, Nie JY. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. New York, NY, USA: Association for Computing Machinery; 2015: 553–62.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref7] 7. Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E. DialogueRNN: an attentive RNN for emotion detection in conversations. AAAI. 2019;33(01):6818–25.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref8] 8. Ghosal D, Majumder N, Poria S, Chhaya N, Gelbukh A. DialogueGCN: a graph convolutional neural network for emotion recognition in conversation. 2019.

[ref9] 9. Geetha AV, Mala T, Priyanka D, Uma E. Multimodal Emotion Recognition with Deep Learning: Advancements, challenges, and future directions. Information Fusion. 2024;105:102218.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref10] 10. Odusami M, Maskeliūnas R, Damaševičius R, Misra S. Machine learning with multimodal neuroimaging data to classify stages of Alzheimer’s disease: a systematic review and meta-analysis. Cogn Neurodyn. 2024;18(3):775–94. pmid:38826669
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref11] 11. Murthy JS, Siddesh GM. Multimedia video analytics using deep hybrid fusion algorithm. Multimed Tools Appl. 2024;84(14):14167–85.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref12] 12. Xu F, Sun T, Zhou W, Yu Z, Lu J, Du Q. Multi-level causal reasoning for emotion recognition in conversations. In: 2024 IEEE Smart World Congress (SWC). 2024. 1120–6.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref13] 13. Qian Y, Su L, Xu M, Tang L. enhanced protein secondary structure prediction through multi-view multi-feature evolutionary deep fusion method. IEEE Trans Emerg Top Comput Intell. 2025;:1–12.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref14] 14. Shen X, Huang X, Zou S, Gan X. Multimodal knowledge-enhanced interactive network with mixed contrastive learning for emotion recognition in conversation. Neurocomputing. 2024;582:127550.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref15] 15. Akram A, Rashid J, Jaffar MA, Faheem M, Amin RU. Segmentation and classification of skin lesions using hybrid deep learning method in the Internet of Medical Things. Skin Res Technol. 2023;29(11):e13524. pmid:38009016
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref16] 16. Khemani B, Patil S, Kotecha K, Tanwar S. A review of graph neural networks: concepts, architectures, techniques, challenges, datasets, applications, and future directions. J Big Data. 2024;11(1).
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref17] 17. Lu N, Han Z, Han M, Qian J. Bi-stream graph learning based multimodal fusion for emotion recognition in conversation. Information Fusion. 2024;106:102272.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref18] 18. Shen S, Liu F, Wang H, Zhou A. Towards speaker-unknown emotion recognition in conversation via progressive contrastive deep supervision. IEEE Trans Affective Comput. 2025:1–13.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref19] 19. Đurkić T, Simić N, Suzić S, Bajović D, Perić Z, Delić V. Multimodal emotion recognition using compressed graph neural networks. In: Karpov A, Delić V, eds. Speech and Computer. Cham: Springer Nature Switzerland; 2025: 109–21.

[ref20] 20. Peng J, Tang H, Zheng W. Hierarchical heterogeneous graph network based multimodal emotion recognition in conversation. Multimedia Syst. 2025;31(2).
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref21] 21. Fu C, Qian F, Su K, Su Y, Wang Z, Shi J, et al. HiMul-LGG: a hierarchical decision fusion-based local-global graph neural network for multimodal emotion recognition in conversation. Neural Netw. 2025;181:106764. pmid:39368277
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref22] 22. Jiao W, Yang H, King I, Lyu MR. HiGRU: hierarchical gated recurrent units for utterance-level emotion recognition. 2019.

[ref23] 23. Ghosal D, Majumder N, Gelbukh A, Mihalcea R, Poria S. COSMIC: COmmonSense knowledge for eMotion Identification in Conversations. 2020.

[ref24] 24. Gao Y, Xu C. Residual learning with bi-LSTM and multi-head attention for multi-modal emotion recognition. In: 2023 IEEE International Conference on Image Processing and Computer Applications (ICIPCA). 2023: 409–13.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref25] 25. Hazarika D, Poria S, Zadeh A, Cambria E, Morency L-P, Zimmermann R. Conversational memory network for emotion recognition in dyadic dialogue videos. Proc Conf. 2018;2018:2122–32. pmid:32219222
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref26] 26. Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R. ICON: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics; 2018: 2594–604.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref27] 27. Gandhi A, Adhvaryu K, Poria S, Cambria E, Hussain A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion. 2023;91:424–44.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref28] 28. Hu D, Wei L, Huai X. DialogueCRN: contextual reasoning networks for emotion recognition in conversations. 2021.

[ref29] 29. Hu D, Hou X, Wei L, Jiang L, Mo Y. MM-DFN: multimodal dynamic fusion network for emotion recognition in conversations. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022: 7037–41.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref30] 30. Wen J, Jiang D, Tu G, Liu C, Cambria E. Dynamic interactive multiview memory network for emotion recognition in conversation. Information Fusion. 2023;91:123–33.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref31] 31. Guo Q, Luo Y, Ou Y, Liu M, Liu J-G. Identification of perceptive users based on the graph convolutional network. Expert Syst Applications. 2025;267:125844.
View Article
Google Scholar

[86] View Article

[87] Google Scholar

[ref32] 32. Cao Z, Luo C. Link prediction for knowledge graphs based on extended relational graph attention networks. Expert Syst Applications. 2025;259:125260.
View Article
Google Scholar

[89] View Article

[90] Google Scholar

[ref33] 33. Jin X, Zhu F, Shen Y, Jeon G, Camacho D. Data-driven dynamic graph convolution transformer network model for EEG emotion recognition under IoMT environment. Big Data Min Anal. 2025;8(3):712–25.
View Article
Google Scholar

[92] View Article

[93] Google Scholar

[ref34] 34. Khemani B, Patil S, Malave S, Gupta J. Improved graph convolutional network for emotion analysis in social media text. MethodsX. 2025;14:103325. pmid:40491513
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref35] 35. Zhang C, Liu Y, Cheng B. A MoE multimodal graph attention network framework for multimodal emotion recognition. In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2025: 1–5.
View Article
Google Scholar

[99] View Article

[100] Google Scholar

[ref36] 36. Dai Y, Li J, Li Y, Lu G. Multi-modal graph context extraction and consensus-aware learning for emotion recognition in conversation. Knowledge Based Syst. 2024;298:111954.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref37] 37. Lian Z, Liu B, Tao J. SMIN: semi-supervised multi-modal interaction network for conversational emotion recognition. IEEE Trans Affective Comput. 2023;14(3):2415–29.
View Article
Google Scholar

[105] View Article

[106] Google Scholar

[ref38] 38. Ding Y, Tong C, Zhang S, Jiang M, Li Y, Lim KJ, et al. EmT: a novel transformer for generalized cross-subject EEG emotion recognition. IEEE Trans Neural Netw Learn Syst. 2025;36(6):10381–93. pmid:40208757
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref39] 39. Jiang X, An B, Zhao G, Qian X. M3Net: efficient time-frequency integration network with mirror attention for audio classification on edge. AAAI. 2025;39(17):17644–52.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref40] 40. Park Y, Shin Y. Adaptive bi-encoder model selection and ensemble for text classification. Mathematics. 2024;12(19):3090.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref41] 41. Makhmudov F, Kultimuratov A, Cho Y-I. enhancing multimodal emotion recognition through attention mechanisms in BERT and CNN architectures. Appl Sci. 2024;14(10):4199.
View Article
Google Scholar

[118] View Article

[119] Google Scholar

[ref42] 42. Raju AS, Kumari VS. Mel frequency cepstral coefficients based speech emotion recognition using decision tree algorithm in comparison with support vector machine classifier for better accuracy. In: 2024 International Conference on Trends in Quantum Computing and Emerging Business Technologies. 2024: 1–5.
View Article
Google Scholar

[121] View Article

[122] Google Scholar

[ref43] 43. Sindhu R, Arunachalam V. Design of an Approximate Radix-2 FFT butterfly unit for LSTM-Speech signal-based Parkinson’s disease classifier. Circuits Syst Signal Process. 2025;44(6):4258–78.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref44] 44. Muthukkumar R. Enhancing the identification of autism spectrum disorder in facial expressions using DenseResNet-Based transfer learning approach. Biomed Signal Process Control. 2025;103:107433.
View Article
Google Scholar

[127] View Article

[128] Google Scholar

[ref45] 45. Chowdhury JH, Ramanna S, Kotecha K. Speech emotion recognition with light weight deep neural ensemble model using hand crafted features. Sci Rep. 2025;15(1):11824. pmid:40195486
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref46] 46. Fu F, Ai W, Yang F, Shou Y, Meng T, Li K. SDR-GNN: spectral domain reconstruction graph neural network for incomplete multimodal learning in conversational emotion recognition. Knowledge Based Syst. 2025;309:112825.
View Article
Google Scholar

[134] View Article

[135] Google Scholar

[ref47] 47. Akram A, Rashid J, Jaffar A, Hajjej F, Iqbal W, Sarwar N. Weber law based approach for multi-class image forgery detection. CMC. 2024;78(1):145–66.
View Article
Google Scholar

[137] View Article

[138] Google Scholar

[ref48] 48. Akram A, Ramzan S, Rasool A, Jaffar A, Furqan U, Javed W. Image splicing detection using discriminative robust local binary pattern and support vector machine. WJE. 2022;19(4):459–66.
View Article
Google Scholar

[140] View Article

[141] Google Scholar

[ref49] 49. Akram A, Jaffar MA, Rashid J, Boulaaras SM, Faheem M. CMV2U-Net: A U-shaped network with edge-weighted features for detecting and localizing image splicing. J Forensic Sci. 2025;70(3):1026–43. pmid:40177991
View Article
PubMed/NCBI
Google Scholar

[143] View Article

[144] PubMed/NCBI

[145] Google Scholar

[ref50] 50. Aurelio C, Chowanda A. Using CNN and transformer model for unimodal speech emotion recognition on MELD and IEMOCAP. In: 2025 International Conference on Advancement in Data Science, E-learning and Information System (ICADEIS). 2025: 1–7.
View Article
Google Scholar

[147] View Article

[148] Google Scholar

[ref51] 51. Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R. MELD: a multimodal multi-party dataset for emotion recognition in conversations. 2019.

[ref52] 52. Melin P, Sánchez D, Castillo O. Fuzzy dynamic parameter adaptation for gray wolf optimization of modular granular neural networks applied to human recognition using the iris biometric measure. Handbook on computer learning and intelligence. World Scientific; 2021: 947–72.

[ref53] 53. Hasan R, Nigar M, Mamun N, Paul S. EmoFormer: A Text-Independent Speech Emotion Recognition using a Hybrid Transformer-CNN model. 2025.

[ref54] 54. Li B, Fei H, Liao L, Zhao Y, Teng C, Chua T-S, et al. Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In: Proceedings of the 31st ACM International Conference on Multimedia. 2023: 5923–34.
View Article
Google Scholar

[153] View Article

[154] Google Scholar

[ref55] 55. Yang J, Wang Y, Yi R, Zhu Y, Rehman A, Zadeh A. MTAG: modal-temporal attention graph for unaligned human multimodal language sequences. 2021.

RETRACTED: HGLER: A hierarchical heterogeneous graph networks for enhanced multimodal emotion recognition in conversations

RETRACTED: HGLER: A hierarchical heterogeneous graph networks for enhanced multimodal emotion recognition in conversations

Retraction

Figures

Abstract

1. Introduction

2. Related works

2.1. Emotion recognition in conversation

2.2. Graph convolutional networks

3. Materials and methods

3.1. Multi-model feature extraction

3.1.1. Sentence transformer.

3.1.2. MFCC (Mel-Frequency Cepstral Coefficients).

3.1.3. DenseNet121.

3.2. Heterogeneous graph construction and learning

3.3. Emotion classification (DNN Classifier)

4. Experimental results and discussion

4.1. Datasets

4.2. Experimental settings

4.3. Recent models for comparison

4.4. Comparative results of recent models and proposed HGLER

4.4. Performance comparison of HGLER against different settings

4.5. Ablation experiments

4.6. Performance evaluation of proposed HGLER model

5. Conclusion and future work

References