Figures
Abstract
In the task of emotion inference, a common issue is the lack of common sense knowledge, particularly in the context of dialogue, where traditional research has failed to effectively extract structural features, resulting in lower accuracy in emotion inference. To address this, this paper proposes a dialogue emotion inference model based on Common Sense Enhancement and Graph Model (CEICG). This model integrates external common sense with graph model techniques by dynamically constructing nodes and defining diverse edge relations to simulate the evolution of dialogue, thereby effectively capturing the structural and semantic features of the conversation. The model employs two methods to incorporate external common sense into the graph model, overcoming the limitations of previous models in understanding complex dialogue structures and the absence of external knowledge. This strategy of integrating external common sense significantly enhances the model’s emotion inference capabilities, improving the understanding of emotions in dialogue. Experimental results demonstrate that the CEICG model outperforms six existing baseline models in emotion inference tasks across three datasets.
Citation: Zhang Y, Xu K, Xie C, Gao Z (2024) Emotion inference in conversations based on commonsense enhancement and graph structures. PLoS ONE 19(12): e0315039. https://doi.org/10.1371/journal.pone.0315039
Editor: Ying Shen, Tongji University, CHINA
Received: August 27, 2024; Accepted: November 19, 2024; Published: December 11, 2024
Copyright: © 2024 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Emotion is an individual’s attitude and experience toward objective events during psychological processes. Humans are easily influenced by emotions, leading to a series of subtle decisions that shape personal habits, career planning, and social relationships. Emotion analysis can utilize various features, such as text features, facial features [1], and behavioral features [2], to understand people’s emotional characteristics, which hold significant importance across various fields [3]. Providing robots with emotional skills has been a key focus in this area, with one important aspect being the ability of robots to actively recognize emotions. On the other hand, emotion analysis enables robots to adapt to users’ emotional states and provide more appropriate emotional responses, enhancing user experience in fields such as medical services [4], counseling services [5], and educational services [6]. Moreover, with societal development, human communication has shifted from traditional audiovisual methods to text-based methods, such as online customer support systems. This shift presents challenges in obtaining emotional analysis data that was previously reliant on audiovisual cues. Therefore, analyzing emotions in dialogue text in the absence of audiovisual data contributes to expanding the research domain of emotion analysis.
Traditional emotion analysis methods are primarily based on statistical models and machine learning techniques [7–10]. However, these methods often only support binary classification, exhibit poor performance, and have weak generalization capabilities, as they mainly rely on objective patterns in the data to predict emotions. This reliance results in the need for extensive computation to build high-accuracy models using traditional emotion analysis methods, making it even more challenging for large-scale datasets. Combining deep learning with dialogue emotion analysis is a promising approach, as it enhances the model’s generalization ability while simulating human behavior to capture emotional features in vast amounts of data. It can also learn features that are often overlooked but highly relevant to emotion analysis results [11]. This approach aims to enhance machines’ understanding of human emotions and provide robust monitoring of various negative emotions in society.
Emotion prediction is a popular topic. Compared to traditional methods that only identify the emotions of specific sentences, the main goal of emotion prediction is to forecast the user’s emotional state at the next moment. Due to the dependency on emotional statements and the diversity of emotions, recognizing emotions during a conversation becomes increasingly challenging, as emotional changes correspond to variations in dialogue expressions. To accurately differentiate among several emotional categories within discourse, it is essential to consider the contextual features of past information. While this may be simple for humans, it poses a significant challenge for machines. Humans can grasp the implied meanings in language and context, such as sarcasm, humor, or puns. Different cultures have varying norms and customs for expressing and understanding emotions, and humans are able to accurately interpret emotions based on this background knowledge, which largely relies on sociocultural experience and contextual knowledge. Furthermore, previous research models typically handle individual sentences rather than entire conversations [12, 13]. Consequently, they are limited in their ability to understand linguistic relevance across multiple dialogue turns and may fail to capture subtle linguistic cues and contextual changes in ongoing conversations [14, 15]. To address this, we introduce external common knowledge through an external knowledge base as auxiliary information to enhance the representational performance of discourse. Additionally, to better capture the relationships between dialogues, we design a graph model to represent conversations, aiming to improve the understanding of dialogue features and enrich the variables extracted. The primary research contributions of this paper are as follows:
- (1) The CEICG model based on external knowledge was proposed. This model draws on the principles of deep learning, with sentences in the dialogue system being initially encoded and emotional features fine-tuned using RoBERTa. Feature extraction was subsequently performed using LSTM. A graph model was then constructed based on sentence nodes to capture the structural features within the dialogue. Furthermore, two methods were employed to extract knowledge from the sentences, which was integrated as auxiliary information into the graph model. This approach of incorporating external common knowledge significantly enhanced the model’s emotional reasoning capabilities. Finally, Graph Convolutional Networks (GCN) were utilized for node updates and feature extraction within the graph model, providing a basis for emotional reasoning.
- (2) Experiments were conducted across three distinct datasets, with the results being compared against existing baseline methods. The experimental findings indicate that, in comparison to traditional methods, the proposed model demonstrates significant advantages, validating its advanced and practical efficacy in addressing dialogue emotion reasoning tasks.
2 Related work
Emotions are regarded as patterns related to the environment, encompassing a series of interconnected events, including environmental stimuli, psychological changes, self-perception, and behavioral impulses. It has been demonstrated that emotions serve as a means to improve human social relationships and enhance our adaptability to the ever-changing environment [16]. Researchers have categorized emotions based on various emotional theories. Among these, the most well-known classification was proposed by Ekman [17], who identified six basic emotions: surprise, sadness, fear, joy, disgust, and anger, which are viewed as core patterns of emotional responses. Subsequent studies have expanded this classification to include eight categories [18], fifteen categories [19], and so on.
Emotion prediction has long been a popular topic. Hasegawa et al. [10] predicted the emotions of recipients in online dialogues using statistical methods within the context of two-turn dialogues. Zhang et al. [20] introduced a novel method for interactive emotional learning, designing a Transformer-based variational learning network to learn the response distribution between dialogues to predict future emotions. Soujanya et al. [21] explored spatial and temporal attention, as well as the parallel/sequential arrangement of spatial and temporal attention modules, to enhance emotion prediction performance by integrating information about emotions and emotional changes. Li et al. [22] proposed an interactive emotional model that summarizes and utilizes the characteristics of dialogue texts to effectively extract the historical emotional features and emotional change features of dialogue users in an interactive manner, predicting the speaker’s future emotions based on previous utterances. Radhika et al. [23] noted that emotional labels themselves might convey information, as these labels can guide the model’s attention distribution upon input. The emotions triggered by an event are often interrelated and exhibit a certain homogeneity. Sun et al. [24] combined reinforcement learning and emotional editing mechanisms for emotion prediction during response tasks, resulting in replies that are both logically and emotionally relevant. Liu et al. [25] proposed a simple and effective dialogue emotion prediction model based on relationship extraction, utilizing self and other dialogues to extract and incorporate relevant emotional dependencies, predicting their emotions without the emotional context of the current utterance, and measuring the similarity between the emotional distribution and the emotion prediction distribution using KL divergence. Enas et al. [26] modeled the three inherent dimensional relationships that evoke emotions in dialogues, merging them into two deep neural network architectures: one being a Graph Convolutional Network model and the other a sequence network model, aimed at capturing the network formation and discourse sequence features within dialogues. Gopendra et al. [27] proposed an end-to-end model that directly models category labels using label embedding techniques. This model can handle text, audio, and visual features concurrently, integrating various modalities through a cross-attention network to identify emotions and intentions in multimodal environments.
The objective of emotion reasoning tasks is primarily to infer how events evoke the emotions of characters in stories. In dialogue systems, researchers have redefined this task. Li et al. [28] redefined emotional reasoning within the dialogue process, where the main goal is to predict how utterances influence the listener’s emotions without knowing any responses from the listener, specifically predicting their emotional state in the next turn. However, the challenge lies in the unknown future dialogue content, which limits the ability to directly infer future emotional states. Particularly, understanding how emotions propagate among participants in dialogues is key to achieving accurate emotional reasoning. By accurately understanding and predicting the emotional changes of dialogue listeners, dialogue systems can communicate more naturally and effectively with users, thereby playing a crucial role in providing customer support, mental health counseling, and personalized recommendations. This enhancement in emotional perception capabilities lays the foundation for constructing a more human-like human-computer interaction experience. In existing research on emotional reasoning, with the development of deep learning, Li et al. [28] modeled the propagation of emotional states between participants by deeply analyzing dialogue history. A deep learning model based on address-aware modules and ensemble strategies was proposed, which automatically identifies whether participants maintain their previous emotional states or are influenced by other participants in the dialogue, thus exhibiting corresponding emotional responses in the next round of dialogue. Wang et al. [29] introduced an innovative global and local modeling technique that combines the ability of RNNs to process sequential data with the powerful capabilities of pre-trained language models (PLMs) for deep language understanding. Through this combination, the model captures dialogue dynamics while extracting and utilizing knowledge from extensive contexts. Specifically, the entire dialogue history is used as input to the PLM, employing a contextual learning mechanism to generate knowledge regarding the dialogue context, thereby providing a new approach for emotional reasoning. Narayana et al. [30] inferred short-term emotional states by focusing on long-term emotional influences and their changes, rather than relying on traditional contextual cues such as background scenes, locations, or social actors. This approach shifts away from the previous unidimensional methodology by integrating video information for multimodal emotional reasoning.
External knowledge bases refer to collections of knowledge relied upon by a system or program that exist outside the system itself, encompassing relevant definitions, concepts, examples, rules, experiences, and other information. In the fields of artificial intelligence and natural language processing, external knowledge bases are frequently used to supplement model training data, thereby enhancing the model’s understanding and reasoning capabilities. Numerous common-sense knowledge bases exist in current research, including Event2Mind [31] and ATOMIC [32], which contain over 877k instances of everyday common-sense knowledge organized into variable-type if-then relationships with strict logical connections. ConceptNet [33] is a semantic network that includes concept-level relationship common sense, aiding machines in understanding the meanings of words. SenticNet [34] is an emotional lexicon widely used in sentiment analysis tasks. COMET [35] is a generative model trained on ConceptNet and ATOMIC, which generates richer and more diverse common-sense knowledge that is not present in the original knowledge bases based on original knowledge. To address the issue of machines being unable to rely on context and common-sense knowledge like humans, Zhou et al. [13] employed a method of retrieving common-sense knowledge relevant to user sentences and structuring each knowledge graph as a separate entity for encoding. This process strengthens the semantic meaning of sentences through the use of a static graph attention mechanism, thereby obtaining responses rich in information. Zhong et al. [36] developed a knowledge-enhanced transformer model that adopts a hierarchical self-attention mechanism to analyze contextual utterances and dynamically introduces external common sense through context-aware emotional graph attention mechanisms, significantly improving the accuracy of emotion recognition. Li et al. [37] proposed a knowledge integration strategy for integrating common-sense knowledge related to dialogues generated by event-based knowledge graphs. Ghosal et al. [38] and his team developed a novel framework named COSMIC that integrates various common-sense features, including mental states, event themes, and their causal relationships, using this information to study the interactions between participants in dialogues.
Currently, most research in dialogue systems focuses on the classification of emotional polarity as positive, negative or neutral at the utterance level for both dialogue participants [10, 22], with a lack of consideration for emotional reasoning work related to external common sense.
3 Methodology
3.1 Definition of the problem
Let U = {μ1, μ2, …, μT} be a set of dialogues, where T is the length of the dialogue, such as μt represents the utterance spoken by the speaker at time t, with each utterance consisting of several words. The task of emotion inference is to deduce the emotional reaction of the recipient of the last utterance μt in the dialogue set, without knowing any information after μT, by utilizing the entire dialogue’s utterance information. Specifically, it is defined as ET ∼ P(ET|(μ1, μ2, …, μt, …, μT)), where ET is the predicted emotional probability distribution corresponding to the recipient.
This paper proposes a dialogue emotion inference model based on commonsense enhancement and graph modeling (CEICG). Based on the idea that each utterance in a dialogue should have a certain connection with previous utterances, we not only consider the relevant sequential information but also the rationality of information change over time. To achieve this, we transform the dialogue process into a graph structure for storage, and then use GCN [39] to update the node information within the graph. This node information can be viewed as emotional information derived from external commonsense and dialogue sequence features. Additionally, we integrate external commonsense as auxiliary information into the model, and finally obtain the corresponding emotional probability distribution through a classifier. The structure of the proposed CEICG model is shown in Fig 1. The proposed CEICG model comprises three main components: emotion feature extraction, construction of the dialogue graph with integrated external commonsense, and emotion inference.
3.2 Emotion feature extraction
The process primarily includes two main steps: word embedding and feature extraction.
(1) Word Embedding
Word embedding is a natural language processing technique that maps words or phrases from a vocabulary into a high-dimensional space composed of numerical vectors. Traditional word representation methods, such as One-hot Encoding, generate vectors with lower dimensions. However, most elements in these vectors are zeros, making them inefficient and unable to effectively capture relationships between words. RoBERTa (Robustly optimized BERT approach) is an improved version of BERT, introduced by Facebook AI in 2019. RoBERTa enhances BERT by training on larger datasets for longer periods and removing the Next Sentence Prediction (NSP) task during the pre-training phase. This allows RoBERTa to capture richer linguistic features and subtle semantic differences. Therefore, we choose RoBERTa as the encoder to generate dynamic word embeddings, meaning that the same word will have different embeddings in different contexts. This characteristic enables our embeddings to handle polysemy and context-related semantics. We obtain the vector representation of each sentence from the last layer of the encoder. The encoding process is represented as follows:
(1)
Here, v1, v2, …, vT are the representations of the corresponding utterances μ1, μ2, …, μT.
(2) Feature extraction
In real conversations or multi-turn dialogue environments, the sentiment of the text may change as the dialogue progresses. In other words, the sentiment at time t can change due to the accumulation of sentiment from previous conversations. Therefore, the introduction of LSTM will help the model better track and understand these dynamic changes in sentiment. For example, considering that the sentiment tendencies of the text may be influenced by distant phrases, LSTM effectively retains this important information through its gating mechanisms, even if there is a long interval between them. At the same time, the forget gate of LSTM allows the model to “forget” unimportant or irrelevant information, which helps the model focus on information more critical to the current task. For a specific utterance representation vt, we use it as the input to LSTM to obtain the sentiment features of each sentence. This process is formalized as follows:
(2)
Here, ht denotes the hidden state at time t, ct represents the cell state at time t, vt is the input at time t, ht−1 is the hidden state at time t − 1, and ct−1 is the cell state at time t − 1. In our experiments, the initial hidden state is set to 0.
3.3 Constructing dialogue graphs integrating external knowledge
3.3.1 Constructing dialogue graph.
Given a set of conversations U = {μ1, μ2, …, μt, …μT}, which are transformed into conversation representation features {h1, h2, …, ht, …hT} through an emotion feature extraction component. Let Ht = {h1, h2, …, ht} be the set of nodes in the graph at time t, and let γt ← {{hi, hj}|hi, hj ∈ Ht, hi ≠ hj} be the set of relational edges. The graph Gt(Ht, γt) can then be used to represent the dialogue state at time t, where (hi, hj) represents a relational edge between the nodes hi and hj. To better represent the sequential information and relevant semantic information of the utterances, two different types of relational edges are designed:
- (1) Sequential Edge: This edge type represents a clear sequence between two utterances. For example, if a speaker says utterance hi, and another speaker replies with utterance hj, there is a clear sequential relationship between them. We define this relational edge as a sequential edge and set its weight to 1.
- (2)Semantic Edge: This type of relationship represents the case where there is no clear sequential relationship between two utterances. In this case, we calculate the edge weight between them using cosine distance. We then prune the edges with low weights to remove unnecessary information. In the next time step, we insert the node representing the new utterance into the current graph to construct a new graph, thereby reflecting the temporal sequence and semantic connections of the utterances throughout the dialogue.
3.3.2 Updating dialogue graphs with external commonsense.
External commonsense refers to knowledge that has been widely recognized and validated within a specific domain, including but not limited to linguistic rules, common sense judgments, cultural background, and behavior patterns in specific scenarios. By incorporating this knowledge into models, it can help them better understand the implicit emotions and intentions in dialogue content. For example, in a conversation with the sentence, “There will be heavy rain tomorrow, our game might be canceled,” analyzing the text based solely on surface emotions might make it difficult to accurately determine the speaker’s mood. However, if the model possesses commonsense knowledge related to games, such as understanding that people usually feel disappointed when a game is canceled, the model can more accurately infer that the speaker might be expressing regret or disappointment.
To achieve this, we extract commonsense knowledge for each utterance during the dialogue process and integrate it into a graph model. In previous studies, the social commonsense graph ATOMIC and the classification/lexical knowledge graph ConceptNet have proven to enhance machine understanding in emotion analysis within natural language research. Among them, ATOMIC provides a large number of “if-then” relationships covering various scenarios in daily life. This rich knowledge base can help NLP models understand the logic, causes, and effects behind human behavior, thereby enhancing the model’s contextual understanding ability. ATOMIC, by segmenting different types of social interactions and outcomes of human behavior, offers fine-grained social interaction commonsense, which aids NLP models in more accurately predicting the consequences of human actions, understanding the dynamics of interpersonal relationships, and predicting emotions and psychological states. ATOMIC contains a large number of commonsense knowledge tuples, consisting of a head phrase, relation, and attribute, such as: (“Someone is watching a movie alone at home,” xReact, “afraid”). The relationship types and their meanings in the ATOMIC knowledge base are shown in Table 1. Utilizing ATOMIC’s knowledge can improve a model’s generalization ability to unseen text, enabling the model to not only understand the text based on its literal meaning but also make inferences using underlying commonsense, thereby exhibiting better performance when handling new scenarios and tasks.
With the development of deep learning technology, more and more research is focusing on how to utilize deep learning models, especially pre-trained language models like BERT and GPT, to automatically generate and enhance knowledge graphs. Compared to traditional methods of knowledge graph construction, which primarily rely on manual editing, generative knowledge graphs can expand and update knowledge bases more quickly, improving the efficiency and coverage of knowledge acquisition. Moreover, generative methods can significantly reduce the labor costs associated with building and maintaining knowledge graphs. For example, COMET, a deep learning model based on GPT and pre-trained on ATOMIC, automatically generates commonsense knowledge tuples from text, thereby constructing and expanding knowledge graphs. In addition to acquiring traditional knowledge, this approach also allows us to obtain less frequently considered but still useful knowledge for our tasks.
Therefore, we divide knowledge acquisition into two parts: knowledge querying and knowledge generation. First, for knowledge querying, we utilize ATOMIC to perform knowledge queries. We use the SBERT model, based on cosine similarity and BERT, to calculate the similarity between each utterance and the Head phrases in ATOMIC, obtaining the relationship attributes that are most similar to the Head phrases as the commonsense knowledge we query. In our research, we mainly select two relationships that are highly relevant to sentiment analysis: xReact and oReact, which represent how the subject feels after the event and how others feel after the event, respectively. These two attributes are crucial for extracting emotions. Next, in the knowledge generation part, we use COMET-ATOMIC to generate new commonsense knowledge. We input the utterance μi and the relationship types to be selected (xReact and oReact) into the model, and the model outputs the corresponding results based on the relationship types we selected. We use BiLSTM to extract the feature vectors of each relationship, obtaining the knowledge representation based on knowledge querying, oA- xReact and oA- oReact, and the knowledge representation based on knowledge generation, oC- xReact and oC- oReact. Finally, all the knowledge is fused together, as shown below:
(3)
where
represents the commonsense knowledge extracted from the utterance μi, WT is the learnable parameter matrix, and Ki is the feature obtained after fusing the commonsense knowledge related to the utterance μi. Here’s an example to illustrate how to derive the xReact and oReact relational properties from a sentence and generate common sense knowledge, followed by the fusion process. For convenience, the process is named DGIEKGen:
DGIEKGen Algorithm Example
Assume the input sentence i is: “Vincent passed the exam”.
- Step 1: Encode the Sentence—Use SBERT to encode sentence i, obtaining a vector representation.
- Step 2: Encode Head Phrases—Select a series of Head phrases from the ATOMIC database, such as “someone achieves success” or “someone reaches a goal,” and encode them using SBERT.
- Step 3: Calculate Similarity—Compute the cosine similarity between the encoded sentence vector and each Head phrase vector.
- Step 4: Select the most similar Head phrase—Choose the Head phrase most similar to the sentence based on the calculated similarity values. For sentence i, the most similar Head phrase is “someone achieves success.”
- Step 5: Retrieve Common Sense Knowledge—Use the selected Head phrase to find corresponding common sense knowledge from the ATOMIC database. For example, the common sense for the Head phrase “someone achieves success” might be “achieving a goal through effort.”
- Step 6: Extract xReact and oReact—Extract xReact and oReact relational properties related to emotion analysis from the retrieved common sense. For instance, for the common sense “achieving a goal through effort.”,
- xReacti (Vincent’s reaction): “Vincent feels happy.”
- oReacti (others’ reaction): “Others feel proud.”
- Step 7: Generate New Common Sense Using COMET-ATOMIC. For example,
- For xReacti, the new common sense generated by the COMET-ATOMIC algorithm is:
XGeni: “Vincent feels proud because his efforts paid off.” - For oReacti, the common sense generated by the COMET-ATOMIC algorithm is:
OGeni: “Others celebrate Vincent because he achieved a significant accomplishment.”
- For xReacti, the new common sense generated by the COMET-ATOMIC algorithm is:
- Step 8: Use BiLSTM to extract the corresponding query-based and generation-based knowledge representations, and then perform fusion (Eq 3) to generate the features of Ki. Here,
-
represents features extracted from xReacti.
-
represents features extracted from oReacti.
-
represents features extracted from XGeni.
-
represents features extracted from OGeni.
-
For the knowledge fusion features extracted from μi, we insert them as new nodes into the corresponding graph Gi at time i, forming a new graph . The resulting graph not only represents the structure of the conversation well but also integrates relevant external commonsense knowledge. To obtain the final utterance features, we use a Graph Convolutional Network (GCN), which has demonstrated excellent performance in processing graph data [39, 40]. For the graph
converted from text data, the GCN leverages the structural information within it to better understand the syntactic and semantic features of the sentences, thereby improving sentiment analysis performance. Additionally, by stacking multiple graph convolutional layers, it is possible to learn multi-level features of the text data layer by layer. The initial layers may capture lexical or local features, and as the number of layers increases, the model can capture deeper semantic features [41, 42]. This approach enhances the model’s understanding of complex semantic relationships and structures. Ultimately, each node in the graph is updated, and the updated nodes contain not only dialogue history information but also relevant commonsense knowledge. By stacking multiple Graph Convolutional Layers, the model can learn multi-level features of text data layer by layer. The initial layers may capture lexical or local features, and as the number of layers increases, the model can capture deeper semantic features. This approach enhances the model’s understanding of complex semantic relationships and structures.
3.4 Inference of emotions
The updated node from the GCN is used as the final discourse representation, where
is the feature vector extracted from the discourse μt at time t through the GCN. To integrate the features extracted from previous layers and to enhance the model’s flexibility and adaptability, and to map these features to the target sentiment categories, we send them to a fully connected layer and apply softmax to obtain the final sentiment probability distribution, as shown below:
(4)
Here,
is a learnable parameter matrix, and
is the node embedding corresponding to the discourse μT after GCN updates. In the model training, we utilize the cross-entropy loss function for optimizing the trainable parameters. This loss function measures the difference between the probability distribution predicted by the model and the true label’s probability distribution, making the training objective and performance evaluation more intuitive. In this paper, it is defined as follows:
(5)
where N is the total number of samples, C is the total number of sentiment categories, yi,c is the vector representation of the true sentiment category, pi,c is the probability that the model predicts the i-th sample belongs to category c, and L is the average loss over the entire dataset. To minimize the difference between the predicted probability distribution of the model and the actual label’s probability distribution, the cross-entropy loss function is widely used in sentiment classification tasks to improve model accuracy. During this process, the model parameters are updated through the backpropagation algorithm to reduce the value of the loss function. Algorithm 1 describes the entire process of emotion inference using CEICG from a conversation dataset U.
Algorithm 1 CEICG: emotion inference from a conversation dataset U
Require: a conversation dataset U
Ensure: the predicted emotional probability distribution ET
1: {μ1, μ2, …, μT}←U
2: {v1, v2, …, vT} = RoBERTa(μ1, μ2, …, μT)
3: for each t from 1 to T do
4: ht ← LSTM(vt, (ht−1, ct−1))
5: H ← {h1, h2, …, hT}
6: γ ← {{hi, hj}|hi, hj ∈ H, hi ≠ hj}
7: for each t from 1 to T do
8: Gt ← (Ht, γt)
9: Kt ← DGIEKGen(Gt)
10:
11:
12: .
4 Analyses and results of experiments
4.1 Experimental datasets
The English dialogue datasets used in this experiment include:
- MELD: This dataset is a resource covering multi-party conversations [20], which includes text, audio, and video data derived from the TV show “Friends.” The dataset contains over 1,400 dialogues and more than 13,000 utterances. Each dialogue involves multiple participants, and each utterance is annotated with one of the following emotion labels: surprise, joy, disgust, anger, sadness, fear, and neutral. In this experiment, the predefined textual utterances and the pre-split training and test sets from MELD are used directly. The dataset can be accessed and downloaded via the following URL: https://affective-meld.github.io/
- Topical Chat: This is a topical chat dataset obtained from Amazon. It contains over 8,000 dialogues and 184,000 utterances. Each utterance is labeled with one of the following emotion labels: curious, happy, sad, surprised, neutral, and others. These labels represent the emotions perceived by the sender. The dataset can be accessed and downloaded via the following URL: https://github.com/alexa/Topical-Chat
- DailyDialog: This dataset covers a wide range of topics, including daily life, work, travel, health, finance, etc., aiming to simulate real daily conversation scenarios. In addition to the original dialogue text, the dataset also provides annotations of emotions and communicative intents for each sentence in the dialogue, supporting research in emotion analysis and intent recognition. The dataset comprises multi-turn dialogues that are carefully designed and written to ensure they are both natural and reflective of real-life conversation patterns. The dataset includes over 13,000 multi-turn dialogues, with emotional annotations based on Ekman’s six basic emotions. The dataset can be accessed and downloaded via the following URL: http://yanran.li/dailydialog.html
Some statistical information about these three datasets is shown in Table 2.
4.2 Evaluation metrics
To evaluate the model’s performance, this experiment uses Precision, Recall, and the weighted F1-score as evaluation metrics. Precision measures the proportion of correctly predicted samples among all predicted results. Recall measures the proportion of correctly predicted samples among all actual samples. The F1-score is the harmonic mean of Precision and Recall. The formulas are as follows:
(6)
(7)
(8)
Here,
represents the number of correctly predicted sentiments, Numgold represents all samples predicted as that category (i.e., true positives and false positives)., and Numpre refers to the sum of the number of correctly classified samples and the number of incorrectly classified samples.
4.3 Hyper parameters
In these experiments, the BERT layer uses the base version of RoBERTa with a hidden layer dimension of 768 and employs dropout for regularization. The LSTM hidden layer dimension is set to 128. The GCN consists of two layers, and the optimal number of fully connected layers is 2. The default number of training iterations is 1000, with the parameter optimization method using the Adam optimizer. The learning rate is set to 0.001, and the batch size is set to 64. Experimental environment: RTX 3090 with 24 GB of VRAM, CPU is i9–12900K, development platform is PyCharm, and the deep learning framework is PyTorch 1.13.2.
4.4 Baseline models
The proposed model is compared with six baseline models:
- LSTM [43]: This model uses a basic LSTM structure, which effectively captures long-distance dependencies by introducing three gates (forget gate, input gate, output gate) and a cell state. It ultimately obtains contextual representations of historical utterances from the conversation and uses an emotion classifier for emotion classification.
- LSTM+ATT [44]: This model combines LSTM with an attention mechanism. The attention mechanism helps the model identify words or sentences most crucial for emotion prediction, such as emotionally charged vocabulary, thereby improving prediction accuracy. It enhances the model’s performance by leveraging the ability of LSTM to process sequential data and remember long-term dependencies, along with the attention mechanism’s advantage of highlighting key information.
- DialogueRNN [45]: The DialogueRNN model is a recurrent neural network architecture specifically designed for dialogue systems, aimed at better capturing emotional dynamics and interaction relationships among participants in a conversation. Unlike traditional RNNs, DialogueRNN particularly considers dependencies between roles (such as speakers) and the dialogue context, to improve the accuracy and depth of tasks like emotion analysis and dialogue understanding.
- DialogueGCN [40] (Dialogue Graph Convolutional Networks): This model is a graph convolutional neural network specifically designed for dialogue systems, intended to capture complex structures and dynamic relationships within a conversation. By treating dialogue as a graph structure, where nodes represent utterances and edges denote various relationships between utterances (e.g., replies, references), DialogueGCN effectively understands and utilizes contextual information. Its advantage lies in its ability to integrate semantic information from the dialogue content and relational information from the dialogue structure, enhancing the dialogue system’s contextual understanding.
- DialogInfer-Ensemble [28]: This model focuses on emotion inference tasks in multi-turn dialogues by simulating the propagation of emotional states between speakers in the dialogue history. It introduces an address-aware module that can automatically learn whether participants in the next round of dialogue maintain historical emotional states or are influenced by others. Additionally, an ensemble strategy is proposed to extract multiple potential emotional responses, further enhancing model performance.
- DialogueGLP [29]: This method combines the temporal sequence processing abilities of RNNs with the deep language understanding capabilities of PLMs, enabling the model to deeply analyze dialogue history for more accurate predictions of upcoming emotional states, even in the absence of direct utterance information from the responder. It not only focuses on emotion expression in individual utterances but also considers the complex dynamics of emotion propagation and change in dialogues. Moreover, by inputting the entire dialogue history as contextual information into the PLM and using context learning to extract and generate knowledge, it provides a unique way for the model to understand and leverage the implicit deep semantics and emotional cues within the dialogue.
4.5 Analyses and results of experiments
RQ1: Performance comparison with existing models.
To verify the effectiveness of the proposed CEICG model in the emotion inference task, we conducted experimental comparisons between the CEICG model and the six baseline models mentioned above across three dialogue datasets. The experimental results are shown in Tables 3–5, and Fig 2.
From Table 3, it can be seen that on the DailyDialog dataset, the precision of the CEICG model improved by 1.1% compared to the second-ranked DialogueGLP and by 19.2% compared to the lowest-ranked LSTM. The recall rate of CEICG improved by 1.6% compared to DialogueGLP and by 10.0% compared to LSTM. Its F1 score improved by 1.3% compared to DialogueGLP and by 14.4% compared to LSTM. The experiments indicate that, compared to the MELD and Topical Chat datasets, the DailyDialog dataset is characterized primarily by two-person conversations, with a greater number of dialogue turns (as shown in Table 2). Therefore, all models achieved better performance on this dataset. However, the performance of knowledge-based methods CEICG and DialogueGLP is superior to most models because the DailyDialog dataset consists of data generated from everyday conversations, which involves a broader range of external common knowledge compared to other datasets. Additionally, the dialogue length in the DailyDialog dataset is relatively shorter than that in long dialogue datasets, resulting in limited emotional features that can be extracted. Consequently, models based on historical emotional features tend to perform relatively weaker. In contrast, models that incorporate external common knowledge as auxiliary information allow us to capture more features for the emotion inference task, thereby enhancing the model’s performance.
As shown in Table 4, on the Topical Chat dataset, the precision of the CEICG model improved by 0.6% compared to the second-ranked DialogInfer-Ensemble and by 14.2% compared to the lowest-ranked LSTM. Its recall rate improved by 1.6% compared to the second-ranked DialogueRNN and by 9.2% compared to LSTM. The F1 score improved by 1.5% compared to DialogInfer-Ensemble and by 11.6% compared to LSTM. The experiments indicate that, compared to the DailyDialog dataset, the main characteristics of the Topical Chat dataset are fewer conversations and a multi-party dialogue format (as shown in Table 2). Consequently, all models performed at a lower level on this dataset. Among these models, LSTM still exhibited the worst performance, while DialogueGLP performed slightly worse than DialogInfer-Ensemble. This is because DialogInfer-Ensemble extracts features from the dialogue process based on both sequence and graph approaches, whereas the knowledge generated by DialogueGLP has weaker weights within the same topic. The experiments also demonstrate that, compared to the DailyDialog dataset, on the more challenging inference dataset Topical Chat, our model shows better advantages in terms of both precision and recall compared to the second-ranked DialogInfer-Ensemble.
From Table 5, it can be seen that in the MELD dataset, the accuracy of the CEICG model is improved by 3.1% compared to the second-ranked DialogueGLP and by 21.7% compared to the lowest-ranked Dialogue RNN. Its recall rate is improved by 2.8% compared to DialogueGLP and by 14.1% compared to the lowest-ranked LSTM. The F1 score of CEICG is improved by 3.0% compared to DialogueGLP and by 15.8% compared to the lowest-ranked LSTM. This experiment shows that although LSTM can effectively handle long-distance dependency issues, it often learns only superficial features from complex sentences. The LSTM+ATT model, with the added attention mechanism, allows the model to focus on sentences relevant to the emotion inference task, thereby reducing the impact of noise on the model. DialogueRNN enhances the ability to capture factors influencing emotion inference in multi-party dialogues by considering the characteristics of the speaker and the interlocutors, resulting in significant improvements compared to LSTM and LSTM+ATT, which rely solely on historical dialogue features. As for DialogueGCN and DialogInfer-Ensemble, both are based on sequence and graph models that transform the multi-party dialogue process into a graph node format, effectively simulating the dialogue evolution process, allowing the model to better understand the semantic and structural features of the conversation. This results in F1 score improvements of 0.00370 and 0.0137 over DialogueRNN, demonstrating the feasibility of converting the dialogue process into a graph structure. DialogueGLP, by incorporating external knowledge through a pre-trained model, shows some improvement over DialogueGCN and DialogInfer-Ensemble. Experimental results prove that our model CEICG, through the definition of graph models with different relational edges and the extraction of external knowledge via knowledge queries and fusion, better accomplishes the emotion inference task.
The comparison of the F1 scores for emotion inference across the three datasets is shown in Fig 2. It can be observed that on the DailyDialog dataset, the F1 scores of all models are relatively high, with the proposed CEICG model showing a certain advantage over the second-ranked DialogueGLP. On the more challenging Topical Chat dataset, the proposed CEICG model exhibits a greater advantage compared to the second-ranked DialogInfer. Furthermore, on the most difficult MELD dataset, the CEICG model has an absolute advantage over the second-ranked DialogueGLP. Additionally, compared to all other models, the proposed model demonstrates robust performance, adapting well to more complex real-world data environments.
RQ2:Ablation experiments.
To understand the impact of the various components of the model on the final performance, we systematically removed or modified specific parts of the CEICG model and observed how these changes affected the model’s performance, thereby validating the effectiveness of each module in the CEICG model. This experiment was based on the Topical Chat dataset, and the results are shown in Table 6 and Fig 3. In Table 6, “-” indicates the removal of the corresponding structure. CEICG-G-K refers to the CEICG model with the graph model and external knowledge component removed; CEICG-G refers to the model with only the graph component removed, integrating external knowledge into the text feature vector; CEICG-K refers to the model with the external knowledge component removed; CEICG-KA refers to the model with the external knowledge module’s knowledge query structure removed; and CEICG-KC refers to the model with the external knowledge module’s knowledge generation structure removed.
The results indicate that whether removing the graph model or the external knowledge module, the model experiences a corresponding decline in performance. The performance decreased by 3.82% when only the graph model was removed, demonstrating that utilizing a graph structure to represent the dialogue process aids the model in better understanding the structural and semantic features of the conversation, leading to the greatest improvement in model performance. When only the external knowledge component was removed, the performance decreased by 1.84%. Although this impact is not as significant as that of the graph model, it still suggests that incorporating external knowledge as auxiliary information helps enhance the model’s understanding capabilities. Additionally, we analyzed the roles of knowledge querying and knowledge generation within the knowledge fusion component. From the experimental results, both parts contribute to the model’s performance enhancement. These findings reveal that the various components of our model play a crucial role in the task of emotion reasoning, highlighting the importance of each component within the model.
5 Conclusions
This paper presents a new CEICG model for the emotion reasoning task, which primarily leverages external knowledge and graph models for emotion inference. The model first utilizes RoBERTa to embed dialogue sentences, obtaining rich contextual representations. LSTM is then employed for extracting emotional features, which are used to create nodes in the graph model. The edges of the graph model are constructed based on the sequential and semantic characteristics of the nodes. External knowledge is obtained through knowledge querying and knowledge generation, which is then integrated into the graph model. Finally, Graph Convolutional Networks (GCN) are used to update the nodes in the graph model, followed by classification using an emotion classifier. A comparison with several existing baseline methods demonstrates that the CEICG model achieves competitive results in the emotion reasoning task, with F1 scores surpassing those of baseline methods across three datasets. Particularly, for the daily dataset, which involves a wide range of common sense knowledge but consists of shorter sentences, our model demonstrates superior performance.
Supporting information
S1 File. Data set MELD.
Description: Public dataset that was used in this study, including all measured variables. Format: pkl file. Availability: The complete dataset can be downloaded from the following link: URL: https://affective-meld.github.io/.
https://doi.org/10.1371/journal.pone.0315039.s001
(ZIP)
S2 File. Data set topical chat.
Description: Public dataset that was used in this study, including all measured variables. This dataset can be directly downloaded from the internet. Format: json file. Availability: The complete dataset can be downloaded from the following link: https://github.com/alexa/Topical-Chat.
https://doi.org/10.1371/journal.pone.0315039.s002
(ZIP)
S3 File. Data set dailyDialog.
Description: Public dataset that was used in this study, including all measured variables. Format: txt file. Availability: The complete dataset can be downloaded from the following link: http://yanran.li/dailydialog.html.
https://doi.org/10.1371/journal.pone.0315039.s003
(ZIP)
References
- 1. Wu M, Su W, Chen L, Liu Z, Cao W, Hirota K. Weight-adapted convolution neural network for facial expression recognition in human–robot interaction. IEEE Transactions on Systems, Man, and Cybernetics: Systems. 2019;51(3):1473–1484.
- 2. Lee CC, Mower E, Busso C, Lee S, Narayanan S. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication. 2011;53(9-10):1162–1171.
- 3. Ferrara E, Yang Z. Measuring emotional contagion in social media. PloS One. 2015;10(11):e0142390. pmid:26544688
- 4.
Sgorbissa A, Papadopoulos I, Bruno B, Koulouglioti C, Recchiuto C. Encoding guidelines for a culturally competent robot for elderly care. In: RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE; 2018. p. 1988–1995.
- 5.
Kanda T, Shiomi M, Miyashita Z, Ishiguro H, Hagita N. An affective guide robot in a shopping mall. In: Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction; 2009. p. 173–180.
- 6. Belpaeme T, Kennedy J, Ramachandran A, Scassellati B, Tanaka F. Social robots for education: A review. Science Robotics. 2018;3(21):eaat5954. pmid:33141719
- 7.
Koolagudi SG, Maity S, Kumar VA, Chakrabarti S, Rao KS. IITKGP-SESC: speech database for emotion analysis. In: Contemporary Computing: Second International Conference, IC3 2009, Noida, India, August 17-19, 2009. Proceedings 2. Springer; 2009. p. 485–492.
- 8.
Hakak NM, Mohd M, Kirmani M, Mohd M. Emotion analysis: A survey. In: 2017 international conference on computer, communications and electronics (COMPTELIX). IEEE; 2017. p. 397–402.
- 9. Alslaity A, Orji R. Machine learning techniques for emotion detection and sentiment analysis: current state, challenges, and future directions. Behaviour & Information Technology. 2024;43(1):139–164.
- 10.
Hasegawa T, Kaji N, Yoshinaga N, Toyoda M. Predicting and eliciting addressee’s emotion in online dialogue. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Texts); 2013. p. 964–972.
- 11. Lutz C, White GM. The anthropology of emotions. Annual Review of Anthropology. 1986;15(1):405–436.
- 12. Stojanovski D, Strezoski G, Madjarov G, Dimitrovski I, Chorbev I. Deep neural network architecture for sentiment analysis and emotion identification of Twitter messages. Multimedia Tools and Applications. 2018;77:32213–32242.
- 13.
Zhou H, Young T, Huang M, Zhao H, Xu J, Zhu X. Commonsense knowledge aware conversation generation with graph attention. In: IJCAI; 2018. p. 4623–4629.
- 14. Zhang Z, Li J, Zhao H. Multi-Turn Dialogue Reading Comprehension With Pivot Turns and Knowledge. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021;29:1161–1173.
- 15.
Zhang A, Wu S, Zhang X, Chen S, Shu Y, Feng Z. EmoEM: Emotional Expression in a Multi-turn Dialogue Model. In: 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI); 2020. p. 496–501.
- 16. Plutchik R. The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American Scientist. 2001;89(4):344–350.
- 17. Ekman P. An argument for basic emotions. Cognition & Emotion. 1992;6(3-4):169–200.
- 18.
Plutchik R. A general psychoevolutionary theory of emotion. In: Emotion: Theory, Research, and Experience; 1980. p. 3–33.
- 19.
Lazarus RS. Emotion and Adaptation. Oxford University Press; 1991.
- 20. Zhang L, Lu L, Wang X, Zhu RM, Bagheri M, Summers RM, et al. Spatio-temporal convolutional LSTMs for tumor growth prediction by learning 4D longitudinal patient data. IEEE Transactions on Medical Imaging. 2019;39(4):1114–1126. pmid:31562074
- 21.
Narayana S, Radwan I, Parameshwara R, Abbasnejad I, Asthana A, Subramanian R, et al. A weakly supervised approach to emotion-change prediction and improved mood inference. In: 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE; 2023. p. 1–8.
- 22. Li D, Li Y, Wang S. Interactive double states emotion cell model for textual dialogue emotion prediction. Knowledge-Based Systems. 2020;189:105084.
- 23.
Gaonkar R, Kwon H, Bastan M, Balasubramanian N, Chambers N. Modeling label semantics for predicting emotional reactions. arXiv preprint arXiv:200605489. 2020;.
- 24. Sun X, Li J, Wei X, Li C, Tao J. Emotional editing constraint conversation content generation based on reinforcement learning. Information Fusion. 2020;56:70–80.
- 25.
Yingjian L, Xiaoping W, Shanglin L. Emotion Prediction in Conversation Based on Relationship Extraction. In: 2022 IEEE International Conference on Cyborg and Bionic Systems (CBS). IEEE; 2023. p. 53–58.
- 26.
Altarawneh E, Agrawal A, Jenkin M, Papagelis M. Predicting Evoked Emotions in Conversations. arXiv preprint arXiv:240100383. 2023;.
- 27. Singh GV, Firdaus M, Chauhan DS, Ekbal A, Bhattacharyya P. Zero-shot multitask intent and emotion prediction from multimodal data: A benchmark study. Neurocomputing. 2024;569:127128.
- 28.
Li D, Zhu X, Li Y, Wang S, Li D, Liao J, et al. Emotion Inference in Multi-turn Conversations with Addressee-aware Module and Ensemble Strategy. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; 2021. p. 3935–3941.
- 29.
Wang R, Feng S. Global-Local Modeling with Prompt-Based Knowledge Enhancement for Emotion Inference in Conversation. In: Findings of the Association for Computational Linguistics: EACL 2023; 2023. p. 2120–2127.
- 30.
Narayana S, Radwan I, Subramanian R, Goecke R. Mood as a Contextual Cue for Improved Emotion Inference. arXiv preprint arXiv:240208413. 2024;.
- 31.
Rashkin H, Sap M, Allaway E, Smith NA, Choi Y. Event2mind: Commonsense inference on events, intents, and reactions. arXiv preprint arXiv:180506939. 2018;.
- 32.
Sap M, Le Bras R, Allaway E, Bhagavatula C, Lourie N, Rashkin H, et al. Atomic: An atlas of machine commonsense for if-then reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33; 2019. p. 3027–3035.
- 33.
Speer R, Chin J, Havasi CC. 5.5: An open multilingual graph of general knowledge. arXiv preprint arXiv:161203975. 2016;.
- 34.
Cambria E, Olsher D, Rajagopal D. SenticNet 3: A common and common-sense knowledge base for cognition-driven sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 28; 2014.
- 35.
Bosselut A, Rashkin H, Sap M, Malaviya C, Celikyilmaz A, Choi Y. COMET: Commonsense transformers for automatic knowledge graph construction. arXiv preprint arXiv:190605317. 2019;.
- 36.
Zhong P, Wang D, Miao C. Knowledge-enriched Transformer for emotion detection in textual conversations. arXiv preprint arXiv:190910681. 2019;.
- 37. Li D, Zhu X, Li Y, Wang S, Li D, Liao J, et al. Enhancing emotion inference in conversations with commonsense knowledge. Knowledge-Based Systems. 2021;232:107449.
- 38.
Ghosal D, Majumder N, Gelbukh A, Mihalcea R, Poria S. Cosmic: Commonsense knowledge for emotion identification in conversations. arXiv preprint arXiv:201002795. 2020;.
- 39. Dyda F, Klein DC, Hickman AB. GCN5-related N-acetyltransferases: a structural overview. Annual review of biophysics and biomolecular structure. 2000;29(1):81–103. pmid:10940244
- 40.
Ghosal D, Majumder N, Poria S, Chhaya N, Gelbukh A. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. arXiv preprint arXiv:190811540. 2019;.
- 41.
Xu S, Rao H, Hu X, Hu B. Multi-level co-occurrence graph convolutional LSTM for skeleton-based action recognition. In: 2020 IEEE International Conference on E-health Networking, Application & Services (HEALTHCOM). IEEE; 2021. p. 1–7.
- 42. Wu Z, Pan S, Chen F, Long G, Zhang C, Philip SY. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems. 2020;32(1):4–24.
- 43.
Graves A, Graves A. Long Short-Term Memory. In: Supervised Sequence Labelling with Recurrent Neural Networks; 2012. p. 37–45.
- 44. Xie Y, Liang R, Liang Z, Huang C, Zou C, Schuller B. Speech Emotion Classification Using Attention-Based LSTM. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2019;27(11):1675–1685.
- 45.
Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E. Dialoguernn: An Attentive RNN for Emotion Detection in Conversations. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33; 2019. p. 6818–6825.