Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

An effective emotion tendency perception model in empathic dialogue

  • Jiancu Chen,

    Roles Methodology, Writing – original draft, Writing – review & editing

    Affiliation College of Computer Science and Engineering, Chongqing Three Gorges University, Chongqing, China

  • Siyuan Yang,

    Roles Data curation, Validation, Visualization

    Affiliation College of Computer and Big Data, Fuzhou University, Fuzhou, China

  • Jiang Xiong ,

    Roles Methodology, Project administration, Supervision

    Affiliation College of Computer Science and Engineering, Chongqing Three Gorges University, Chongqing, China

  • Yiping Xiong

    Roles Investigation, Validation, Visualization

    Affiliation College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China


The effectiveness of open-domain dialogue systems depends heavily on emotion. In dialogue systems, previous models primarily detected emotions by looking for emotional words embedded in sentences. However, they did not precisely quantify the association of all words with emotions, which has led to a certain bias. To overcome this issue, we propose an emotion tendency perception model. The model uses an emotion encoder to accurately quantify the emotional tendencies of all words. Meanwhile, it uses a shared fusion decoder to equip the decoder with the sentiment and semantic capabilities of the encoder. We conducted extensive evaluations on Empathetic Dialogue. Experimental results demonstrate its efficacy. Compared with the state of the art, our approach has distinctive advantages.


Empathy is a complex socio-emotional behavior resulting from emotional and cognitive interactions [1]. Human-computer dialogue aims to investigate how computers understand natural language and strengthen the connection with the user by sensing emotions [2]. It plays a vital role in improving user satisfaction [37].

In the existing empathic response studies, [812] controlled the generated contents by a specified emotional label. [1317] proposed various methods for generating empathic responses, which focus on detecting the user’s emotions and generating appropriate responses. [18, 19] perceive implicit emotions through experience or external knowledge. This allows empathic dialogue systems to learn from a bit of history of dialogue about emotional interactions.

However, the above emotion detection models overlook the effects of each word on emotion in dialogue. Inspired by the idea of multi-granularity computing [2022], we note that in an actual multi-round conversation, people can express their emotions through sentiment and non-sentiment words. Therefore, it is crucial for comprehending emotions to perceive the emotions of fine-grained sentences, i.e., words. Fig 1 demonstrates a real-world example of an empathic dialogue. In this illustration, we should respond to the “Speaker” based on his descriptions. “Pred” presents the response created by the KEMP [19] model, an advanced model. However, responding following the “Ref” is more reasonable because it represents the dataset’s default response. Since the KEMP model neglects the latent emotion of many words in the sentence, it has shortcomings in generating empathic replies. Therefore, if we use all words in the dialogue to calculate the intensity value with emotion, we can capture the associations between all words and emotions in the dialogue. Consequently, we can better infer users’ implicit emotions.

Fig 1. A comparative example from the EMPATHETIC DIALOGUES dataset compares the standard responses with the KEMP model responses.

To realize this goal, we propose a model of emotional tendency perception for empathic dialogue, named EMO_SA. It quantifies the degree of emotional tendency in sentences to perceive emotions more intuitively, which can generate more empathic responses. EMO_SA consists of the emotion encoder and the shared decoder. First, we evaluate the emotional tendency degree of all words based on a popular taxonomy [13], which classifies emotion into 32 categories. Furthermore, we feed all words’ emotions into a new emotion encoder. In addition, we shared the multi-head self-attention layer weights in the encoder with the decoder and fusion them to enhance the decoder’s ability to respond empathically.

The main contributions of the paper are as follows:

  • We propose a novel approach to accurate quantification of the degree of emotion. The approach better captures the user’s emotions based on calculating the similarity of each word in the dialogue with 32 emotional words.
  • We propose a shared fusion decoder to mine hierarchical emotions. It fusion the encoder’s ability to perceive emotions and semantics to express empathic responses.
  • Experimental results on the empathic dialogue dataset demonstrate the effectiveness of the proposed model. Compared with the current optimal KEMP model, EMO_SA’s accuracy, ppl, distinct-1, and distinct-2 are improved by 0.89, 2.38, 0.23, and 2.29.

The rest of this paper is organized as follows: we introduce related works in Section Related Work. Section Motivation describes motivation of EMO_SA model. Section Method introduces the structure and details of EMO_SA. Experimental results and analysis are provided in Section Experimental Settings. We conclude the whole work and discuss the future work in Section Conclusions and Future Work.

Related work

In the open-domain dialogue system, a single-round dialogue system is often represented as one question and one response, whereas a multi-round dialogue system expresses a user beginning a dialogue or query more than once. The schematic diagram of single-round and multi-round dialogue tasks is shown in Fig 2.

Fig 2. Representation of single-round dialogue and multi-round dialogue tasks.

The difference between multi-round and single-round dialogue systems is that the former takes into account historical dialogue content. In previous research on emotional dialogues, Seq2Seq [23] uses an encoder and decoder structure to map the extracted features to the output to solve the problem of indefinitely long speech sequences. But the gradient fading issue can arise. Due to this, Bahdanau et al. [24] proposed an attention mechanism adapted to Seq2Seq, its ability to focus on important information in the encoding during decoding, which facilitates the extraction of semantic features. Zhou et al. [9] proposed an ECM framework based on Seq2Seq, which incorporates an internal dynamic simulation mechanism of emotions and a lexicon-based adaptive response generation mechanism to generate emotional responses. Paper [25] proposed a 25K dialogue dataset based on emotional contexts to facilitate the issue of emotional feelings in human-computer communication. Lin et al. [14] proposed an end-to-end approach to model empathy in dialogue systems: the mixture of empathic listeners (MoEL), which takes into account understanding the user’s emotions and reacts to specific emotions. For the feature that empathic responses will mimic the user’s emotions to varying degrees rather than treating emotions uniformly. Navonil et al. [15] proposed the MIME model, which enhances the contextual correlation of empathy and response. Li et al. [26] proposed to use of coarse-grained dialogue-level and fine-grained token-level emotions to capture the nuances of human emotion, and consider the potential for user feedback to generate more empathetic responses. Sahand Sabour et al. proposed the CEM [27] model by using user emotion recognition [28] and cognitive understanding to enhance the expression of empathy in generative responses. Considering lack of external knowledge would make it difficult for empathic dialogue systems to perceive users’ implicit emotions and to learn emotional interaction problems from limited dialogue history, Qintong Li et al. introduced the NRC-VAD and ConceptNet external knowledge to propose the KEMP [19] model, to understand and express emotions explicitly.

The above researches are good for enhancing empathic responses in dialogue systems, but they do not consider the emotional tendencies of words and cannot perceive emotions accurately. To address this problem, we propose a model for perceiving emotional tendencies for empathic dialogue. By quantifying the sentiment tendency of emotion and non-emotion words in a sentence, we can better perceive the user’s sentiment and generate more empathic responses.


We believe that each word potentially conveys the underlying emotion of the user in the field of human-computer dialogue. In the previous studies, they did not fully utilize the possible sentiment information embedded in each word. Therefore, we propose an emotion encoder to express the correlation between words and emotions. It makes the words reflect the degree of expression for each emotion. We also note that when the encoder extracts information from the input data, the self-attention layer can catch and maintain some semantic information from the original input utterance, which was also overlooked by earlier studies. Therefore, we propose a shared fusion decoder by introducing a shared attention mechanism that enables the attention layer in the decoder and encoder to share part of the semantic information. The parameters of the attention layer are enriched so that the decoder can consider the original information of the input data when generating replies.



Based on Motivation, we propose the EMO_SA (Emotional_ShareAttertion) model based on KEMP [19], and the overview diagram of the EMO_SA model is shown in Fig 3.

Fig 3. EMO_SA’s architecture diagram.

It is composed of an emotional context graph, an emotional context encoder, and an emotion-dependency decoder. Compared with the original KEMP model, we mainly added an emotion encoder of the same level as the KEMP encoder and a shared fusion decoder to the emotion-dependency decoder.

As is shown in Fig 3, we mainly add an emotion encoder and a shared fusion decoder. In summary, we take as input a set of dialogue histories D with B sequences, i.e. D = [X1, X2, …, Xi, …, XB], where Xi is a sequence containing m words and . Through conceptual networks based on the KEMP model, enriched dialogue histories D into emotional contexts g, and extracted the sentiment signals ce and ep. Finally, a shared attention network and a transformer-based decoder are used to generate responses y = [y1, y2, …, yn] with sentiment information.

Emotional context encoder

Since each word in a dialogue utterance potentially expresses the user’s emotional information, we present an emotion encoder additionally. First, it calculates the emotional correlation for each word with 32 emotion categories, respectively. Then, splice the emotional correlation and the IDF value. Finally, feed it to the emotional tendency encoder to obtain the emotional tendency.

Emotional correlation.

Each word and emotion in dialogue has emotional correlation. To characterize this correlation, we calculate the cosine similarity for each word vector and the 32 emotion vectors separately.

If the word embedding of the input statement is wi ∈ [w1, w2, …, wd], d is the number of words and each emotion vector is ej ∈ [e1, e2, …, e32]. Then, the emotion correlation of wi with ej is, (1)

During the experiments, we noticed that due to the word embedding layer having certain defects, its calculated values for emotion are small, which cannot reflect the correlation of emotions and interfere with the calculation of motion vectors. In order to significantly represent the correlation of words to emotions, we perform a de-averaging portion by performing the mean value on cosine similarity for each category of emotion, which makes the overall emotion expression of words tend to be stable. That is, we find the mean value of cosine similarity between words and a certain class of emotions ej, (2) where n is the number of words in the whole dataset.

Then perform the de-averaging operation to get the emotional correlation Oij between wi and ej, (3)

Word frequency processing.

For all words in the dataset, each word has a corresponding word frequency, i.e., the number of occurrences of the word. The example analysis reveals that although high-frequency words like “I”, “you”, and “he” are used frequently in the dialogue, and their emotional significance is not as great; and for other words, such as “like”, “disgust”, “hate”, etc., often express users’ particular emotion. Therefore, in order to reduce the influence of deactivated words and high-frequency words on the judgment of emotional tendency, we introduce the IDF algorithm to distinguish the importance of different words in the dialogue, that is, to get the weight of the word Wi,j, (4) where |D| denotes the total number of documents in the corpus, and |N| denotes the number of documents containing the word. From Eq 4, it is observed that the weights Wi,j are inversely correlated with the frequency of the words appearing in the corpus.

Emotional tendency encoder.

To obtain the input of the emotional tendency encoder ETi,j, we can fuse the de-averaged emotional correlation with the weight information of the words, i.e. (5) ETi,j denotes all the emotional tendency of word i,ETi,j = [ETi,1, ETi,2ETi,64]. We input ETi,j into the same emotional tendency encoder as the transformer structure to acquire the output, (6) where Emo_encoder denotes the emotional tendency encoder in the emotional encoder.


If the encoder output OUTkenc of KEMP is expressed as, (7) where KEMP_encoder denotes the encoder in the KEMP model.

We splice OUTemo with the output of the encoder in KEMP, then the spliced output OUTe is that, (8)

Emotional signal perception.

Inputting OUTe into the emotion signal perception is encoded to obtain the emotional context variable as well as the emotional signal ce, where, (9) denotes the output of the multi-headed attention layer, and k is the number of vertices in the contextual concept network. (10) ηi denotes the emotional intensity corresponding to , and ce is a vector of d-dimensional size.

Then, we use the softmax linear layer to project the vector ce onto the emotion signal pe, (11) where We is a weight matrix of size [32, d].

And we use negative log-likelihood estimation as the emotion loss function for parameter learning, (12) where e* is the actual emotion classification, and e denotes the predicted emotion classification.

Finally, we feed the output ce and ep semaphores from the emotion context encoder into the emotion-dependency decoder for emotion recognition and generating empathic responses.

Emotion-dependency decoder

The Emotion-dependency decoder uses the embedding vector from the emotion signal and the standard output as input. The input of the emotion signal is the output ep of the encoder, while the standard output is passed through the embedding layer to obtain the embedding vector.

Since the self-attention layer of the encoder is able to capture and retain emotional information of the original input utterance when extracting information from the input data. If some semantic information of the encoder is shared with the decoder, it allows the decoder to take into account the original information when generating responses. Therefore, we propose a shared fusion decoder based on the transformer model. It employs two shared attention networks while using a multivariate residual model(MRM) [29] to fuse the output.

Multi-headed attention sharing.

The multi-head attention-sharing mechanism aims to share the attention information from the encoder to the decoder. We consider that the parameters in the multi-headed self-attention carry certain semantic information, so we share it. That is to say, the parameters of the second self-attention layer in the encoder are shared with the decoder, and they can be expressed as, (13) where MHAttdi, i = 1, 2, …, n denotes the parameter of multi-headed self-attention layers in the encoder, and MHAtte2 is second level, n is the maximum number of layers.

Moreover, to diversify the information in the attention layer of the decoder, we share two self-attention networks with different parameters into the attention layer of the decoder. Then the self-attention of the layer can be expressed as,s (14) where denotes the self-attention of the previous layer, denotes the input from the upper-level structure, i.e., the emotion signal and the word embedding vector of the standard output. And MHAtt denotes that it is a multi-headed self-attention sublayer consisting of H attention heads, and LayerNorm denotes the normalization of the network layer.

The output of the two shared attention networks will be fed into the MRM for feature fusion.

MRM feature fusion.

For the multi-headed self-attention parameters shared by the encoder, the decoder uses an MRM to fuse the features. The MRM is mainly used to integrate information between different modalities in a multimodal task, and we adapted it to fuse multiple attentional items. The output results of multiple shared attention networks are extracted to fuse the contained emotional information. The MRM is divided into two parts: projection and association.

Projection. The projection uses two independent residual networks for the extraction of semantic feature information, and the structure diagram is shown in Fig 4

The projection first maps the features As1 and As2 of the two attention layers to the same object space. As1 maps to Hs1 and As2 maps to Hs2. Then, (15) (16) where Wms1 and Wms2 are the weight matrixs and Relu is a nonlinear activation function.

Then the two feature vectors Hs1 and Hs2 are fused in the same object space. The fused feature vector H is, (17)

Association. The association uses a bilinear strategy to develop feature relationships for different attention. The association structure diagram is shown in Fig 5

As shown in Fig 5, we obtain x1 and x2 by splicing As1 and As2 with the weight matrix W, respectively, then multiple x1 and x2 to obtain, (18)

Since the weight matrix W can be decomposed as, (19) then Z can be transformed into, (20) where ° indicates Hadamard accumulation.

In the end, the pooling layer is used to obtain the output R, (21) In summary, the output M obtained by MRM is the fusion of the two components R and H, i.e., (22)

This layer takes the outputs As1 and As2 of the shared attention layer as input and the vector M after fusing the features as output.

Then, we input M into the residual and normalization layers to obtain a. The obtained a is fed into the multi-headed cross-attention network simultaneously with the output ce from the encoder, which is through the feedforward neural network to obtain the dialogue response . Finally, the final response output y is obtained by the normalization layer.

Parameter learning.

In the emotional encoder, the emotion loss Lemo is obtained by Eq 12, which is designed to improve the correctness of emotion perception. Also, calculating the reinforced emotional attention loss Latt and the response-generated loss Lgen. Latt is the emotional attention loss in the KEMP model, which is used to increase the emotional intensity of the responses. And the loss function of response generation is that, (23) where yj is the correct result, that is, the response corresponding to the input statement in the dataset, and is the predicted result of the model.

Ultimately, we learn the integrated loss function L by adjusting the parameters, (24) where, γ1, γ2, and γ3 are hyperparameters.

Experimental settings


We use Empathetic Dialogue [13], a benchmark dataset widely used to generate empathic responses, which contains 24,850 multi-round conversations. In each round of the dialogue, the speaker talks about one of the 32 emotions and the content associated with the emotion label, and the listener makes sense of what the speaker says to generate an empathic response [17]. The 32 emotion categories of the Empathetic Dialogue dataset are shown in Table 1.

Table 1. Emotional category of Empathetic Dialogue dataset.

A sample of the Empathetic Dialogue dataset is shown in Fig 6. Red denotes emotion, green denotes the speaker’s content, blue denotes the listener’s content, and gray indicates the complete content of the previous dialogue. We can see from Fig 6 that it converts multiple dialogues into a single round of dialogues for processing. That is, it splices the complete dialogues of the previous sequence and uses them as input for the later rounds.

Baselines for comparison

We compared EMO_SA with the following baseline models:

  • The Transformer [30] adopts the encoder-decoder architecture and then uses the self-attention mechanism instead of the RNN network structure commonly used in NLP tasks.
  • EmoPrepend-1 [25] is an extension of the Transformer that includes an additional supervised emotion classifier.
  • MoEL [14] is a transformer-based generative model that mixes response representations from several decoders and integrates decoder outputs under the projected distribution of emotions.
  • MIME [15] is a transformer-based generative model that replicates human emotions based on emotion grouping and uses stochastic sampling for a range of responses.
  • EmpDG [16] consists of an adversarial framework including a generator and discriminators that reflect the user feedback, which exploits multi-resolution emotions and user feedback.
  • KEMP [19] is an implicit emotion perception model containing external knowledge of NRC-VAD and ConceptNet.

We also conducted an ablation study to better analyze the effects of the different components in our model.

  • w/o SA is a model that considers only emotional tendency based on KEMP without considering the shared decoder.
  • w/o EMO is a model that considers only the shared decoder on the basis of KEMP without considering feature fusion in the decoder with emotional tendency.
  • w/o MRM is a model that considers emotional tendency with the shared decoder but not feature fusion in the shared decoder of the model.

Evaluation metrics

We used accuracy, perplexity leveL, and distinct-n to evaluate the model.

  • Accuracy [31] is the fundamental metric for measuring classification performance, and sentiment accuracy measures the degree of agreement between the sentiment categories in the generated replies and the labels, or the proportion of adequately predicted samples in the classification to the total number of samples.
  • Perplexity Level (PPL) [32] is used to evaluate the goodness of the language model, which indicates the confidence level of the model on the candidate response set, and the higher the confidence level, the lower the perplexity level.
  • Distinct-n [33] is used to measure the diversity of generated responses. It is not dependent on predetermined responses and can be separated into distinct-1 and distinct-2.

Implementation details

We partitioned the sentiment dialogue dataset in the ratio of 8:1:1 into a training set, a test set, and a validation set. We use pre-trained Glove vectors to initialize the word embeddings with the same common hyperparameters as the KEMP model. The number of sentiments in the emotion encoder is 32, which is consistent with the categories of emotion in the dataset. The total number of attention layers in the shared attention network is six, and the shared attention layers in the encoder are layer 2 and layer 3. We implemented all models using Pytorch and a single Tesla T4 GPU. In the process of training the model, we found that when each batch contains 16 groups of dialogues and the iteration times of the model are 30,000 times, its performance is optimal. If the number of iterations continues to increase, overfitting will occur. Therefore, the models were trained with 16 dialogues per batch, the number of iterations is about 30,000, and the time is about 5 hours.

Results and analysis

In the process of replicating KEMP model, we found that when the number of attention layers is 6, the result is the closest approximation to the original KEMP. Therefore, the EMO_SA model we proposed adopted a six-layer attention structure. In order to be fair, in addition to the comparison with the baseline model, we also compared the KEMP_6 model with 6 layers of attention structure in KEMP. The experimental results are displayed in Table 2. The best outcomes from all models are highlighted in bold. We can see that the EMO_SA model has outstanding performance. The accuracy, ppl, distinct-1, and distinct-2 are improved by 0.89, 2.38, 0.23, and 2.29 compared to the integrated optimal KEMP model.

For the question of how to pick the encoder weights to share with the decoder. In order to reduce the impact of semantic information loss caused by the high number of shared attention layers, we choose the third and lower attention layers for comparison experiment on the basis of without MRM. In the experiment, we compared unshared weights, shared single-layer weights, and spliced different layers weights respectively. The experimental results are shown in Table 3.

Table 3. Automatic evaluation results with different layer fusions.

Where shareAttetion_1 denotes that we just shared the first layer weights, shareAttetion_2 denotes that we just shared the second layer weights, shareAttetion_2&3 indicates that we shared-fused the second and third layer weights, and shareAttetion_2&origin indicates that we shared-fused the second and original layer weights. The best outcomes from all models are highlighted in bold. According to the experimental findings in Table 3, shareAttetion_2&3 performs better in accuracy and perplexity in similar cases of distinct_1 and distinct_2. Therefore, we splice the weights of the second with the third layer in our model.

In addition, we also performed an ablation study to better understand the contributions of the main parts of our model. The results of the ablation study are shown in Table 4.

As shown in Table 4, the accuracy and perplexity of emotions show a significant decrease when we just consider the emotional tendency (w/o SA) based on KEMP. This phenomenon indicates that the shared fusion decoder plays a crucial role in understanding emotions and generating empathic responses. Furthermore, the accuracy, perplexity, distinct-1, and distinct-2 all decrease when we consider the shared fusion decoder (w/o EMO), which demonstrates the importance of emotional tendency. We also consider emotional tendency and shared decoder without fusion (w/o MRM), that the accuracy is same as EMO_SA, but its perplexity, distinct-1, and distinct-2 all deteriorated, which proves fusing is effective.

Case study

Table 5 compares the responses generated by EMO_SA and the six main baselines. In case 1 EMO_SA generated responses with the most consistent content and accurate sentiment, and it made the responses more empathetic. In case 2–5, EMO_SA perceives the user’s sentiment more accurately and generates responses more empathetic. The above examples demonstrate that EMO_SA can better balance the performance between content and emotion.

Table 5. Case study of the generated responses by EMO_SA and the baselines.

Conclusions and future work

In this paper, we propose a novel emotional tendency encoder and shared fusion decoder. On the one hand, the emotion tendency encoder measures the emotional tendency of each word underlying the emotions of the user. On the other hand, the shared fusion decoder shares and fuses the self-attention layer of the encoder with the decoder to generate more empathic responses. The experimental results validate the effectiveness of our approach, and the ablation study illustrates the contribution of the main parts of the model. In the future, we will perform further precise mining of the word-emotion relationship to capture users’ emotions more effectively.


  1. 1. Yalçın Özge Nilay. Empathy framework for embodied conversational agents. Cognitive Systems Research. 2020;59:123–132.
  2. 2. Beale Russell and Creed Chris. Affective interaction: How emotional agents affect users. International journal of human-computer studies. 2009;67(9):755–776.
  3. 3. Brave Scott and Nass Clifford and Hutchinson Kevin. Computers that care: investigating the effects of orientation of emotion exhibited by an embodied computer agent. International journal of human-computer studies. 2005;62(2):161–178.
  4. 4. Klein Jonathan and Moon Youngme and Picard Rosalind W. This computer responds to user frustration. In: CHI’99 extended abstracts on Human factors in computing systems; 1999. p.242–243.
  5. 5. Partala Timo and Surakka Veikko. The effects of affective interventions in human–computer interaction. Interacting with computers. 2004;16(2):295–309.
  6. 6. Ochs Magalie and Sadek David and Pelachaud Catherine. A formal model of emotions for an empathic rational dialog agent. Autonomous Agents and Multi-Agent Systems. 2012;24(3):410–440.
  7. 7. Picard Rosalind W and Liu Karen K. Relative subjective count and assessment of interruptive technologies applied to mobile monitoring of stress. International Journal of Human-Computer Studies. 2007;65(4):361–375.
  8. 8. Zhou, Xianda and Wang, William Yang. Mojitalk: Generating emotional responses at scale. arXiv preprint arXiv:171104090. 2017;.
  9. 9. Zhou, Hao and Huang, Minlie and Zhang, Tianyang and Zhu, Xiaoyan and Liu, Bing. Emotional chatting machine: Emotional conversation generation with internal and external memory. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32; 2018.
  10. 10. Wang Ke and Wan Xiaojun. Sentigan: Generating sentimental texts via mixture adversarial networks. In: IJCAI; 2018. p. 4446–4452.
  11. 11. Song, Zhenqiao and Zheng, Xiaoqing and Liu, Lu and Xu, Mu and Huang, Xuan-Jing. Generating responses with a specific emotion in dialog. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; 2019. p. 3685–3695.
  12. 12. Shen, Lei and Feng, Yang. CDL: Curriculum dual learning for emotion-controllable response generation. arXiv preprint arXiv:200500329. 2020;.
  13. 13. Rashkin, Hannah and Smith, Eric Michael and Li, Margaret and Boureau, Y-Lan. Towards empathetic open-domain conversation models: A new benchmark and dataset. arXiv preprint arXiv:181100207. 2018;.
  14. 14. Lin, Zhaojiang and Madotto, Andrea and Shin, Jamin and Xu, Peng and Fung, Pascale. Moel: Mixture of empathetic listeners. arXiv preprint arXiv:1908.07687.2019.
  15. 15. Majumder, Navonil and Hong, Pengfei and Peng, Shanshan and Lu, Jiankun and Ghosal, Deepanway and Gelbukh, Alexander et al. MIME: MIMicking emotions for empathetic response generation. arXiv preprint arXiv:2010.01454.2020.
  16. 16. Li, Qintong and Chen, Hongshen and Ren, Zhaochun and Ren, Pengjie and Tu, Zhaopeng and Chen, Zhumin. EmpDG: Multiresolution interactive empathetic dialogue generation. arXiv preprint arXiv:1911.08698.2019.
  17. 17. Kim, Wongyu and Ahn, Youbin and Kim, Donghyun and Lee, Kyong-Ho. Emp-RFT: Empathetic Response Generation via Recognizing Feature Transitions between Utterances. arXiv preprint arXiv:2205.03112.2022.
  18. 18. Zhong, Peixiang and Wang, Di and Miao, Chunyan. Knowledge-enriched transformer for emotion detection in textual conversations. arXiv preprint arXiv:1909.10681.2019.
  19. 19. Li Qintong and Li Piji and Ren Zhaochun and Ren Pengjie and Chen Zhumin. Knowledge bridging for empathetic dialogue generation. 2022.
  20. 20. Xia, Shuyin and Dai, Xiaochuan and Wang, Guoyin and Gao, Xinbo and Giem, Elisabeth. An Efficient and Adaptive Granular-ball Generation Method in Classification Problem. arXiv preprint arXiv:2201.04343,2022.
  21. 21. Xia Shuyin and Peng Daowan and Meng Deyu and Zhang Changqing and Wang Guoyin and Giem Elisabeth, et al. A fast adaptive k-means with no bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence,2020.
  22. 22. Xia Shuyin and Zheng Shaoyuan and Wang Guoyin and Gao Xinbo and Wang Binggui. Granular ball sampling for noisy label classification or imbalanced classification. IEEE Transactions on Neural Networks and Learning Systems,2021. pmid:34460405
  23. 23. Sutskever Ilya and Vinyals Oriol and Le Quoc V. Sequence to sequence learning with neural networks. Advances in neural information processing systems. 2014;27.
  24. 24. Bahdanau, Dzmitry and Cho, Kyunghyun and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.2014.
  25. 25. Rashkin Hannah and Smith Eric Michael and Li Margaret and Boureau Y-Lan. I know the feeling: Learning to converse with empathy. 2018.
  26. 26. Li, Q and Chen, H and Ren, Z and Chen, Z and Tu, Z and Ma, J EmpGAN. Multi-resolution Interactive Empathetic Dialogue Generation. arXiv 2019. arXiv preprint arXiv:1911.08698.
  27. 27. Sabour, Sahand and Zheng, Chujie and Huang, Minlie. Cem: Commonsense-aware empathetic response generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36; 2022. p. 11229–11237.
  28. 28. Zhu Linan and Xu Minhao and Bao Yinwei and Xu Yifei and Kong Xiangjie. Deep learning for aspect-based sentiment analysis: a review. PeerJ Computer Science, 2022, 8: e1044. pmid:36092006
  29. 29. Wang, Weixuan and Chen, Zhihong and Hu, Haifeng. Hierarchical attention network for image captioning. Proceedings of the AAAI Conference on Artificial Intelligence.vol. 33;2019. p.8957–8964.
  30. 30. Vaswani Ashish and Shazeer Noam and Parmar Niki and Uszkoreit Jakob and Jones Llion and Gomez Aidan N, et al. Attention is all you need. Advances in neural information processing systems. 2017;30.
  31. 31. Raamkumar, Aravind Sesagiri and Yang, Yinping. Empathetic Conversational Systems: A Review of Current Advances, Gaps, and Opportunities. arXiv preprint arXiv:2206.05017.2022.
  32. 32. Vinyals, Oriol and Le, Quoc. A neural conversational model. arXiv preprint arXiv:1506.05869.2015.
  33. 33. Li, Jiwei and Galley, Michel and Brockett, Chris and Gao, Jianfeng and Dolan, Bill. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.2015.