Reinforcing personalized persuasion in task-oriented virtual sales assistant

Aritra Raut; Abhisek Tiwari; Subrata Das; Sriparna Saha; Anutosh Maitra; Roshni Ramnani; Shubhashis Sengupta

doi:10.1371/journal.pone.0275750

Abstract

Purpose

Existing task-oriented virtual agents can assist users with simple tasks like ticket booking, hotel reservations, etc. effectively and with high confidence. These virtual assistants, however, assume specific, predictable end-user behavior, such as predefined/servable objectives, which results in conversation failures in challenging situations, such as when goals are unavailable.

Methodology

Inspired by the practice and its efficacy, we propose an end-to-end framework for task-oriented persuasive dialogue generation that combines pre-training and reinforcement learning for generating context-aware persuasive responses. We utilize four novel rewards to improve consistency and repetitiveness in generated responses. Additionally, a meta-learning strategy has also been utilized to make the model parameters better for domain adaptation. Furthermore, we also curate a personalized persuasive dialogue (PPD) corpus, which contains utterance-level intent, slot, sentiment, and persuasion strategy annotation.

Findings

The obtained results and detailed analysis firmly establish the effectiveness of the proposed persuasive virtual assistant over traditional task-oriented virtual assistants. The proposed framework considerably increases the quality of dialogue generation in terms of consistency and repetitiveness. Additionally, our experiment with a few shot and zero-shot settings proves that our meta-learned model learns to quickly adopt new domains with a few or even zero no. of training epochs. It outperforms the non-meta-learning-based approaches keeping the base model constant.

Originality

To the best of our knowledge, this is the first effort to improve a task-oriented virtual agent’s persuasiveness and domain adaptation.

Citation: Raut A, Tiwari A, Das S, Saha S, Maitra A, Ramnani R, et al. (2023) Reinforcing personalized persuasion in task-oriented virtual sales assistant. PLoS ONE 18(1): e0275750. https://doi.org/10.1371/journal.pone.0275750

Editor: Lalit Chandra Saikia, National Institute of Technology Silchar, India, INDIA

Received: June 22, 2022; Accepted: September 22, 2022; Published: January 5, 2023

Copyright: © 2023 Raut et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Data are available through a Github repository: https://github.com/aritraraut/USBAR.

Funding: The research reported in this paper is an outcome of the project “Autonomous Goal-Oriented and Knowledge-Driven Neural Conversational Agents ” (Project No. IITP/2020/458), sponsored by Accenture LLP.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Recent research in natural language processing has focused on developing models for conversational agents, which have many applications ranging from healthcare, business, sales domain, etc. Conversational agents can be of two types based on the nature of the goal: one is a task / goal-oriented virtual agent (virtual agent) [1], and the other is chit chat agent [2]. Chit-chat agents interact with users as companions to satisfy communication needs and create long-term relationships, whereas the former strives to assist users in achieving tasks.

In recent few years, task-oriented conversational agents have been grabbing interest in the domain of the Natural Language Generation. As mentioned earlier, these agents help users solve several tasks like hotel booking, ticket reservation, product purchasing, etc. In a simple task-oriented dialogue generation setting, there are several modules, one DST (dialogue state tracker) for extracting the belief states, one module for searching a query in the database based on the belief states, a policy learning module to determine the suitable action against the context and finally one NLG (Natural Language Generation) module to generate the response. These modules are frequently modeled and assessed independently. The pipeline approach has the obvious disadvantage that error propagation from cascaded components might harm succeeding sub-tasks [3]. So, the appeal for developing some end-to-end systems has increased, and there are few attempts. Some research has generated the system act and response jointly while maintaining ground truth belief states (Chen et al. 2019 [4], Wang et al. 2020 [5]). Some approaches have come close to fully modeling TOD (Task-Oriented Dialogue) agent, but they use different decoders for each component. For example, Lei et al. [6] and Liang et al. [7] generated belief spans and reactions using a seq2seq model. On the other hand, multiple decoders are proposed by Zhang, Ou, and Yu [8] to generate belief spans, act spans, and reactions. This concept is further generalised by SimpleTOD [9] to an end-to-end environment in which belief states are generated in addition to ground truth values. Additionally, they include database results in the training procedure. Then finally, Yang et al. [10] proposed a fully end-to-end dialogue system, UBAR. Based on the context, this model not only extracts belief states but also generates actions and responses on its own. The training objective of UBAR was to maximize the probability of the next word prediction based on the current word.

In the case of a few specific task domains like sales, it needs to generate some persuasive responses according to the context. In case of goal unavailability or user dissatisfaction, the previous models will fail to fulfill the task in most cases. In this scenario, if we can somehow perform persuasion while keeping the user’s needs in mind, the chance of task completion increases. One example is shown in Fig 1. Even this can be helpful to the user. There might be a different set of products that will fulfill most of the user’s needs, but the user has no information about it. During persuasion, the model may suggest something from those models which can make the user satisfied. According to current research on tailored conversational agents [11, 12], adopting distinct human-oriented chatbot identities or conversational methods can substantially impact user reactions and make the interaction more engaging. These conversational agents considerably improved the user-targeted personalization.

Download:

Fig 1. Performance of a traditional virtual agent and proposed agent on a goal unavailability scenario.

https://doi.org/10.1371/journal.pone.0275750.g001

While performing persuasion, there can be a few complex situations, like when the agent gets consecutive negative sentiments from the user. Then it is evident that the kind of response the agent generates does not satisfy the user. Losing consistency with the context or generating the same information again and again while persuading may be the two significant issues in this scenario. To tackle this, we have used an RL(Reinforcement Learning) based reward and penalized the loss. Our reward is a collection of four sub-rewards. Among them, two rewards, context consistency and repetitiveness reward are introduced to address the above mentioned problems. We also have introduced action consistency reward and sentiment-based reward to ensure that the agent generates responses consistent with the chosen action and that the agent satisfies the user with its responses, respectively.

A task-oriented virtual assistant with such a powerful, persuasive skill may benefit various sales environments. For this reason, we must constantly consider an agent’s domain adaptability while developing one. In our case, we have five subdomains: phone, camera, tablet, laptop, and Computer. This can be extended into a few other domains like refrigerator, microwave oven, etc. To make that, we need sufficient data, but we may not collect that, and the high annotation cost inhibits developers from creating their own NLG component from the ground up. Thus, using a fair amount of annotated data to train a language generator module that can be adapted to various other domains or tasks (Which doesn’t have a fair amount of annotated data) is tremendously beneficial. We used a generalized optimization-based meta-learning approach to directly increase the optimization procedure for the low-resource NLG challenge rather than framing the problem as a model-based approach. We found that a recently developed model-agnostic meta-learning algorithm (MAML) [13] is a good match for the low-resource NLG challenge. This MAML aims to learn a better initialization of model parameters that facilitates fast adaptation to new low-resource NLG scenarios.

We have followed the works of Yang et al. [10] and tried to improve its performance in this persuasive dialogue generation setting. In the case of persuasive dialogues, sentiment plays an essential role while generating responses. For instance, Wang et al. [14] included user sentiment to make an effective user-adaptive system. So, we have passed sentiment information as an extra token. We have evaluated and trained the model on our PPD dataset, which contains persuasive dialogues, and the performance after introducing the sentiment token increased over UBAR. Our main contributions are fourfold which are as follows:

We develop a large-scale personalized persuasive dialogue corpus annotated with semantic information (intent, slot, sentiment, user persona, and dialogue act) for the e-commerce domain. This data set has dialogues utilizing different persuasion strategies depending on the context.
To the best of our knowledge, this is the first work towards building an end-to-end dialogue agent capable of persuading the user in a goal unavailability situation. This module can even follow different persuasion strategies, depending on the context. For example, let’s say a user has come to buy a mobile phone for his daughter, and the model faces some goal conflict. In that case, instead of some logical dialogues, if the model generates emotional or personal persuasive dialogues, that would be more effective.
We have infused RL-based rewards with task-oriented end-to-end NLG module UBAR, which helps the model generate more soothing, more consistent, and more appealing responses while performing persuasion.
We have experimented with a different training setup of optimization-based meta-learning to make the model parameters better for low resource sub-domain adaptation.

For the reader’s convenience, acronyms often used in this paper are listed in Table 1.

Download:

Table 1. Accronyms used in the paper.

https://doi.org/10.1371/journal.pone.0275750.t001

2 Related work

Our proposed work is mainly striking the areas of personalized persuasive dialogue agents, infusion of RL in dialogue agents, and domain adaptability of task-oriented dialogue agents. So, in this section, we have summarized the relevant works in the subsequent sections.

Task oriented virtual agent

Many sequence-to-sequence based conversation generation approaches have been suggested in recent years [15], which encode dialogue context using RNN units (LSMT/GRU) and create answers utilizing the encoded information. After that, the use of pre-trained models such as GPT became popular. Large pre-trained language models have outperformed small pre-trained language models on a variety of NLP tasks [16–18], with GPT-2 [19] and GPT-2 [19] is especially good at language generation tasks. GPT-2 [19] has been extended to create responses in chit-chat dialogue [20, 21]. Budzianowski and Vulic (2019) [22], in the task-oriented dialogue domain, first pointed out the ability to fine-tune all essential information in plain text on GPT-2, which drives a line of enhanced and simplified task-oriented dialogue system designs. Then Yang et al. [10] have finally developed the end-to-end task-oriented agent, which had performed really well, outperforming all the previous works in this domain. On the other hand, in a few recent approaches [23, 24], researchers have attempted to close the gap between chit-chat and task-oriented dialogue agents in an effort to make task-oriented discussion more interesting and appealing.

Persuasive virtual agent

On the other hand, attempts to incorporate persuasion in NLG module had also been made. The Elaboration Likelihood Model (ELM) of Petty and Cacioppo [25] claims that a person’s persuasion is based on changing degrees of thoughts of processing information and persuasive context. The Persuasion Knowledge Model (PKM) proposed by Friestad and Wright proposes that scientific and common persuasion knowledge are interconnected [26]. Furthermore, authors of [27] claimed that combining personal traits with persuasive information might increase a person’s drive to respond to persuasive communications. Then recently, the research [28] proposes a personalized end-to-end task-oriented conversation system that uses a memory network to create attractive and persona-consistent replies. In other recent publications [29–32], the researchers emphasized the DST module to carry out persuasion in task-oriented conversation agents to catch and address dynamic user needs effectively.

Reinforcement learning on NLG module

It is difficult to create a personalized conversation agent in supervised learning (SL) framework that can generalize to various users in different settings because of a lack of accessible data and the inherent shifting attitudes and emotions of users in an ongoing dialogue. Because MLE-based models are prone to exposure bias, researchers have recently focused on reinforcement learning (RL) to fine-tune these models because of their capacity to learn from user interactions and improve depending on user input in the form of incentives [33–38]. In a recent work [39], reinforcement learning has also been used to enhance the performance of dialogue generation agent in a different domain, i.e., medical diagnosis. Here in our case, we have taken the idea of following different persuasion strategies depending upon the context and imposed it on the works of Yang et al. [10] to develop an end-to-end persuasive natural dialogue generation module. To enhance the model’s performance, we added a few sub-rewards and changed the context(See Fig 2) a little.

Download:

Fig 2. An example of our modified context.

https://doi.org/10.1371/journal.pone.0275750.g002

Meta learning

Meta-learning, also known as learning-to-learn, has recently received a lot of attention. It may be traced back to some early publications [40].“Quick adaptation to fresh and restricted observation data” is a major issue. There are three types of meta-learning that can be used to solve this problem:

In metric-based meta-learning, the objective is to learn a metric space and then compare low-resource testing samples to high-resource training samples using it. Siamese Network [41], Matching Network [42], Memory-augmented Neural Network (MANN) [43], Prototype Net [44], and Relation Network [45] are some examples of representative works in this domain.
In model based approach, the concept is to employ a second meta-learner to update the primary learner with a few training instances. Andrychowicz et al., 2016 [46] created an LSTM-based meta learner. For quick model adaptation, Hypernetwork [47], MetaNet [48], and TCML [49] all learn a different set of representations. Ravi and Larochelle [50] suggested an LSTM-based meta-learner to learn the original network’s optimization technique (gradients).
The optimization based method can be built in such a way that it supports rapid adaptation. By optimizing the gradient towards a good parameter initialization for easy fine-tuning in low-resource scenarios, model agnostic meta-learning [13, 51, 52] achieved state-of-the-art performance. In 2019, Lin et al. [53] used this optimization-based meta-learning to make the model adaptive to new personalities to generate personalized responses in a task-oriented setting. They used meta-learning algorithms to learn multiple personas as separate tasks, which is fundamentally different from optimizing the model to represent all of the personas.

3 Problem formulation

We aim to build neural-based goal unavailability adapted virtual assistant that can serve end-users, even in goal unavailability scenarios, and alleviates task failures due to goal conflicts. The agent is also capable of using different persuasion strategies for convincing the user to buy an alternative product. The agent’s response (R_t) at time t, is being conditioned on user sentiment, belief states, chosen action and is generated as follows:

We define context C_t at time step t as: (1) Where U_i, S_i, B_i, D_i, A_i, R_i stand for user utterance, sentiment, belief states, database query, agent action, and agent response at i^th turn, respectively.
The proposed model first encodes(e) the information and generates one token (R_t[j]) at each time step depending upon encoded information and previously generated tokens. It can be expressed as follows: (2) where, n is the number of words in the generated sequence (R_t) and R_t[j] is j^th word of the generated sequence.

4 Dataset

We looked at a number of benchmark task-oriented corpora, but we were unable to locate a single dataset that was suitable for the purpose. The properties of several existing conversation datasets are presented in Table 2. In the current work, we have first created a sizable personalised persuasive dialogue corpus called the PPD (personalised persuasive dialogue) corpus. As persuasion is an essential quality of any sales agent, and to encourage researchers to work in the direction of developing some intelligent persuasive conversational agents. The creation of this data collection is intended to hasten the study into creating conversational bots that can persuade users to purchase things when a goal is unavailable. The dataset includes many conversations in which a salesperson tries to persuade a client to buy something, using a variety of persuasive techniques depending on the consumer’s traits and personalities.

Download:

Table 2. Statistics of the existing datasets and our developed corpus (PPD).

https://doi.org/10.1371/journal.pone.0275750.t002

PPD: Data creation and annotation

Virtual assistants are widely used in commercial applications like online shopping. Thus, for our internal data production, we chose the duty of selling various technological devices. With the help of five mobile retailers, we extensively reviewed the assignment and produced 100 instances of dialogue conversations between sellers and buyers around the work of acquiring electronic items (Mobile, Tab, Camera, Computer, and Laptop). The dialogues that were generated had the following three crucial elements i. Dynamic goal ii. Goal unavailability and iii. Personalized persuasion. The user intent, slot (BIO tag), user sentiment, user personality, persuasive strategy, and dialogue act of each speech in the interaction were also annotated.

Role of sentiment

Speakers’ responses in conversations are influenced by other speakers’ utterances’ semantic aspects as well as the substance of their own utterances. Sentiment is an example of a feature that subtly conveys feedback and details about the type of action the user wanted to communicate through the message. Sentiment may be efficiently used to track goal conflicts and the results of agents’ persuasion efforts in goal-shifting situations. A consumer may comment, ‘Oh, the colour of the phone is rather drab,’ for example. Here, sentiment (negative) connected to the colour component is the key characteristic that may be used to spot these aim conflicts.

Role of personalized persuasive strategy

The effectiveness of persuasion is a very subjective and dynamic issue that much depends on the persuasion target’s relevance and the persuadee’s personality. Even the same persuasion aim and method might not be able to convince the same person in two distinct situations. The suggested approach intends to harness both user personality and dialogue environment for convincing users in goal unavailability scenarios. It is motivated by the importance of customised and dynamic nature of persuasion task. We offer examples of several such ways in Table 3. Distribution of emotion and persuasive tactics within the corpus are shown in Fig 3.

Download:

Table 3. Examples of different persuasion strategies.

https://doi.org/10.1371/journal.pone.0275750.t003

Download:

Fig 3. (a) Sentiment and (b) persuasion strategy distribution across PPD corpus.

https://doi.org/10.1371/journal.pone.0275750.g003

For scaling up the conversational dataset in accordance with example conversations and a full guideline report, we hired five English linguists. For knowledge-based dialogue development, we used GSMArean’s mobile database [60]. A corpus of 1031 conversations and 11602 utterances was produced after they constructed and analysed 931 dialogues. Each speech has been labelled with the appropriate persuasion approach, conversation act, user sentiment, slot, and intent. The kappa coefficient (k), which measures the degree of agreement among annotators on their annotations, was calculated and found to be 0.77, showing a considerable degree of uniform annotation. Table 4 contains statistics from the PPD dataset. In Table 5, we have additionally reported metadata data such as intent and slot lists.

Download:

Table 4. PPD dataset statistics.

https://doi.org/10.1371/journal.pone.0275750.t004

Download:

Table 5. Intent, slot, dialogue act, sentiment, and persuasion strategy list of the PPD dataset.

https://doi.org/10.1371/journal.pone.0275750.t005

5 Methodology

The work aims to develop a neural-based persuasive dialogue generation framework to deal with goal unavailability scenarios effectively. In the first part of this section we have elaborated the pipeline of UBAR and in the later parts we have discussed how our USBAR model works and what rewards we have introduced to improve the performance.

5.1 UBAR pipeline

The pipeline of the UBAR module is very simple. Let’s say, we have our very first user utterance, U₀, at turn t = 0. After receiving the user utterance, UBAR generates the components as described below.

The model extracts the belief states B₀, based on U₀. Belief state at each turn is basically a set of decoupled slot-value pairs {slt₀, v₀, slt₁, v₁, ….slt_n, v_n}, where each pair (slt_i, v_i) consists of slot and value information extracted from the current utterance. One example of this belief state has been shown in Fig 4.
After extracting the belief states B₀, it performs the database query. This provides the number of database instances D₀ matching with the belief states, B₀.
Finally based on [U₀, B₀, D₀], it generates agent action A₀, and the delexicalized response R₀. Delexicalized response means, the model is generating special placeholders in the responses for specific slots. For example a brand name in the generated response is <value_brand>. Later these placeholders should be replaced by the respective values from the database query result. This completes the very first turn.

Download:

Fig 4. An example of our belief state.

https://doi.org/10.1371/journal.pone.0275750.g004

At the next turn (t = 1), again the user will say something (U₁). Now for belief state (B₁) extraction, this U₁ will be concatenated with all the previous contents in order to form the context. The final context for extracting the belief states for this turn will be [U₀, B₀, D₀, A₀, R₀, U₁]. Flow from this point will be exactly as same as the mentioned steps. Similarly at turn t, UBAR takes [U₀, B₀, D₀, A₀, R₀, …, U_t−1, B_t−1, D_t−1, A_t−1, R_t−1, U_t] as context and generates B_t, D_t, A_t & R_t, respectively. The overall pipeline of UBAR has been shown in Fig 5.

Download:

Fig 5. UBAR workflow.

https://doi.org/10.1371/journal.pone.0275750.g005

The previous models used only dialogue history ([U₀, R₀, U₁, R₁, …, U_t]) in the context to generate response at turn t, while UBAR uses all the previous components (user utterance, belief states, database query instances, actions & agent responses) in the context.

5.2 USBAR workflow

We have almost maintained the workflow of UBAR (see section 5.1) and introduced the sentiment token as an extra information in the context. At turn t = 0, after receiving the user’s turn U₀, flow of USBAR is as follows:

The model performs sentiment classification of the user utterance (U₀). Basically it generates a word (positive/negative/neutral) representing the sentiment (S₀) of the user utterance.
At this point, our model uses [U₀, S₀] as the context and performs belief state extraction. Belief state B₀ for our model is as same as mentioned in section 5.1.
Following UBAR, our model also performs database query based on the belief state B₀, and finds the number of database instances (D₀) matching with B₀.
Finally based on [U₀, S₀, B₀, D₀], the model generates required agent action A₀ and delexicalized response R₀. This overall flow is shown in Fig 6.

Download:

Fig 6. USBAR (our proposed model) workflow.

https://doi.org/10.1371/journal.pone.0275750.g006

This flow continues till the end of the conversation.

At the training time, for UBAR we just calculate the cross-entropy loss and update the model parameters accordingly. Following this procedure we have faced a few issues like repetitiveness, lack of consistency with the context and the chosen agent action etc. These problems occur especially when the user expresses negative sentiment at the consecutive turns. This may not lead the conversation the way we want. To tackle these problems, in USBAR, we calculate a reward r^{t = 0} for the generated response. This reward will make the model aware of the quality of this response. For the next turns, we again follow the previous steps and we keep on calculating rewards for every turn. At the very end of the whole conversation, following UBARs procedure, we calculate a cross-entropy loss, l. In addition, we consider the expectation of the rewards (calculated at each turn), in order to have a single reward for the whole conversation, as - (3) Where we assume there are n turns in that conversation. We use this reward R to calculate our final loss as- (4)

This penalized loss is used to perform the back propagation in order to update the model parameters. The whole training pipeline has been shown in Fig 7. The details of the rewards are provided in section 5.3.

Download:

Fig 7. The overall training pipeline.

https://doi.org/10.1371/journal.pone.0275750.g007

Other than this, we also have used gradient based meta learning technique to train our model which is discussed in a detailed manner in section 5.5.

5.3 Rewards

We have introduced 4 sub-rewards in order to penalize the loss on a session level. These reward are: Repetitiveness reward (r₁), Consistency reward (r₂), Action consistency reward (r₃) and Sentiment based reward (r₄). So, the final reward function is- (5)

Then we will penalize the batch loss with these rewards just by adding them. This is just to make sure that we get less repetitiveness and more consistency (both with the context and the chosen action) while generating the responses. The details of these sub-rewards are discussed below.

5.3.1 Repetitiveness reward.

According to [61], the models tend to generate more often occurring utterances in the dataset, and this repetition usually occurs at the exact lexical level. As a result, the conversation falls flat and eventually it can affect the persuasion. If the model tries to persuade with the same features of some other product again and again, the user will definitely loose interest. So, to avoid this problem, we use Jaccard Score, a unigram-based measure of similarity between earlier utterances and the current generated response. The sentences are normalised first with spaCy1, and the resulting score is then used as a sub-reward. (6)

5.3.2 Consistency reward.

We use Meteor score [62], a machine translation evaluation metric based on a generalised idea of unigram matching between machine-produced and human-produced reference translations. Here we determine Meteor score [62] between the generated responses (hypothesis) and the gold human response in order to generate human-like responses (reference). We chose the golden human response as a benchmark for assessing its resemblance to our generated responses since we believe it is optimally consistent with the dialogue. Meteor calculates a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference once all generalized unigrams matching between the two strings have been found. Meteor score was chosen because it employs WordNet to find synonyms when exact matches aren’t found [63], and it has a high connection with human assessment in machine translation jobs.

5.3.3 Action consistency reward.

Consistency between the generated response and the chosen action is very much important, especially in case of different persuasion strategies. For example let’s say, for a particular turn in a conversation the chosen action is emotional appeal, but the model is trying to convince the user with some features in a logical manner. That may lead the conversation to an end. It is absolutely necessary to maintain consistency with the chosen action. To make it sure that the generated response is consistent with the chosen agent action, we have introduced this sub-reward.

To calculate this reward we need to have the probability distribution of the generated response over the action classes (different persuasion strategies are included as actions), and then we can calculate this sub-reward as: (7) Where, denotes the probability of the response R at turn t belonging to the i^th action class. Here in this equation, j^th class is the ground truth action class given in that context. To get the probability distribution over all the action classes, we built an action or strategy classifier (mentioned in section 5.4) by fine tuning RoBERTa, which achieved an overall accuracy of 82% and macro F1 score of 68.56.

5.3.4 Sentiment based reward.

The main motive of persuasion is to make the user satisfied. In other words, the model should make sure that the user doesn’t express negative sentiment consecutively. Because if the user is showing negative sentiment again and again, that means, the same kind of response is getting generated which is not working at all. In that scenario, the model must understand that it has to change the previous persuasive strategy or it should stop persuading and recommend some other products to the user. So, to capture this, the idea of this sub-reward is very simple. If the user’s sentiment is negative in consecutive 3 or more turns, a penalty of 1 is added. Let’s say, we get negative sentiment from the user for consecutive 4 times, then the penalty would be 2(= 1+ 1). Then let’s say after that we got some positive or neutral sentiments and then again we get consecutive 3 negative sentiments. Then the penalty would be 3(= 2+ 1).

5.4 Action classifier

Here we have 26 different kinds of actions including 5 main different persuasion strategies (Personal appeal, Persona appeal, Logical appeal, Emotional appeal and Credibility appeal), excluding ‘Default’ persuasion strategy. Our model chooses one from them at every turn, depending upon the context. We have tried a few models like BiLSTM, CNN and RoBERTa [64] to perform this task. We have failed to achieve an accuracy beyond 60 percent, so we have gone for the hierarchical approach. We have divided all the classes into main four classes- inform, request, persuasion and others and then divided the persuasion class into another 5 sub-classes representing the different persuasion strategies (details provided in Fig 8). We have experimented with different window sizes of context and using RoBERTa, we obtained the best classifier in terms of accuracy and macro F1 score. The metric, accuracy is defined as: (8)

Download:

Fig 8. 26 action class grouping for the hierarchical action classifier.

https://doi.org/10.1371/journal.pone.0275750.g008

On the other hand, using the arithmetic mean (aka the unweighted mean) of all the per-class F1 scores, the macro-averaged F1 score (or macro F1 score) is calculated. Where F1 score for each class is defined as: (9)

The detailed results are presented in Tables 6 and 7.

Download:

Table 6. Performances of action classifier for main 4 classes.

https://doi.org/10.1371/journal.pone.0275750.t006

Download:

Table 7. Performances of action classifier for 5 persuasive classes.

https://doi.org/10.1371/journal.pone.0275750.t007

5.5 Meta learning

Algorithm of Meta Learning for domain adaptation

Require: D_train

Require: α_meta, β_meta: step size hyperparameters

1: Randomly initialize θ

2: while not done do

3: Sample batch of different sub-domain

4: For all do

5:

6:

7: end for

8:

9: end while

To extend this work into some other unseen sub-domains like refrigerator, air condition, micro-wave oven etc. we need to make the parameters easily adoptable. Following the PAML(Persona Agnostic Meta Learning) [53], we use meta-learning algorithm to learn different sub-domains as separate tasks, which is fundamentally different from optimising the model to represent all of the sub-domains. A high-level intuition of the difference between these two approaches is shown in Fig 9. Then we define the sub-domain meta-dataset , where m is the no. of different sub-domains we have (here, we have merged computer & laptop domains to make it a 4 sub-domain meta-data). Before training, we divide the dataset into two parts, D_train & D_test. For each training epoch, we uniformly sample a set of conversations from each , as & . After t iterations on D_train, the model f_θ parameterized by θ is updated to by standard gradient descent. (10) Where, α_meta is the learning rate of inner optimisation and is the training loss. Then the model is updated such that it maximizes the log-likelihood for the unseen dialogues, i.e., . We apply again stochastic gradient descent on the meta-model parameters, θ, by computing the gradient of , that is- (11) where β_meta is meta learning rate. Second order optimization partial derivatives are required for this procedure, which can be generated using any automatic differentiation library (e.g., PyTorch, Tensorflow, etc.). The overall algorithm is shown above.

Download:

Fig 9. The difference between finetuning from a) joint training on all sub-domains and b) meta-learning sub-domain.

The solid line represents the optimization path of the initial parameters and dashed line the fine-tuning path. Meta-learned initial parameters can faster adapt to a new sub-domain.

https://doi.org/10.1371/journal.pone.0275750.g009

6 Training setup & implementation details

We have implemented our model with HuggingFace’s Transformers [65] and DistilGPT2 [66], a distilled version of GPT-2, in a session level; this means the whole conversation has been passed to the model, and it learns to generate or guess the next word based on the current word. We have used cross-entropy function as our loss, AdamW as our optimizer and standard greedy decoding method with temperature of 0.7. We have calculated the respective rewards at every iteration and added them with the loss, and then this penalized loss has been used to update the respective parameters. We have used α₁ = 1, α₂ = −1, α₃ = −1 and α₄ = 1 in Eq 5 and β = 1 in Eq 7 and a batch size of 2 at every iteration.

7 Results

We have evaluated our model with different setups on our developed PPD dataset. For automatic evaluation, we calculated BLEU (BiLingual Evaluation Understudy) score [67] and Rouge score. We also have performed human evaluation of this model in terms of repetitiveness, consistency, personalized persuasion and grammatical correctness. The metrics are defined as follows:

Repetitiveness: It measures how much similar the generated responses are. We have defined it as the Jaccard Similarity (see Eq 6) between agent responses in a single conversation.
Consistency: This is defined as the number of slots fulfilled by the agent / number of slots asked by the user.
Personalized persuasion: These were marked on the degree of perceived personalization tactic executed by the agent. On a scale of 5, the agent responses which were able to use the contextual information that was provided by the user at the start of the conversation were given relatively higher points. Expectedly, a neutral response was given 2.5 points.
Grammatical Correctness Score (G.C Score): It measures how good the generated sentence is in terms of grammatical correctness. For each conversation, it is calculated as: Grammatical Correctness = number of grammatically correct responses / total number of turns. Then, finally we have taken the mean over all the conversations in order to get the final Grammatical Correctness score (G.C Score).

We evaluated the performance of our model (USBAR) in comparison to two established techniques, such as UBAR [10] and SimpleTOD [9], which is also a GPT-2-based technique trained on turn-level data without generated belief state and system act in dialogue history. Additionally, we have experimented with altering the base model in the USBAR configuration from DistilGPT2 to DialoGPT [20]. We have fine-tuned models for 50 epochs each and experimented with different settings. For each model we have experimented with a combination of true and generated belief states and actions and measured the performances in terms of the automatic evaluation metrics. We also have measured the performances of each model in terms of our aforementioned human evaluation metrics, but only in an end-to-end setting (using generated belief states and actions in the context). Automatic evaluation results are shown in Table 8 and the human evaluation results are shown in Table 9.

Download:

Table 8. Automatic evaluation results with different setup.

https://doi.org/10.1371/journal.pone.0275750.t008

Download:

Table 9. Human evaluation results in an end-to-end setting.

https://doi.org/10.1371/journal.pone.0275750.t009

7.1 Results without rewards

A careful inspection of results attained by SimpleTOD, UBAR and USBAR(our model) as shown in Table 8 reveals a clear performance improvement in each setting. The improvement is not huge but with the inclusion of a very small information like sentiment in the context, an improvement over UBAR is achieved. The improvement is reflected both in terms of automatic and human evaluation (see Table 9) metrics. We also experimented by changing the pre-trained model to DialoGPT from DistilGPT2 and it is not providing us a greater performance in any of the cases.

7.2 Results with r₁ & r₂

These rewards were used mainly to improve the performance in terms of repetitiveness and consistency with the context. After the inclusion of these 2 rewards we can notice a significant rise in the performance both in terms of automatic and human evaluation metrics. Specially there were two human evaluation metrics (repetitiveness and consistency) designed to capture the performance after the inclusion of these two rewards. In those columns of Table 9 also, we can see a significant amount of rise in the performance. That signifies, our motive behind introducing these two rewards is successful.

7.3 Results using r₁, r₂, r₃ & r₄

After inclusion of all the rewards, we are not getting a significant improvement over the previous model (with 1st and 2nd reward only). We achieved a good amount of improvement in terms of grammatical correctness and personalized persuasion. The action classifier we designed is not perfect, which is restricting us from getting an significant improvement in case of UBAR. On the other hand, if we look at the case of USBAR, we can see somewhere the performance has gone down after introducing 3rd and 4th rewards over the first 2 rewards. The classifier we are using was not trained with a context which contains sentiment. So in case of USBAR the classifier is becoming more confused. As a result whenever we are using ground truth actions, the model is getting confused and generating some responses which are different than the ground truth. On the other hand whenever the concern is action choosing, the USBAR model is choosing the action such a way that it generates responses which are more closer to the gold human responses. As a result, a performance comparison between USBAR+R₁+R₂ and USBAR+R₁+R₂+R₃+R₄ reveals that relying on generated actions leads to an improved result from the later model but addition of ground truth actions in the context reverses the result. Anyway in an end-to-end setting, we are getting the best result in terms of almost all the metrics (including both human and automatic evaluation metrics), from USBAR+R₁+R₂+R₃+R₄ module.

A comparison between the performances of UBAR and our final USBAR+rewards module is shown in Fig 10. A close look at the left image which is from the UBAR module, reveals that at some point of conversation when the user is showing negative sentiment for consecutive 3rd time the model has clearly lost the context. It is trying to persuade mentioning some different sub-domains. In the same situation, our USBAR module, trained with the reinforcement learning based rewards(right image), is able to catch the context correctly and even after getting three consecutive negative sentiments, it is able to persuade with proper information. One more thing to notice is in the left image, the model is passing the same message in different forms, which can eventually make the user disgusted. In the later case it is changing the strategy. On the second consecutive negative sentiment, it is trying to persuade the user with the brand then again getting back to the uniqueness of the color. This is possibly a better approach. So our final model has clearly achieved an improvement over UBAR. A few more situations are covered in section 10.

Download:

Fig 10. Outputs generated from two different models:(a) output from simple UBAR (b) output from our USBAR+rewards (for simplicity the intermediate components, i.e., sentiment, belief states, db query, agent action, have been omitted).

https://doi.org/10.1371/journal.pone.0275750.g010

Meta learning.

For the meta learning, we kept each of the domains (except phone, as it has a good amount of data) out at each time, and fine-tuned the DistilGPT-2 for 2000 iterations, following the earlier mentioned algorithm. At each iteration, we picked 4 instances from each of the training domains selected, 2 for training and 2 for validation. Then loaded these parameters in order to fine tune them on the domain on which we are trying to test the domain adaptation. This time we have fine tuned the model for 25 epochs only and measured the performances. We have experimented with few shot (trained for 25 epochs) and zero shot setting. The results for domain adaptation experiments are shown in Table 10

Download:

Table 10. Domain adaptation results using Meta learning.

https://doi.org/10.1371/journal.pone.0275750.t010

8 Error analysis

We observed the following two key issues with the proposed model.

The model is getting confused between different persuasive strategies. Let’s say, the chosen action is emotional appeal but the generated response is not emotional appeal at all. We tried to resolve this problem by our 3rd reward, but the action classifier we made is not enough accurate, especially for different persuasive appeals. A more accurate classifier can help in avoiding this problem totally.
It is very important for our model to be able to generate correct delexicalized responses. In some scenarios we noticed that our model is unable to generate placeholders for some slots. We have not done slot-value annotations for agent responses in the dataset. The model is fully dependent on the slot-value annotations of user utterances, to learn the placeholders for respective slots, but there are few slot values (like processor, release date, etc.), which rarely appear at the user utterances. As a result, naturally the model fails to learn the proper placeholders for those slots.

9 Advantage & limitation

As mentioned earlier, when the end user’s goal is unavailable, the sales field faces its most difficult predicament. Unlike the existing models, the agent we suggest will be effective in trying to convince the user or recommend something else that they might enjoy, rather than failing in such circumstances. In some situations when the model opts for recommendation over persuasion, it makes a suggestion simply by disregarding the most recent specification (belief state) that the user has provided rather than drawing on the strength of a separate recommendation system.

10 Case studies

Here we are showing a few more examples of how the two models (UBAR and USBAR+reward) generate responses in different situations. On the left, we are keeping the responses from UBAR and on the right we are keeping the responses from USBAR+rewards.

10.1 Generated sample 1

In this image (Fig 11) we can see that the user is not happy about the colour. He/she is consecutively throwing negative sentiment about it and the model is trying to convince him/her. On the UBAR response (left image), we see at some point the model is saying“… it is in her favorite color…”. The model doesn’t have any previous knowledge about the person for whom the product is getting purchased, or even the user has never gave him that information. So this sentence may somehow mislead the conversation. In addition to that, surprisingly in the last reply, the model is trying to convince the user by saying that it has never received any complaint about its battery. So it is clearly loosing context and as a result the response is not expected at all. Whereas the USBAR+Rewards module is not loosing context and continues persuasion till the end, keeping consistency with the context.

Download:

Fig 11. Sample-1.

https://doi.org/10.1371/journal.pone.0275750.g011

10.2 Generated sample 2

Here (Fig 12) we are trying to see the models’ performances when the user passes consecutive positive sentiments. In the second reply of UBAR (left image), we can see that the model is asking about the budget of the user and again at the very last reply in that image it is again asking the user about the same. In addition to this, after getting 2nd consecutive positive response, the UBAR model is passing the same information about the battery. Here the model is failing to avoid repetitiveness and the response is not even properly delexicalized. On the other hand, the USBAR (right image) module is trying to pass the battery information after getting the first positive sentiment, then providing information about the radio and gps. So it is clear that the model USBAR+Rewards has improved over UBAR in terms of repetitiveness.

Download:

Fig 12. Sample-2.

https://doi.org/10.1371/journal.pone.0275750.g012

10.3 Generated sample 3

Again here (Fig 13) we have passed negative sentiments, but this time regarding the RAM. In this scenario, the UBAR module is trying to convince the user with the colour attribute. Again and again the user is expressing his/her dissatisfaction regarding the RAM and in response, every time he/she is receiving something about the colour, which can disgust the user. On the other hand, USBAR+rewards module first highlighted the storage capacity. On the next turn when it realised that this is not working, then it is trying to convince the user by stating about its good processor which can avoid every problem, despite of having a low RAM.

Download:

Fig 13. Sample-3.

https://doi.org/10.1371/journal.pone.0275750.g013

11 Conclusion and future works

The current work reports about the development of an end-to-end neural response generation system for sales domain which is having several features:(a) capable of persuading the user in case of goal unavailability situation; if user’s specified goals/specifications are not available in the database, the agent will try to persuade the user regarding some alternative goal; (b) the agent utilizes different persuasion strategies for convincing the user as per the context information; there are 5 different persuasion strategies the agent can follow: emotional appeal, personal appeal, persona based appeal, logical appeal, credibility appeal. The appropriate strategy is selected by the agent based on the context information. To the best of our knowledge, this is the first work on automatic neural response generation in an end-to-end setting for developing a persuasive conversational agent. Moreover, some reinforcement learning based rewards are also introduced with this end-to-end NLG module to improve the model’s performance in terms of less repetitiveness and consistency (both with the context and the generated action). The model is trained using a newly developed data set, namely PPD (personalized persuasive dialogue) and the incorporation of persuasion behaviour is making the model more useful in practical scenarios, specially in sales domain or where the agent needs to do some reservation. We also have improved the domain adaptation power of this model by the inclusion of optimization based meta learning. Results on PPD data set illustrate the impact of meta-learning for domain adaptation and also utility of introducing the RL based reward functions for improving the quality of responses in terms of automatic and human evaluation metrics. For this paper we have concentrated only on a few electronic goods’ sales domain, but we believe this work can be extended to other sub-domains like air conditioner, refrigerator, micro-wave oven etc. and that is also with a low amount of resource. Future work also includes the introduction of multimodality concept in neural response generation system where the system will be capable of extracting slot-value pairs (belief states) from images shown by the user. Moreover, we also aim in developing some models for negotiating chat-bot.

References

1. Lipton Z, Li X, Gao J, Li L, Ahmed F, Deng L. BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems. Proceedings of the AAAI Conference on Artificial Intelligence. 2018;32(1).
2. Li X, Chen YN, Li L, Gao J. End-to-End Task-Completion Neural Dialogue Systems. 2017;.
3. Liu B, Lane I. End-to-End Learning of Task-Oriented Dialogs. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. New Orleans, Louisiana, USA: Association for Computational Linguistics; 2018. p. 67–73. Available from: https://aclanthology.org/N18-4010.
4. Chen W, Chen J, Qin P, Yan X, Wang WY. Semantically conditioned dialog response generation via hierarchical disentangled self-attention. arXiv preprint arXiv:190512866. 2019;.
5. Wang K, Tian J, Wang R, Quan X, Yu J. Multi-domain dialogue acts and response co-generation. arXiv preprint arXiv:200412363. 2020;.
6. Lei W, Jin X, Kan MY, Ren Z, He X, Yin D. Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics; 2018. p. 1437–1447. Available from: https://aclanthology.org/P18-1133.
7. Liang W, Tian Y, Chen C, Yu Z. MOSS: End-to-End Dialog System Framework with Modular Supervision. Proceedings of the AAAI Conference on Artificial Intelligence. 2020;34(05):8327–8335.
8. Zhang Y, Ou Z, Yu Z. Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context; 2019.
9. Hosseini-Asl E, McCann B, Wu CS, Yavuz S, Socher R. A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems. 2020;33:20179–20191.
- View Article
- Google Scholar
10. Yang Y, Li Y, Quan X. UBAR: Towards Fully End-to-End Task-Oriented Dialog Systems with GPT-2. In: AAAI; 2021.
11. Mazaré PE, Humeau S, Raison M, Bordes A. Training Millions of Personalized Dialogue Agents. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics; 2018. p. 2775–2779. Available from: https://aclanthology.org/D18-1298.
12. Zheng Y, Chen G, Huang M, Liu S, Zhu X. Personalized dialogue generation with diversified traits. arXiv preprint arXiv:190109672. 2019.
13. Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning. PMLR; 2017. p. 1126–1135.
14. Wang X, Shi W, Kim R, Oh Y, Yang S, Zhang J, et al. Persuasion for good: Towards a personalized persuasive dialogue system for social good. arXiv preprint arXiv:190606725. 2019.
15. Li J, Monroe W, Shi T, Jean S, Ritter A, Jurafsky D. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:170106547. 2017.
16. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep Contextualized Word Representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics; 2018. p. 2227–2237. Available from: https://aclanthology.org/N18-1202.
17. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018.
18. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018.
19. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. Language models are unsupervised multitask learners. OpenAI blog. 2019;1(8):9.
- View Article
- Google Scholar
20. Zhang Y, Sun S, Galley M, Chen YC, Brockett C, Gao X, et al. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:191100536. 2019.
21. Wu Z, Galley M, Brockett C, Zhang Y, Gao X, Quirk C, et al. A controllable model of grounded response generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35; 2021. p. 14085–14093.
22. Budzianowski P, Vulić I. Hello, it’s GPT-2–how can I help you? towards the use of pretrained language models for task-oriented dialogue systems. arXiv preprint arXiv:190705774. 2019.
23. Chiu S, Li M, Lin YT, Chen YN. SalesBot: Transitioning from Chit-Chat to Task-Oriented Dialogues. arXiv preprint arXiv:220410591. 2022.
24. Sun K, Moon S, Crook P, Roller S, Silvert B, Liu B, et al. Adding chit-chat to enhance task-oriented dialogues. arXiv preprint arXiv:201012757. 2020.
25. Petty RE, Cacioppo JT. The elaboration likelihood model of persuasion. In: Communication and persuasion. Springer; 1986. p. 1–24.
26. Friestad M, Wright P. The persuasion knowledge model: How people cope with persuasion attempts. Journal of consumer research. 1994;21(1):1–31.
- View Article
- Google Scholar
27. Dijkstra A. The psychology of tailoring-ingredients in computer-tailored persuasion. Social and personality psychology compass. 2008;2(2):765–784.
- View Article
- Google Scholar
28. Qiu S, Zhang K. Learning Personalized End-to-End Task-Oriented Dialogue for Fast and Reliable Adaptation. In: 2021 International Conference on Digital Society and Intelligent Systems (DSInS). IEEE; 2021. p. 62–66.
29. Tiwari A, Saha T, Saha S, Sengupta S, Maitra A, Ramnani R, et al. A dynamic goal adapted task oriented dialogue agent. Plos one. 2021;16(4):e0249030. pmid:33793633
- View Article
- PubMed/NCBI
- Google Scholar
30. Tiwari A, Saha T, Saha S, Sengupta S, Maitra A, Ramnani R, et al. A persona aware persuasive dialogue policy for dynamic and co-operative goal setting. Expert Systems with Applications. 2022;195:116303.
- View Article
- Google Scholar
31. Tiwari A, Saha T, Saha S, Sengupta S, Maitra A, Ramnani RR, et al. Multi-Modal Dialogue Policy Learning for Dynamic and Co-operative Goal Setting. In: International Joint Conference on Neural Networks, IJCNN 2021, Shenzhen, China, July 18-22, 2021. IEEE; 2021. p. 1–8. Available from: https://doi.org/10.1109/IJCNN52387.2021.9533878.
32. Priya N, Tiwari A, Saha S. Context Aware Joint Modeling of Domain Classification, Intent Detection and Slot Filling with Zero-Shot Intent Detection Approach. In: Mantoro T, Lee M, Ayu MA, Wong KW, Hidayanto AN, editors. Neural Information Processing—28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8-12, 2021, Proceedings, Part III. vol. 13110 of Lecture Notes in Computer Science. Springer; 2021. p. 582–595. Available from: https://doi.org/10.1007/978-3-030-92238-2_48.
33. Singh S, Kearns M, Litman D, Walker M. Reinforcement learning for spoken dialogue systems. Advances in neural information processing systems. 1999;12.
- View Article
- Google Scholar
34. Li J, Monroe W, Ritter A, Galley M, Gao J, Jurafsky D. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:160601541. 2016.
35. Casanueva I, Budzianowski P, Su PH, Ultes S, Rojas-Barahona L, Tseng BH, et al. Feudal reinforcement learning for dialogue management in large domains. arXiv preprint arXiv:180303232. 2018.
36. Chen L, Chen Z, Tan B, Long S, Gašić M, Yu K. AgentGraph: Toward universal dialogue management with structured deep reinforcement learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2019;27(9):1378–1391.
- View Article
- Google Scholar
37. Mesgar M, Simpson E, Gurevych I. Improving factual consistency between a response and persona facts. arXiv preprint arXiv:200500036. 2020.
38. Saha T, Chopra S, Saha S, Bhattacharyya P. Reinforcement learning based personalized neural dialogue generation. In: International Conference on Neural Information Processing. Springer; 2020. p. 709–716.
39. Tiwari A, Saha S, Bhattacharyya P. A knowledge infused context driven dialogue agent for disease diagnosis using hierarchical reinforcement learning. Knowledge-Based Systems. 2022;242:108292.
- View Article
- Google Scholar
40. Naik DK, Mammone RJ. Meta-neural networks that learn by learning. In: [Proceedings 1992] IJCNN International Joint Conference on Neural Networks. vol. 1. IEEE; 1992. p. 437–442.
41. Koch G, Zemel R, Salakhutdinov R, et al. Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop. vol. 2. Lille; 2015. p. 0.
42. Vinyals O, Blundell C, Lillicrap T, Wierstra D, et al. Matching networks for one shot learning. Advances in neural information processing systems. 2016;29.
- View Article
- Google Scholar
43. Santoro A, Bartunov S, Botvinick M, Wierstra D, Lillicrap T. Meta-learning with memory-augmented neural networks. In: International conference on machine learning. PMLR; 2016. p. 1842–1850.
44. Snell J, Swersky K, Zemel R. Prototypical networks for few-shot learning. Advances in neural information processing systems. 2017;30.
- View Article
- Google Scholar
45. Sung F, Yang Y, Zhang L, Xiang T, Torr PH, Hospedales TM. Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 1199–1208.
46. Andrychowicz M, Denil M, Gomez S, Hoffman MW, Pfau D, Schaul T, et al. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems. 2016;29.
- View Article
- Google Scholar
47. Ha D, Dai A, Le QV. Hypernetworks. arXiv preprint arXiv:160909106. 2016.
48. Munkhdalai T, Yu H. Meta networks. In: International Conference on Machine Learning. PMLR; 2017. p. 2554–2563.
49. Mishra N, Rohaninejad M, Chen X, Abbeel P. Meta-learning with temporal convolutions. arXiv preprint arXiv:170703141. 2017;2(7):23.
50. Ravi S, Larochelle H. Optimization as a model for few-shot learning. 2016.
51. Yoon J, Kim T, Dia O, Kim S, Bengio Y, Ahn S. Bayesian model-agnostic meta-learning. Advances in neural information processing systems. 2018;31.
- View Article
- Google Scholar
52. Gu J, Wang Y, Chen Y, Cho K, Li VO. Meta-learning for low-resource neural machine translation. arXiv preprint arXiv:180808437. 2018.
53. Lin Z, Madotto A, Wu CS, Fung P. Personalizing dialogue agents via meta-learning. arXiv preprint arXiv:190510033. 2019.
54. Hemphill CT, Godfrey JJ, Doddington GR. The ATIS spoken language systems pilot corpus. In: Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990; 1990.
55. Budzianowski P, Wen TH, Tseng BH, Casanueva I, Ultes S, Ramadan O, et al. MultiWOZ–A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. arXiv preprint arXiv:181000278. 2018.
56. Zhang S, Dinan E, Urbanek J, Szlam A, Kiela D, Weston J. Personalizing Dialogue Agents: I have a dog, do you have pets too? In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2018. p. 2204–2213.
57. Bordes A, Boureau YL, Weston J. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:160507683. 2016.
58. Lewis M, Yarats D, Dauphin Y, Parikh D, Batra D. Deal or No Deal? End-to-End Learning of Negotiation Dialogues. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing; 2017. p. 2443–2453.
59. Saha A, Khapra M, Sankaranarayanan K. Towards building large scale multimodal domain-aware conversation systems. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32; 2018.
60. Baichoo A. Kaggle GSMArean; 2017. Available from: https://www.kaggle.com/arwinneil/gsmarena-phone-dataset.
61. Shi W, Li Y, Sahay S, Yu Z. Refine and Imitate: Reducing Repetition and Inconsistency in Persuasion Dialogues via Reinforcement Learning and Human Demonstration. arXiv preprint arXiv:201215375. 2020.
62. Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization; 2005. p. 65–72.
63. Castillo J, Estrella P. Semantic textual similarity for MT evaluation. In: Proceedings of the Seventh Workshop on Statistical Machine Translation; 2012. p. 52–58.
64. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:190711692. 2019.
65. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:191003771. 2019.
66. Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:191001108. 2019.
67. Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics; 2002. p. 311–318.

[ref1] 1. Lipton Z, Li X, Gao J, Li L, Ahmed F, Deng L. BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems. Proceedings of the AAAI Conference on Artificial Intelligence. 2018;32(1).

[ref2] 2. Li X, Chen YN, Li L, Gao J. End-to-End Task-Completion Neural Dialogue Systems. 2017;.

[ref3] 3. Liu B, Lane I. End-to-End Learning of Task-Oriented Dialogs. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. New Orleans, Louisiana, USA: Association for Computational Linguistics; 2018. p. 67–73. Available from: https://aclanthology.org/N18-4010.

[ref4] 4. Chen W, Chen J, Qin P, Yan X, Wang WY. Semantically conditioned dialog response generation via hierarchical disentangled self-attention. arXiv preprint arXiv:190512866. 2019;.

[ref5] 5. Wang K, Tian J, Wang R, Quan X, Yu J. Multi-domain dialogue acts and response co-generation. arXiv preprint arXiv:200412363. 2020;.

[ref6] 6. Lei W, Jin X, Kan MY, Ren Z, He X, Yin D. Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics; 2018. p. 1437–1447. Available from: https://aclanthology.org/P18-1133.

[ref7] 7. Liang W, Tian Y, Chen C, Yu Z. MOSS: End-to-End Dialog System Framework with Modular Supervision. Proceedings of the AAAI Conference on Artificial Intelligence. 2020;34(05):8327–8335.

[ref8] 8. Zhang Y, Ou Z, Yu Z. Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context; 2019.

[ref9] 9. Hosseini-Asl E, McCann B, Wu CS, Yavuz S, Socher R. A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems. 2020;33:20179–20191.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref10] 10. Yang Y, Li Y, Quan X. UBAR: Towards Fully End-to-End Task-Oriented Dialog Systems with GPT-2. In: AAAI; 2021.

[ref11] 11. Mazaré PE, Humeau S, Raison M, Bordes A. Training Millions of Personalized Dialogue Agents. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics; 2018. p. 2775–2779. Available from: https://aclanthology.org/D18-1298.

[ref12] 12. Zheng Y, Chen G, Huang M, Liu S, Zhu X. Personalized dialogue generation with diversified traits. arXiv preprint arXiv:190109672. 2019.

[ref13] 13. Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning. PMLR; 2017. p. 1126–1135.

[ref14] 14. Wang X, Shi W, Kim R, Oh Y, Yang S, Zhang J, et al. Persuasion for good: Towards a personalized persuasive dialogue system for social good. arXiv preprint arXiv:190606725. 2019.

[ref15] 15. Li J, Monroe W, Shi T, Jean S, Ritter A, Jurafsky D. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:170106547. 2017.

[ref16] 16. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep Contextualized Word Representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics; 2018. p. 2227–2237. Available from: https://aclanthology.org/N18-1202.

[ref17] 17. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018.

[ref18] 18. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018.

[ref19] 19. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. Language models are unsupervised multitask learners. OpenAI blog. 2019;1(8):9.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref20] 20. Zhang Y, Sun S, Galley M, Chen YC, Brockett C, Gao X, et al. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:191100536. 2019.

[ref21] 21. Wu Z, Galley M, Brockett C, Zhang Y, Gao X, Quirk C, et al. A controllable model of grounded response generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35; 2021. p. 14085–14093.

[ref22] 22. Budzianowski P, Vulić I. Hello, it’s GPT-2–how can I help you? towards the use of pretrained language models for task-oriented dialogue systems. arXiv preprint arXiv:190705774. 2019.

[ref23] 23. Chiu S, Li M, Lin YT, Chen YN. SalesBot: Transitioning from Chit-Chat to Task-Oriented Dialogues. arXiv preprint arXiv:220410591. 2022.

[ref24] 24. Sun K, Moon S, Crook P, Roller S, Silvert B, Liu B, et al. Adding chit-chat to enhance task-oriented dialogues. arXiv preprint arXiv:201012757. 2020.

[ref25] 25. Petty RE, Cacioppo JT. The elaboration likelihood model of persuasion. In: Communication and persuasion. Springer; 1986. p. 1–24.

[ref26] 26. Friestad M, Wright P. The persuasion knowledge model: How people cope with persuasion attempts. Journal of consumer research. 1994;21(1):1–31.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref27] 27. Dijkstra A. The psychology of tailoring-ingredients in computer-tailored persuasion. Social and personality psychology compass. 2008;2(2):765–784.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref28] 28. Qiu S, Zhang K. Learning Personalized End-to-End Task-Oriented Dialogue for Fast and Reliable Adaptation. In: 2021 International Conference on Digital Society and Intelligent Systems (DSInS). IEEE; 2021. p. 62–66.

[ref29] 29. Tiwari A, Saha T, Saha S, Sengupta S, Maitra A, Ramnani R, et al. A dynamic goal adapted task oriented dialogue agent. Plos one. 2021;16(4):e0249030. pmid:33793633
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref30] 30. Tiwari A, Saha T, Saha S, Sengupta S, Maitra A, Ramnani R, et al. A persona aware persuasive dialogue policy for dynamic and co-operative goal setting. Expert Systems with Applications. 2022;195:116303.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref31] 31. Tiwari A, Saha T, Saha S, Sengupta S, Maitra A, Ramnani RR, et al. Multi-Modal Dialogue Policy Learning for Dynamic and Co-operative Goal Setting. In: International Joint Conference on Neural Networks, IJCNN 2021, Shenzhen, China, July 18-22, 2021. IEEE; 2021. p. 1–8. Available from: https://doi.org/10.1109/IJCNN52387.2021.9533878.

[ref32] 32. Priya N, Tiwari A, Saha S. Context Aware Joint Modeling of Domain Classification, Intent Detection and Slot Filling with Zero-Shot Intent Detection Approach. In: Mantoro T, Lee M, Ayu MA, Wong KW, Hidayanto AN, editors. Neural Information Processing—28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8-12, 2021, Proceedings, Part III. vol. 13110 of Lecture Notes in Computer Science. Springer; 2021. p. 582–595. Available from: https://doi.org/10.1007/978-3-030-92238-2_48.

[ref33] 33. Singh S, Kearns M, Litman D, Walker M. Reinforcement learning for spoken dialogue systems. Advances in neural information processing systems. 1999;12.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref34] 34. Li J, Monroe W, Ritter A, Galley M, Gao J, Jurafsky D. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:160601541. 2016.

[ref35] 35. Casanueva I, Budzianowski P, Su PH, Ultes S, Rojas-Barahona L, Tseng BH, et al. Feudal reinforcement learning for dialogue management in large domains. arXiv preprint arXiv:180303232. 2018.

[ref36] 36. Chen L, Chen Z, Tan B, Long S, Gašić M, Yu K. AgentGraph: Toward universal dialogue management with structured deep reinforcement learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2019;27(9):1378–1391.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref37] 37. Mesgar M, Simpson E, Gurevych I. Improving factual consistency between a response and persona facts. arXiv preprint arXiv:200500036. 2020.

[ref38] 38. Saha T, Chopra S, Saha S, Bhattacharyya P. Reinforcement learning based personalized neural dialogue generation. In: International Conference on Neural Information Processing. Springer; 2020. p. 709–716.

[ref39] 39. Tiwari A, Saha S, Bhattacharyya P. A knowledge infused context driven dialogue agent for disease diagnosis using hierarchical reinforcement learning. Knowledge-Based Systems. 2022;242:108292.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref40] 40. Naik DK, Mammone RJ. Meta-neural networks that learn by learning. In: [Proceedings 1992] IJCNN International Joint Conference on Neural Networks. vol. 1. IEEE; 1992. p. 437–442.

[ref41] 41. Koch G, Zemel R, Salakhutdinov R, et al. Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop. vol. 2. Lille; 2015. p. 0.

[ref42] 42. Vinyals O, Blundell C, Lillicrap T, Wierstra D, et al. Matching networks for one shot learning. Advances in neural information processing systems. 2016;29.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref43] 43. Santoro A, Bartunov S, Botvinick M, Wierstra D, Lillicrap T. Meta-learning with memory-augmented neural networks. In: International conference on machine learning. PMLR; 2016. p. 1842–1850.

[ref44] 44. Snell J, Swersky K, Zemel R. Prototypical networks for few-shot learning. Advances in neural information processing systems. 2017;30.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref45] 45. Sung F, Yang Y, Zhang L, Xiang T, Torr PH, Hospedales TM. Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 1199–1208.

[ref46] 46. Andrychowicz M, Denil M, Gomez S, Hoffman MW, Pfau D, Schaul T, et al. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems. 2016;29.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref47] 47. Ha D, Dai A, Le QV. Hypernetworks. arXiv preprint arXiv:160909106. 2016.

[ref48] 48. Munkhdalai T, Yu H. Meta networks. In: International Conference on Machine Learning. PMLR; 2017. p. 2554–2563.

[ref49] 49. Mishra N, Rohaninejad M, Chen X, Abbeel P. Meta-learning with temporal convolutions. arXiv preprint arXiv:170703141. 2017;2(7):23.

[ref50] 50. Ravi S, Larochelle H. Optimization as a model for few-shot learning. 2016.

[ref51] 51. Yoon J, Kim T, Dia O, Kim S, Bengio Y, Ahn S. Bayesian model-agnostic meta-learning. Advances in neural information processing systems. 2018;31.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref52] 52. Gu J, Wang Y, Chen Y, Cho K, Li VO. Meta-learning for low-resource neural machine translation. arXiv preprint arXiv:180808437. 2018.

[ref53] 53. Lin Z, Madotto A, Wu CS, Fung P. Personalizing dialogue agents via meta-learning. arXiv preprint arXiv:190510033. 2019.

[ref54] 54. Hemphill CT, Godfrey JJ, Doddington GR. The ATIS spoken language systems pilot corpus. In: Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990; 1990.

[ref55] 55. Budzianowski P, Wen TH, Tseng BH, Casanueva I, Ultes S, Ramadan O, et al. MultiWOZ–A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. arXiv preprint arXiv:181000278. 2018.

[ref56] 56. Zhang S, Dinan E, Urbanek J, Szlam A, Kiela D, Weston J. Personalizing Dialogue Agents: I have a dog, do you have pets too? In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2018. p. 2204–2213.

[ref57] 57. Bordes A, Boureau YL, Weston J. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:160507683. 2016.

[ref58] 58. Lewis M, Yarats D, Dauphin Y, Parikh D, Batra D. Deal or No Deal? End-to-End Learning of Negotiation Dialogues. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing; 2017. p. 2443–2453.

[ref59] 59. Saha A, Khapra M, Sankaranarayanan K. Towards building large scale multimodal domain-aware conversation systems. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32; 2018.

[ref60] 60. Baichoo A. Kaggle GSMArean; 2017. Available from: https://www.kaggle.com/arwinneil/gsmarena-phone-dataset.

[ref61] 61. Shi W, Li Y, Sahay S, Yu Z. Refine and Imitate: Reducing Repetition and Inconsistency in Persuasion Dialogues via Reinforcement Learning and Human Demonstration. arXiv preprint arXiv:201215375. 2020.

[ref62] 62. Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization; 2005. p. 65–72.

[ref63] 63. Castillo J, Estrella P. Semantic textual similarity for MT evaluation. In: Proceedings of the Seventh Workshop on Statistical Machine Translation; 2012. p. 52–58.

[ref64] 64. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:190711692. 2019.

[ref65] 65. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:191003771. 2019.

[ref66] 66. Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:191001108. 2019.

[ref67] 67. Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics; 2002. p. 311–318.

Figures

Abstract

Purpose

Methodology

Findings

Originality

1 Introduction

2 Related work

Task oriented virtual agent

Persuasive virtual agent

Reinforcement learning on NLG module

Meta learning

3 Problem formulation

4 Dataset

PPD: Data creation and annotation

Role of sentiment

Role of personalized persuasive strategy

5 Methodology

5.1 UBAR pipeline

5.2 USBAR workflow

5.3 Rewards

5.3.1 Repetitiveness reward.

5.3.2 Consistency reward.

5.3.3 Action consistency reward.

5.3.4 Sentiment based reward.

5.4 Action classifier

5.5 Meta learning

6 Training setup & implementation details

7 Results

7.1 Results without rewards

7.2 Results with r1 & r2

7.3 Results using r1, r2, r3 & r4

Meta learning.

8 Error analysis

9 Advantage & limitation

10 Case studies

10.1 Generated sample 1

10.2 Generated sample 2

10.3 Generated sample 3

11 Conclusion and future works

References

7.2 Results with r₁ & r₂

7.3 Results using r₁, r₂, r₃ & r₄