Recently, there is a surge of interest on dialogue generation for chatbots which aim to naturally and meaningfully converse with humans on open domain topics (Vinyals and Le, 2015). Although often called “non-goal-oriented” dialogue systems, such conversational agents are often built to keep users engaged in human-machine interactions as long as possible (Ram et al., 2018). While most of the existing effort is paid to generating relevant and diverse responses for static contexts (Ser- ban et al., 2016, 2017b; Sordoni et al., 2015; Li et al., 2015), it is not clear if relevance and diversity are sufficient to engagement in dynamic human-machine interactions, and if not, what else are needed to achieve the engagement.
In this work, we investigate the following problems: (1) how to understand human engagement in their social chat; (2) how to imitate such behavior in dialogue generation; (3) how to learn such a dialogue model; and (4) if the model can control its responses in interactions and thus enhance user engagement.
We design dialogue acts that can describe how human behave regarding to conversational contexts in their social interactions. The dialogue acts, when applied to real data, give rise to an interesting finding that in addition to replying with relevance and diversity, people are used to driving their social chat by constantly switching to new contexts and properly asking questions. Such behavior is less explored before, and thus is dif-ficult for the existing end-to-end learning methods to imitate. To mimic the behavior, we propose modeling open domain dialogue generation as an alternation of dialogue act selection and response generation where the dialogue acts control the types of the generated responses and thus manage the flow of interactions as policies. The model is learnt from large scale human-human dialogues tagged with a dialogue act classifier, and the policy of act selection is further optimized for long-term conversation through a reinforcement learning approach. Our model enjoys several advantages over the existing models: (1) the dialogue acts provide interpretation to response generation from a discourse perspective; (2) the dialogue acts enhance diversity of responses by expanding the search space from language to act (3) the dialogue acts improve user engagement in human-machine interactions; and (4) the dialogue acts allow engineers to control their systems by picking responses from their desired acts. Evaluation results on large scale test data indicate that our model can significantly outperform state-of-the-art methods in terms of quality of generated responses regarding to given contexts and lead to long-term conversation in both machine-machine simulation and human-machine conversation.
Our contributions in this work include: (1) design of dialogue acts that represent human behavior regarding to conversational contexts and insights from analysis of human-human interactions; (2) joint modeling of dialogue act selection and response generation in open domain dialogue generation; (3) proposal of learning the model through a supervised learning approach and a reinforcement learning approach; (4) empirical verification of the effectiveness of the model through automatic metrics, human annotations, machine-machine simulation, and human-machine conversation.
2.1 Definition of Dialogue Acts
We define our dialogue acts by extending the 42 tags (Jurafsky et al., 1997; Stolcke et al., 2006) based on the DAMSL annotation scheme (Core and Allen, 1997). Specifically, we merge some acts and define two high-level ones that describe how people behave regarding to conversational contexts in their interactions. As will be seen later, the extension brings us insights on engagement in social chat. Details of the dialogue acts are described in Table 1.
The dialogue acts in Table 1 are generally applicable to open domain dialogues from various sources in different languages such as Twitter, Reddit, Facebook, Weibo (www.weibo.com), and Baidu Tieba (https://tieba.baidu. com/), etc. Existing annotated data sets (e.g., the Switchboard Corpus1) do not have dialogue acts regarding to conversational contexts. Therefore, it is not clear how such dialogue acts depict human behavior in interactions, and there are no large scale data available for learning dialgoue generation with the dialogue acts either. To resolve these problems, we build a data set.
2.2 Data Set
We crawled 30 million dyadic dialogues (conversations between two people) from Baidu Tieba. Baidu Tieba is the largest Reddit-like forum in China which allows users to communicate with each other through one posting a comment and the other one replying to the comment. We randomly sample 9 million dialogues as a training set, 90 thousand dialogues as a validation set, and 1000 dialogues as a test set. These data are used to learn a dialogue generation model later. We employ the Standford Chinese word segmenter (https://nlp.stanfo rd.edu/software/tokenizer.shtml) to tokenize utterances in the data. Table 2 reports statistics of the data.
For dialogue act learning, we randomly sample 500 dialogues from the training set and recruit 3 native speakers to label dialogue acts for each utterance according to the definitions in Table 1. Table 3 shows a labeling example from one annotator. Each utterance receives 3 labels, and the Fleiss’ kappa of the labeling work is 0.45, indicating moderate agreement among the labelers.
2.3 Insights from the labeled data
The frequencies of the dialogue acts in terms of percentages of the total number of utterances in the labeled data are CM.S 55.8%, CM.Q 11.7%, CM.A 12.2%, CS.S 12.4%, CS.Q 4.8%, CS.A 2%, and O 1.1%. In addition to the numbers, we also get further insights from the data that are instructive to our dialogue generation learning:
Context switch is a common skill to keep conversation going. In fact, we find that 78.2% dialogues contain at least one CS.* act. The average number of turns of dialogues that contain at least one CS.* is 8.4, while the average number of turns of dialogues that do not contain a CS.* is 7. When dialogues are shorter than 5 turns, only 47% of them contain a CS.*, but when dialogues exceed 10 turns, more than 85% of them contain a CS.*. Because there are no specific goals in their conversations, people seldom stay long in one context. The average number of turns before context switch is 3.39. We also observed consecutive context switch in many dialogues (43.7%). The numbers suggest dialogue generation with smooth context switch and moderate context maintenance.
Question is an important building block in open domain conversation. In fact, 13.9% CM.* are CM.Q and the percentage is even higher in CS.* which is 20.27%. People need to ask questions in order to maintain contexts. The average number of turns of contexts with questions (i.e., consecutive CM.* with at least one CM.Q) is 3.92, while the average number of turns of contexts without questions is only 2.95. The observation indicates that a good dialogue model should be capable of asking questions properly, as suggested by (Li et al., 2017a). A further step to study human’s questioning behavior is to look into types and functions of questions. We leave it as future work.
The observations raise new challenges that are
Table 1: Definition of dialogue acts.
Table 2: Statistics of the experimental data sets.
difficult for the existing end-to-end methods to tackle (e.g., smoothly interleaving context blocks with switch actions), and thus encourage us to create a new model. Before elaborating the model, we first build a classifier that can automatically tag large scale dialogues with the dialogue acts.
2.4 Dialogue Act Classification
We aim to learn a classifier where
sents a dialogue with
-th utterance and
the labeled dialogue act. Given a new dialogue
can sequentially tag the utterances in d with dialouge acts by taking
, and the predicted
as inputs and outputting a vector
element representing the probability of
tagged as the j-th dialogue act.
We parameterize using neural networks. Specifically,
are first processed by bidirectional recurrent neural networks with gated recurrent units (biGRUs) (Chung et al., 2014) respectively. Then the last hidden states of the two biGRUs are concatenated with an embedding of
and fed to a multi-layer perceptron (MLP) to calculate a dialogue act distribution. Formally, suppose that
is the embedding of the j-th word, then the j-th hidden state of the biGRU is given by
-th state of a forward GRU,
-th state of a backward GRU, and
is a concatenation operator.
and
are calculated by
Similarly, we have -th hidden state of
be the embedding of
then
is defined by a two-layer MLP:
where we pad zeros for We learn
by minimizing cross entropy with
be the probability of
ing the j-th dialogue act and
be the j-th element of
objective function of learning is formulated as
We randomly split the labeled dialogues as 400/30/70 dialogues with 3280/210/586 utterances for training/validation/test. Details of model
Table 3: An example of dialogue with labeled acts.
Figure 1: Policy network and generation network.
training are given in Appendix. The learned clas-sifier achieves an accuracy of 70.1% on the test data. We employ it to tag the training, validation, and test sets in Table 2.
3.1 Supervised Learning
We aim to learn a dialogue generation model a human-human dialogue with
utterance and
by the classifier in Section 2.4. Given
dialogue session,
can generate a response as the next turn of the dialogue.
Our dialogue model consists of a policy network and a generation network. A dialogue act is first selected from the policy network according to the conversation history, and then a response is generated from the generation network based on the conversation history and the dialogue act. Formally, the dialogue model can be formulated as
where is the selected dialogue act for the i-th turn, and
is the response.
are the policy network and the generation network respectively. A is the space of dialogue acts.
Figure 1(b) shows the architecture of the policy network. The utterance sequence and the act sequence are encoded with a hierarchical encoder and a GRU encoder respectively. Then, the last hidden states of the two encoders are concatenated and fed to an MLP to calculate a probability distribution of dialogue acts for the next turn. Formally, is first trans- formed to hidden vectors
biGRU parameterized as Equation (1). Then,
is processed by a GRU parameter- ized as
lel,
is transformed to
is then defined by
We build the generation network in a sequence-to-sequence framework. Here, we simplify since decoding natural language responses from long conversation history is challenging. Figure 1(a) illus- trates the architecture of the generation network. The only difference from the standard encoder-decoder architecture with an attention mechanism (Bahdanau et al., 2015) is that in encoding, we concatenate
, and attach
to the top of the long sentence as a special word. The technique here is similar to that in zero-shot machine translation (Johnson et al., 2016). Formulation details are given in Appendix.
The dialogue model is then learned by minimiz-
ing the negative log likelihood of D:
where Through supervised learning, we fit the dialogue model to human-human interactions in order to learn their conversational patterns and human language. However, supervised learning does not explicitly encourage long-term conversation (e.g., 45.35% dialogues in our training set are no more than 5 turns), and the policy network is optimized without awareness of what is going to happen in the future when a dialogue act is selected. This motivates us to further optimize the model through a reinforcement learning approach.
3.2 Reinforcement Learning
We optimize the dialogue model through self-play (Li et al., 2016b; Lewis et al., 2017) where we let two models learned with the supervised approach talk to each other in order to improve their performance. In the simulation, a dialogue is initialized with a message sampled from the training set. Then, the two models continue the dialogue by alternately taking the conversation history as an input and generating a response (top one in beam search) until T turns (T = 20 in our experiments).
To speed up training and avoid generated responses diverging from human language, we fix the generation network and only optimize the policy network by reinforcement learning. Thus, the policy in learning is naturally de-fined by the policy network a state and
tion. We define a reward function
where is the expected dialogue length after taking
expected response relevance within the conversation,
. Through Equation (7), we try to encourage actions that can lead to long (measured by
) and reasonable (measured by
) conversations.
To estimate we fix
and construct a dialogue set
in our experiments) by sam- pling after
with self-play.
domly sampled from the top 5 beam search results of
according to Equation (4). Inspired by (Li et al., 2016b), we terminate a simulated dialogue if (1)
, or (3) the length of the dialogue reaches
denotes the representation of an utterance given by the encoder of
. Condition (1) means three consecutive turns are (semantically) repetitive, and Condition (2) means one agent gives repetitive responses in two consecutive turns. Both conditions indicate a high probability that the conversation falls into a bad infinite loop.
are then estimated by
where dual LSTM model proposed in (Lowe et al., 2015) which measures the relevance between a response and a context. We train
lion crawled data through negative sampling. The objective of learning is to maximize the expected future reward:
The gradient of the objective is calculated by Reinforce algorithm (Williams, 1992):
where the baseline is empirically set as
4.1 Experiment Setup
Our experiments are conducted with the data in Table 2. The following methods are employed as baselines: (1) S2SA: sequence-to-sequence with attention (Bahdanau et al., 2015) in which utterances in contexts are concatenated as a long sentence. We use the implementation with Blocks (https:
HRED: the hierarchical encoder-decoder model in (Serban et al., 2016) implemented with the source code available at (https://github .com/julianser/hed-dlg-truncated); (3) VHRED: the hierarchical latent variable encoder-decoder model in (Serban et al., 2017b) implemented with the source code available at (https://github.com/julianser /hed-dlg-truncated); and (4) RL-S2S: dialogue generation with reinforcement learning (Li et al., 2016b). We implement the algorihtm by finishing the code at (https:
All baselines are implemented with the recommended configurations in the literatures. We denote our Dialogue Act aware Generation Model with only Supervised Learning as SL-DAGM, and the full model (supervised learning + reinforcement learning) as RL-DAGM. Implementation details are given in Appendix.
4.2 Response Generation for Given Contexts
The first experiment is to check if the proposed models can generate high-quality responses regarding to given contexts. To this end, we take the last turn of each test dialogue as ground truth, and feed the previous turns as a context to different models for response generation. Top one responses from beam search (beam size= 20) of different models are collected, randomly shuffled, and presented to 3 native speakers to judge their quality. Each response is rated by the three annotators under the following criteria: 2: the response is not only relevant and natural, but also informative and interesting; 1: the response can be used as a reply, but might not be informative enough (e.g.,“Yes, I see” etc.); 0: the response makes no sense, is irrelevant, or is grammatically broken.
Table 4 (a) summarizes the annotation results. Improvements from our models over the baseline methods are statistically significant (t-test, p-value < 0.01). Besides human annotations, we also compare different models using automatic metrics with the ground truth. These metrics include BLEU (Papineni et al., 2002), embedding based metrics (Liu et al., 2016) such as Embedding Average (Average), Embedding Extrema (Extrema), and Embedding Greedy (Greedy), and ratios of distinct unigrams (distinct-1) and bigrams (distinct-2) in the generated responses which are employed in (Li et al., 2015) to measure response
(a) Human annotations. Ratios are calculated by combining labels from the three judges.
Table 4: Evaluation Results
diversity. Table 5 reports the results.
We can see that diversity of responses is sig-nificantly improved with the dialogue acts. This is supported by the much more 2 responses from the two models in Table 4 (a) and the significant improvement on distinct n-grams in Table 5. The reason is that we search a response not only from a language space, but also from an act space. The dimension of dialogue acts provides further variations to the generated responses. On the other hand, due to the diversity, responses from our models may diverge from the ground truth sometimes. This is why improvements on other automatic metrics are not significant. To further explain the advantages of our models, we show an example in Table 6. Besides responses from the dialogue acts selected by our models, we also show responses from other reasonable but not selected acts. With the dialogue acts, the generated responses become really rich, from confirmation (CM.Q) to an open question (CS.Q) and then to a long informative statement (CS.S). More importantly, the dialogue acts let us know why we have such responses: both SL-DAGM and RL-DAGM try to switch to new topics (e.g., Xiamen, noodle, and plan etc.) in order to continue the conversation. One can also change the flow of the conversation by picking responses from other dialogue acts. The example demonstrates that besides good performance, our models enjoy good interpretability and controllability as well. We show more such examples in Appendix.
To further understand how the dialogue acts affect response generation, we collect generated responses from a specific dialogue act for the contexts of the test dialogues, and characterize the re-
Table 5: Automatic evaluation results. Numbers in bold mean that improvement from the model on that metric is statistically significant over the baseline methods (t-test, p-value < 0.01).
Table 6: An example of response generation. Utterances in the context are split by “
sponses with the following metrics: (1) distinct-1 and distinct-2; (2) words out of context (OOC): ratio of words that are in the generated responses but not contained by the contexts; and (3) average length of the generated responses (Ave Len).
Table 7 reports the results. In general, responses generated from CS.* are longer, more informative, and contain more new words than responses generated from CM.*, which has been illustrated in Table 6. Another interesting finding is that statements and answers are generally more informative than questions in both CS.* and CM.*. In addition to these metrics, we also calculate BLEU scores and embedding based metrics, but do not observe significant difference among responses from different dialogue acts. The reason might be that these metrics are based on comparsion of the generated responses and human responses, but human responses in the test set are inherently mixture of responses from different dialogue acts.
4.3 Engagement Test
Secondly, we study conversation engagement with the proposed models. Experiments are conducted through machine-machine simulation and human-machine conversation. In both experiments, we compare SL-DAGM and RL-DAGM with RL-S2S, as RL-S2S is the only baseline optimized for future success. Responses from all models are randomly sampled from the top 5 beam search results. Average length of dialogues is employed as an evaluation metric as in (Li et al., 2016b).
Machine-machine simulation is conducted in a way similar to (Li et al., 2016b) in which we let two bots equipped with the same model talk with each other in 1000 simulated dialogues. Each dialogue is initialized with the first utterance of a test example, and terminated according to the termination conditions for reward estimation in Section 3.2. In human-machine conversation, we recruit 5 native speakers as testers and ask them to talk with the bots equipped with the three models. Every time, a bot is randomly picked for a tester, and the tester does not know which model is behind. Every tester finishes 100 dialogues with each bot. To make a fair comparison, we let the bots start dialgoues. A starting message in a dialogue is randomly sampled from the test data and copied 3 times for all the 3 bots. A dialogue is terminated if (1) the tester thinks the conversation cannot be continued (e.g., due to bad relevance or repetitive content etc.); or (2) the bot gives repetitive responses in two consecutive turns (measured by ). The evaluation metric is calculated with the total 500 dialogues for each model.
Table 4 (b) reports the evaluation results. In both experiments, SL-DAGM and RL-DAGM can lead to longer conversations, and the improvements from both models over the baseline are sta-
Table 7: Characteristics of the generated responses from different dialogue acts.
Figure 2: Average dialogue length of human-machine conversation in terms of different testers.
tistically significant (t-test, p-value < 0.01). Improvements in human-machine conversation are smaller than those in machine-machine simulation, indicating the gap between the simulation environment and the real conversation environment and encouraging us to consider online optimization in human-machine conversations in the future. RL-DAGM is better than SL-DAGM in both experiments, indicating the efficacy of reinforcement learning. In addition to the overall average length, we also show the distributions of average length of dialogues across different testers in human-machine conversation in Figure 2. Although there exists variance among the testers, the overall trend is consistent with the numbers in Table 4 (b).
The reason that our models are better is that they captured conversational patterns in human-human interactions and obtained further optimization through reinforcement learning. First, the models can pro-actively switch contexts in a smooth way. In machine-machine simulation, 65.4% (SL) and 94.4% (RL) dialogues contain at least one CS.*; and in human-machine conversation, the two percentages are 38.1% (SL) and 48.1% (RL) respectively. More interestingly, in machine-machine simulation, average lengths of dialogues without CS.* are only 4.78 (SL) and 2.67 (RL) respectively which are comparable with or even worse than RL-S2S, while average lengths of dialogues with CS.* are 8.66 (SL) and 8.18 (RL) respectively. The results demonstrate the importance of context switch for engagement in open domain conversation and one signficant effect of RL is promoting context switch in interactions for future engagment even with sacrifice on relevance of the current turn (e.g., more 0 responses than SL-DAGM in Table 4 (a)). Second, the models can drive conversations by asking questions. In machine-machine simulation, 36.5% (SL) and 32.4% (RL) dialogues contain at least one question. The percentages in human-machine conversation are 17.7% (SL) and 22.3% (RL) respectively. We show examples of machine-machine simulation and human-machine conversation in Appendix.
A common practice for building an open domain dialogue model is to learn a generative model in an end-to-end fashion. On top of the basic sequence-to-sequence with attention architecture (Vinyals and Le, 2015; Shang et al., 2015), various extensions have been proposed to tackle the “safe response” problem (Li et al., 2015; Mou et al., 2016; Xing et al., 2017a); to model complicated structures of conversational contexts (Ser- ban et al., 2016; Sordoni et al., 2015; Xing et al., 2017b); to bias responses to some specific persona or emotions (Li et al., 2016a; Zhou et al., 2017); and to pursue better optimization strategies (Li et al., 2017b, 2016b). In this work, we consider open domain dialogue generation with dialogue acts. Unlike task-oriented dialogue systems (Young et al., 2013; Wen et al., 2016) where task specific dialogue acts have been extensively applied for dialogue management, only a little work on open domain dialogue modeling takes dialogue acts into account. Most of the existing work stops at performing utterance classification or clustering (Kim et al., 2010, 2012; Ivanovic, 2005; Wal- lace et al., 2013; Ritter et al., 2010). Recently, Zhao et al. (2017) incorporate dialogue acts in the Switchboard Corpus as prior knowledge into dialogue generation. Serban et al. (2017a) leverage dialogue acts as features in their response selection model. Our work is unique in that we design special dialogue acts to explain social interactions, control open domain response generation, and thus guide human-machine conversations.
We design dialogue acts to describe human behavior in social interactions and propose open domain dialogue generation with the dialogue acts as policies. The dialogue model is learned through a supervised learning approach a reinforcement learning approach. Empirical studies show that the proposed models can significantly outperform state-of-the-art methods in terms of both response quality and user engagement.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. ICLR .
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 .
Mark G Core and James Allen. 1997. Coding dialogs with the damsl annotation scheme. In AAAI fall symposium on communicative action in humans and machines. Boston, MA, volume 56.
Edward Ivanovic. 2005. Dialogue act tagging for in- stant messaging chat sessions. In Proceedings of the ACL Student Research Workshop. Association for Computational Linguistics, pages 79–84.
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Vi´egas, Martin Wattenberg, Greg Corrado, et al. 2016. Google’s multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558 .
Daniel Jurafsky, Elizabeth Shriberg, and Debra Biasca. 1997. Switchboard-damsl labeling project codermanual. Tech. Rep. 97-02 .
Su Nam Kim, Lawrence Cavedon, and Timothy Bald- win. 2010. Classifying dialogue acts in one-on-one live chats.
Su Nam Kim, Lawrence Cavedon, and Timothy Bald- win. 2012. Classifying dialogue acts in multi-party live chats. In PACLIC. pages 463–472.
Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. 2017. Deal or no deal? end-to-end learning of negotiation dialogues. In EMNLP. pages 2433–2443.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015. A diversity-promoting objective function for neural conversation models. NAACL .
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016a. A persona-based neural conversation model. ACL .
Jiwei Li, Alexander H Miller, Sumit Chopra, Jason We- ston, et al. 2017a. Learning through dialogue interactions by asking questions. ICLR .
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. 2016b. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541 .
Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, and Dan Jurafsky. 2017b. Adversarial learning for neural dialogue generation. EMNLP .
Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. EMNLP .
Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909 .
Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang, and Zhi Jin. 2016. Sequence to backward and forward sequences: A content-introducing approach to generative short-text conversation. arXiv preprint arXiv:1607.00970 .
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL. Association for Computational Linguistics, pages 311–318.
Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu Venkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn, Behnam Hedayatnia, Ming Cheng, Ashish Nagar, et al. 2018. Conversational ai: The science behind the alexa prize. arXiv preprint arXiv:1801.03604 .
Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Un- supervised modeling of twitter conversations. In Human Language Technologies: The 2010 Annual
Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pages 172–180.
Iulian V Serban, Chinnadhurai Sankar, Mathieu Ger- main, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, et al. 2017a. A deep reinforcement learning chatbot. arXiv preprint arXiv:1709.02349 .
Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben- gio, Aaron C. Courville, and Joelle Pineau. 2016. End-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA.. pages 3776–3784.
Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C Courville, and Yoshua Bengio. 2017b. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI. pages 3295–3301.
Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL) pages 1577–1586.
Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714 .
Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza- beth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. 2006. Dialogue act modeling for automatic tagging and recognition of conversational speech. Dialogue 26(3).
Oriol Vinyals and Quoc Le. 2015. A neural conversa- tional model. arXiv preprint arXiv:1506.05869 .
Byron C Wallace, Thomas A Trikalinos, M Barton Laws, Ira B Wilson, and Eugene Charniak. 2013. A generative joint, additive, sequential model of topics and speech acts in patient-doctor communication. In EMNLP. pages 1765–1775.
Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2016. A networkbased end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562 .
Ronald J Williams. 1992. Simple statistical gradient- following algorithms for connectionist reinforcement learning. Machine learning 8(3-4):229–256.
Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2017a. Topic aware neural response generation. In AAAI. pages 3351– 3357.
Chen Xing, Wei Wu, Yu Wu, Ming Zhou, Yalou Huang, and Wei-Ying Ma. 2017b. Hierarchical recurrent attention network for response generation. arXiv preprint arXiv:1701.07149 .
Stephanie Young, Milica Gasic, Blaise Thomson, and John D Williams. 2013. Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE 101(5):1160–1179.
Matthew D Zeiler. 2012. Adadelta: an adaptive learn- ing rate method. arXiv preprint arXiv:1212.5701 .
Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960 .
Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. 2017. Emotional chatting machine: Emotional conversation generation with internal and external memory. arXiv preprint arXiv:1704.01074 .
7.1 Generation Network
Suppose that ding of the k-th word, then the k-th hidden state of the encoder is given by
where
Positions of are padded with zeros. Let
, then in de- coding the
marized as a context vector
through an attention mechanism:
where are parameters, and
-th hidden state of the decoder GRU in which
is calculated by
The generation probability of is then defined as
p
where is a vector with only one element 1 indicating the index of
in the vocabulary.
is finally defined as
7.2 Implementation Details of the Dialogue Act Classifier
We randomly split the 500 labeled dialogues as 400, 30, and 70 dialogues for training, validation, and test respectively. Utterances in the three sets are 3280, 210, and 586 respectively. In training, we represent dialogue acts as probability distributions by averaging the labels given by the three annotators. For example, if an utterance is labeled as “CM.S”, “CM.S”, and “CS.S”, then the probability distribution is (0.67, 0, 0, 0.33, 0, 0, 0). In test, we predict the dialogue act of an utterance avoid overfitting, we pre-train word embeddings using word2vec2 with an embedding size of 200 on the 30 million data and fix them in training. We set the embedding size of the dialogue acts and the hidden state size of the biGRUs as 100, and the dimensions of the first layer and the second layer of the MLP as 200 and 7 respectively. We optimize the objective function (i.e., Equation (3) in the submission) using back-propagation and the parameters are updated by stochastic gradient descent with AdaDelta algorithm (Zeiler, 2012). The best performing model on the validation data is picked up for test.
7.3 Implementation Details of the Dialogue Model
In learning of the generation network, we set the size of word embedding as 620 and the size of hidden vectors as 1024 in both the encoder and the decoder. Both the encoder vocabulary and the decoder vocabulary contain 30, 000 words. Words out of the vocabularies are replaced by a special token “UNK”. We employ AdaDelta algorithm (Zeiler, 2012) to train the generation network with a batch size 128. We set the initial learning rate as 1.0 and reduce it by half if perplexity on validation begins to increase. We stop training if the perplexity on validation keeps increasing in two successive epochs.
In learning of the policy network, we set the size of word embedding, the size of dialogue act, and the size of hidden states of the biGRU as 100. There are 50 neurons in the first layer of the MLP and 7 neurons in the second layer of the MLP. Vectors in the policy network have smaller sizes than those in the generation network because the complexity of dialogue act prediction is much lower than language generation.
In reinforcement learning, the size of minibatch is 60 and learning rate is fixed as 0.05. To estimate the reward, we train a dual LSTM (Lowe et al., 2015) with the size of word embedding and the size of hidden states as 100. Responses from the simulated dialogues are generated with a beam size 20.
In RL-S2S, we define 8 responses as dull responses according to the frequency of responses in the training set. Table 8 gives the responses.
7.4 More Examples of Response Generation
We compare SL-DAGM and RL-DAGM with baseline models in terms of response quality for given contexts with more examples in Table 9.
7.5 Examples in Engagment Test
Table 10 gives some examples on machine-machine simulation. Unlike the dialogues from RL-S2S which quickly converge to loops, dialogues from our models smoothly move forward under the management of the dialogue acts. The dialogue acts let us know why such responses are generated and make the simulated dialogues closer to human dialogues with moderate context continuation and jumping out of the contexts at proper timing. Table 11 and Table 12 show some examples from the test of human-machine conversation. We denote a machine turn as “M” and a human turn as “H”. After each example, we give the reason of termination in which “EOD-H” means the dialogue is terminated by the tester and “EOD-R” means the dialogue is terminated by the repetition check with the next generated turn attached. Compared to dialogues with the baseline, dialogues with our models can go deeper with much richer content, although a side-effect is that sometimes responses from CS.* might be nonsense (e.g., the first example of SL-DAGM). This sheds light on
Table 8: Dull responses for learning RL-S2S.
Table 9: More examples of response generation. Utterances in the context are split by “
our future direction to further improve the genera- tion network with knowledge.
Table 10: Comparison of simulated dialogues from different models.
Table 11: Example 1 of human-machine conversation. “M” means a machine turn, and “H” means a human turn.
Table 12: Example 2 of human-machine conversation. “M” means a machine turn, and “H” means a human turn.