Adaptive Parameterization for Neural Dialogue Generation

2020·Arxiv

Abstract

Abstract

Neural conversation systems generate responses based on the sequence-to-sequence (SEQ2SEQ) paradigm. Typically, the model is equipped with a single set of learned parameters to generate responses for given input contexts. When confronting diverse conversations, its adaptability is rather limited and the model is hence prone to generate generic responses. In this work, we propose an Adaptive Neural Dialogue generation model, ADAND, which manages various conversations with conversation-specific parameterization. For each conversation, the model generates parameters of the encoder-decoder by referring to the input context. In particular, we propose two adaptive parameterization mechanisms: a context-aware and a topic-aware parameterization mechanism. The context-aware parameterization directly generates the parameters by capturing local semantics of the given context. The topic-aware parameterization enables parameter sharing among conversations with similar topics by first inferring the latent topics of the given context and then generating the parameters with respect to the distributional topics. Extensive experiments conducted on a large-scale real-world conversational dataset show that our model achieves superior performance in terms of both quantitative metrics and human evaluations.

1 Introduction

Open-domain dialogue models, usually called chitchat systems, draw increasing attention from both academia and industry in recent years. Building on the successful sequence-to-sequence (SEQ2SEQ) paradigm (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015), contemporary mainstream open-domain dialogue generation models (Shang et al., 2015; Serban et al., 2016, 2017; Shen et al., 2017; Clark and Cao, 2017; Zhao et al., 2017; Xing et al., 2017; Zhao et al., 2018), trained on a large number of context-response pairs, attempt to generate an appropriate response for the given context based on a single set of the model parameters.

Because of its great potential in understanding and modeling conversations, SEQ2SEQ has been widely applied in different kinds of conversation scenarios including technical supporting, movie discussions, and social entertainment, etc. However, when confronting conversations with diverse topics or themes, SEQ2SEQ is usually prone to make generic meaningless responses due to its over-simplified parameterization. To tackle this issue, Xing et al. (2017) proposed a topic-aware response generation model, which utilizes a single encoder-decoder, augmented with topic information obtained from a pre-trained LDA model. Though effective, the model heavily relies on the outsourcing topic information to capture the topic variations of different conversations. Another approach, per-topic/theme encoder-decoder model (Choud- hary et al., 2017), uses separate encoder-decoder model for each topic or theme. This method needs the preorganized topic/theme annotations for each conversation, which are prohibitively expensive to obtain. Furthermore, building multiple separate topic/theme-specific encoder-decoders not only weakens the applicability and efficiency of the system, but also prevents parameter sharing across domains, which leads to overparameterization due to the excessive amount of model parameters.

To gather the benefits of both approaches, in this paper, we propose an adaptive neural dialogue generation model which utilizes a single encoder-decoder for diverse conversations, meanwhile, the encoder-decoder is specifically parameterized according to each conversation. In particular, we propose two adaptive parameterization mechanisms: 1) context-aware parameterization directly generates parameters of the encoder-decoder model by capturing local semantics of the given context; 2) topic-aware parameterization enables parameter sharing among conversations with similar topic distributions by first inferring the latent topics of the input context, and then generating the parameters with respect to the inferred latent topics. Equipped with both the context-aware and topic-aware parameterization mechanisms, the model is capable of generating responses for diverse conversations with a single encoder-decoder through a more flexi-ble and efficient approach. Moreover, our model is trained in an end-to-end fashion without any costly external or labeled topic annotations.

We empirically evaluate our approach on a large scale real-world conversational dataset. Extensive experiments show that our proposed ADAND outperforms existing dialogue generation models in terms of both the automatic evaluation metrics and human judgements.

2 Neural Dialogue Generation Model

Conventionally, neural dialogue generation model follows the encoder-decoder paradigm. Given the context and the target response , the model learns to maximize the following conditional probability:

where is the decoding hidden state up to time step , depending on defined as:

where the encoder G and the decoder F can be implemented as recurrent neural networks such as LSTMs (Hochreiter and Schmidhuber, 1997) or GRUs (Cho et al., 2014), or the transformer (Vaswani et al., 2017) (with the attention mechanism (Bahdanau et al., 2015) or selfattention mechanism).

In this work, we employ the LSTM-based encoder-decoder dialogue generation model. The

Figure 1: General model architecture. Black solid lines denote information flow, and the red dashed line indicates the adaptive parameterization operation.

LSTM unit is formulated as:

where is the sigmoid operator and stands for Hadamard product. is the previous hidden state and is the input embedding at step

and are the LSTM parameters. The model parameters are tuned on the training corpus.

When testing, given the input context, it generates a response with the learned parameters. This architecture shows great success in neural dialogue generation (Shang et al., 2015; Serban et al., 2016, 2017; Shen et al., 2017; Clark and Cao, 2017; Xing et al., 2017; Choudhary et al., 2017; Zhao et al., 2018). However, with a single set of model parameters and the oversimplified model architecture, the flexibility of the model is rather limited, especially when confronting conversations with diverse topics or themes.

3 Adaptive Neural Dialogue Generation Model

In this work, we propose an adaptive neural dialogue generation model which utilizes a single encoder-decoder model and a set of dynamical parameters to balance the model’s flexibility and ef-ficiency. As depicted in Figure 1, we utilize the model adapter to parameterize the encoder-decoder for each conversation. It takes the dialogue context as its input, and generates parameters of the encoder-decoder model through two adaptive parameterization mechanisms; and then the resultant

Figure 2: General architecture of the context-aware parameterization. (a) Context-aware parameter adapter. (b) One layer of the dialogue generation model. Black solid lines denote information flow, and red dashed lines indicate adaptive parameterization operations. stands for the context-aware parameterization function (Eq.(7)).

encoder-decoder model generates the response with a specific set of model parameters.

3.1 Context-aware Parameterization

Context-aware parameterization parameterizes the dialogue encoder-decoder with respect to the local semantics of the given context. As shown in Figure 2, we first obtain the semantic representation of the context through a context encoder. Then, the parameter adapter dynamically adapts the parameters of the encoder-decoder at each time step. Here, we utilize a bidirectional LSTM to transform the input context into the semantic hidden representation

The parameter adapter then generates the weights of LSTM units as:

where and is a tensor. and , in which is the hidden size of the LSTM and is the embedding size. is the context representation. The product results in a weight matrix W, where each row is computed by one slice of the tensor:

Although such parameterization seems to be straightforward, due to the quadratic size of , the parameter size of such parameter adapter is times larger than the basic encoder-decoder model. Therefore, overfitting and expensive computational cost make it infeasible (Bertinetto et al., 2016; Ha et al., 2017). Following Flennerhag et al. (2018), we reduce the parameter space by factorizing the weights as:

where is implemented as a LSTM unit, is the previous encoder or decoder hidden state, and is the context representation at time step denotes the context-aware parameterization function, defined as:

where and are learnable weights.

The context-aware parameterization function is reminiscent of the Singular Value Decomposition (SVD). Here, composes a projection by adapting the dialogue context to the weight matrices and does not perform matrix decomposition actually. The number of parameters in this formulation is and the total parameter number of the model is linear with L, so that the total number of model parameters does not explode.

3.2 Topic-aware Parameterization

Context-aware parameterization adapts the encoder-decoder parameters regarding each input context. As a result, the adapted encoder-decoder is sensitive to the input context. To enable the parameter sharing among similar topics, we further propose a topic-aware parameterization mechanism. As shown in the Figure 3, the topic inferrer first distills the topic distribution from the context (Figure 3 (a)), and then the parameters of the encoder-decoder model are constructed upon

3.2.1 Latent Topic Inference

We introduce a variational topic inferrer to infer the topic distribution of the conversation. Drawing inspirations from neural topic model (Miao et al., 2017), as illustrated in Figure 3 (a), the generation process of the variational topic inferrer is formulated as follows:

(i) A latent variable is inferred to convey the underlying semantics of the given context.

Figure 3: Illustration of the topic-aware parameterization. (a) Topic inferrer. (b) One layer of the dialogue generation model. Black solid lines denote information flow, blue dashed lines only appear in training, and red dashed lines indicate adaptive parameterization operations.

(ii) The topic distribution is constructed from the latent variable through a softmax function.

(iii) The dialogue d, composing of a context-response pair (x, y), is drawn from the topic distribution over words a topic assignment sampled from a multinomial distribution parameterized by the topic distribution is the topic-word distribution of topic assignment

Given a context x, the latent variable is sampled from , and is the multivariate normal distri- bution with mean and diagonal covariance matrix . In practice, is reparameterized as: and is a standard Gaussian noise. The Gaussian parameters and are the outputs of multilayer perceptrons (MLP) given the bag-of-words (BoW) representation of the context as input. To reduce the encoding noise of stop words, here we choose the BoW instead of LSTM for context representations, following Miao et al. (2017). To implement neural variational inference, we utilize an inference network proximate the intractable true posterior , where are computed in the same way as the prior, taking the bag-of-words representation of dialogue d as input. The dialogue d consists of the context-response pair (x, y).

The topic distribution is constructed from the latent variable through a softmax function:

where is a linear transformation and the bias terms are left out for brevity.

Then, the dialogue d is generated based on the topic distribution . Given , the marginal likelihood of a dialogue d is formulated as:

In addition, the topic assignment can be integrated out and the log-likelihood of a word in dialogue d can be factorized as:

The topic-word distribution is defined by:

where is the topical word embedding matrix and is the topic embedding matrix, K is the number of topics, C is the number of topical words, H is the embedding size and

3.2.2 Parameterization with Latent Topics

We parameterize the encoder-decoder with the inferred topic distribution . In context-aware parameterization, the parameters of the encoder-decoder are adapted dynamically at each time step, whereas in topic-aware parameterization, as illustrated in Figure 3 (b), we generate only one set of parameters for each conversation.

Similar to the context-aware parameterization function in Eq.(7), given the topic distribution , the topic-aware parameterization function structs the LSTM weight W as follows:

where are learnable parameters. K is the number of latent topics.

3.3 Parameterization with Both Context and Topics

Intuitively, context-aware parameterization is more adept at capturing local semantics of the input context while topic-aware parameterization enables parameter sharing between conversations with similar topic distributions. To benefit the model parameterization with both the local and global information, we further adapt parameters of the encoder-decoder by utilizing both the context representations the topic distribution . In particular, the LSTM weight W at time step t is adapted as follows:

where is the gating function deciding whether the parameterization relies more on the context or the topics. are learnable weights. and denote the context-aware and topic-aware parameterization function respectively. the sigmoid function.

3.4 Learning

To enable the joint optimization of latent topic inference, adaptive model parameterization, and response generation in ADAND, given the definitions in Eq.(9), similar to Kingma and Welling (2014) and Miao et al. (2017), we derive a variational lower bound for the generation likelihood:

where is the prior estimation of the latent variable which approximates the posterior . The prior = , and the posterior = . The first term is the dialogue generation objective in the latent topic inferrer, the second term is the response generation objective, and the third term is the KL divergence between two Gaussian distributions. All the parameters are learned by optimizing Eq.(14) and updated with back-propagation.

The following previously proposed strategies (Bowman et al., 2016; Zhao et al., 2017) are adopted in training to alleviate the vanishing latent variable problem: (1) KL annealing: the weight of the KL divergence term is gradually increasing from 0 to 1 during training; (2) Bag-of-words loss: the bag-of-words loss requires the latent variable together with the dialogue context, to reconstruct the response bag-of-words representation

4 Experiments

4.1 Dataset and Competitor Baselines

To ascertain the effectiveness of the proposed model, we construct an open-domain conversation corpus covering a broad range of resources including a movie discussions dataset collected from Reddit (Dodge et al., 2015), an Ubuntu technical corpus (Lowe et al., 2015), and a chitchat dataset (Zhang et al., 2018). 87,468 context-response pairs were sampled for training, 4,460 for validation and 4,468 for testing.

The following state-of-the-art models are adopted as our comparison systems.

SEQ2SEQ The attention-based sequence-to-sequence model (Bahdanau et al., 2015), which is a representative baseline.

CVAE A latent variable conversation model in which it incorporates a latent variable at the sentence-level to inject stochasticity and diversity (Clark and Cao, 2017; Zhao et al., 2017).

LAED A recurrent encoder-decoder conversation model using discrete latent actions for interpretable neural dialogue generation (Zhao et al., 2018).

TA-SEQ2SEQ TA-SEQ2SEQ incorporates the outsourcing topic information into the response generation, where the topics are learned from a separate LDA model to enrich the context (Xing et al., 2017).

DOM-SEQ2SEQ A domain-aware conversation model consisting of multiple domain-targeted

Table 1: Quantitative evaluation results (%).

SEQ2SEQ models for response generation (Choud- hary et al., 2017).

4.2 Evaluation

Following the evaluation procedure in previous work (Li et al., 2016; Xing et al., 2017; Chen et al., 2018), experimental results of all models are reported in terms of the relevance and informativeness. To evaluate the semantic relevance between the generated response and the ground-truth response, we adopted the BLEU metric (Pap- ineni et al., 2002) and three embedding-based similarity metrics proposed in Liu et al. (2016): Embedding Average (Average), Embedding Extrema (Extrema) and Embedding Greedy (Greedy). To measure informativeness and diversity of the response, we exploited the Distinct-1, Distinct-2 and Distinct-3 metrics. A higher ratio of distinct ngrams implies more informative and diverse responses.

4.3 Implementation and Reproducibility

We implemented our model with ParlAI (Miller et al., 2017). The sequence lengths are truncated at 50. We used Adam (Kingma and Ba, 2014) with an initial learning rate of 0.001 to optimize the model. For all the experiments, we employed a 2-layer bidirectional LSTM as the encoder and a unidirectional one as the decoder. The hidden size and the word embedding dimension are both set to 300. The latent variable size is set to 64. The topic number K in our model is set to 5 and the most frequent 3,159 words are taken as the topical words vocabulary by stemming, filtering stop-words from the training set. The batch size is set to 128 for all models. We trained a Twitter LDA model to obtain the topical words for TA-SEQ2SEQ and set its model-specific parameters following the original paper (Xing et al., 2017). For regularization and preventing over-fitting, a dropout of 0.1 is applied and the weight decay is set to . We used the pretrained word embeddings (Pennington et al., 2014) of 300 dimensions, and the vocabulary size is set to 20,000. All models are trained with early stopping, i.e., if the loss does not decrease after 10 validations. The loss is computed on the validation set at every 0.5 epochs and we save the parameters for the top model on the validation set. We finally report evaluation scores on the test set from the saved model.

4.4 Overall Performance

Table 1 lists the performance of our system and the comparison systems. CVAE and LAED inject SEQ2SEQ with stochastic latent variable, resulting in more informative responses and better performance on Distinct-{1, 2, 3}. TA-SEQ2SEQ incorporates SEQ2SEQ with the outsourcing topic information from LDA. It is not surprising that it performs much better on the response relevance (BLEU, Average, Greedy, Extrema), while its improvements on the informativeness are limited. DOM-SEQ2SEQ builds multiple domain-specific encoder-decoders. It gains improvements on both the relevance metrics and informativeness metrics.

In general, with both the context-aware and topic-aware parameterization, our model outperforms all the competitive baselines in terms of the response relevance and informativeness.

4.5 Context-aware vs Topic-aware Parameterization

Context-aware parameterization captures local semantics of the given context, while topic-aware parameterization enables parameters sharing among conversations with similar topics. As shown in Ta-

Table 2: The results of human evaluation.

Table 3: Speed test.

ble 1, both parameterization mechanisms perform much better than the original SEQ2SEQ model, while context-aware parameterization is slightly better in terms of informativeness. When jointly utilizing both the context-aware and topic-aware parameterization mechanisms, we observe the best performance, indicating that these two mechanisms are both beneficial and complementary.

4.6 Human Evaluation

We conducted human evaluations on the test set to further validate the effectiveness of the model. We randomly selected 500 samples from the test set. Three well-educated students were invited to conduct the evaluation. For each case, we provided annotators with triplets (sample, response, response) whereby one response is generated by ADAND, and the other is generated by a competitor model. The annotators, who have no knowledge about which system the response is from, are then required to independently rate among win (responseis better), loss (responseis better) and tie (they are equally good or bad), considering four factors: context relevance, logical consistency, flu-ency and informativeness. Note that if annotators rate different options, this triplet will be counted as “tie”. Table 2 reveals the results of subjective evaluation. The kappa scores indicate that the annotators came to a fair agreement in the judgment.

As expected, ADAND outperforms the other baselines and enjoys a large margin over the existing models. The relative performance of the competitors is consistent with the quantitative evaluation results, confirming the superior performance of our proposed method.

Table 4: Topics by the words ()) with top-10 highest probability discovered by the latent topic inferrer.

4.7 Speed Test

We conducted speed test to verify the efficiency of the ADAND model empirically in Table 3. Augmented with auxiliary components, all the extension models exhibit higher time cost than the original SEQ2SEQ model. We observe that the decoding speeds of CVAE and LAED are relatively comparable with our model. However, when comparing with TA-SEQ2SEQ and DOM-SEQ2SEQ that also elaborately and explicitly model conversations with diverse topics or themes, ADAND shows a clear superiority in decoding speed. For TA-SEQ2SEQ, it relies on an outside LDA model to obtain the topic information. The joint attention and copying mechanism also reduce its efficiency. For DOM-SEQ2SEQ, it is not surprising that the time complexity of multiple topic/theme-specific encoder-decoders is much higher than all-other comparison models. ADAND utilizes a single encoder-decoder and is parameterized dynamically regarding the input context, which ensures its flexibility and effi-ciency.

4.8 Analysis & Case Study

To get some insights of how topic-aware parameterization performs, we present the topics by the words (in Eq.(10)) with top-10 highest probabilities in Table 4. The discernible clusters of the topical words ()) are illustrated in Figure 4. These evidences demonstrate that the topic inferrer in topic-aware parameterization effectively distills the latent topic distribution of each conversation, which enables the parameter sharing among conversations with similar topics.

We also investigate the orthogonality of the learned U and V matrices in Eq.(7) and Eq.(12). We trained our model multiple times with differ-

Figure 4: t-SNE (van der Maaten and Hinton, 2008) projection of topical word embeddings (in the latent topic inferrer. Words with similar topics are in the same color.

ent parameter initialization methods (drawn values from normal distribution or uniform distribution). We observe that approximate identity matrices. We conjecture that such SVD-alike parameterization implicitly enforces orthogonality during training.

We list several examples generated by different models in Table 5. The inferred latent topic distributions are also presented in the table. It can be observed that responses generated by the original SEQ2SEQ model are more generic. Latent variable conversation models (CVAE and LAED) generate more diverse but sometimes irrelevant responses, TA-SEQ2SEQ tends to produce short responses while DOM-SEQ2SEQ does not perform obviously better than TA-SEQ2SEQ. The responses generated by ADAND are not only relevant but also informative.

5 Related Work

Our work is closely related to the research of dialogue generation in diverse conversations. Previous work relies on external pre-organized topic information (Xing et al., 2017; Wang et al., 2017) or predicted keywords (Yao et al., 2017; Wang et al., 2018) to boost the response informativeness and coherence. Choudhary et al. (2017) further leveraged the topic/theme annotations to build multiple separate encoder-decoder models for topic/themeaware response generation. In contrast, we do not exploit any outsourcing or labeled topic information. The proposed model directly infers the latent topics of each conversation and is trained in an end-to-end manner. Another difference is that we

Table 5: Test samples of our model (ADAND) and the baselines. The latent topic distributions inferred by ADAND are also presented. The reference is the ground-truth response in the dataset.

maintain a single encoder-decoder for various conversations whereas the model is dynamically and specially parameterized.

The second line of related work is parameterization in NLP. Ha et al. (2017) proposed to train a small network to generate the parameters for another larger network. Such adaptive parameteriza- tion has been shown to be successful in many NLP tasks, including language modeling (Suarez, 2017; Flennerhag et al., 2018), sequence generation (Ha and Eck, 2018; Peng et al., 2019), and neural machine translation (Platanios et al., 2018). In our work, we parameterize the encoder-decoder with respect to both the context and the latent topics.

Regarding latent variable conversation models, prior researches strive to learn meaningful latent variables for dialogue systems, and reveal that latent variables befit the neural dialogue models with more diverse response generations (Serban et al., 2017; Zhao et al., 2017; Clark and Cao, 2017; Chen et al., 2018) and interpretable dialogue actions (Zhao et al., 2018). In our model, instead of directly injecting the latent variable into dialogue models, we distill the latent topics through neural variational inference, offering a more interpretable latent variable. Moreover, we parameterize the encoder-decoder with the inferred latent topics, which allows parameter sharing among conversations with similar topics.

6 Conclusion

This paper presents an adaptive neural dialogue generation model—ADAND, which allows the dynamical parameterization of the model to each conversation and enables the generation of appropriate responses in diverse conversations. Specially, we propose two adaptive parameterization approaches: context-aware parameterization which captures local semantics of the input context; and topic-aware parameterization which enables parameter sharing by first inferring the latent topics of the given context and then generating the parameters with the inferred latent topics. The proposed approaches are assessed on a large-scale conversational dataset and the results show that our model achieves superior performance and higher efficiency. It should be noted that our approach is not isolated to only LSTMs. We would like to explore the effectiveness of the approach regarding other structures in future work.

Acknowledgments

We would like to thank all the reviewers for their insightful and valuable comments and suggestions. Hongshen Chen and Cheng Zhang are the corresponding authors.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.

Luca Bertinetto, Jo˜ao F. Henriques, Jack Valmadre, Philip H. S. Torr, and Andrea Vedaldi. 2016. Learning feed-forward one-shot learners. In NIPS, pages 523–531.

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An- drew M. Dai, Rafal J´ozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In SIGNLL, pages 10–21.

Hongshen Chen, Zhaochun Ren, Jiliang Tang, Yi- hong Eric Zhao, and Dawei Yin. 2018. Hierarchical variational memory network for dialogue generation. In WWW, pages 1653–1662.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, pages 1724–1734.

Sajal Choudhary, Prerna Srivastava, Lyle H. Ungar, and Jo˜ao Sedoc. 2017. Domain aware neural dialog sys- tem. CoRR, abs/1708.00897.

Stephen Clark and Kris Cao. 2017. Latent variable di- alogue models and their diversity. In EACL, pages 182–187.

Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander H. Miller, Arthur Szlam, and Jason Weston. 2015. Evaluating prereq- uisite qualities for learning end-to-end dialog sys- tems. CoRR, abs/1511.06931.

Sebastian Flennerhag, Hujun Yin, John A. Keane, and Mark Elliot. 2018. Breaking the activation function bottleneck through adaptive parameterization. In NeurIPS, pages 7750–7761.

David Ha, Andrew M. Dai, and Quoc V. Le. 2017. Hy- pernetworks. In ICLR.

David Ha and Douglas Eck. 2018. A neural representa- tion of sketch drawings. In ICLR.

Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735– 1780.

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In ICLR.

Diederik P. Kingma and Max Welling. 2014. Autoencoding variational bayes. In ICLR.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting ob- jective function for neural conversation models. In NAACL-HLT, pages 110–119.

Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation met- rics for dialogue response generation. In EMNLP, pages 2122–2132.

Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dia- logue systems. In SIGDIAL, pages 285–294.

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605.

Yishu Miao, Edward Grefenstette, and Phil Blunsom. 2017. Discovering discrete latent topics with neural variational inference. In ICML, volume 70, pages 2410–2419.

A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra, A. Bor- des, D. Parikh, and J. Weston. 2017. Parlai: A dialog research software platform. arXiv preprint arXiv:1705.06476.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic eval- uation of machine translation. In ACL, pages 311– 318.

Hao Peng, Ankur P. Parikh, Manaal Faruqui, Bhuwan Dhingra, and Dipanjan Das. 2019. Text generation with exemplar-based adaptive decoding. In NAACLHLT, pages 2555–2565.

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word rep- resentation. In EMNLP, pages 1532–1543.

Emmanouil Antonios Platanios, Mrinmaya Sachan, Graham Neubig, and Tom M. Mitchell. 2018. Con- textual parameter generation for universal neural ma- chine translation. In EMNLP, pages 425–435.

Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben- gio, Aaron C. Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, pages 3776–3784.

Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, pages 3295–3301.

Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neu- ral responding machine for short-text conversation. In ACL, pages 1577–1586.

Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, Shuzi Niu, Yang Zhao, Akiko Aizawa, and Guoping Long. 2017. A conditional variational framework for dia- log generation. In ACL, pages 504–509.

Joseph Suarez. 2017. Character-level language model- ing with recurrent highway hypernetworks. In NIPS, pages 3269–3278.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS, pages 5998–6008.

Di Wang, Nebojsa Jojic, Chris Brockett, and Eric Ny- berg. 2017. Steering output style and topic in neural response generation. In EMNLP, pages 2140–2150.

Wenjie Wang, Minlie Huang, Xin-Shun Xu, Fumin Shen, and Liqiang Nie. 2018. Chat more: Deepening and widening the chatting topic via A deep model. In SIGIR, pages 255–264.

Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2017. Topic aware neural response generation. In AAAI, pages 3351– 3357.

Lili Yao, Yaoyuan Zhang, Yansong Feng, Dongyan Zhao, and Rui Yan. 2017. Towards implicit content- introducing for generative short-text conversation systems. In EMNLP, pages 2190–2199.

Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Per- sonalizing dialogue agents: I have a dog, do you have pets too? In ACL, pages 2204–2213.

Tiancheng Zhao, Kyusong Lee, and Maxine Esk´enazi. 2018. Unsupervised discrete sentence representa- tion learning for interpretable neural dialog genera- tion. In ACL, pages 1098–1107.

Tiancheng Zhao, Ran Zhao, and Maxine Esk´enazi. 2017. Learning discourse-level diversity for neural dialog models using conditional variational autoen- coders. In ACL, pages 654–664.