Goal-oriented dialogue is an area of increasingly high interest, both from academic and industrial perspectives. Data-driven approaches to developing such systems (Lemon and Pietquin 2012) proved to be more flexible and scalable to various scenarios and domains than previous techniques widely employed in industry, mostly based on expert knowledge. The benefits of methods based on machine learning (especially deep learning) can only be experienced when there are excess amounts of training data available; however, in real-world scenarios, there’s only a small amount of initial data available for a new domain. Training techniques must make the most of this small data, i.e. work in a data-efficient way, in order to enable rapid development of dialogue models for an everincreasing number of domains and tasks. The most promising method to achieve this under the deep learning framework has become transfer learning where a large, generic model is first trained from a highly represented source of data, after which it gets adapted to the target task.
In this paper, we explore this problem through the Eighth Dialogue System Technology Challenge (DSTC), Fast Domain Adaptation task. Specifically, we propose a hybrid generative/retrieval dialogue model leveraging knowledge transfer from a large-scale pre-trained general-purpose language model. Our model is able to maintain goal-oriented dialogue in a closed domain having only been exposed to a small set of in-domain dialogues as the domain description. Our hybrid model achieves state-of-the-art performance on the MetaLWOz dataset when evaluated with human judges, and attains competitive generalization level in adapting to goal-oriented MultiWOZ dataset unseen at the main training stage. Automated word overlap-based metrics demonstrate that it outperforms a series of competitive baselines—both generative-only and retrieval-only models.
Generative dialogue modeling is an actively researched area, with the sequence-to-sequence (seq2seq) model (Vinyals and Le 2015) gaining wide adoption in both chat-oriented (Serban et al. 2016) and goal-oriented dialogue (Zhao et al. 2017). Initially these architectures were based on Recurrent Neural Networks such as LSTM (Hochreiter and Schmidhuber 1997) or GRU (Chung et al. 2014) which were quite challenging to train on large amounts on conversational data, causing researchers to focus on improving response diversity (Li et al. 2015) and the overall dialogue consistency (Li and Jurafsky 2016). Quite recently, self-attention mechanisms, like those used in the Transformer (Vaswani et al. 2017), have been adopted for conversation models—together with large-scale pre-training, it resulted in a new generation of seq2seq architectures.
The data efficiency of dialogue systems has also been extensively researched in the past. Initially, when modular dialogue system architecture was the prevalent approach, dialogue managers and state trackers were the components that data-efficient methods were applied to the most. As such, the dialogue state tracker domain adaptation task was initially proposed in DSTC-3 (Henderson, Thomson, and Williams 2014) — that challenge featured approaches like Bayesian Processes (Gasic et al. 2017) and Recurrent Neural Networks (Mrksic et al. 2015). Later research was focused on
Figure 1: Model diagram: (a) encode target the dialogue context and (b) produce the ‘generated candidate’; (c) encode support dialogue contexts in a similar way; (d) find the nearest ‘support’ neighbor and select its response as the ‘retrieved candidate’; (e) finally, rank the two candidates given the target context and produce the final result.
data-efficiency of dialogue managers, for instance Williams, Asadi, and Zweig (2017) introduced a model designed for bootstrapping from limited training data and further fine-tuning in the reinforcement learning fashion. Furthermore, a recent paper by Vlasov, Drissner-Schmid, and Nichol (2018) proposed a dialogue management model which used a unified embedding space for user and system turns allowing for efficient cross-domain knowledge transfer.
End-to-end dialogue response generation, the technique that followed modular architectures with the arrival of large conversational datasets, was also eventually approached in a data-efficient way. One such method used prior linguistic knowledge to improve zero-shot performance: Eshghi, Shalyminov, and Lemon (2017) proposed a linguistically informed model based on an incremental semantic parser (Eshghi, Purver, and Hough 2011) combined with a reinforcement learning-based agent. The parser was used for both maintaining the agent’s state and pruning the agent’s incremental, word-level generation actions to those leading to syntactically correct word sequences. While outperforming end-to-end dialogue models on bAbI Dialog Tasks (Bordes, Boureau, and Weston 2017) in the extreme zero-shot case (Shalyminov, Eshghi, and Lemon 2017), this method inherited the limitations of the dialogue grammar — specifically, it is limited to a single closed domain until a wide-coverage grammar is available.
Zhao and Esk´enazi (2018) introduced zero-shot dialogue generation (ZSDG) framework under which a dialogue system was trained on dialogues from several source domains and a small amount of annotated utterances from the target domain. The key feature in their framework was the uni-fied latent space which was used to encode user’s queries, dialogue contexts, and annotations.
Later, Shalyminov et al. (2019b; 2019a) proposed Dialogue Knowledge Transfer Networks which approached the problem in a few-shot setup with a separate out-of-domain pre-training stage on a large goal-oriented corpus (MetaLWOz, Lee et al. 2019a). In those approaches, MetaLWOz was used as source dataset for transfer, whereas we treat it as the target dataset. While the authors used full target-domain dialogues, they ended up using only a fraction of ZSDG’s data in terms of the number of utterances.
More generally, transfer learning has been widely adopted for natural language problems with the emergence of large-scale pre-trained text representation models like ELMo (Pe- ters et al. 2018), BERT (Devlin et al. 2019), and GPT-2 (Radford et al. 2018). When applied to dialogue response generation, the most successful approaches made use of a Transformer for chat-oriented dialogue (Wolf et al. 2019) and GPT/GPT-2 for goal-oriented dialogue (Budzianowski and Vulic 2019). Our approach is based on a similar technique, though in addition to fine-tuning a pre-trained model to our task, we augment the generative model with a retrieval component in a hybrid approach.
Finally, another recent approach applied to the problem of few-shot dialogue generation is meta-learning (Qian and Yu 2019), under which the task is split into multiple subtasks corresponding to dialogue domains. For each of them, a specialized dialogue model was trained, with their training progress then merged into the main model. In general, the intuition behind meta-learning is training a base model which would be best suited for data-efficient fine-tuning – otherwise known as rapid adaptation – making the most efficient gradient updates from the few data points available in the target domain.
Goal-oriented dialogue systems can be challenging to bootstrap: For a new domain, little data is available to train e.g. a natural language understanding (NLU) module or other parts
Figure 2: Human evaluation: rank densities by metric with the sample size of 100 dialogues (lower numbers are better). Our submission is denoted as Team B. Densities are determined by drawing 1000 times with replacement from the 100 dialogues and recomputing the rank.
Table 1: Ranking from judges’ pairwise comparisons
of the pipeline. Often, a Wizard-of-Oz (WOz, Kelley 1984, Rieser, Kruijff-Korbayov, and Lemon 2005) schema can be used to obtain some initial test data, however, this requires training human agents for the task and setting up a complex pipeline. The value of WOz data is limited, since “users” are mostly hired and might not conform to real users. Additionally, any change in the chatbot interface requires collecting more data.
In the context of the DSTC-8 domain adaptation challenge, we aim to build a model that predicts user responses for a goal-oriented dialogue system for which only limited in-domain data is available. Such data could be collected from e.g. customer service transcripts, or written by the developers themselves. From this in-domain data, the support set, we would like to extrapolate responses to novel dialogue contexts (the target). However, the support set is typically too small to train a generative dialogue model. Instead, we adapt a generic dialogue model trained on a large corpus of conversations over multiple source domains.
Technically, the problem setup is as follows: having trained the base model on the source domains, the model is then fed with one target dialogue and a support set at a time. The model’s task is to predict the next user turn of the target dialogue, taking into account the support set before producing a prediction. At prediction time, each target dialogue is processed in isolation from other target dialogues, such that the model cannot use knowledge or state obtained from other target/support data.
We use a language model pre-trained on a very large and diverse collection of textual data providing a strong language prior and then adapt the model for our tasks in the form of fine-tuning. Our base model is GPT-2 (Wolf et al. 2019), a transformer-based language model. In order to adapt GPT-2 for dialogue generation, we first augment the input embedding for each token in the dialogue with (1) a speaker tag embedding identifying the speaker and (2) a turn embedding, identifying the turn number in the current dialogue. These additional embedding matrices are learned solely using the dialogue data. The input token embeddings are then obtained by summing up these representations. We also add two task-specific output layers (or “heads”) for our purposes: a language modeling (LM) head and a next-sentence prediction (NSP) classification head, both trained from randomly initialized parameters.
We fine-tune GPT-2 for response generation by minimizing the negative log-likelihood of response tokens given the concatenation of dialogue context and the previous tokens in the response,
where X is the response and C is the dialogue context, i.e. the concatenation of the tokens in the previous utterances.
To predict the next sentence, we proceed as follows: given a context/response pair (C, X), the classification head is trained to produce a binary label y, which is 1 if X is the correct response given the context C, and 0 if X is a distractor (a random utterance from the corpus). We minimize a binary cross-entropy:
Table 2: Automatic evaluation results on MetaLWOz
where is the last hidden state of the last GPT-2 layer after having encoded the concatenation of
is the next-sentence prediction head (in our case a simple linear transformation). In practice, for each (C, X) pair in the corpus, we sample 1 distractor
We obtain a suitable dialogue prior by fine-tuning the modified GPT-2 model on the source domains with both the language modeling and next-sentence prediction tasks as described above, therefore minimizing
Fine-tuning on target domains and prediction As every test dialogue in the target domain/task is accompanied with a small support set of dialogues from the same domain/task, we make use of this data by further fine-tuning the dialogue model on the support dialogues. Crucially, we make sure not to accumulate any information between test dialogues: after each fine-tuning on the support set, we reset the weights of the model to the dialogue prior obtained by the fine-tuning stage described in the previous section.
In order to add diversity to the responses, GPT-2 uses nucleus (top-p) sampling (Holtzman et al. 2019) during generation, i.e. the model’s vocabulary V is pruned into smallest set such that
and the final distribution from which the words are sampled is rescaled as follows:
Hybrid generative-retrieval prediction In our experiments, we found that retrieval baselines are quite effective in the automatic metrics considered. Combining retrieval techniques with our generative model in a hybrid approach produced a stronger model.
The retrieval component is set up as follows: when predicting the t-th turn of the test dialogue, the model embeds its context of length as well as all the support dialogue contexts of the same length
using the fine-tuned dialogue encoder. The encoding for the dialogue context is the hidden state of the last layer of the Transformer model at the position corresponding to the last token in the context. Then,
Table 3: Automatic evaluation results on MultiWOZ pure task dataset
it selects the nearest support context to the target context and picks its t-th turn as the retrieved candidate response.
Finally, the model’s own generated response and the best retrieved candidate response are ranked using the NSP classi-fication head, i.e. both responses are concatenated with the ground-truth context and the one with the higher is selected. The model is visualized in Figure 1.
We compare our hybrid model to the retrieval baselines provided by the DSTC-8 organizers. The baselines ignore the training data and rely solely on the support sets: they embed each support dialogue’s context and find the one nearest to the target context using cosine distance as the metric. They then return the turn following the identified context as the predicted response. There are two baselines, which differ in their encoder: (1) BERT (Devlin et al. 2019)-based, taken off-the-shelf, and (2) SentencePiece/FastText-based, modeled after Gu et al. (2018) with embeddings pre-trained on the Reddit Conversations corpus.
We also compare our model to a bidirectional LSTM-based HRED (Serban et al. 2016) trained on MetaLWOz. Given the time constraints, we could only evaluate a base model without fine-tuning to support sets.
We use MetaLWOz, the dataset for DSTC-8 Track 2 “Fast Domain Adaptation” (Lee et al. 2019a). It contains more than 37,000 human-human dialogues spanning the total of 227 tasks in 47 domains. The dialogues are collected in a Wizard-of-Oz style: human participants were assigned the role of bot or user, then given a problem domain and related specific task, and instructed to reach the user’s goal over at least 10 dialogue turns.
For evaluation purposes, we additionally use MultiW OZ (Budzianowski et al. 2018), another multi-domain, multitask dialogue dataset. Dialogues in MultiWOZ contain NLU annotations, particularly for intent and slots, which we use in order to to evaluate the systems’ goal-oriented performance. A subset of MultiWOZ (MultiWOZ pure task), where dialogues only pertain to a single domain, was used for evaluation.
We perform training in two stages: training of the base model and fine-tuning it to the target dialogue’s support set. At the first stage, we train the model for the maximum of 5 epochs with early stopping. The second stage goes on for 1 epoch in the interest of time. GPT-2 models use the context of 3 exchanges, or 5 turns: bot-user-bot-user-bot, predicting the next user’s utterance. We mainly used the ‘small’ GPT-2 checkpoint by HuggingFace —we also tried the ‘medium’ one, but found no improvement with it in our task.
Human evaluation
The main systems’ goal is to generate appropriate responses towards maintaining a natural cooperative dialogue on the user’s side, so the main evaluation is performed involving human judges. Specifically, Amazon Mechanical Turk workers were tasked to compare the candidate responses given the dialogue context. Each comparison was pairwise between the results of two systems presented in random order. Judges ranked the responses against the following criteria:
• Usefulness — whether the response is useful given the dialogue context and the user’s overall final goal,
• Informativeness — whether the response specifically contains information relevant to the conversation,
• Appropriateness — whether the response is appropriate (on-topic, of a reasonable length, not repetitive) to the conversation,
• Easiness to answer — given a hypothetical conversational bot on the system side, whether the response will be a valid input for it and presumably straightforward to process. For each pairing, 3 independent comparisons were per-
formed against each metric. The number of comparisons
required was reduced by letting the Multisort algorithm (Maystre and Grossglauser 2017) determine which responses to compare, causing more similar systems with similar performance to be compared more often with each other. Bootstrapping over the 100 randomly chosen dialogue contexts was used to determine average ranks and assess the ranking robustness (Hall, Miller, and others 2009).
Automatic evaluation In addition to human evaluation, we also assess model performance using automatic metrics. The models were evaluated on MetaLWOz against word-overlap metrics such as BLEU-1–3, CIDEr, METEOR, ROUGE-L using the NLGEval package (Sharma et al. 2017). Although not ideal for the specifics of dialogue and spoken language in general (Lowe et al. 2017; Dziri et al. 2019), such metrics approximate the overall quality of a generative model and are especially useful for intermediate evaluation. We evaluate models in two modes on MetaLWOz: in pure task, support dialogues are drawn from the same domain and task as target dialogue; in cross-task, support and target dialogues are from the same domain, but different tasks.
We also perform additional evaluation of Entity/Intent F1 of the MultiWOZ dataset in pure task mode with pre-trained NLU taggers from the ConvLab package (Lee et al. 2019b). There is no MultiWOZ data available at the first stage (base model training), so all the exposure our model has to this dataset is via support dialogues. Complementary to MetaLWOz evaluation, this stage is designed for assessing the models’ goal-oriented performance.
Human evaluation Results of pairwise comparisons are shown in Table 1. Our GPT-2 hybrid system’s responses (Team B) were preferred by the judges in 56% of direct comparisons. This surpasses the next best system (Team C) performance by more than 4%, with only the gold human responses being chosen more frequently.
Furthermore, from the bootstrap ranking distribution (Figure 2), we see that, apart from the gold human responses, our model’s outputs are consistently preferred over other submissions by the judges. Of all metrics used, the most notable are ‘appropriateness’ and ‘usefulness’. On the former, GPT-2 hybrid’s responses have the second visible peak at rank 1 competing with gold responses. On usefulness however, rank 1 is held by the gold responses with no variation, and our model has the second visible peak at rank 3, thus almost tying with Team C.
Automatic evaluation Results on MetaLWOz and MultiWOZ against automatic evaluation metrics are shown in Tables 2 and 3, respectively. We observe that retrieval baselines attain very competitive performance on both datasets, with FastText embeddings from Reddit leading to overall better results than off-the-shelf BERT, especially in the pure task setting.
With GPT-2, we performed an ablation study to have a closer look into its performance. We evaluated three versions: ‘hybrid’ which we presented in this paper, ‘–ret’ with retrieval logic turned off, and ‘–sup’ with no retrieval logic and no fine-tuning to the support set. As seen in the Table 2, there is strong dependence on support dialogues (‘–sup’ vs. ‘–ret’) as the base model mostly struggles to compete with the baselines. Adding retrieval logic (‘hybrid’ vs. ‘–ret’) results in further performance gains. HRED and GPT-2–sup, the two
Table 4: GPT-2 Hybrid example responses
Table 5: GPT-2-hybrid generate/retrieve response ratio
models that did not use support dialogues, had comparable performance on MetaLWOz.
In goal-oriented metrics on MultiWOZ (see Table 3), the same performance pattern is observed with retrieval models, but GPT-2 in the generative-only version performs surprisingly better when not fine-tuned to support set (‘–sup’). On the other hand, the hybrid model experiences even more performance gain than on MetaLWOz. Presumably, generating responses for this dataset is harder due to the fact that it is not represented at the main training stage, and there is not much utterance overlap with MetaLWOz, so little knowledge transfer takes place in this experiment. Compared to other
submissions, we observe that GPT-2 hybrid still outperforms most of the competitors and only gives way to Team A’s system. We hypothesize here the best MultiWOZ model (Team A) was fitted to the automatic evaluation metrics too tightly, with the negative side effect observable in human evaluation results of Table 1 and Figure 2, where this system was prevalently ranked 4th and 5th.
Retrieval and Generation Frequency In Table 5, we show per-domain ratios of retrieved/generated responses from the hybrid model. We find that the majority of the responses are generated, and the retrieval logic works as the fallback option. On MetaLWOz, which the model had more exposure to during the training, generated responses ratio is generally slightly higher than that on MultiWOZ which was only seen by the model via support dialogues. Consequently, the model’s overall confidence on this dataset is lower, which results in more frequent fallbacks.
Overall, we observe in Table 4 that there are many cases in the data where the gold response cannot possibly be inferred from the dialogue context. Specifically, the task was posed in the way that no extra data, such as a knowledge base or task description, was provided to the system — therefore, the main goal intended for the hypothetical ideal system is to naturally model human responses in a co-operative goal-oriented dialogue, and to do that in a data-efficient way. This is reflected in the way human judges are asked about response quality.
We presented a hybrid generative/retrieval approach to goal-oriented dialogue with fast domain adaptation via transfer learning. It attains robust and diverse language generation performance across domains, and uses retrieval logic as a fallback mechanism in cases of low confidence. Our method is the winning entry at the DSTC-8 Fast Domain Adaptation task achieving state-of-the-art performance as evaluated with human judges. In additional automatic evaluation, it attains competitive generalization performance in adaptation to the goal-oriented MultiWOZ dataset without any exposure to that data during the main training stage.
Overall, we observe that transfer learning, while being in the core of state-of-the-art methods for dialogue domain adaptation and few-shot learning (Shalyminov et al. 2019a; Shalyminov et al. 2019b), still does not attain the performance level sufficient for direct adoption in industry. It’s evident that the problem of data-efficient dialogue response generation needs further research, and one promising direction that we are going to explore in our own future work is the meta-learning framework (Qian and Yu 2019), or ‘learning to fine-tune’. Based on splitting the task into multiple subtasks and solving them with separate versions of the model with further merging of each individual learner’s progress, meta-learning approach will naturally fit our multi-domain setup as well as lead to potentially better fine-tuning performance.
[Bordes, Boureau, and Weston 2017] Bordes, A.; Boureau, Y.-L.; and Weston, J. 2017. Learning end-to-end goal-oriented dialog. ICLR.
[Budzianowski and Vulic 2019] Budzianowski, P., and Vulic, I. 2019. Hello, it’s GPT-2 - how can I help you? Towards the use of pretrained language models for task-oriented dialogue systems. CoRR abs/1907.05774.
[Budzianowski et al. 2018] Budzianowski, P.; Wen, T.; Tseng, B.; Casanueva, I.; Ultes, S.; Ramadan, O.; and Gasic, M. 2018. Multiwoz - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP).
[Chung et al. 2014] Chung, J.; G¨ulc¸ehre, C¸ .; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555.
[Devlin et al. 2019] Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. Conf. of the North American Chapter of the Assoc. for Computational Linguistics: Human Language Technologies (NAACL-HLT).
[Dziri et al. 2019] Dziri, N.; Kamalloo, E.; Mathewson, K. W.; and Za¨ıane, O. R. 2019. Evaluating coherence in dialogue systems using entailment. In Proc. Conf. of the North American Chapter of the Assoc. for Computational Linguistics: Human Language Technologies (NAACL-HLT).
[Eshghi, Purver, and Hough 2011] Eshghi, A.; Purver, M.; and Hough, J. 2011. Dylan: Parser for dynamic syntax. Technical report, Queen Mary University of London.
[Eshghi, Shalyminov, and Lemon 2017] Eshghi, A.; Shalyminov, I.; and Lemon, O. 2017. Interactional Dynamics and the Emergence of Language Games. In Proc. ESSLLI workshop on Formal approaches to the Dynamics of Linguistic Interaction.
[Gasic et al. 2017] Gasic, M.; Mrksic, N.; Rojas-Barahona, L. M.; Su, P.; Ultes, S.; Vandyke, D.; Wen, T.; and Young, S. J. 2017. Dialogue manager domain adaptation using gaussian process reinforcement learning. Computer Speech & Language 45:552–569.
[Gu et al. 2018] Gu, J.; Wang, Y.; Chen, Y.; Li, V. O. K.; and Cho, K. 2018. Meta-learning for low-resource neural machine translation. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP).
[Hall, Miller, and others 2009] Hall, P.; Miller, H.; et al. 2009. Using the bootstrap to quantify the authority of an empirical ranking. The Annals of Statistics 37(6B):3929–3959.
[Henderson, Thomson, and Williams 2014] Henderson, M.; Thomson, B.; and Williams, J. D. 2014. The third dialog state tracking challenge. In Spoken Language Technology Workshop (SLT).
[Hochreiter and Schmidhuber 1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural Computation 9(8):1735–1780.
[Holtzman et al. 2019] Holtzman, A.; Buys, J.; Forbes, M.; and Choi, Y. 2019. The curious case of neural text degeneration. CoRR abs/1904.09751.
[Kelley 1984] Kelley, J. F. 1984. An iterative design methodology for user-friendly natural language office information applications. In ACM Transactions on Information Systems.
[Lee et al. 2019a] Lee, S.; Schulz, H.; Atkinson, A.; Gao, J.; Suleman, K.; El Asri, L.; Adada, M.; Huang, M.; Sharma, S.; Tay, W.; and Li, X. 2019a. Multi-domain task-completion dialog challenge. In Dialog System Technology Challenges 8.
[Lee et al. 2019b] Lee, S.; Zhu, Q.; Takanobu, R.; Li, X.; Zhang, Y.; Zhang, Z.; Li, J.; Peng, B.; Li, X.; Huang, M.; and Gao, J. 2019b. Convlab: Multi-domain end-to-end dialog system platform. In Proc. Conf. Assoc. for Computational Linguistics (ACL).
[Lemon and Pietquin 2012] Lemon, O., and Pietquin, O. 2012. Data-Driven Methods for Adaptive Spoken Dialogue Systems: Computational Learning for Conversational Interfaces. Springer.
[Li and Jurafsky 2016] Li, J., and Jurafsky, D. 2016. Neural net models of open-domain discourse coherence. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP).
[Li et al. 2015] Li, J.; Galley, M.; Brockett, C.; Gao, J.; and Dolan, W. B. 2015. A diversity-promoting objective function for neural conversation models. In Proc. Conf. of the North American Chapter of the Assoc. for Computational Linguistics: Human Language Technologies (NAACL-HLT).
[Lowe et al. 2017] Lowe, R.; Noseworthy, M.; Serban, I. V.; Angelard-Gontier, N.; Bengio, Y.; and Pineau, J. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. In Proc. Conf. Assoc. for Computational Linguistics (ACL).
[Maystre and Grossglauser 2017] Maystre, L., and Grossglauser, M. 2017. Just sort it! a simple and effective approach
to active preference learning. In Proc. Int. Conf. on Machine Learning (ICML).
[Mrksic et al. 2015] Mrksic, N.; S´eaghdha, D. ´O.; Thomson, B.; Gasic, M.; Su, P.; Vandyke, D.; Wen, T.; and Young, S. J. 2015. Multi-domain dialog state tracking using recurrent neural networks. In Proc. Conf. Assoc. for Computational Linguistics (ACL).
[Peters et al. 2018] Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In Proc. Conf. of the North American Chapter of the Assoc. for Computational Linguistics: Human Language Technologies (NAACL-HLT).
[Qian and Yu 2019] Qian, K., and Yu, Z. 2019. Domain adaptive dialog generation via meta learning. In Proc. Conf. Assoc. for Computational Linguistics (ACL).
[Radford et al. 2018] Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2018. Language models are unsupervised multitask learners.
[Rieser, Kruijff-Korbayov, and Lemon 2005] Rieser, V.; Kruijff-Korbayov, I.; and Lemon, O. 2005. A corpus collection and annotation framework for learning multimodal clarification strategies. In Proc. Meeting on Discourse and Dialogue (SIGDIAL).
[Serban et al. 2016] Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A. C.; and Pineau, J. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Proc. AAAI Conf. on Artificial Intelligence.
[Shalyminov et al. 2019a] Shalyminov, I.; Lee, S.; Eshghi, A.; and Lemon, O. 2019a. Data-efficient goal-oriented conversation with dialogue knowledge transfer networks. Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP).
[Shalyminov et al. 2019b] Shalyminov, I.; Lee, S.; Eshghi, A.; and Lemon, O. 2019b. Few-shot dialogue generation without annotated data: A transfer learning approach. Proc. Meeting on Discourse and Dialogue (SIGDIAL).
[Shalyminov, Eshghi, and Lemon 2017] Shalyminov, I.; Eshghi, A.; and Lemon, O. 2017. Challenging neural dialogue models with natural data: Memory networks fail on incremental phenomena. In Proc. Workshop on the Semantics and Pragmatics of Dialogue (SemDial).
[Sharma et al. 2017] Sharma, S.; El Asri, L.; Schulz, H.; and Zumer, J. 2017. Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. CoRR abs/1706.09799.
[Vaswani et al. 2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS).
[Vinyals and Le 2015] Vinyals, O., and Le, Q. V. 2015. A neural conversational model. Proc. Int. Conf. on Machine Learning (ICML).
[Vlasov, Drissner-Schmid, and Nichol 2018] Vlasov, V.; Drissner-Schmid, A.; and Nichol, A. 2018. Few-shot generalization across dialogue tasks. CoRR abs/1811.11707.
[Williams, Asadi, and Zweig 2017] Williams, J. D.; Asadi, K.; and Zweig, G. 2017. Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning. In Proc. Conf. Assoc. for Computational Linguistics (ACL).
[Wolf et al. 2019] Wolf, T.; Sanh, V.; Chaumond, J.; and Delangue, C. 2019. Transfertransfo: A transfer learning approach for neural network based conversational agents. NeurIPS Workshop on Conversational AI.
[Zhao and Esk´enazi 2018] Zhao, T., and Esk´enazi, M. 2018. Zero-shot dialog generation with cross-domain latent actions. In Proc. Meeting on Discourse and Dialogue (SIGDIAL).
[Zhao et al. 2017] Zhao, T.; Lu, A.; Lee, K.; and Esk´enazi, M. 2017. Generative encoder-decoder models for task-oriented spoken dialog systems with chatting capability. In Proc. Meeting on Discourse and Dialogue (SIGDIAL).