Unlike humans who can do both, goal-oriented dialogues [4, 5] and chit-chat conversations [6, 7] are often learned with separate models. A more desirable approach for the users would be to have a single chat interface that can handle both casual talk and tasks such as reservation or scheduling. This can be formulated as a problem of learning different conversational skills across multiple domains. A skill can be either querying a database, generating daily conversational utterances, or interacting with users in a particular task-domain (e.g. booking a restaurant). One challenge of having multiple skills is that existing datasets either focus only on chit-chat or on goal-oriented dialogues. This is due to the fact that traditional goal-oriented systems are modularized [4, 8–10, 5]; thus, they cannot be jointly trained with end-to-end architecture as in chit-chat. However, recently proposed end-to-end trainable models [11–14] and datasets [15, 2] allow us to combine goal-oriented [1, 2] and chit-chat [3] into a single benchmark dataset with multiple conversational skills as shown in Table 1.
A straight forward solution would be to have a single model for all the conversational skills, which has shown to be effective to a certain extent by [16] and [17]. Putting aside the performance in the tasks, such fixed shared-parameter framework, without any task-specific designs, would lose controllability and interpretability in the response generation. In this paper, instead, we propose to model multiple conversational skills using the Mixture of Experts (MoE) [18] paradigm, i.e., a model that learns and combine independent specialized experts using a gating function. For instance, each expert could specialize in different dialogues domains (e.g., Hotel, Train, Chit-Chat etc.) and skills (e.g., generate SQL query). A popular implementation of MoE [19, 20] uses a set of linear transformation (i.e., experts) in between two LSTM [21] layers. However, several problems arise with this implementation: 1) the model is computationally expensive as it has to decode multiple times each expert and make the combination at the representation-level; 2) no prior knowledge is
Table 1: An example from the dataset which includes both chit-chat and task-oriented conversations. The model has to predict all the Sys turn, which includes SQL query and generating response from a the Memory content, which is dynamically updated with the queries results. The skills are the prior knowledge needed for the response, where Persona refers to chit-chat.
injected in the expert selection (e.g., domains); 3) Seq2Seq model has limited ability in extracting information from a Knowledge Base (KB) (i.e., generated by the SQL query) [2], as required in end-to-end task-oriented dialogues systems [15]. The latter can be solved by using more advanced multi-hop models like the Transformer [22], but the remaining two need to be addressed. Hence, in this paper we:
• propose a novel Transformer-based architecture called Attention over Parameters (AoP). This model parameterize the conversational skills of end-to-end dialogue systems with independent decoder parameters (experts), and learns how to dynamically select and combine the appropriate decoder parameter sets by leveraging prior knowledge from the data such as domains and skill types;
• proof that AoP is algorithmically more efficient compared to forwarding all the Transformer decoders and then mix their output representation, like is normally done in MoE. Figure 1 illustrates the high-level intuition of the difference;
• empirically show the effectiveness of using specialized parameters in a combined dataset of MultiWOZ [1], In-Car Assistant [2], and Persona-Chat [3], which to the best of our knowledge, is the first evaluation of this genre i.e. end-to-end large-scale multi-domains/skills. Moreover, we show that our model is highly interpretable and is able to combine different learned skills to produce compositional responses.
We use the standard encoder-decoder architecture and avoid any task-specific designs [12, 13], as we aim to build a generic conversation model for both chit-chat and task-oriented dialogues. More specifically, we use a Transformer for both encoder and decoder.
Let us define the sequence of tokens in the dialogue history as and the dynamic memory content as a sequence of tokens
. The latter can be the result of a SQL query execution (e.g., table) or plain texts (e.g., persona description), depending on the task. The dialogue history D and the memory M are concatenated to obtain the final input denoted by
. We then denote
as the sequence of tokens that the model is expected to produce. Without loss of generality, Y can be both plain text and SQL-like queries. Hence, the model has to learn when to issue database queries and when to generate human-like responses. Finally, we define a binary skill vector
that specifies the type of skills required to generate Y . This can be considered as a prior vector for learning to select the correct expert during the training1. For example, in Table 1 the first response is of type SQL
Figure 1: Comparisons between Single model, Mixture of Experts (MoE) [18], and Attention over Parameters (AoP).
in the Hotel domain, thus the skill vector V will have and
, while all the other skill/domains are set to zero 2. More importantly, we may set the vector V to have multiple ones to enforce the model to compose skills to achieve a semantic compositionality of different experts.
2.1 Encoder-Decoder
To map the input sequence to the output sequence, we use a standard Transformer [22] and denote the encoder and decoder as TRSand TRS
, respectively. The input of a Transformer is the embedded representation of the input words; thus, we define a word embedding matrix
where d is the embedding size and |V | is the cardinality of the vocabulary. The input X, with its positional embedding (Appendix A1 for more information), are encoded as the following equation:
where , and E. Then the decoder receives the target sequence shifted by one
as the input. Using teacher-forcing [23], the model is trained to produce the correct sequence Y . The output of the decoder is produced as follow:
where . Finally, a distribution over the vocabulary is generated for each token by an affine transformation
followed by a Softmax function.
In addition, P(Y |X) is mixed with the encoder-decoder attention distribution to enable to copy token from the input sequence as in [24]. The model is then trained to minimize a standard cross entropy loss function and at inference time to generate one token at the time in an auto-regressive manner [25]. Hence, the training loss is defined as:
2.2 Attention over Parameters
The main idea is to produce a single set of parameters for decoder TRSby the weighted sum of r independently parameterized decoders. This process is similar to attention [26] where the memories are the parameters and the query is the encoded representation. Let us define
list of parameters for r decoders, since a TRS
is represented by its parameters
. Since each
can be sized in the order of millions, we assign the corresponding key vectors to each
, similar to key-value memory networks [27]. Thus, we use a key matrix
and a Recurrent Neural Networks (RNN), in this instance a GRU [28], to produce the query vector by processing the encoder output H. The attention weights for each decoders’ parameters is computed as follow:
where and
is the attention vectors where each
is the score corresponding to
. Hence, the new set of parameters is computed as follow:
The combined set of parameters are then used to initialize a new TRS
, and Equation 2 will be applied to the input based on this. Equation 6 is similar to the gating function proposed in [19, 18], but the resulting scoring vector
is applied directly to the parameter instead of the output representation of each decoder, holding an algorithmically faster computation.
Theorem 1. The computation cost of Attention over Parameters (AoP) is always lower than Mixture Of Experts (MoE) for sequence longer than 1.
Proof. Let a generic function parametrized by
. Without loss of generality, we define
as a affine transformation
. Let
a generic input sequence of length t and d dimensional size. Let the set
be the set of r experts. Hence, the operation done by MoE are:
MoEThus the computational cost in term of operation is O(rtdn+rtn) since the cost of
and it is repeated r times, and the cost of summing the representation is O(rtn). On the other hand, the operation done by AoP are:
in this case the computational cost in term of operation is O((r + t)dn) since the cost of summing the parameters is O(rdn) and the cost of . Hence, it is easy to verify that if t > 1 then:
Furthermore, the assumption of using a simple affine transformation W is actually an optimal case. Indeed, assuming that the cost of parameters sum is equal to the number of operation is optimistic, for instance already by using attention the number of operations increases but the number of parameters remains constant.
Importantly, if we apply to each of the output representation
generated by the TRS
, we end up having a Transformer-based implementation of MoE. We call this model as Attention over Representation (AoR). Finally, an additional loss term is used to supervise the attention vector
using the prior knowledge vector V . Since multiple decoder parameters can be selected at the same time, we use a binary cross-entropy to train each
. Thus a second loss is defined as:
The final loss is the summation of
Finally, in AoP, but in general in the MoE framework, stacking multiple layers (e.g., Transformer) leads to models with a large number of parameters, since multiple experts are repeated across layers. An elegant workaround is the Universal Transformer [29], which loops over an unique layer and, as shown by [29], holds similar or better performance than a multi-layer Transformer. In our experiment, we report a version of AoP that uses this architecture, which for instance does not add any further parameter to the model.
3.1 Dataset
To evaluate the performance of our model for different conversational skills, we propose to combine three publicly available datasets: MultiWOZ [1], Stanford Multi-domain Dialogue [2] and PersonaChat [3].
MultiWOZ (MWOZ) is a human-to-human multi-domain goal-oriented dataset annotated with dialogue acts and states. In this dataset, there are seven domains (i.e., Taxi, Police, Restaurant, Hospital, Hotel, Attraction, Train) and two APIs interfaces: SQL and BOOK. The former is used to retrieve information about a certain domain and the latter is used to book restaurants, hotels, trains, and taxis. We refine
this dataset to include SQL/BOOK queries and their outputs using the same annotations schema as [15].
Hence, each response can either be plain text conversation with the user or SQL/BOOK queries, and the memory is dynamically populated with the results from the queries as the generated response is based on such information. This transformation allows us to train end-to-end models that learns how and when to produce SQL queries, to retrieve knowledge from a dynamic memory, and to produce plain text response. A detailed explanation is reported in Appendix A3, together with some samples.
Stanford Multi-domain Dialogue (SMD) is another human-to-human multi-domain goal-oriented dataset that is already designed for end-to-end training. There are three domains in this dataset (i.e., Point-of-Interest, Weather, Calendar). The difference between this dataset and MWOZ is that each dialogue is associated with a set of records relevant to the dialogues. The memory is fixed in this case so the model does not need to issue any API calls. However, retrieving the correct entities from the memory is more challenging as the model has to compare different alternatives among records.
Persona-Chat is a multi-turn conversational dataset, in which two speakers are paired and different persona descriptions (4-5 sentences) are randomly assigned to each of them. For example, “I am an old man” and “I like to play football” are one of the possible persona descriptions provided to the system. Training models using this dataset results in a more persona consistent and fluent conversation compared to other existing datasets [3]. Currently, this dataset has become one of the standard benchmarks for chit-chat systems, thus, we include it in our evaluation.
For all three datasets, we use the training/validation/test split provided by the authors and we keep all the real entities in input instead of using their delexicalized version as in [1, 2]. This makes the task more challenging, but at the same time more interesting since we force the model to produce real entities instead of generic and frequent placeholders. Table 2 summarizes the dataset statistics in terms of number of dialogues, turns, and unique tokens. Finally, we merge the three datasets obtaining 154,768/19,713/19,528 for training, validation and, test respectively, and a vocabulary size of 37,069 unique tokens.
3.2 Evaluation Metrics
Goal-Oriented For both MWOZ and SMD, we follow the evaluation done by existing works [11, 16, 30, 31]. We use BLEU3 score [32] to measure the response fluency and Entity F1-Score [33, 16] to evaluates the ability of the model to generate relevant entities from the dynamic memory. Since MWOZ also includes SQL and BOOK queries, we compute the exact match accuracy (i.e., and
) and BLEU score (i.e.,
). Furthermore, we also report the F1-score for each domain in both MWOZ and SMD.
Chit-Chat We compare perplexity, BLEU score, F1-score [34], and Consistency score of the generate sentences with the human-generated prediction. The Consistency score is computed using a Natural Language Inference (NLI) model trained on dialogue NLI [35], a recently proposed corpus based on Persona dataset. We fine-tune a pre-trained BERT model [36] using the dialogue DNLI corpus and achieve a test set accuracy of 88.43%, which is similar to the best-reported model in [35].
The consistency score is defined as follow:
where u is a generated utterance and is one sentence in the persona description. In [35, 37], the authors showed that by re-ranking the beam search hypothesis using the DNLI score (i.e., C score), they achieved a substantial improvement in dialogue consistency. Intuitively, having a higher consistency C score means having a more persona consistent dialogue response.
3.3 Baselines
In our experiments, we compare Sequence-to-Sequence (Seq2Seq) [24], Transformer (TRS) [22], Mixture of Expert (MoE) [19] and Attention over Representation (AoR) with our proposed Attention over Parameters (AoP). In all the models, we used the same copy-mechanism as in [24]. In AoR instead of mixing the parameters as in Equation 7, we mix the output representation of each transformer decoder (i.e. Equation 2). For all AoP, AoR, and MoE, r = 13 is the number of decoders (experts): 2 skills of SQL and BOOK, 10 different domains for MWOZ+SMD, and 1 for Persona-Chat. Furthermore, we include also the following experiments: AoP that uses the gold attention vector V , which we refer as AoP w/ Oracle (or AoP + O); AoP trained by removing the from the optimization (AoP w/o
); and as aforementioned, the Universal Transformer for both AoP (AoP + U) and the standard Transformer (TRS + U) (i.e., 6 hops). All detailed model description and the full set of hyper-parameters used in the experiments are reported in Appendix A4.
3.4 Results
Table 2 and Table 3 show the respectively evaluation results in MWOZ+SMD and Persona-Chat datasets.
Figure 3: Results for the Persona-Chat dataset.From Table 2, we can identify four patterns.
1) AoP and AoR perform consistently better then other baselines which shows the effectiveness of combining parameters by using the correct prior V ; 2) AoP performs consistently, but marginally, better than AoR, with the advantage of an algorithmic faster inference; 3) Using Oracle (AoP+O) gives the highest performance in all the measures, which shows the performance upper-bound for AoP. Hence, the performance gap when not using oracle attention is most likely due to the error in attention (i.e., 2% error rate). Moreover, Table 2 shows that by
removing ) the model performance decreases, which confirms that good inductive bias is important for learning how to select and combine different parameters (experts). Additionally, in Appendix A5, we report the per-domain F1-Score for SQL, BOOK and sentences, and Table 3 and Table 2 with the standard deviation among the three runs.
Furthermore, from Table 3, we can notice that MoE has the lowest perplexity and F1-score, but AoP has the highest Consistency and BLUE score. Notice that the perplexity reported in [3] is lower since the vocabulary used in their experiments is smaller. In general, the difference in performance among different models is marginal except for the Consistency score; thus, we can conclude that all the models can learn this skill reasonably well. Consistently with the previous results, when is removed from the optimization, the models’ performance decreases.
Finally, in both Table 2 and Table 3, we report the results obtained by using the Universal Transformer, for both AoP and the Transformer. By adding the layer recursion, both models are able to consistently improve all the evaluated measures, in both Persona-Chat and the Task-Oriented tasks. Especially
Table 2: Results for the goal-oriented responses in both MWOZ and SMD. Last raw, and italicized, are the Oracle results, and bold-faced are best in each setting (w and w/o Universal). Results are averaged among three run (full table in Appendix A6).
AoP, which achieves better performance than Oracle (i.e. single layer) in the SQL accuracy, and a consistently better performance in the Persona-Chat evaluation.
To demonstrate the effectiveness of our model in learning independent skills and composing them together, we manually trigger skills by modifying and generate 14 different responses for the same input dialogue context. This experiment allows us to verify whether the model accurately captures the meaning of each skill and whether it can properly learn to compose the selected parameters (skills). Table 3 first shows the dialogue history along with the response of AoP on the top, and then different responses generated by modifying
(i.e., black cells correspond to 1 in the vector, while the whites are 0). By analyzing Table 3 4 we can notice that:
• The model learns the correct semantics of each skill. For instance, the AoP response is of type SQL and Train, and by deactivating the SQL skill and activating other domain-skills, including Train, we can see that the responses are grammatical and they are coherent with the selected skill semantics. For instance, by just selecting Train, the generated answer becomes “what time would you like to leave?” which is coherent with the dialogue context since such information has not been yet provided. Interestingly, when Persona skill is selected, the generated response is conversational and also coherent with the dialogue, even though it is less fluent.
• The model effectively learns how to compose multiple skills. For instance, when SQL or BOOK are triggered the response produces the correct SQL-syntax (e.g. “SELECT * FROM ..” etc.). By also adding the corresponding domain-skill, the model generates the correct query format and attributes relative to the domain type (e.g. in SQL, Restaurant, the model queries with the relevant attribute food for restaurants).
Dialogue Task-oriented dialogue models [38] can be categorized in two types: module-based [4, 8– 10, 5, 39] and end-to-end. In this paper, we focus on the latter which are systems that train a single model directly on text transcripts of dialogues. These tasks are tackled by selecting a set of predefined utterances [15, 40–42] or by generating a sequence of tokens [33, 43, 16, 44]. Especially in the latter, copy-augmented models [11, 13, 14] are very effective since extracting entities from a knowledge base is fundamental. On the other hand, end-to-end open domain chit-chat models have been widely studied [6, 7, 45–47]. Several works improved on the initially reported baselines with various methodologies [48, 14, 49, 50, 34]. Finally, [16] was the first attempt of having an end-to-end system for both task-oriented models and chit-chat. However, the dataset used for the evaluation was small, evaluated only in single domain, and the chit-chat ability was added manually through rules.
Mixture of Expert & Conditional Computation The idea of having specialized parameters, or socalled experts, has been widely studied topics in the last two decades [18, 51]. For instance, different
Table 3: Selecting different skills thought the attention vector results in a skill-consistent response. AoP response activates SQL and Train.
architecture and methodologies have been used such as Gaussian Processes [52], Hierarchical Experts [53], and sequential expert addition [54]. More recently, the Mixture Of Expert [19, 20] model was proposed which added a large number of experts between two LSTMs. To the best of our knowledge, none of these previous works applied the results of the gating function to the parameters itself. On the other hand, there are Conditional Computational models which learn to dynamically select their computation graph [55, 56]. Several methods have been used such as reinforcement learning [57], a halting function [58, 29, 59], by pruning [60, 61] and routing/controller function [62]. However, this line of work focuses more on optimizing the inference performance of the model more than specializing parts of it for computing a certain task.
Multi-task Learning Even though our model processes only input sequence and output sequences of text, it actually jointly learns multiple tasks (e.g. SQL and BOOK query, memory retrieval, and response generation), thus it is also related to multi-task learning [63]. Interested readers may refer to [64, 65] for a general overview on the topic. In Natural Language Processing, multi-task learning has been applied in a wide range of applications such as parsing [66–68], machine translation in multiple languages [69], and parsing image captioning and machine translation [70]. More interestingly, DecaNLP [17] has a large set of tasks that are cast to question answering (QA), and learned by a single model. In this work, we focus more on conversational data, but in future works, we plan to include these QA tasks.
In this paper, we propose a novel way to train a single end-to-end dialogue model with multiple composable and interpretable skills. Unlike previous work, that mostly focused on the representation-level mixing [19], our proposed approach, Attention over Parameters, learns how to softly combine independent sets of specialized parameters (i.e., making SQL-Query, conversing with consistent persona, etc.) into a single set of parameters. By doing so, we not only achieve compositionality and interpretability but also gain algorithmically faster inference speed. To train and evaluate our model, we organize a multi-domain task-oriented datasets into end-to-end trainable formats and combine it with a conversational dataset (i.e. Persona-Chat). Our model learns to consider each task and domain as a separate skill that can be composed with each other, or used independently, and we verify the effectiveness of the interpretability and compositionality with competitive experimental results and thorough analysis.
Several extensions of this work are possible, for example: incremental learning and zero-shot skill composition. The first, would be similar to [71] where we can add skills through time and learn how to combine it to existing ones. The second, instead, is more related to the semantic compositionality shown in the analysis, where each skill is correctly learned and can be apply to control the generation. An interesting direction would be to learn more general skills (e.g. Machine Translation (MT) or emotional responses), and being able to mix it to existing skills to obtain compositional responses without labeled data.
[1] Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, 2018.
[2] Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D. Manning. Key-value retrieval networks for task-oriented dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 37–49. Association for Computational Linguistics, 2017. URL http://aclweb.org/anthology/W17-5506.
[3] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213. Association for Computational Linguistics, 2018. URL http: //aclweb.org/anthology/P18-1205.
[4] Jason D Williams and Steve Young. Partially observable markov decision processes for spoken dialog systems. Computer Speech & Language, 21(2):393–422, 2007.
[5] Steve Young, Milica Gaši´c, Blaise Thomson, and Jason D Williams. Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179, 2013.
[6] Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, and Joelle Pineau. Generative deep neural networks for dialogue: A short review. arXiv preprint arXiv:1611.06216, 2016.
[7] Oriol Vinyals and Quoc V Le. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.
[8] Chiori Hori, Kiyonori Ohtake, Teruhisa Misu, Hideki Kashioka, and Satoshi Nakamura. Statistical dialog management applied to wfst-based dialog systems. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2009. ICASSP 2009., pages 4793–4796. IEEE, 2009.
[9] Cheongjae Lee, Sangkeun Jung, Seokhwan Kim, and Gary Geunbae Lee. Example-based dialog modeling for practical multi-domain dialog system. Speech Communication, 51(5):466–484, 2009.
[10] Esther Levin, Roberto Pieraccini, and Wieland Eckert. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on speech and audio processing, 8 (1):11–23, 2000.
[11] Mihail Eric and Christopher Manning. A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 468–473, Valencia, Spain, April 2017. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/E17-2075.
[12] Chien-Sheng Wu, Richard Socher, and Caiming Xiong. Global-to-local memory pointer networks for task-oriented dialogue. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
[13] Revanth Reddy, Danish Contractor, Dinesh Raghu, and Sachindra Joshi. Multi-level memory for task oriented dialogs. arXiv preprint arXiv:1810.10647, 2018.
[14] Semih Yavuz, Abhinav Rastogi, Guan-lin Chao, Dilek Hakkani-Tür, and Amazon Alexa AI. Deepcopy: Grounded response generation with hierarchical pointer networks. Conversational AI NIPS workshop, 2018.
[15] Antoine Bordes and Jason Weston. Learning end-to-end goal-oriented dialog. International Conference on Learning Representations, abs/1605.07683, 2017.
[16] Tiancheng Zhao, Allen Lu, Kyusong Lee, and Maxine Eskenazi. Generative encoder-decoder models for task-oriented spoken dialog systems with chatting capability. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 27–36. Association for Computational Linguistics, August 2017. URL http://aclweb.org/anthology/W17-5505.
[17] Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018.
[18] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, Geoffrey E Hinton, et al. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
[19] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
[20] Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. arXiv, 2017.
[21] Jurgen Schmidhuber. Evolutionary principles in self-referential learning. on learning now to learn: The meta-meta-meta...-hook. Diploma thesis, Technische Universitat Munchen, Germany, 14 May 1987. URL http://www.idsia.ch/~juergen/diploma.html.
[22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
[23] Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
[24] Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. URL http://aclweb.org/anthology/ P17-1099.
[25] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
[26] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal, September 2015. Association for Computational Linguistics. URL http://aclweb.org/anthology/D15-1166.
[27] Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1400–1409, 2016.
[28] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder– decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014.
[29] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. ICLR, 2019.
[30] Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. Mem2seq: Effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. arXiv preprint arXiv:1804.08217, 2018.
[31] Chien-Sheng Wu, Andrea Madotto, Genta Winata, and Pascale Fung. End-to-end recurrent entity network for entity-value independent goal-oriented dialog learning. In Dialog System Technology Challenges Workshop, DSTC6, 2017.
[32] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL http: //www.aclweb.org/anthology/P02-1040.
[33] Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina Maria Rojas-Barahona, Pei hao Su, Stefan Ultes, David Vandyke, and Steve J. Young. A network-based end-to-end trainable task-oriented dialogue system. In EACL, 2017.
[34] Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. The second conversational intelligence challenge (convai2). arXiv preprint arXiv:1902.00098, 2019.
[35] Welleck Sean, Jason Weston, Arthur Szlam, and Kyunghyun Cho. Dialogue natural language inference. arXiv preprint arXiv:1811.00671, 2018.
[36] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[37] Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. Personalizing dialogue agents via meta-learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5454–5459, 2019.
[38] Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to conversational ai. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1371–1374. ACM, 2018.
[39] Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and Pascale Fung. Transferable multi-domain state generator for task-oriented dialogue systems. arXiv preprint arXiv:1905.08743, 2019.
[40] Fei Liu and Julien Perez. Gated end-to-end memory networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1–10, Valencia, Spain, April 2017. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/E17-1001.
[41] Jason D Williams, Kavosh Asadi, and Geoffrey Zweig. Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 665–677, Vancouver, Canada, July 2017. Association for Computational Linguistics. URL http://aclweb.org/anthology/P17-1062.
[42] Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. Query-reduction networks for question answering. International Conference on Learning Representations, 2017.
[43] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, pages 3776–3784, 2016.
[44] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C Courville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, pages 3295–3301, 2017.
[45] Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. Transfertransfo: A transfer learning approach for neural network based conversational agents. arXiv preprint arXiv:1901.08149, 2019.
[46] Zhaojiang Lin, Andrea Madotto, Jamin Shin, Peng Xu, and Pascale Fung. Moel: Mixture of empathetic listeners. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 121–132, 2019.
[47] Zhaojiang Lin, Peng Xu, Genta Indra Winata, Zihan Liu, and Pascale Fung. Caire: An end-to-end empathetic chatbot. arXiv preprint arXiv:1907.12108, 2019.
[48] Ilya Kulikov, Alexander H Miller, Kyunghyun Cho, and Jason Weston. Importance of a search strategy in neural dialogue modelling. arXiv preprint arXiv:1811.00907, 2018.
[49] Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415, 2019.
[50] Yury Zemlyanskiy and Fei Sha. Aiming to know you better perhaps makes me a more engaging dialogue partner. CoNLL 2018, page 551, 2018.
[51] Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
[52] Volker Tresp. Mixtures of gaussian processes. In Advances in neural information processing systems, pages 654–660, 2001.
[53] Bangpeng Yao, Dirk Walther, Diane Beck, and Li Fei-Fei. Hierarchical mixture of classification experts uncovers interactions between brain regions. In Advances in Neural Information Processing Systems, pages 2178–2186, 2009.
[54] Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3366–3375, 2017.
[55] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
[56] Andrew Davis and Itamar Arel. Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv, 2013.
[57] Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. ICLR, 2016.
[58] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016.
[59] Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and Ruslan Salakhutdinov. Spatially adaptive computation time for residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1039–1048, 2017.
[60] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In Advances in Neural Information Processing Systems, pages 2181–2191, 2017.
[61] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800, 2018.
[62] Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of non-linear functions for multi-task learning. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ry8dvM-R-.
[63] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
[64] Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
[65] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Clustered multi-task learning via alternating structure optimization. In Advances in neural information processing systems, pages 702–710, 2011.
[66] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of machine learning research, 12(Aug):2493–2537, 2011.
[67] Kazuma Hashimoto, Yoshimasa Tsuruoka, Richard Socher, et al. A joint many-task model: Growing a neural network for multiple nlp tasks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1923–1933, 2017.
[68] Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders Søgaard. Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142, 2017.
[69] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, 2017.
[70] Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. In International Conference on Learning Representations, 2016.
[71] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
[72] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
[73] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Figure 5: Attention over Parameters visualization, vector for different reference (Ref.) and AoP generated answers. Top rows (Usr) are the last utterances from each dialogue contexts.
A.1 Embedded Representation
Figure 4: Positional Embedding of the dialogue
Since the model input may include structured data (e.g. DB records) we further define another embedding matrix for encoding the types and the segments as where S is the set of positional tokens and |S| its cardinality. P is used to inform the model of the token types such as speaker information (e.g. Sys and Usr), the data-type for the memory content (e.g. Miles, Traffic etc.), and segment types like dialogue turn information and database record index [45]. Figure 4 shows an example of the embedded representation of the input. Hence, we denote
as the type and segment tokens for each token in input X, respectively.
A.2 Attention Visualization
Figure 5 shows the attention vector over parameters for different generated sentences. In this figure, and by analyzing more examples 4, we can identify two patterns:
• AoP learns to focus on the correct skills (i.e., SQL, BOOK) when API-calls are needed. From the first example in Figure 5, we can see that the activations in are consistent with those in the correct attention vector P. There are also false positives, in which AoP puts too high weights on BOOK when the correct response is plain text that should request more information from the user (i.e., i can help you with that. when would you like to leave the hotel?). However, we can notice that this example is, in fact, "almost correct" as triggering a booking API call may also be considered a valid response. Meanwhile, the third example also fails to attend to the correct skill, but, in fact, generates a very fluent and relevant response. This is most likely because the answer is simple and generic.
• The attention often focuses on multiple skills not directly relevant to the task. We observe this pattern especially when there are other skill-related entities mentioned in the context or the response. For example, in the second dialog example in Figure 5, we can notice that AoP not only accurately focuses on taxi domain, but also has non-negligible activations for restaurant and hotel. This is because the words “hotel" and “restaurant" are both mentioned in the dialogue context and the model has to produce two entities of the same type (i.e. finches bed and breakfast and ask).
A.3 Data Pre-Processing
As mentioned in the main article, we convert MultiWOZ into an end-to-end trainable dataset. This requires to add sql-syntax queries when the system includes particular entities. To do so we leverage two annotations such as the state-tracker and the speech acts. The first is used to generate the a well-formed query, including key and attribute, the second instead to decide when to include the query. More details on the dialogue state-tracker slots and slots value, and the different speech acts can be found in [1].
A query is create by the slots, and its values, that has been updated in the latest turn. The SQL query uses the following syntax:
Similarly for the booking api BOOK the syntax is the following:
In both cases the slot values are kept as real entities.
More challenging is to decide when to issue such apis. Speech acts are used to decide by using the "INFORM-DOMAIN" and "RECOMMEND-DOMAIN" tag. Thus any response that include those speech tag will trigger an api if and only if:
By a manual checking, this strategy results to be effective. However, as reported by [1] the speech act annotation includes some noise, which is reflected also into our dataset.
The results from the SQL query can be of more that 1K records with multiple attributes. Following [1] we use the following strategy:
• If no speech act INFORM or RECOMMEND and the number of records are more than 5, we use a special token in the memory < TM >.
• If no speech act INFORM or RECOMMEND and the number of records are less or equal than 5, we put all the records in memory.
• If any speech act INFORM or RECOMMEND, we filter the records to include based on the act value. Notice that this is a fair strategy, since all the resulting record are correct possible answers and the annotators pick-up on of the record randomly [1].
Notice that the answer of a booking call instead, is only one record containing the booking information (e.g. reference number, taxi plate etc.) or "Not Available" token in case the booking cannot made.
A.4 Hyper-parameters and Training
We used a standard Transformer architecture [22] with pre-trained Glove embedding [72]. For the both Seq2Seq and MoE we use Adam [73] optimizer with a learning rate of , where instead for the Transformer we used a warm-up learning rate strategy as in [22]. In both AoP and AoR we use an additional transformer layer on top the output of the model. Figure 6,7,8 shows the high level design MoE, AoR and AoP respectively. In all the model we used a batch size of 16, and we early stopped the model using the Validation set. All the experiments has been conducted using a single Nvidia 1080ti.
We used a small grid-search for tuning each model. The selected hyper-parameters are reported in Table 4, and we run each experiment 3 times and report the mean and standard deviation of each result.
Table 4: Hyper-Parameters used for the evaluations.
A.5 MWOZ and SMD with Std.
A.6 Persona Result with Std
A.7 Domain F1-Score
Table 5: Per Domain F1 Score.
Figure 6: Mixture of Experts (MoE) [19] model consist of r feed-forward neural network (experts) which are embedded between two LSTM layers, a trainable gating network to select experts.
Figure 7: Attention over Representation (AoR) consist of a transformer encoder which encode the source input and compute the attention over the skills. Then r transformer decoder layers computes r specialized representation and the output response is generated based on the weighted sum the representation. In the figure, we omitted the output layer.
Figure 8: Attention over Parameters (AoP) consist of a transformer encoder which encode the source input and compute the attention over the skills. Then, r specialized transformer decoder layers and a dummy transformer decoder layer parameterized by the weighted sum of the r specialized transformer decoder layers parameters. In the figure, we omitted the output layer.