Table 3: The automatic evaluation results of the baselines and the proposed methods. Baseline is a context-to-response seq2seq model without attention. CO, PA, CG correspond to context-only attention, parallel attention and context-guided attention respectively. (B: BLEU; N: Nist; MET: METEOR)
Table 4: Human evaluation results in our offline and the official evaluation.
It can be found that the context-guided attention model with CVAE (CG+CVAE) achieves better performance for most metrics in terms of the similarity between the generated responses and the ground truth responses. This jus-tifies the effectiveness of our context-guided attention, because its goal is to generate responses containing more relevant knowledge, and the metrics slightly measure the relatedness. However, the context-only attention with CVAE (CO+VVAE) obtains the higher diversity, which is also important for this generation task. The results show the small improvement achieved by the proposed CVAE model in terms of the generation quality and the diversity.
Human Evaluation In order to understand the effect of our fact-grounded attention and variational generation, we conduct human evaluation on three proposed methods: the parallel attention model as our baseline (PA), compared with the parallel attention with variational generation (PA+CVAE), and the context-guided attention (CG). First, we randomly sample 100 testing samples that fulfill the following two conditions:
1. Each response has at least 3 words, because some methods tend to produce very short responses, which is hard to evaluate.
2. Due to the goal about fact-grounded generation, we make sure that the contexts and the retrieved fact have more than 3 common words for each sample, where punctuations and stop-words are not considered.
Then we conduct human evaluation for our proposed methods in a similar way to the official evaluation:
1. In addition to relevance and interest, which are asked in
official evaluation, we ask the judges to evaluate two additional metrics: fluency and knowledge relatedness (to the retrieved fact) of our response.
2. Because we only pick one fact based on the contexts as our model input, we directly provide this fact to judges as the extra information for them to better evaluate knowledge relatedness of the response.
The results are shown in Table 4. The submitted system, the best achieved results, and human performance are also included in Table 4 for better comparison. Note that the numbers for two sets of evaluation may not be directly compared but for reference.
In the offline human evaluation, it is found that the proposed models do not achieve better performance and the difference between all models are small. From the official evaluation, our submitted results are also between disagree (2) and neutral (3) as in our evaluation, but the context-guided attention achieves slightly better numbers than other proposed models shown in the offline setting. Furthermore, the best achieved performance is about 2.94, which is also lower than neutral (3), implying the difficulty of this task. It is clear that there is a huge gap between the currently machine-achieved and human-achieved performance, so this task requires further investigation.
Qualitative Analysis
The above results tell that there is no significant difference between our proposed models and baselines. A sample model responses from the human evaluation set is shown in Table 5 for our qualitative analysis. In this example, adding
in the united states , centenarians traditionally receive a letter from the president , congratulating them for their longevity . nbc ’ s today show has also named new centenarians on air since 1983 . centenarians born in ireland receive a 2,540 ” centenarians ’ bounty ” and a letter from the president of ireland , even if they are resident abroad . [ 63 ] japanese centenarians receive a silver cup and a certificate from the prime minister of japan upon their 100th birthday , honouring them for their longevity and prosperity in their lives . swedish centenarians receive a telegram from the king and queen of sweden . [ 64 ] centenarians born in italy receive a letter from the president of italy . in japan , a ” national respect for the aged day ” has been celebrated every september since 1966 .
Conversation: – til in the united states , people who turn 100 years old receive a letter from the president , congratulating them on their
Ground Truth: is that the canadian exchange rate these days ? PA Response: they are the same thing . PA+CVAE Response: you can have to be a . CG Response: it’s not the same thing in the uk .
Table 5: Model response sample.
CVAE generates a more diverse response than the parallel attention result, but may not effectively ground the knowledge in the sentence. Also, our context-guided result seems to focus more on the fact compared to other models. However, the ground truth in the data is very difficult to simulate for the current models, because it may need additional knowledge or common sense. From the current results achieved by our model, we conclude that this task still needs further investigation.
We describe a variational knowledge-grounded conversation system, which attempts at modeling the relations between dialogue contexts and external facts in an end-to-end fashion. It guides a potential research direction about how external information interacts with dialogues and how the machine can capture such interaction for better knowledge-grounded response generation. In the experiments on DSTC7, the results demonstrate the difficulty of this task, because almost all current models fail to generate reasonable responses. Therefore, the knowledge-grounded dialogue modeling requires further study in order to advance the machine’s capacity of producing a informative and knowledgable conversation.
[Bahdanau, Cho, and Bengio 2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[Banerjee and Lavie 2005] Banerjee, S., and Lavie, A. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65–72.
[Bowman et al. 2016] Bowman, S. R.; Vilnis, L.; Vinyals,
O.; Dai, A.; Jozefowicz, R.; and Bengio, S. 2016. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, 10–21.
[Cho et al. 2014] Cho, K.; Van Merri¨enboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
[Doddington 2002] Doddington, G. 2002. Automatic eval- uation of machine translation quality using n-gram cooccurrence statistics. In Proceedings of the second international conference on Human Language Technology Research, 138–145. Morgan Kaufmann Publishers Inc.
[Gao, Galley, and Li 2018] Gao, J.; Galley, M.; and Li, L. 2018. Neural approaches to conversational ai. 1371–1374.
[Ghazvininejad et al. 2017] Ghazvininejad, M.; Brockett, C.; Chang, M.-W.; Dolan, B.; Gao, J.; Yih, W.-t.; and Galley, M. 2017. A knowledge-grounded neural conversation model. arXiv preprint arXiv:1702.01932.
[Hori and Hori 2017] Hori, C., and Hori, T. 2017. End-to- end conversation modeling track in dstc6. arXiv preprint arXiv:1706.07440.
[Kingma and Ba 2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[Kingma and Welling 2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
[Li et al. 2016a] Li, J.; Galley, M.; Brockett, C.; Gao, J.; and Dolan, B. 2016a. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 110–119.
[Li et al. 2016b] Li, J.; Monroe, W.; Ritter, A.; Galley, M.; Gao, J.; and Jurafsky, D. 2016b. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541.
[Li et al. 2017] Li, X.; Chen, Y.-N.; Li, L.; Gao, J.; and Ce- likyilmaz, A. 2017. End-to-end task-completion neural dialogue systems. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, 733–743.
[Papineni et al. 2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, 311–318. Association for Computational Linguistics.
[Peng et al. 2018] Peng, B.; Li, X.; Gao, J.; Liu, J.; Chen, Y.- N.; and Wong, K.-F. 2018. Adversarial advantage actorcritic model for task-completion dialogue policy learning. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6149–6153. IEEE.
[Pennington, Socher, and Manning 2014] Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543.
[Sharma et al. 2017] Sharma, S.; El Asri, L.; Schulz, H.; and Zumer, J. 2017. Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. CoRR abs/1706.09799.
[Sohn, Lee, and Yan 2015] Sohn, K.; Lee, H.; and Yan, X. 2015. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, 3483–3491.
[Sordoni et al. 2015] Sordoni, A.; Galley, M.; Auli, M.; Brockett, C.; Ji, Y.; Mitchell, M.; Nie, J.-Y.; Gao, J.; and Dolan, B. 2015. A neural network approach to contextsensitive generation of conversational responses. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 196–205.
[Vinyals and Le 2015] Vinyals, O., and Le, Q. 2015. A neu- ral conversational model. ICML Deep Learning Workshop 2015.
[Yoshino et al. 2018] Yoshino, K.; Hori, C.; Perez, J.; D’Haro, L. F.; Polymenakos, L.; Gunasekara, C.; Lasecki, W. S.; Kummerfeld, J.; Galley, M.; Brockett, C.; Gao, J.; Dolan, B.; Gao, S.; Marks, T. K.; Parikh, D.; and Batra, D. 2018. The 7th dialog system technology challenge. arXiv preprint.
[Zhang et al. 2018] Zhang, Y.; Galley, M.; Gao, J.; Gan, Z.; Li, X.; Brockett, C.; and Dolan, B. 2018. Generating informative and diverse conversational responses via adversarial information maximization. arXiv preprint arXiv:1809.05972.