Multi-hop question answering tests the ability of a system to retrieve and combine multiple facts to answer a single question. HotpotQA (Yang et al., 2018) introduces a task where questions are freeform text, supporting facts come from Wikipedia, and answer text and supporting facts are labeled. The questions in HotpotQA are further categorized as bridge-type questions or comparison-type questions. For comparison questions, often all necessary facts may be retrieved using terms in the question itself. For challenging bridge-type questions, it may not be possible to retrieve all the necessary facts based on the terms present in the original question alone. Rather, partial information must first be retrieved and used to formulate an additional query.
Although many systems have been submitted to the HotpotQA leaderboard, surprisingly, only a few have directly addressed the challenge of followups. Systems can either be evaluated in a distractor setting, where a set of ten paragraphs containing all supporting facts is provided, or in a full wiki setting, where supporting facts must be retrieved from all of Wikipedia. The systems that compete only in the distractor setting can achieve good performance by combining and ranking the information provided, without performing followup search queries. Furthermore, even in the distractor setting, Min et al. (2019a) found that only 27% of the questions required multi-hop reasoning, because additional evidence was redundant or unnecessary or the distractors were weak. They trained a single-hop model that considered each paragraph in isolation and ranked confidences of the answers extracted from each, to obtain competitive performance.
Of the nine systems with documentation submitted to the full wiki HotpotQA leaderboard as of 24 November 2019, four of them (Nie et al., 2019; Ye et al., 2019; Nishida et al., 2019; Yang et al., 2018) attempt to retrieve all relevant data with one search based on the original question, without any followups. Fang et al. (2019) retrieves second hop paragraphs simply by following hyperlinks from or to the first hop paragraphs.
Qi et al. (2019), Ding et al. (2019), and Feld- man and El-Yaniv (2019) form various kinds of followup queries without writing a new question to be answered. Qi et al. (2019) trains a span extractor to predict the longest common subsequence between the question plus the first hop evidence and the (unseen) second hop evidence. At inference time, these predicted spans become followup search queries. In Ding et al. (2019), a span extractor is trained using the titles of the second hop evidence. Feldman and El-Yaniv (2019) trains a neural retrieval model that uses maximum inner
Figure 1: The architecture of our system to generate intermediate questions for answer extraction.
product with an encoding of the question plus first hop evidence to retrieve second hop evidence.
Min et al. (2019b) forms not just followup queries but followup questions. They use additional specially labeled data to train a pointer network to divide the original question into substrings, and use handcrafted rules to convert these substrings into subquestions. The original question is answered by the second subquestion, which incorporates a substitution of the answer to the first subquestion.
While performing followup retrievals of some sort should be essential for correctly solving the most difficult multi-hop problems, formulating a followup question whose answer becomes the answer to the original question is motivated primarily by interpretability rather than accuracy. In this paper, we pursue a trained approach to generating followup questions that is not bound by handcrafted rules, posing a new and challenging application for abstractive summarization and neural question generation technologies. Our contributions are to define the task of a followup generator module (Section 2), to propose a fully trained solution to followup generation (Section 3), and to establish an objective evaluation of followup generators (Section 5).
Our technique is specifically designed to address the challenge of discovering new information is needed that is not specified by the terms of the original question. At the highest level, comparison questions do not pose this challenge, because each quantity to be compared is specified by part of the original question. (They also pose different semantics than bridge questions because a comparison must be applied after retrieving answers to the subquestions.) Therefore we focus only on bridge questions in this paper.
Figure 1 shows our pipeline to answer a multi-hop bridge question. As partial information is obtained, an original question is iteratively reduced to simpler questions generated at each hop. Given an input question or subquestion, possible premises which may answer the subquestion are obtained from an information retrieval module. Each possible premise is classified against the question as irrelevant, containing a final answer, or containing intermediate information, by a three-way controller module. For premises that contain a final answer, the answer is extracted with a single-hop question answering module. For premises that contain intermediate information, a question generator produces a followup question, and the process may be repeated with respect to this new question. It is this question generator that is the focus of this paper. Various strategies may be used to manage the multiple reasoning paths that may be produced by the controller. Details are in section 5.
Although our method applies to bridge questions with arbitrary number of hops, for simplicity we focus on two-hop problems and on training the followup question generator. Let Cont denote the controller, SingleHop denote the answer extractor, and Followup denote the followup generator. Let be a question with answer A and gold supporting premises
and
, and suppose that
but not
contains the answer. The task of the followup generator is to use
to generate a followup question
Failure of any of these desiderata could harm label accuracy in the HotpotQA full wiki or distractor evaluations.
Some questions labeled as bridge type in HotpotQA have a different logical structure, called “intersection” by Min et al. (2019b). Here the subquestions specify different properties that the answer entity is supposed to satisfy, and the intersection of possible answers to the subquestions is the answer to the original question. Our approach is not oriented towards this type of question, but there is no trivial way to exclude them from the dataset.
One non-interpretable implementation of our pipeline would be for Followup to simply output concatenated with
as the “followup question.” Then SingleHop would operate on input that really does not take the form of a single question, along with
, to determine the final answer. Effectively, SingleHop would be doing multi-hop reasoning. To ensure that Followup gets credit only for forming real followup questions, we insist that SingleHop is first trained as a single-hop answer extractor, by training it on SQuAD 2.0 (Ra- jpurkar et al., 2018), then freeze it while Followup and Cont are trained.
Ideally, we might train Followup using cross entropy losses inspired by equations 1, 2, and 3 with SingleHop and Cont fixed, but the decoded output is not differentiable with respect to Followup parameters. Instead, we train Followup with a token-based loss against a set of weakly labeled ground truth followup questions.
The weakly labeled ground truth followups are obtained using a neural question generation (QG) network. Given a context C and an answer A, QG is the task of finding a question
most likely to have produced it. We use reverse SQuAD to train the QG model of Zhao et al. (2018), which performs near the top of an extensive set of models tested by Tuan et al. (2019) and has an independent implementation available. Applied to our training set with and A = A, it gives us a weak ground truth followup
We instantiate the followup question generator, which uses to predict
, with a pointer-generator network (See et al., 2017). This is a sequence to sequence model whose decoder repeatedly chooses between generating a word from a fixed vocabulary and copying a word from the input. Typically, pointer-generator networks are used for abstractive summarization. Although the output serves a different role here, their copy mechanism is useful in constructing a followup that uses information from the original question and premise.
We train Cont with cross-entropy loss for ternary classification on the ground truth triples if
, and
for all other P. In this way the controller learns to predict when a premise has sufficient or necessary information to answer a question. Both Cont and SingleHop are implemented by BERT following the code by Devlin et al. (2019).
Evaluating a followup question generator by whether its questions are answered correctly is analogous to verifying the factual accuracy of abstractive summarizations, which has been studied by many, including Falke et al. (2019), who estimate factual correctness using a natural language inference model, and find that it does not correlate with ROUGE score. Contemporaneous work by Zhang et al. (2019) uses feedback from a fact extractor in reinforcement learning to optimize the correctness of a summary, suggesting an interesting future direction for our work.
A recent neural question generation model has incorporated feedback from an answer extractor into the training of a question generator, rewarding the generator for constructing questions the extractor can answer correctly (Klein and Nabi, 2019). Although the loss is not backpropagated through both the generator and extractor, the generator is penalized by token level loss against ground truth questions when the question is answered wrongly, but by zero loss when it constructs a variant that the extractor answers correctly.
To isolate the effect of our followup generator on the types of questions for which it was intended, our experiments cover the subset of questions in HotpotQA labeled with exactly two supporting facts, with the answer string occurring in exactly one of them. There are 38,806 such questions for training and 3,214 for development, which we use for testing because the structure of the official test set is not available. For a baseline we compare to a trivial followup generator that returns the original question without any rewriting.
Table 1: Answer accuracy on filtered subset of HotpotQA development set in the distractor setting.
First, we evaluate performance using an oracle controller, which forwards only to the fol-
Table 2: Example generated followup questions , evaluated against oracle
lowup generator, and only to the answer extractor. Results are shown in Table 1. Best performance is achieved using the system “
else
,” which answers with
or
, whichever is non-null. Thus, although many questions are really single-hop and best answered using the original question, using the followup questions when a single-hop answer cannot be found helps the F1 score by 8.9%. Table 2 shows followup generations and extracted answers in two typical successful and two typical failed cases.
Next we consider the full system of Figure 1. We use the distractor paragraphs provided. We run the loop for up to two hops, collecting all answer extractions requested by the controller, stopping after the first hop where a non-null extracted answer was obtained. If multiple extractions were requested for the same problem, we take the answer in where SingleHop had the highest confidence. The controller requested 2,989 followups, and sent 975 (Q, P) pairs for answer extraction in hop one, and 1,180 in hop two. The performance gain shows that the followup generator often can generate questions which are good enough for the frozen single hop model to understand and extract the answer with, even when the question must be specific enough to
avoid distracting premises.
Followup queries are essential to solving the dif-ficult cases of multi-hop QA, and real followup questions are an advance in making this process interpretable. We have shown that pointer generator networks can effectively learn to read partial information and produce a fluent, relevant question about what is not known, which is a complement to their typical role in summarizing what is known. Our task poses a novel challenge that tests semantic properties of the generated output.
By using a neural question generator to produce weak ground truth followups, we have made this task more tractable. Future work should examine using feedback from the answer extractor or controller to improve the sensitivity and specificity of the generated followups. Additionally, the approach should be developed on new datasets such as QASC (Khot et al., 2019), which are designed to make single-hop retrieval less effective.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, and Jie Tang. 2019. Cognitive graph for multi-hop reading comprehension at scale. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2694–2703, Florence, Italy. Association for Computational Linguistics.
Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2019. Ranking generated summaries by correctness: An in- teresting but challenging application for natural lan- guage inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy. Association for Computational Linguistics.
Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuohang Wang, and Jingjing Liu. 2019. Hierarchical graph network for multi-hop question answering. CoRR, 1911.03631.
Yair Feldman and Ran El-Yaniv. 2019. Multi-hop para- graph retrieval for open-domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2296–2309, Florence, Italy. Association for Computational Linguistics.
Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, and Ashish Sabharwal. 2019. Qasc: A dataset for question answering via sentence composition. CoRR, 1910.11473.
Tassilo Klein and Moin Nabi. 2019. Learning to an- swer by learning to ask: Getting the best of gpt-2 and bert worlds. CoRR, 1911.02365.
Sewon Min, Eric Wallace, Sameer Singh, Matt Gard- ner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019a. Compositional questions do not necessitate multi-hop reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4249–4257, Florence, Italy. Association for Computational Linguistics.
Sewon Min, Victor Zhong, Luke Zettlemoyer, and Han- naneh Hajishirzi. 2019b. Multi-hop reading compre- hension through question decomposition and rescor- ing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6097–6109, Florence, Italy. Association for Computational Linguistics.
Yixin Nie, Songhe Wang, and Mohit Bansal. 2019. Revealing the importance of semantic retrieval for machine reading at scale. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2553–2566, Hong Kong, China. Association for Computational Linguistics.
Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata, Atsushi Otsuka, Itsumi Saito, Hisako Asano, and Junji Tomita. 2019. Answering while summarizing: Multi-task learning for multi-hop QA with evidence extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2335–2345, Florence, Italy. Association for Computational Linguistics.
Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and Christopher D. Manning. 2019. Answering complex open-domain questions through iterative query gen- eration. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2590–2602, Hong Kong, China. Association for Computational Linguistics.
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable ques- tions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784– 789, Melbourne, Australia. Association for Computational Linguistics.
Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer- generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073– 1083, Vancouver, Canada. Association for Computational Linguistics.
Luu Anh Tuan, Darsh J Shah, and Regina Barzilay. 2019. Capturing greater context for question generation. CoRR, 1910.10274.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answer- ing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
Deming Ye, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, and Maosong Sun. 2019. Multi-paragraph reasoning with knowledge-enhanced graph neural network. CoRR, 1911.02170.
Yuhao Zhang, Derek Merck, Emily Bao Tsai, Christo- pher D. Manning, and Curtis P. Langlotz. 2019. Optimizing the factual correctness of a summary: A study of summarizing radiology reports. CoRR, 1911.02541.
Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ke. 2018. Paragraph-level neural question gener- ation with maxout pointer and gated self-attention
networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3901–3910, Brussels, Belgium. Association for Computational Linguistics.