Open book exams are a common mechanism for assessing human understanding of a subject, where test takers are allowed free access to a relevant book, study guide, or class notes when answering questions. In this context, the goal is not to evaluate memorization but a deeper understanding of the material and its application to new situations (Jenkins, 1995; Landsberger, 1996). The application, in turn, often requires combining a fact in the book (e.g., metals conduct electricity) with additional common knowledge the test taker is ex-
Figure 1: An example for a question with a given set of choices and supporting facts.
pected to have acquired by this stage (e.g., a suit of armor is made of metal).
Motivated by this setting, we present a new kind of question answering dataset, OpenBookQA,1 that consists of two parts: Q, a set of 5957 multiple-choice questions, and F, a set of 1326 diverse facts about elementary level science. F has three key characteristics of an ‘open book’: (a) it forms the basis for generating Q; (b) it has been deemed central to scientific explanations (Jansen et al., 2018); and (c) by itself, F is generally in-sufficient to answer questions in Q. Faced with a question , a student or system S is expected retrieve a relevant fact
, and appeal to their own common knowledge,
, when applying f to answer q.
Figure 1 provides an example. Here, metals are thermal conductors is a core scientific fact available in F. One way to apply this fact to decide whether a steel spoon would let the most heat travel through is to appeal to common knowledge that steel is metallic and heat travels through thermal conductors. In general, the expected common knowledge is relatively simple (taxonomic facts, definitions, object properties, etc.); the difficulty lies in identifying it and meaningfully combining it with a core fact from F to answer the question.
OpenBookQA questions are challenging as they require multi-hop reasoning with partial context provided by F. Specifically, unlike existing datasets for reading comprehension (RC), answering questions on the back of a textbook (TQA),2 as well as question answering over structured knowledge-bases (KBQA), the open book F that comes with OpenBookQA is not self-contained. A successful system must therefore go beyond the typical challenges such as paraphrase matching and coreference resolution, without benefiting from the canonicalized and complete information in KBQA.
Generating interesting open book questions is a difficult task. We used a multi-stage process starting with F, using crowd-sourcing to generate (noisy) questions based on F that probe novel situations, using an automatic filter to ensure hardness for retrieval and association based systems, using a crowd filter to ensure answerability by a lay person, and further using an expert filter to ensure higher quality in Dev and Test sets.
We evaluate a number of existing QA systems for science (without retraining) on OpenBookQA, finding that they perform surprisingly close to the random guessing baseline of 25%. Human performance, on the other hand, is close to 92%.3
Motivated by recent findings of gameability of NLP datasets (Gururangan et al., 2018), we also develop and evaluate simple, attention-based, neural baselines including a plausible answer detector (which ignores the question text completely) and an odd-one-out solver. These highlight inevitable human bias in any crowdsourced dataset, increasing performance on OpenBookQA to 48%.
Building upon a recent neural model for incorporating external knowledge in the story cloze setting (Mihaylov and Frank, 2018), we propose a knowledge-aware neural baseline that can utilize both the open book F and common knowledge retrieved from sources such as ConceptNet (Speer et al., 2017). While retrieving the most useful pieces of knowledge remains an open challenge, our ‘oracle’ experiments with the fact f used while generating a question q and an interpretation (by the question author) of the additional knowledge k needed for q, provides valuable insight into the nature of this dataset: Facts from the open book F are valuable (5% improvement) but not sufficient. Using both f and k increases the accuracy to 76%, but is still far from human level performance, suggesting the need for non-trivial reasoning to combine these facts.
To encourage further research on this new task, for each Train and Dev question q, OpenBookQA also includes f as intermediate supervision signal, which may be viewed as a partial explanation for q. We leave closing the large gap to human performance as a challenge for the NLP community.
By construction, answering OpenBookQA questions requires (i) some base science facts from a provided ‘open book’, (ii) broader understanding about the world (common or commonsense knowledge), and (iii) an ability to combine these facts (reasoning). This setup differs from several existing QA tasks, as summarized below.
Reading Comprehension (RC) datasets have been proposed as benchmarks to evaluate the ability of systems to understand a document by answering factoid-style questions over this document. These datasets have taken various forms: multiple-choice (Richardson et al., 2013), cloze-style (Hermann et al., 2015; Onishi et al., 2016; Hill et al., 2016), and span prediction (Rajpurkar et al., 2016; Trischler et al., 2017; Joshi et al., 2017) However, analysis (Chen et al., 2016; Sug- awara et al., 2017) of these datasets has shown that many of the questions can be solved with context token matching (Chen et al., 2017a; Weissenborn et al., 2017) or relatively simple paraphrasing.
To focus on the more challenging problem of reasoning across sentences, new datasets have been proposed for multi-step RC. QAngaroo (Welbl et al., 2018) have used a knowledgebase to identify entity pairs (s, o) with a known relation, r, which is also supported by a multi-hop path in a set of documents. They use structured tuple queries (s, r, ?) and use all the documents along the path as the input passage. NarrativeQA (Kocisk´y et al., 2017) is an RC dataset that has been shown to require an iterative reasoning about the narrative of a story. Similar to OpenBookQA, the questions were generated to ensure that the answer is not a direct match or paraphrase that can be retrieved with an IR approach. Most recently, Khashabi et al. (2018) proposed MultiRC, a multiple-choice RC dataset that is designed to require multi-sentence reasoning and can have multiple correct answers. Again, like most RC datasets, it is self-contained.
Tasks with external knowledge. While many of the RC datasets could benefit from commonsense or background knowledge, they are designed to be self-contained, i.e., solvable by the document context alone. Datasets such as the Story Cloze Test (Mostafazadeh et al., 2016), MCScript,4 and ProPara (Mishra et al., 2018) do require additional domain knowledge about everyday events, scripts, and processes, respectively. However, these datasets need domain-specific modeling of events, whereas OpenBookQA appeals to broad common knowledge cutting across a variety of types and topics.
Stasaski and Hearst (2017) explore the creation of multi-hop questions and propose generating stronger distractors for the multiple-choice setting. Their work, however, starts with structured knowledge, specifically a Biology ontology.
Lastly, many Science Question Answering datasets (e.g. Clark et al., 2016, 2018) have been released that need broad external knowledge to answer the questions. However, these questions are not associated with a core set of facts, i.e., an “open book” used to define these questions. As a result, the questions vary widely in style and complexity (Clark et al., 2018). In contrast, OpenBookQA focuses on a more well-defined subset of science QA, appealing to one core fact from the open book and one (or few) relatively simple commonly known supporting facts.
The OpenBookQA dataset consists of about 6,000 4-way multiple-choice questions, each associated with one core fact from a “book” F of 1326 such facts, and an auxiliary set K of about 6000 additional facts. The questions were created via a multi-stage crowdsourcing and partial expert fil-tering process, discussed in Section 3.1.
The small “book” F consists of recurring science themes and principles, each of which can be (and here is) instantiated into multiple questions. For F, we use a subset of the WorldTree corpus which Jansen et al. (2018) have analyzed for suf-ficiency for elementary level science. The subset we use is taken from the 2287 WorldTree facts that were marked as “central” by the original authors in at least one explanation. We further filter them down to 1326 that appear general enough to be applicable to multiple situations.
OpenBookQA additionally requires broad common knowledge, which is expected to come from large corpora, such as ConceptNet, Wikipedia, or a corpus with 14M science-related sentences used by some existing baselines. The crowdsourcing process below also asks workers to mark a second fact, k, needed for each question q, in addition to f. These second facts, unfortunately, were often incomplete, over-complete, or only distantly related to q. We thus include in OpenBookQA the set K of such second facts only as auxiliary data for optional use. We emphasize that K should not be viewed as ‘gold’ additional facts, or as a substitute for broad common knowledge.
3.1 Crowdsourcing Process
The overall question generation and filtering pipeline is summarized in Figure 2. Given the “book” F of core facts, the process proceeds as follows, starting with an empty question set Qs and an empty ‘second facts’ set K:
1. A crowd-worker5 w is shown a random science fact f from the set F.
2. w is asked to think of a second common fact, k, that may be combined with f to derive a new, valid assertion s.
3. w then converts s into a question-answer pair and extends this into a 4-way multiple choice question by adding 3 incorrect answer choices, , where one of the
is the unique correct answer.
4. The system verifies passes basic checks such as uniformity of answer choices.6
5. w then feeds the multiple-choice question to an information retrieval solver (Clark et al.,
Figure 2: OpenBookQA question generation pipeline
2016) and a word association based solver (Tur- ney, 2017), and verifies that (a) neither of them answers correctly and (b) the top 3 IR retrieved sentences are insufficient to answer
; if not, the question is edited and re-tried.
6. Question is then shown to 5 new crowd-workers, who are asked to answer it.
7. If at least 4 out of 5 workers answer rectly, it is deemed answerable and the process continues. If not,
is discarded.
8. The answer choices of are randomly shuf-fled to avoid unintended bias.7
9. is associated with f as the core science fact and added to the question set Q. k is added to the set K of additional (noisy) facts.
The Dev and Test splits were further filtered by an in-house expert to ensure higher quality.
3.2 Human Performance
To assess human accuracy on this dataset, we consider the following model: Each question has some (unknown) human accuracy
as the probability that a random human subject, chosen uniformly from a large pool H, would answer q correctly. Thus, we can think of this as defining a Bernoulli random variable,
, whose mean is (unknown)
. The average human accuracy on Q under this model is:
where are unknown. With H as the set of crowd-workers (cf. Footnote 5), step 6 of the above question generation
process is equivalent to obtaining 5 independent samples, must, however, be careful when using this data to estimate
, as the same 5 samples were used to decide whether q makes it into the question set Q or not. For instance, if we had kept only those questions that all 5 workers answered correctly, it would clearly be inaccurate to claim that the human accuracy on Q is 100%. Nevertheless, it is possible to re-use the judgments from Step 6 to approximate H(Q) with high confidence, without posing the questions to new workers.
Intuitively, if all questions in Q were difficult to answer (i.e., all were small), it would be unlikely that all |Q| questions would pass the test in Step 6. We can use the contrapositive of this observation to conclude that
, on average, must have been high for
Formally, aggregating across all questions gives the following empirical estimate of H(Q):
For analysis, we assume all samples dependent, i.e., every answer is obtained independently.8 An application of Hoeffding’s Inequality (Hoeffding, 1963) shows that
to H(Q) very rapidly as n = |Q||I| grows; specifically,
with probability at least
; similarly for
In our Dev and Test sets, where |Q| = 500 and |I| = 5, this translates into H(Q) being at least
Table 1: Statistics for full OpenBookQA dataset. Par- enthetical numbers next to each average are the max.
with probability over 98.8% and at least
with prob 95.6%; we report the former as our conservative estimate on human performance.
3.3 Question Set Analysis
OpenBookQA consists of 5957 questions, with 4957/500/500 in the Train/Dev/Test splits.9 Table 1 summarizes some statistics about the full dataset. Each question has exactly four answer choices and one associated fact used in the creation process. We report the average length of questions, candidate choices, and associated facts, as well as how often is the longest/shortest choice the correct one.
We analyzed 100 questions in the Train set to capture the kind of common knowledge and reasoning needed. For each, we wrote down the additional common knowledge needed to answer this question in addition to the original science fact. In 21% of the cases, the crowdsourced question actually tests for a fact that doesn’t necessarily need the original science fact. For example, the question: “On a rainy day the clouds are (A) low (B) white (C) small (D) gray” was written based on the science fact “clouds produce rain” but doesn’t need this fact to answer it. We ignore such questions in our analysis. For the remaining questions, we categorized the additional facts into five high-level categories (and collapsed the remaining facts into a catch-all OTHERS category) based on previous approaches on similar science questions (Clark et al., 2018; Jansen et al., 2016):
Table 2: Percentage of questions and facts for the five most common type of additional facts. Note that % Questions does not add up to 100% since we count the percentage of questions where at least one such fact is needed.
2. PROPERTY: Properties of objects such as madeof(belt buckle, metal), has(mammals, four legs), contains(lemon juice, citric acid).
3. DEFINITION: Definitions of objects that may be based on their appearance (tape is a plastic with markings), working mechanism (telescope is a device that uses mirrors to view objects), etc.
4. CAUSAL: Causal facts such as causes(adding lemon juice to milk, milk to break down).
5. BASIC: General scientific fact that did not fit above, e.g. squirrels eat nuts for food.
Table 2 presents the proportions of these facts in our analyzed question set. For each type of fact, we calculate the percentage of questions that need at least one such fact (shown as % Questions). We also calculate the overall percentage of each fact type across all the common knowledge facts (shown as % Facts). Most of our questions need simple facts such as isa knowledge and properties of objects, further confirming the need for simple reasoning with common knowledge. Apart from these five major categories of facts, the catch-all OTHERS category contains commonsense facts (e.g., it is dark at night), world knowledge (e.g., Japan is often hit by earthquakes) and lexical rewrites10 (e.g., ad infinitum means over and over).
Most of our questions need simple facts that should be easily retrievable from any knowledgebase/textual corpora. On an average, each question needed 1.16 additional facts ignoring any linguistic variations. Despite the simplicity of the knowledge needed for these questions, as we show empirically, most baseline approaches achieve a relatively low score on this dataset (even when the core fact is provided). We claim that this is due to the fact that the reasoning needed to answer these questions is non-trivial. Table 3 shows few questions with the associated facts and high-level reasoning needed to answer these questions. Assuming a model can extract the described relations (e.g. defn, contains), the QA system still needs to be able to chain these facts together, identify the resulting relation and verify its expression for each choice. In the extreme case (as shown in the last example), even though only one additional fact is needed to answer the question, it needs a system to apply the core “general” science fact to a “spe-cific” situation.
We evaluate the performance of several baselines systems on the Dev and Test subsets of OpenBookQA. For each question, a solver receives 1 point towards this score if it chooses the correct answer, and 1/k if it reports a k-way tie that includes the correct answer. The “Guess All” baseline, which always outputs a 4-way tie, thus achieves a score of 25%, same as the expected performance of a uniform random baseline.
4.1 No Training, External Knowledge Only
Since OpenBookQA is a set of elementary level science questions, one natural baseline category is existing systems that have proven to be effective on elementary- and middle-school level science exams. These pre-trained systems, however, rely only on their background knowledge and do not take the set F of core facts into account. Further, their knowledge sources and retrieval mechanism are close to those used by the IR solver that, by design, is guaranteed to fail on OpenBookQA. These two aspects place a natural limit on the effectiveness of these solvers on OpenBookQA, despite their excellent fit for the domain of multiple-choice science questions. We consider four such solvers.
PMI (Clark et al., 2016) uses pointwise mutual information (PMI) to score each answer choice using statistics based on a corpus of 280 GB of plain text. It extracts unigrams, bigrams, trigrams, and skip-bigrams from the question q and each answer choice . Each answer choice is scored based on the average PMI across all pairs of question and
answer n-grams.
TableILP (Khashabi et al., 2016) is an Integer Linear Programming (ILP) based reasoning system designed for science questions. It operates over semi-structured relational tables of knowledge. It scores each answer choice based on the optimal (as defined by the ILP objective) “support graph” connecting the question to that answer through table rows. The small set of these knowledge tables, however, often results in missing knowledge, making TableILP not answer 24% of the OpenBookQA questions at all.
TupleInference (Khot et al., 2017), also an ILP-based QA system, uses Open IE tuples (Banko et al., 2007) as its semi-structured representation. It builds these subject-verb-object tuples on-the-fly by retrieving text for each question from a large corpus. It then defines an ILP program to combine evidence from multiple tuples.
DGEM (Khot et al., 2018) is a neural entailment model that also uses Open IE to produce a semi-structured representation. We use the adaptation of this model to multiple-choice question answering proposed by Clark et al. (2018), which works as follows: (1) convert a hypothesis,
, and each retrieved fact into a premise
; and (2) return the answer choice with the highest entailment score,
4.2 No Training; F and Extr. Knowledge
We also consider providing the set F of core facts to two existing solvers: the IR solver of Clark et al. (2016) (to assess how far simple word-overlap can get), and the TupleInference solver.
4.3 Trained Models, No Knowledge
We consider several neural baseline models that are trained using Train set of OpenBookQA. For ease of explanation, we first define the notation used in our models. For a given question , we define the set of token sequences ,
. For each token sequence
is the embedding for this token. We use
indicate the number of tokens in s and d for the dimensionality of the embeddings.11 We model multiple-choice QA as multi-class classification: Given
, predict one of four class labels L =
Table 3: Example training questions (with their correct choices marked) along with the facts and reasoning will appear brighter”. Grounding this rule based on the common-knowledge fact, produces a new rule: “As headlights of the car come closer, headlights will appear brighter”
{1, 2, 3, 4}, where the true label is the correct answer index.
Embeddings + Similarities as Features. We first experiment with a simple logistic regression model (Mihaylov and Nakov, 2016; Mihaylov and Frank, 2016, 2017) that uses centroid vectors of the word embeddings of tokens in s, and then computes the cosine similarities between the question and each answer choice,
For each training instance, we build a feature representations by concatenating these vectors and train an L2 logistic regression classifier:
BiLSTM Max-Out Baselines. As a simple neural baseline, we adapt BiLSTM max-out model (Conneau et al., 2017) to our QA task. That is, we first encode the question tokens and choice tokens , independently with a bi-directional context encoder (LSTM) to obtain a context (ctx) representation
Next, we perform an element-wise aggregation operation max on the encoded representations
to construct a single vector:
Given the contextual representations for each token sequence, we experiment with three config-urations for using these representations for QA:
(a) Plausible Answer Detector. This baseline goes to the extreme of completely ignoring q and trying to learn how plausible it is for correct answer to some question in this domain. This captures the fact that certain choices like ‘a magical place’ or ‘flying cats’ are highly unlikely to be the correct answer to a science question without negation (which is the case for OpenBookQA).
We implement a plausible answer detector using a choice-only model for predicting the answer by obtaining a score weights vector optimized during training, i = {1..4} is the index of the choice. To obtain the answer choice from the set of choice scores
where
(b) Odd-One-Out Solver. It considers all 4 answer options jointly and selects the one that is least similar to the others. This captures bias in human authored questions arising from the fact that creating good quality incorrect answers is difficult. Workers generally start with the correct answer, and then come up with three incorrect ones. The latter often tend to be homogeneous or share other common properties (e.g., non-scientific terms) uncharacteristic of the correct answer.
We implement this using a choice-to-choices at- tention model. For each choice , we calculate the attention to the other choices as
sum these attention values to compute the attention for
to the rest of the choices,
return the choice with the lowest sum. The attention is computed as
is a linear attention function and a weight vector. We then compute
) and select the answer with the index
(c) Question Match. This solver tries to predict which choice best matches the question (Nakov et al., 2016), without relying on external knowledge. To achieve that, we compute an attention score and each of the choices
and select the one with the highest score. We also experiment with a model where
are obtained using token-wise interaction proposed in ESIM (Chen et al., 2017b).
4.4 Trained Model with External Knowledge
Lastly, we implement a two stage model for incorporating external common knowledge, K. The first module performs information retrieval on K to select a fixed size subset of potentially relevant facts for each instance in the dataset (see Appendix A). The second module is a neural network that takes (
) as input to predict the answer
to a question Q from the set of choices C.
Knowledge-Enhanced Reader. As a base knowledge-aware model, we use a variant of the model of Mihaylov and Frank (2018), implemented by extending our BiLSTM max-out question-match baseline (c). For each instance the model reads the question q and answers independently and attends to the set of retrieved external knowledge facts
encode each fact
is the number of facts) with same BiLSTM as used for
and construct a single vector
Having such representations for each
results in knowledge memory matrix
that
is dynamic memory, specific for each instance in the batch and is encoded in each step during training. This memory is used to calculate a knowledge-aware representation,
. Each context (ctx) representation
) is combined with
to obtain a knowledge-enhanced representation
. We then model the knowledge-enhanced attention
Table 4: Scores obtained by various solvers on Open- BookQA, reported as a percentage the standard deviation across 5 runs with different random seeds. Other baselines are described in the corresponding referenced section. For oracle evaluation, we use the gold science fact f associated with each question, and optionally the additional fact k provided by the question author. Bold denotes the best Test score in each category.
as a linear combination of the ctx, kn and ctx + kn representations as
where is a weight vector initialized with the ones vector and optimized during training. We then select the answer
with the highest score.
The results for various baseline models are summarized in Table 4, grouped by method category. We make a few observations:
First, the task is largely solvable by a layperson, as evidenced by the 92% score of crowd-workers. This is measured as described in Section 3.2. We use annotations from Step 6 of the question generation process and report as a conservative lower estimate. As an additional assessment, we also obtained 5 new annotations for 100 randomly chosen questions from each of Train, Dev, and Test sets. The performance remained similar at 88.6%, 90.2%, and 91.6%, resp.
The second group shows that pre-trained state-of-the-art solvers for multiple-choice science questions perform poorly. One explanation is their correlation with the the IR method used for question filtering, as mentioned in Section 4.1.
The third group of results suggests that adding F to pre-trained models has a mixed effect, improving TupleInference by 8.7% but not changing DGEM.12 Unlike DGEM, TupleInference relies on brittle word-overlap similarity measures very similar to the ones used by IR. Since IR (KB) gets 0% by design, TupleInference (KB) also has poor performance and adding F helps it find better support despite the brittle measures.
The fourth group demonstrates that carefully designed trainable neural models—even if simplistic and knowledge-free—can be surprisingly powerful. For example, the “plausible answer detector” can predict the correct answer with 49.6% accuracy without even looking at the question. The “odd-one-out” solver, by considering other answer choices, raises this to 50.2%. The “question match” solver, which simply compares the BiLSTM max-out encoding of the question with that of various answer choices, also achieves 50.2%.13 Similar findings have been reported for several recent datasets (Gururangan et al., 2018), making it imperative to perform such tests early.
Interestingly, all of these neural knowledge-free baselines simultaneously succeed on 34.4% of the Dev questions, and simultaneously fail on 23.6%. For Question Match and ESIM we also experiment with ELMo (Peters et al., 2018) which improved their score on Test with 0.4% and 1.8%.
The final group demonstrates the need for external knowledge and deeper reasoning. When the “oracle” science fact f used by the question author is provided to the knowledge-enhanced reader, it improves over the knowledge-less models by about 5%. However, there is still a large gap, showing that the core fact is insufficient to answer the question. When we also include facts retrieved from WordNet (Miller et al., 1990), the score improves by about 0.5%. Unlike the WordNet gain, adding ConceptNet (Speer et al., 2017) introduces a distraction and reduces the score. This suggests that ConceptNet is either not a good source of knowledge for our task, or only a subset of its relations should be considered. Overall, external knowledge helps, although retrieving the right bits of knowledge remains difficult. In the last row of Table 4, we use the oracle core fact along with question author’s interpretation of the additional fact k. This increases the scores substantially, to about 76%. This big jump shows that improved knowledge retrieval should help on this task. At the same time, we are still not close to the human performance level of 92% due to various reasons: (a) the additional fact needed can be subjective, as hinted at by our earlier analysis; (b) the authored facts K tend to be noisy (incomplete, over-complete, or only distantly related), also as mentioned earlier; and (b) even given the true gold facts, performing reliable “reasoning” to link them properly remains a challenge.
Sample predictions and analysis of questions from Dev are provided in Appendix D.
We present a new dataset, OpenBookQA, of about 6000 questions for open book question answering. The task focuses on the challenge of combining a corpus of provided science facts (open book) with external broad common knowledge. We show that this dataset requires simple common knowledge beyond the provided core facts, as well as multi-hop reasoning combining the two. While simple neural methods are able to achieve an accuracy of about 50%, this is still far from the human performance of 92% on this task. We leave closing this gap for future research, and illustrate, via oraclestyle experiments, the potential of better retrieval and reasoning on this task.
The authors would like to thank Lane Aasen for helping develop the infrastructure for the crowd-sourcing task, and Madeleine van Zuylen for providing expert annotation for the Dev and Test questions.
M. Banko, M. J. Cafarella, S. Soderland, M. Broad- head, and O. Etzioni. 2007. Open information extraction from the web. In IJCAI.
D. Chen, J. Bolton, and C. D. Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension task. In ACL, pages 2358–2367.
D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017a. Reading wikipedia to answer open-domain questions. In ACL.
Q. Chen, X. Zhu, Z.-H. Ling, S. Wei, H. Jiang, and D. Inkpen. 2017b. Enhanced lstm for natural language inference. In ACL, pages 1657–1668.
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabhar- wal, C. Schoenick, and O. Tafjord. 2018. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. CoRR, abs/1803.05457.
P. Clark, O. Etzioni, T. Khot, A. Sabharwal, O. Tafjord, P. D. Turney, and D. Khashabi. 2016. Combining retrieval, statistics, and inference to answer elementary science questions. In AAAI, pages 2580–2586.
A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, pages 670–680.
M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer. 2017. AllenNLP: A deep semantic natural language processing platform. CoRR, abs/1803.07640.
S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL.
K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espe- holt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In NIPS, pages 1693–1701.
F. Hill, A. Bordes, S. Chopra, and J. Weston. 2016. The goldilocks principle: Reading children’s books with explicit memory representations. In ICLR.
W. Hoeffding. 1963. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30.
P. Jansen, N. Balasubramanian, M. Surdeanu, and P. Clark. 2016. What’s in an explanation? characterizing knowledge and inference requirements for elementary science exams. In COLING.
P. A. Jansen, E. Wainwright, S. Marmorstein, and C. T. Morrison. 2018. WorldTree: A corpus of explanation graphs for elementary science questions supporting multi-hop inference. In LREC.
T. Jenkins. 1995. Open book assessment in comput- ing degree programmes 1. Technical Report 95.28, University of Leeds.
M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In ACL, pages 1601–1611.
A. Kembhavi, M. J. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. 2017. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In CVPR, pages 5376–5384.
D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In NAACL.
D. Khashabi, T. Khot, A. Sabharwal, P. Clark, O. Et- zioni, and D. Roth. 2016. Question answering via integer programming over semi-structured knowledge. In IJCAI.
T. Khot, A. Sabharwal, and P. Clark. 2017. Answer- ing complex questions using open information extraction. In ACL.
T. Khot, A. Sabharwal, and P. Clark. 2018. SciTail: A textual entailment dataset from science question answering. In AAAI.
D. P. Kingma and J. L. Ba. 2015. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations 2015, pages 1–15.
T. Kocisk´y, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. 2017. The NarrativeQA reading comprehension challenge. CoRR, abs/1712.07040.
J. Landsberger. 1996. Study guides and strategies. Http://www.studygs.net/tsttak7.htm.
T. Mihaylov and A. Frank. 2016. Discourse relation sense classification using cross-argument semantic similarity based on word embeddings. In CoNLL-16 shared task, pages 100–107.
T. Mihaylov and A. Frank. 2017. Story Cloze Ending Selection Baselines and Data Examination. In LSDSem Shared Task.
T. Mihaylov and A. Frank. 2018. Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge. In ACL, pages 821–832.
T. Mihaylov and P. Nakov. 2016. SemanticZ at SemEval-2016 Task 3: Ranking relevant answers in community question answering using semantic similarity based on fine-tuned word embeddings. In SemEval ’16.
G. A. Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39– 41.
G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. 1990. Introduction to WordNet: An online lexical database. International Journal of Lexicography, 3(4):235–244.
B. D. Mishra, L. Huang, N. Tandon, W. tau Yih, and P. Clark. 2018. Tracking state changes in procedural text: A challenge dataset and models for process paragraph comprehension. In NAACL.
N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen. 2016. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories. In NAACL.
P. Nakov, L. M`arquez, A. Moschitti, W. Magdy, H. Mubarak, a. A. Freihat, J. Glass, and B. Randeree. 2016. Semeval-2016 task 3: Community question answering. In SemEval ’16, pages 525– 545.
T. Onishi, H. Wang, M. Bansal, K. Gimpel, and D. McAllester. 2016. Who did what: A large-scale person-centered cloze dataset. In EMNLP, pages 2230–2235, Austin, Texas.
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
J. Pennington, R. Socher, and C. Manning. 2014. GloVe: Global vectors for word representation. In EMNLP, pages 1532–1543.
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. In NAACL.
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP, pages 2383–2392.
M. Richardson, C. J. Burges, and E. Renshaw. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. In EMNLP, pages 193–203.
P. Singh, T. Lin, E. Mueller, G. Lim, T. Perkins, and W. Zhu. 2002. Open mind common sense: Knowledge acquisition from the general public. In Lecture Notes in Computer Science, volume 2519, pages 1223–1237.
R. Speer, J. Chin, and C. Havasi. 2017. ConceptNet 5.5: An open multilingual graph of general knowledge. In AAAI.
K. Stasaski and M. A. Hearst. 2017. Multiple choice question generation utilizing an ontology. In BEA@EMNLP, 12th Workshop on Innovative Use of NLP for Building Educational Applications.
S. Sugawara, H. Yokono, and A. Aizawa. 2017. Pre- requisite skills for reading comprehension: Multiperspective analysis of mctest datasets and systems. In AAAI, pages 3089–3096.
A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. 2017. NewsQA: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191–200.
P. D. Turney. 2017. Leveraging term banks for answer- ing complex questions: A case for sparse vectors. CoRR, abs/1704.03543.
D. Weissenborn, G. Wiese, and L. Seiffe. 2017. Mak- ing neural qa as simple as possible but not simpler. In CoNLL, pages 271–280.
J. Welbl, P. Stenetorp, and S. Riedel. 2018. Construct- ing datasets for multi-hop reading comprehension across documents. TACL.
Y. Zhang, H. Dai, K. Toraman, and L. Song. 2018. KGˆ2: Learning to Reason Science Exam Questions with Contextual Knowledge Graph Embeddings. In arXiv.
This module is the first part of a two stage model for incorporating knowledge from an external source K. For each instance (q, C) in the dataset, where q is a question and set of answer choices, it performs information retrieval (IR) on K to select a fixed size subset
of potentially relevant facts. The second module is a neural network that takes
and predicts the answer
For the IR module, we use TfIdfVectorizer14 to build vector representations for the question
based on the tokens in the training set. We then calculate similarity scores
, resp., and each of the external facts in
where sim is implemented as cosine distance. Based on these similarity scores, we obtain a set of facts for each
where
are the top
facts each with highest similarity
, respectively.
is a hyper-parameter chosen from {5, 10, 20} so as to yield the best Dev set performance.
For experimentation with knowledge, we consider the ‘open book’ set of facts F in conjunction with two sources of common knowledge: the Open Mind Common Sense (Singh et al., 2002) part of ConceptNet (Speer et al., 2017), and its WordNet (Miller, 1995) subset.
Our neural models are implemented with AllenNLP15 (Gardner et al., 2017) and PyTorch16 (Paszke et al., 2017). We use cross-entropy loss and the Adam optimizer (Kingma and Ba, 2015) with initial learning rate 0.001. For the neural models without external knowledge, we typically train the model with a maximum of 30 epochs and stop training early if the Dev set accuracy does not improve for 10 consecutive epochs. We also halve the learning rate if there is no Dev set improvement for 5 epochs. For the neural models with external knowledge, we typically train for 60 epochs with a patience of 20 epochs. For most of our neural models, we use h = 128 as the LSTM hidden layer size. The embedding dropout rate is chosen from {0.1, 0.2, 0.5}, again based on the best Dev set performance.
For each model configuration, we perform 5 experiments with different random seeds. For each run, we take the model with the best performance on Dev and evaluate on Test. We report the average accuracy for the best Dev score and the average of the corresponding Test score the standard deviation across the 5 random seeds.
The code for the models and the configuration files required for reproducing the results are available at
C.1 Question Answering: ARC
We also perform experiments with the Question Match system on the Challenge (hard) set of the AI2 Reasoning Challenge or ARC (Clark et al., 2018). We train several models with different LSTM hidden sizes (128, 256, 384 (best), 512), and dropout of the embedding layer (0.0 (best), 0.2, 0.5) on the questions from the Challenge Train set and take the model that has the highest accuracy on the Dev set. The resulting system scores 33.87% on the Challenge Test set, which is 2.17% higher than the previous best score by Zhang et al. (2018). The code and model con-figuration are available at
C.2 Textual Entailment: SciTail
We perform textual entailment experiments on the Science enTailment dataset SciTail (Khot et al., 2018). We change the Question Match model to a classic BiLSTM Max-Out (Conneau et al., 2017) for textual entailment, by replacing the question q and a choice with the premise p and the hypothesis h, resp., and perform binary classifi-cation on the entailment labels (Entail, Neural). We run experiments with BiLSTM encoders with LSTM hidden size of 384 and share the encoder parameters between the premise and the hypothesis. Without additional hyper-parameter tuning, this yields entailment accuracy scores of 87.9% and 85.4% on the Dev and Test sets, respectively.
We give some examples of questions that were answered correctly/incorrectly by various groups of models. We include here the first three questions in each case.
D.1 Neural Baseline Successes
We begin with three examples of questions that all neural models without external knowledge (namely Question Match, Plausible Answer, One-Odd-Out, and ESIM from the fourth group in Table 5) predicted correctly.
A body may find its temperature to be lowered after (A) water is heated up (B) fluid spreads from pores (C) the air becomes arid (D) the sky stays bright
Oil is a non-renewable resource which tells us that when (A) it can be remade (B) it can be found in other places (C) there is an endless supply (D) the final barrel is gone, there supply is finished
Magma contains (A) particles of iron (B) Loads of leaves (C) Soda (D) Silly Putty
Table 5: Sample questions predicted correctly (172/500) by all trained neural models without external knowledge.
In these examples, we observe that the correct answer usually contains a word that is semantically closer (than words in other answer choices) to an important word from the question: pores to body; non-renewable (negative sentiment) to gone, finished (also negative sentiment); iron to magma (liquid rock).
D.2 Neural Baseline Failures, Oracle Success Table 6 shows example questions (with the Oracle facts) from the Dev set that were predicted correctly by the f + k Oracle model (405/500) but incorrectly by all of the 4 neural models without knowledge (69/405). In contrast to Table 5, a simple semantic similarity is insufficient. The questions require chaining of multiple facts in order to arrive at the correct answer.
D.3 Neural Baseline and Oracle Failures
42/500 questions in the Dev set were predicted incorrectly by all models without external knowledge, as well as by the Oracle f + k model. In Table 7 we show 3 such questions. In all cases, the Oracle f + k model made an incorrect prediction with confidence higher than 0.9.
Frilled sharks and angler fish live far beneath the surface of the ocean, which is why they are known as (A) Deep sea animals (B) fish (C) Long Sea Fish (D) Far Sea Animals. Oracle facts: (f) deep sea animals live deep in the ocean. (k) Examples of deep sea animals are angler fish and frilled sharks.
Gas can fill any container it is given, and liquid (A) is standard weight and size (B) is the opposite of variable (C) only needs a few (D) uses what it needs. Oracle facts: (f) Matter in the liquid phase has definite volume. (k) liquid cannot spread endlessly.
When birds migrate south for the winter, they do it because (A) they are genetically called to (B) their children ask for them to (C) it is important to their happiness (D) they decide to each year. Oracle facts: (f) migration is an instinctive behavior. (k) instinctive is genetic.
Table 6: Sample questions predicted correctly by the f +k Oracle model (405/500) but were predicted incorrectly by all of the 4 neural models without knowledge (total of 69 out of 405).
As noted earlier, there are several broad reasons why even this so-called oracle model fails on certain questions in OpenBookQA. In some cases, the core fact f associated with a question q isn’t actually helpful in answering q. In many other cases, the corresponding second fact k is noisy, incomplete, or only distantly related to q. Finally, even if f and k are sufficient to answer q, it is quite possible for this simple model to be unable to perform the reasoning that’s necessary to combine these two pieces of textual information in order to arrive at the correct answer.
In the shown examples, the first question falls outside the domain of Science where most of the core facts come from. The scientific fact “(f) An example of collecting data is measuring” is transformed into a question related to the law and judicial domain of collecting data for a (court) case. This is an indication that the model trained on the Train set does not perform well on distant domains, even if the core facts are provided.
In the second question, we have an option all of these. Indeed, the selected answer seems the most relevant (a generalized version of the other two), but the model did not know that if we have an option all of these and all answers are plausible,
An example of data collection is: (A - 0.9977) Deleting case files on the computer, (B - 0.0000) Touching evidence without gloves, (C - 0.0004) speaking with a witness, (D - 0.0019) Throwing documents in the trash. Oracle facts: (f) An example of collecting data is measuring. (k) Interviews are used to collect data.
If a farmland up the hill gets rainfall, what could happen to lower lands? (A - 0.0005) all of these, (B - 0.0245) they could get fertilizer washed to them, (C - 0.9542) they could experience unfavorable chemical change in their lands, (D -0.0208) they could have their lands poisoned. Oracle facts: (f) runoff contains fertilizer from cropland. (k) fertilizers for certain crops could poison other crops or soil types.
Layers of the earth include all but: (A - 0.0429) mantle, (B - 0.0059) center, (C - 0.0334) crust, (D - 0.9177) inner core. Oracle facts: (f) the crust is a layer of the Earth. (k) the last layer is the outer core.
Table 7: Sample questions predicted incorrectly by all models models w/o knowledge, as well as the f + k Oracle model, even though the Oracle model has con-fidence higher than 0.90.
it should decide if all answers are correct and not pick the “most likely” individual answer.
The third question again requires the model to select a special type of aggregate answer (“all but xyz”), but the related Oracle facts are pointing to a specific answer.