Extensive recent work contributes computational models of sentence-level entailment, and propositionlevel semantic similarity more broadly. We propose that end-users of these methods could eventually include: (i) historians of science tracking expression of the idea that “vaccines cause autism” after the 1998 study in The Lancet making this claim; (ii) political scientists and journalists tracking fine-grained opinions like “immigrants are often unfairly used as scapegoats for problems in society” in the media; and (iii) public servants seeking to understand the challenges facing a community after a disaster by tracking claims like “dealing with authorities is causing stress and anxiety.”
What all of these examples have in common is that a user specifies a natural language proposition query: an idea likely to to occur in a given text collection. Natural languages offer many ways to express any idea, so it is an open question what kinds of semantic matching methods will be required to fulfill the information needs of different kinds of users.
In this paper, we demonstrate how semantic matching methods can be used more widely. In 3, we start with simple word-vector-based methods inspired by earlier work on semantic similarity and relatedness. As an initial test, we define two new tasks that exploit existing labeled corpora (
3.2). Using the CNN/Daily Mail Reading Comprehension dataset (Hermann et al., 2015), we evaluate whether a model can identify the relevant sentence in an article given a proposition query. Second, using the Media Frames Corpus (a collection of news articles about immigration in the US; Card et al., 2015), we derive proposition queries from the examples in its annotation codebook. In both evaluations, we find
Figure 1: An example of semantic matching in the domain of natural disaster recovery.
that a simple word vector average-based matching algorithm can retrieve sentences marked by annotators reasonably well.1
Given those positive results, we introduce a more realistic application: propositions in the domain of natural disaster recovery (3.3; example in Fig. 1). A domain expert collaborating on the research provided the proposition queries, and our evaluation is a user study with twenty emergency managers.
Since this application suggests that a more nuanced model of semantics than the word-vector-averaging models is necessary, we turn to more complex entailment-based models (4). We introduce a new syntax-based model for matching, trained on the SNLI dataset (Bowman et al., 2015). Our user study shows that this model offers higher quality matches than the vector-averaging baselines. We find further confirmation of these results in a follow-up study where the emergency managers themselves created the proposition queries. Finally, we introduce an application of semantic matching, semantic measurement (
5), by qualitatively exploring the frequency of matches over time. In the user community we surveyed, we find that there is interest in tools for semantic matching and measurement.
We formalize the semantic matching problem as follows. Let C denote a corpus consisting of a collection of documents, each a list of English sentences (individually denoted by will be the proposition query, also a sentence.
The goal is to find sentences such that s expresses the idea contained in
. To do so, we assume that sentences
will be ranked by some function
and the top n will be returned to the user (as the set
). We can think of m as a model of semantic similarity (as in
3) or entailment (as in
4). This setup is quite similar to (sentence-level) text retrieval, except that the user is assumed to be interested in the full set
, rather than answering a specific information need using any relevant match. (See
7 for further discussion of related tasks.)
We note that our approach assumes segmentation at the sentence level, but alternative formulations (where the expression of an idea may span several sentences or only a clause or phrase in a sentence) can be considered straightforwardly. Here, document structure is not used in identifying matches, but could be an interesting source of information in future work.
We begin with simple word-vector-averaging models. We then construct two relevant tasks based on existing corpora (CNN/Daily Mail Reading Comprehension, Media Frames Corpus), and demonstrate the models’ viability on these tasks. Given these results, we introduce a more realistic application, where a domain expert specifies proposition queries about natural disaster recovery, and validate the output by performing a user study with emergency response professionals.
3.1 Word Vector Averaging
To match proposition queries to sentences s from a corpus, we first consider a scoring method inspired by work on paraphrase (Wieting et al., 2016) and averaging networks (Iyyer et al., 2015). Each sentence is represented as the average of its word vectors, and the similarity score between
and s is the cosine similarity between their vectors.
Of course, the choice of pre-trained word vectors could have a large effect on the quality of a semantic matching system, so we examine two options. We first consider 300-dimensional paraphrastic word vectors generated by Wieting et al. (2016); we selected these because they were designed specifically for semantic similarity between sequences. We also select the widely used word2vec vectors (Mikolov et al., 2013), which are trained on Google News and contain 300-dimensional vectors for approximately 3 million words.2 These are of interest because they are relatively fast to train on large amounts of data. Because they are derived from unstructured news text, they are more likely to contain proper nouns/entities of interest than the paraphrastic vectors, which are trained on the Paraphrase Database (Ganitkevitch et al., 2013).
As a sanity check, we also test an even simpler information retrieval-inspired model that uses cosine similarity of tf-idf vectors of and s.
3.2 Matching Queries in Existing Corpora
Before investing in the design of a new application, we exploit existing corpora labeled for related tasks (CNN/Daily Mail Reading Comprehension and Media Frames Corpus) to test the effectiveness of simple word-vector-averaging models (3.1). In both cases, our evaluation differs from the tasks originally introduced by the dataset, because our interest is in semantic matching applications (
2).
3.2.1 CNN/Daily Mail
The CNN/Daily Mail Reading Comprehension dataset (Hermann et al., 2015) contains 93k articles from CNN and 220k articles from the Daily Mail. Each instance consists of an article, a query (constructed from bullet point summaries in the original articles), and an answer to the query. For each instance, we take the proposition query to be its query and the “corpus” C to be the set of sentences in its article; the model is asked to find the sentence which contains the entity in the answer.3 This problem is simpler than the application described in
is only being matched against sentences in one document (average 30 sentences). Nonetheless, this dataset provides an initial testbed.4 We emphasize that we are interested only in identifying relevant sentences, and not in finding the answer-entity. We consider a sentence relevant if it contains the correct answer.
3.2.2 Media Frames Corpus
The Media Frames Corpus (Card et al., 2015) contains several thousand news articles related to three policy issues (immigration, tobacco, and same-sex marriage). These articles were annotated with fifteen
Figure 2: Word-vector-averaging model results for tasks based on existing corpora (3.2).
“framing dimensions” according to a codebook developed by Boydstun et al. (2014).5 The texts were annotated by a team of political science experts according to the framing dimensions; any span of text could be labeled with any frame, and overlapping is possible. An example span of text annotated with the quality of life frame is “we hear statistics rather than stories, stories of lives mired in human suffering.”
Importantly, the codebook includes expert-designed examples for each framing dimension. We take proposition queries to be these examples. The intuition is that a sentence in the corpus that matches a codebook example for frame F is also expected to evoke frame F. For instance, “immigration rules have changed unfairly over time” and “allowing unauthorized immigration is unfair to those who apply and wait” are both examples of the fairness and equality frame.
In this work, we focus on the immigration-related articles, as the codebook for this subset of the corpus was most complete. From the codebook, we obtain 30 proposition queries across ten framing dimensions (not every framing dimension has examples provided for immigration). The full list of codebook examples used is provided in the appendix (Table 4).
Because annotated spans can be any part of a sentence, we consider a sentence to be annotated with a frame if any part of it is annotated with that frame. In cases where the corpus annotators disagree on which framing dimension is evoked, we note agreement if any of the annotators has specified the frame of interest.6 We will examine how well the output from the models described in 3.1 align with existing frame annotations. We do not expect high recall on this task, since many annotations in the corpus evoke framing dimensions in ways semantically distant from the codebook’s examples.
3.2.3 Results on Existing Corpora
We run each of the models in 3.1 across the train and test partitions of the CNN/Daily Mail corpus, and on the immigration section of the Media Frames Corpus. For the CNN/Daily Mail evaluation, we compute recall at different values of n (the number of top-scoring sentences to output) to see how well our models can identify the relevant sentence(s). In contrast, for the Media Frames Corpus, recall is not interesting since matches to frame annotations will certainly not cover all possible evocations of their frame, so we examine precision for varying values of n.
We plot the results in Fig. 2. In both tasks, we find that the word-vector-based variants result in improved performance over the tf-idf baseline. (In the CNN/Daily Mail task, the word2vec and tf-idf baselines behave similarly for n = 1 and n = 2; as n increases, word2vec becomes significantly better.) We also find that the paraphrastic vector model performs better than word2vec, which may be a result of the paraphrastic vectors being trained with semantic similarity tasks in mind. Of course, we expect that this simple method can be improved with better sentence representations and/or application-specific supervision; we nonetheless consider these results encouraging.
3.3 Matching Expert Queries
The experiments in 3.2 provide a proof of concept: proposition queries can be matched in text using word vectors. We now turn to a design that considers real users who seek to match ideas to text in a specific domain. In particular, we collaborate with an expert in disaster recovery to examine how text sources (e.g., newspapers and reports from government and utility organizations) reveal how communities recover.
3.3.1 Domain Description and Data
Researchers and public servants are interested in understanding the challenges facing a community after a disaster. However, on-the-ground empirical studies can be expensive to conduct, especially across a multi-year recovery period and a wide variety of variables. We propose that these users might obtain additional data through semantic matching of ideas of interest in relevant text.
More specifically, we examine recovery after the Canterbury/Christchurch (New Zealand) earthquakes that took place in late 2010 and early 2011. We collected 982 earthquake-related articles from New Zealand news websites,7 spanning 2011 through 2015. We obtained 20 proposition queries from our domain expert; the queries cover topics like community wellbeing, infrastructure, and decision making. An example query is: “The council should have consulted residents before making decisions.” The full list of proposition queries is provided in Table 5 in the appendix.
3.3.2 User Study Evaluation
To evaluate and compare the performance of the models, we conducted a user study with twenty emergency managers.8 Emergency managers are state/local personnel responsible for planning, administration, operations, and logistics related to natural and manmade hazard events, and therefore might be interested in relevant ideas found in text.
Experimental design. In this experiment, we compare word2vec and paraphrastic word-vector-averaging models; unlike the tasks in 3.2, we turn to users to evaluate the quality of matches.9 Every sentence in the news corpus was scored against each of the 20 instances of
, for each model considered.
User study. Ideally, we would have our users judge how well every candidate sentence matches every . Since expert users are finite, we instead sampled sentences from the following categories for the word2vec and paraphrastic vector-based models: (i) top, the 25 highest-scoring sentences output by the model; (ii) middle, 25 sentences, sampled randomly from those in ranks 26–250 according to the model scores; (iii) bottom, 25 sentences, sampled randomly from those ranked at 251 or lower.
We gave each user the prompt, “Given an idea sentence, score each candidate sentence on a 1–5 scale based on how well it expresses the idea. The preceding and following sentences for each candidate are provided for context, but please score the quality of only the bolded candidate sentence.”10 We provided users with a sample idea sentence and candidate sentences scored by the same domain expert who supplied the idea sentences (Table 1). We also provided score descriptions from 1 through 5 (Table 2).
The candidate sentences to be scored were spread among all 20 participants; users were not made aware of which model or sentence category the output came from. To allow calculation of inter-annotator agreement, half of the sentences received three judgments (rather than just one).11 We computed Krippendorf’s for interval data to be 0.784, which indicates reasonable agreement when users rate the same sentence (Krippendorff, 2012).
Table 1: Provided examples for the idea sentence (): “There is a shortage of construction workers.” The candidate sentence is in bold, with the preceding and following sentences provided for context.
Table 2: Scoring guidelines for the user studies.
Figure 3: User study scores for output from the word-vector-averaging models (3.3). Error bars represent standard error.
Results. Our findings, shown in Fig. 3, confirmed our expectations: users rated bottom sentences low (around 1), and top sentences better than middle ones. As in 3.2, the paraphrastic vectors led to output receiving better ratings than word2vec (3.1 vs. 2.7 on average), establishing a baseline that finds sentences “related to, but not (yet) adequately expressing”
.
In 3, we found that word-vector-averaging models perform only adequately in the disaster recovery application we introduced. We next consider a model based on a richer notion of semantic matching, where a matched sentence should entail the proposition query.
4.1 Tree Edit Models
As a starting point for the semantic matching function , we use the tree edit model introduced by Heilman and Smith (2010). We select this model because it is simple and interpretable, and it was demonstrated to be suitable for a range of semantic similarity problems, including entailment, paraphrase, and answer ranking for question answering.
4.1.1 Base Model
We summarize the base model from Heilman and Smith (2010) and refer the reader to the original paper for further details.
For the sentences s and , we first obtain dependency parse trees12 T and
, respectively. We then choose a tree edit sequence (i.e., a sequence of edit operations) that transforms T into
. Edit operations include adding nodes (words), deleting nodes, relabeling dependency relations, and so on; the full list is provided in the appendix (Table 7). The edit sequence is found using beam search, with a heuristic function that depends on the lemmas, part of speech tags, arc labels, and whether a node is a left or right child of its parent.
A set of 33 integer-valued features are extracted from the edit sequence. These features include the sequence length and counts of different edit types; the full list is provided in the appendix (Table 8). A logistic regression (LR) model is trained on these features.
4.1.2 Neural Tree Edit Model
Given the many successes of non-linear models and the sequential nature of the tree edits, we introduce a neural network variant of the model. We select a tree edit sequence exactly as described above, and then use a LSTM (Hochreiter and Schmidhuber, 1997) that estimates by reading in the tree edits in sequence. Each element in the tree edit sequence is vectorized as the concatenation of:
• A one-hot encoding of the operation type.
• A word-embedding-like vector, in the same space as the word embeddings, that aims to capture the word-embedding-space “difference” between the sentences before and after the edit operation. For example, if a new node is added to the tree (INSERT-CHILD, INSERT-PARENT), then we use the word embedding for that word. If a node is relabeled (RELABEL-NODE) with a new lemma, then we use the difference between word embeddings for the replacement and original word. If a word is deleted (DELETE-LEAF, DELETE-&-MERGE), then we use the negated embedding of the deleted word. In other cases, we use a zero vector.
This approach allows the model to take lexical and sequential information into account rather than just counts of operations. Note that both approaches make use of syntactic context when representing edits to sentences.
4.1.3 Training
We use the Stanford Natural Language Inference corpus (SNLI; Bowman et al., 2015).13 SNLI contains approximately 570,000 pairs of sentences (premise and hypothesis); each sentence pair is humanannotated with an entailment, contradiction, or neutral label of the relationship between the two sentences. (As is standard, we ignore examples marked as “unlabeled” due to annotator disagreement.)
For the purposes of our matching function m, we recast the SNLI examples into a binary framework as follows. We treat the premise sentence as analogous to the candidate s and the hypothesis as the proposition query . Premise-hypothesis pairs labeled as entailment are considered positive matches, and those labeled as contradiction or neutral are considered negative matches.
We train three model variants: the original logistic regression (LR) model, and the LSTM using the two pre-trained word embeddings discussed and motivated in 3.1. We use the standard SNLI train/development splits to tune hyperparameters; for the LSTM models, we optimize using Adam (Kingma and Ba, 2014).14
4.1.4 Fast filter
Many entailment models, including the ones described in this section, require fairly sophisticated semantic analysis, and therefore significant computational expense. Furthermore, many sentences in C can be easily determined not to match . Therefore, we incorporate the word-vector-based matching functions from
3.1 as a initial fast filtering step on C.15
Our procedure, then, is to first score every according to the fast filter, then take the top k candidates for selection by
. The full procedure is outlined in Algorithm 1.
Figure 4: User study scores comparing top-25 sentences from just the word-vector-based averaging models (+none) and with reranking from the LR and LSTM tree edit models. Error bars represent standard error.
4.2 Entailment Model Evaluation
We want to evaluate two hypotheses: (1) that adding the tree edit models on top of the word-vector-based ones (as described above) yields better matches; and (2) that using the LSTM-based tree edit model provides improved performance over the LR-based model.
Our preliminary investigation found that the tree edit models offered no consistent benefit on the existing-corpora tasks (3.2). This is unsurprising; the semantic relationships in those tasks are much broader than entailment. Here we focus entirely on the new, more realistic application in
3.3. We take the 250 top-scoring sentences from both word-vector-averaging models as “fast filter” output, and rerank them using the LR and LSTM tree edit models. As part of the user study in
3.3, we had users judge the top 25 sentences from the tree edit models’ reranked output.
4.3 Results
First, we find that the tree edit model offers some benefit to sentence quality compared to using only the word vector filters (i.e., the averaging models). This difference is significant with the paraphrastic filter but within the range of statistical chance with the word2vec-based filter.
We also find that the LSTM on tree edit sequences offers slightly better matches than logistic regression; again, this difference is significant with the paraphrastic-based filter but not the word2vec one. (In fact, the output from the word2vec filter with LR and LSTM tree edit models overlaps at about 85%.)
User feedback. To gauge interest in the utility of semantic matching systems, we also asked each user to answer an optional set of questions after providing judgements. (All users answered the questions.) We found that (i) 85% were interested in a way to measure ideas in news or other corpora, and (ii) half of the respondents were interested in a follow-up study evaluating semantic matches from idea sentences of their own choosing.
4.4 Follow-Up Study
Our follow-up study was executed similarly to the the original one described above, but with proposition queries solicited from users themselves. Instead of randomly distributing sentences among the follow-up study participants, we gave each user who participated in the follow-up the output for their own proposition queries. There were 18 idea sentences and seven participants in this study. (The full list of idea sentences is provided in Table 6 in the appendix.) Each participant scored approximately 250 sentences, which were drawn from different parts of the output (as in the original study).
Results. We find that the follow-up study replicates the findings of the original study. The average scores for the top-ranked output (by the word-vector-averaging models, and reranked by the LR/LSTM models) are generally 0.1-0.2 lower than those in the original study. However, this decrease holds across different model variants, so the relative performance benefits of using paraphrastic word vectors in the averaging model, as well as using the tree edit LSTM model to rerank, still hold. We suspect that the decreased scores are partially a function of some of our users’ queries being less applicable to the NZ earthquakes (resulting in fewer possible matches), as the emergency managers’ expertise is not centered around that particular disaster.
4.5 Other Entailment Models
Because of the limited availability of expert users, we were unable to include a wider range of entailment models in the user study. It is natural to ask whether alternatives to the model in 4.1 would have led to better results. We perform a post-hoc evaluation using the candidate sentences scored by our study participants. We consider two recent high-performing models: the decomposable attention model (DAM; Parikh et al. 2016) and the enhanced sequential inference model (ESIM; Chen et al. 2017b).
To compare performance of these models in this domain, we take all candidate sentences from both the original and follow-up studies (paired with their proposition query) and mark them as “entailment” if users scored them with greater than or equal to a 4.16 We split off a set of query-candidate sentence pairs to be a development set; we use these to tune the above models during training (rather than the development sets of SNLI or MultiNLI).
We train these in the two-class setting (entailment vs. contradiction/neutral) on SNLI; we use existing public implementations for DAM and ESIM.17 We also train these and the LSTM version of the tree-edit model on MultiNLI (Williams et al., 2018), the more recent multi-domain version of SNLI.
Results. Table 3 summarizes the scores. The relatively low performance from all models, despite high performance on SNLI,18 indicates that this application is indeed challenging. We also find that training on MultiNLI instead of SNLI does not offer consistent improvement; that is, the multi-domain nature of that dataset does not seem to improve generalization to our data. This suggests that our application requires more than modeling sentential entailment.
Table 3: Post-hoc evaluation results (4.5).
Figure 5: Example of semantic measurement: frequency (3 month intervals) of the idea “Dealing with authorities is causing stress and anxiety.”
In this section, we propose an application of obtaining semantic matches of ideas: measuring the frequency of an idea in a corpus across an independent variable (e.g., time).19 To demonstrate this, we return to the example query from Fig. 1: “Dealing with authorities is causing stress and anxiety.” We select this example because it is not easily expressed through n-grams, and its output was one of the most highly scored in our user study. We take the top 50 matched sentences from the paraphrastic vector + tree edit (LSTM) system, determine the publication dates of their source articles via metadata, and compute frequencies in bins of three months.
Our system detects an upward trend in expressions of this idea. To our domain expert, this is an interesting yet explainable finding: in the short term after the earthquake, the focus is more on immediate response and relief. It takes time for frustration to set in among the population (e.g., due to dealing with bureaucracy and denied insurance claims). Furthermore, as recovery efforts stretch across years, the media may be more inclined to bring individual stories of continued distress to the forefront. Future work on semantic measurement could include tuning of hyperparameters (filter width k and output size n) and measurement calibration.
We discuss some findings from our applications of semantic matching and potential future work.
Desired matches. The granularity of the desired matches varies between the applications we presented. For example, in the framing case, codebook examples are often phrased very generally (e.g., “supporting immigrants is the moral thing to do”), and evocations of this idea may diverge too much to be detectable by current semantic matching models. As a consequence, we found that for the two tasks based on existing datasets (3.2), the entailment-based models from
4.1 did not help performance (and sometimes hurt).
In contrast, the disaster recovery application demands more specific semantic matches; users were less sure about scoring sentences where the idea was only partially expressed. From both our user study and post-hoc evaluation with other models, we found that while entailment models offer small improvements over the word-vector-averaging baselines, our application requires more than detecting sentence-level entailment.
Entities. Particularly in the disaster recovery application, corpus-specific entities can be very important. Entities like government agencies and insurance companies may be central to queries of interest but lack appropriate distributed representations (sometimes even in the Google News word2vec case) or presence in the training corpus. (A frequent example in our earthquake news corpus is the Canterbury Earthquake Recovery Authority, often written as “Cera” and conflated with the actor Michael Cera.)
Context and coreference. Currently, we do not take multiple sentences into account at once when determining sentence matches. (The user study in 3.3 provided context in the survey for the users alone; SNLI deals with this issue by grounding both premise and hypothesis in a specific scenario from an image caption, which is an approach not available in our setting.) In some cases, this leads to the system finding a match at the sentence level when it would otherwise be invalid from context; in others, a potential match is spread across a sentence boundary. In future work, including larger and smaller passages (not only sentences) may be worthwhile, especially if coupled with more preprocessing (e.g., coreference resolution, entity linking).
The semantic matching applications in this paper are reminiscent of several lines of research in NLP.
Retrieval. As mentioned in 2, finding coarse semantic matches of a proposition in a corpus is closely related to past work in IR, particularly sentence retrieval (Balasubramanian et al., 2007). Other relevant work in IR includes passage retrieval, which is a component in many web-scale question answering systems (Tellex et al., 2003). The main difference is that, here, we seek more than a single answer to a question-query; we seek all matches to the query (which is a proposition). Our fast filter also resembles recent work on question answering known as machine reading on passages already retrieved (Chen et al., 2017a).
Entailment and related tasks. There is a long line of entailment tasks and corpora: among others, the Recognizing Textual Entailment challenges (RTE; beginning with Dagan et al., 2006); the Sentences Involving Compositional Knowledge dataset (SICK; Marelli et al., 2014); the SNLI and MultiNLI datasets used here (Bowman et al., 2015; Williams et al., 2018); and the SciTail dataset (Khot et al., 2018). The RTE-5 through RTE-7 shared tasks, starting with Bentivogli et al. (2009), contain a similar task to ours; however, these have a very different end goal (using entailment models to improve text summarization) and much smaller corpora (10 documents).
Other related NLP tasks which involve semantic comparisons between pairs of sentences include identifying paraphrase pairs (Dolan et al., 2004; Dolan and Brockett, 2005) and semantic textual similarity (STS, beginning with Agirre et al., 2012). Both paraphrase and STS differ from entailment (and our semantic matching applications) in that they require bidirectional equivalence; STS furthermore treats similarity on a graded scale rather than as a binary label.
Measurement or tracking of ideas. Tracking or measurement of ideas in corpora has often been considered in a more exploratory way, without a user-generated query. Such exploration has long been a motivation for topic models (e.g., Blei and Lafferty, 2006). For example, Prabhakaran et al. (2016) use topics and their rhetorical roles in scientific journal abstracts to understand when topics are in growth or decline. Other work has allowed user specification of a particular query, though usually as an n-gram, as by Michel et al. (2011), or using keywords or topics (Tan et al., 2017) or short meme phrases (Leskovec et al., 2009). We define matches at a more fine-grained proposition level.
We introduced and explored a new application of semantically matching a proposition against a corpus. Our findings show that this problem is different from our initial benchmarks based on convenient existing corpora, and from the textual entailment problem. Our study identified a potential user community and illustrated some factors that will be important in future work.
This work was supported by NSF #1541025. LHL was also supported in part by a NSF Graduate Research Fellowship. Many thanks to the domain experts who participated in the user studies; Ryan Georgi for the CNN/Daily Mail dataset suggestion; Dallas Card for help with the Media Frames Corpus; and members of the ARK and UW NLP for their comments on earlier drafts.
Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. 2012. SemEval-2012 Task 6: A pilot on semantic textual similarity. In SemEval 2012.
Niranjan Balasubramanian, James Allan, and W. Bruce Croft. 2007. A comparison of sentence retrieval techniques. In SIGIR.
Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The fifth PASCAL recogniz- ing textual entailment challenge. In TAC.
David M. Blei and John D. Lafferty. 2006. Dynamic topic models. In ICML.
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP.
Amber E. Boydstun, Dallas Card, Justin Gross, Paul Resnick, and Noah A. Smith. 2014. Tracking the development of media frames within and across policy issues. Technical report, CMU.
Dallas Card, Amber E. Boydstun, Justin H. Gross, Philip Resnik, and Noah A. Smith. 2015. The Media Frames Corpus: Annotations of frames across issues. In ACL.
Danqi Chen, Jason Bolton, and Christopher D. Manning. 2016. A thorough examination of the CNN/Daily Mail reading comprehension task. In ACL.
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017a. Reading Wikipedia to answer open-domain questions. In ACL.
Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017b. Enhanced LSTM for natural language inference. In ACL.
Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Clas-sification, and Recognising Tectual Entailment, Lecture Notes in Computer Science, pages 177–190. Springer.
Bill Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In IWP.
Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In COLING.
Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The Paraphrase Database. In NAACL.
Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL.
Michael Heilman and Noah A. Smith. 2010. Tree edit models for recognizing textual entailments, paraphrases, and answers to questions. In NAACL.
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suley- man, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In NIPS.
Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daum´e III. 2015. Deep unordered com- position rivals syntactic methods for text classification. In ACL.
Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. SCITAIL: A textual entailment dataset from science question answering. In AAAI.
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In ICLR.
Klaus Krippendorff. 2012. Content analysis: An introduction to its methodology.
Jure Leskovec, Lars Backstrom, and Jon Kleinberg. 2009. Meme-tracking and the dynamics of the news cycle. In KDD.
Lucy H. Lin, Scott Miles, and Noah A. Smith. 2018. Natural language processing for analyzing disaster recovery trends expressed in large text corpora. In IEEE GHTC.
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In ACL (System Demonstrations).
Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zam- parelli. 2014. A SICK cure for the evaluation of compositional distributional semantic models. In LREC.
Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. 2011. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176–182.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffery Dean. 2013. Distributed represen- tations of words and phrases and their compositionality. In NIPS.
Ankur P Parikh, Oscar T¨ackstr¨om, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. In EMNLP.
Vinodkumar Prabhakaran, William L. Hamilton Hamilton, Dan McFarland, and Dan Jurafsky. 2016. Predicting the rise and fall of scientific topics from trends in their rhetorical framing. In ACL.
Chenhao Tan, Dallas Card, and Noah A. Smith. 2017. Friendships, rivalries, and trysts: Characterizing relations between ideas in texts. In ACL.
Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. 2003. Quantitative evaluation of passage retrieval algorithms for question answering. In SIGIR.
John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. Towards universal paraphrastic sentence embeddings. In ICLR.
Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL.
A.1 Proposition Queries
In Table 4, we provide the thirty proposition queries and associated frames used in the Media Frames Corpus-based evaluation (3.2). Table 5 lists the twenty proposition queries used in the original user study (
3.3), and Table 6 lists the proposition queries generated by some of our study respondents; these were used in the follow-up study (
4.4).
Table 4: Proposition queries used in the Media Frames Corpus evaluation.
Table 5: Proposition queries used in the original user study (3.3). (“Cera” is short for the “Canterbury Earthquake Recovery Authority”, and “Scirt” is short for the “Stronger Christchurch Infrastructure Rebuild Team.”)
Table 6: Proposition queries used in the follow-up study (4.4).
A.2 Tree Edit Model
For reference, we provide the full list of tree edit operations (Table 7) and features used in the logistic regression version of the model (Table 8) described in 4.1 and Heilman and Smith (2010).
Table 7: Tree edit operations (Heilman and Smith, 2010).
Table 8: A description of the tree edit features for LR classification (Heilman and Smith, 2010).