Frame-semantic parsing (Gildea and Jurafsky, 2002) is the task of identifying the semantic frames evoked in text, along with their arguments, formalized in the FrameNet project (Baker et al., 1998). An example sentence and its frame-semantic annotations are shown in Figure 1. Frame-semantics has shown to be useful in question answering (Shen and Lapata, 2007), text-to-scene generation (Coyne et al., 2012), dialog systems (Chen et al., 2013) and social-network extraction (Agarwal et al., 2014), among others.
found that filtering heuristics based on predicted dependencies bounded recall below 72.6%. Recent research has begun to question whether syntax is necessary for semantic analysis (Zhou and Xu, 2015; Swayamdipta et al., 2016; He et al., 2017; Peng et al., 2017), and here, we begin by posing the same question for frame semantics in particular.
Our first model presents the first syntax-free frame-semantic argument identification system (), which achieves competitive results. We follow Kong et al. (2016) in combining bidirectional RNNs with a semi-Markov CRF—a model known as the segmental RNN (SegRNN)—to segment and label the sentence relative to each frame. We introduce the softmax-margin SegRNN (
a modification to the model that we use to encourage recall over precision. This approach to argument identification forgoes syntactic filtering and syntactic features completely.1
In the rest of the paper, using this basic model as a foundation, we test whether incorporating syntax is still worthwhile. In the first setting (we add standard features derived from either a dependency parser or a phrase-structure parser to the softmax-margin SegRNN. We find that this syntactic pipelining approach improves over both our syntax-free model and achieves state-of-the-art performance. This finding confirms that syntactic features are beneficial, even on top of expressive neural representation learning.
In our second setting (), we preserve most of the benefit of syntactic features without the accompanying computational cost of feature extraction or syntactic parsing. We incorporate the
Figure 1: A FrameNet sentence with color-coded frame annotations below. Target words and phrases are highlighted in the sentence, and their lexical units are shown italicized below. Frames are shown in colored blocks, and frame element segments are shown horizontally alongside the frame.
training objective of our syntax-free model into a multitask setting where the second task is unlabeled constituent identification (i.e., a separate binary decision for each span). This task is trained on the Penn Treebank, sharing the underlying sentence representation with the frame-semantic parser. This syntactic scaffold2 task offers useful guidance to the frame-semantic model, leading to performance on par with our models that use syntactic features. This approach also achieves state-of-the-art performance, despite not involving a syntactic parser during training or testing.
To summarize, our contributions are:
a. the softmax-margin SegRNN, a recall-oriented extension to segmental RNNs, for frame-semantic argument identification without any syntactic information (
b. the addition of syntactic information to the above, achieving state-of-the-art perfomance, using:
i. a pipelined approach, incorporating features from automatic dependency or phrase-structure parsers (
ii. a syntactic scaffolding approach, discarding the need for a syntactic parser altogether (
Our open-source implementation is available as open-SESAME (SEmi-markov Softmax-margin ArguMEnt parser) at https://github.com/ Noahs-ARK/open-sesame/.
The Berkeley FrameNet project (Baker et al., 1998; Ruppenhofer et al., 2010) provides a lexicon of 1,020 semantic frames,3 a corpus of sentences annotated with frames from that inventory, and a corpus of annotated exemplar sentences (not used in this work).
Each frame represents a kind of event, situation, or relationship, and has a set of frame elements (semantic roles) associated with it (Fillmore, 1976). In a sentence, frames are evoked by targets, which are words or phrases. The FrameNet lexicon maintains a list of lexical units for each frame, which are lemma and part-of-speech pairs that can evoke that frame. For example, in Figure 1, the target drying up has dry up.v as its lexical unit, associated with the frame BECOMING DRY. Our main use of the FrameNet lexicon, following earlier work, is as a mapping between frames and the roles they might take.
Frame-semantic parsing is usually performed as a pipeline of tasks: target identification (which words or expressions evoke frames?), frame iden-tification (which frame does each target evoke?), and then argument identification (for each frame f, and each of its possible roles in the FrameNet-defined set , which span of text provides the argument?). Target identification conventionally relies on heuristics, and frame identification is usually treated as a classification problem (Das et al., 2014). The focus of this paper is argument identification, and we evaluate variations of our approach on both gold-standard frame input and on the output of state-of-the-art frame identifica-tion (FitzGerald et al., 2015).
2.1 Formal Notation
A single input instance for argument identifica-tion consists of: an n-word sentence , its predicted part-of-speech tag sequence,
gle target span
, its lexical unit
and its evoked frame f. For brevity we denote the input as
. Given this input x, the task is to produce a segmentation of the sentence:
corresponds to a labeled span of the sentence, with start index
, end index
label
. The label
is either the role that the span fills, or NULL if the segment does not fill any role. The segmentation is constrained so that argument spans cover the sentence and do not overlap (
ments of length
are allowed. A separate segmentation is produced for each frame evoked in a sentence.
Our first model for argument identification is a segmental RNN (SegRNN; Kong et al., 2016), a variant of a semi-Markov conditional random field (Sarawagi et al., 2004) whose span representations are computed using bidirectional RNNs. Semi-Markov CRFs model a conditional distribution over labeled segmentations of an input sequence; precisely the set of outputs possible for a single frame’s argument identification task. They provide inference using dynamic programming (
). This can be reduced to O(nb) by fil-tering out segments longer than b tokens (we use b = 20, which prunes less than 1% of gold arguments).
Semi-Markov models are more general than BIO tagging schemes, which have been used successfully in PropBank SRL (Collobert et al., 2011; Zhou and Xu, 2015, inter alia). The semi-Markov assumption allows scoring functions that directly model an entire variable-length segment (rather than fixed-length label n-grams), while retaining exact inference and a linear runtime. Relatedly, T¨ackstr¨om et al. (2015) introduced a dynamic program that allows direct modeling of variable-length segments as well as enforcing constraints such as certain roles being filled at most once. Its runtime is linear in the sentence length, but exponential in the number of roles. The semi-Markov CRF’s inference algorithm is a relaxed special case of their method, with fewer constraints and without the exponential runtime constant.
SegRNNs use continuous vector representations of spans. In past work, they have been applied to joint word segmentation and part-of-speech tagging for Chinese (Kong et al., 2016) and to speech recognition (Lu et al., 2016).
Given an input x, a SegRNN defines a conditional distribution p(s | x). Every segment given a real-valued score
, detailed in
The score of a segmentation s is the sum of the scores of its segments:
These scores are exponentiated and normalized to define the probability distribution. The sumproduct variant of the semi-Markov dynamic programming algorithm is used to calculate the normalization term (required during learning). At test time, the max-product variant returns the most probable segmentation,
The parameters of SegRNNs are learned to maximize a criterion related to the conditional loglikelihood of the gold-standard segments in the training corpus (). The learner evaluates and adjusts segment scores
) for every span in the sentence, which in turn involves learning embedded representations for all spans (
Representations of the target, lexical unit, frame, and frame elements are also learned (
entire model is illustrated in Fig. 2; we discuss the details from the bottom up.
3.1 Input Span Embeddings
We use two bidirectional long short-term memory networks (biLSTMs; Hochreiter and Schmid- huber, 1997; Schuster and Paliwal, 1997; Graves, 2012) over the input sentence to obtain continuous representations of each token and each candidate span (all subsequences of length
At each word position q, we give as input to the first (token) biLSTM an input vector is a learned embedding of the word type,
is a fixed pre-trained embedding of the word type,
is a learned embedding of the part-of-speech tag, and
is the distance of the word from the beginning of the target. This yields a hidden state vector
Figure 2: Illustration of the model architecture for an example sentence and its frame-semantic parse. The input token embeddings are depicted in black, and the input frame and frame-element embeddings in purple. The token biLSTM hidden states are shown in green, and the span embedding hidden states in red. A final multi-layer perceptron connects all the different components of the model into a segment factor, shown in gray. This analysis contains three segments, one for each of the two frame elements and a third NULL segment.
which is a contextualized representation of the token:
The second biLSTM embeds every candidate span, using the hidden representations from the token biLSTM as input:
These span embeddings are calculated efficiently, sharing computation where possible; for details, see Kong et al. (2016).
3.2 Input Target and Frame Embeddings
In addition to the token and span embeddings above, we learn an embedding for each frame f, and an embedding
for each lexical unit
represent the target in context, we use
target span t, as well as the neighboring token on each side, as an input to a forward LSTM:
The above are concatenated to form a representation of the target and frame, which is used in representing segments (
3.3 Segment Scores
The score of a segment should capture the interaction between the span, the frame element label, the target, the lexical unit, and the frame. We form a vector for a segment
by concatenating the span embedding
, a learned embed- ding
for frame element
two additional one-hot features (denoted
binned length of the span, and the span’s position relative to the target (before, after, overlapping, or within):
Then the representation is passed through a recti-fied linear unit (Nair and Hinton, 2010) to get the segment score:
where the matrix and the vector
parameters. Note that NULL-labeled spans are handled in the same way as other labeled spans.4
3.4 Softmax-Margin Segmental RNNs
Most spans are not arguments; we therefore train to maximize a criterion that biases the model to favor recall (Mohit et al., 2012). Known as the softmax-margin (Gimpel and Smith, 2010), this criterion alters the partition function with a cost function that more strongly penalizes false negatives:
where FN counts false negatives, FP counts false positives, and is a hyperparameter tuned on the development set. In order to keep inference tractable, the cost function needs to factorize by predicted span (so cost
But a false negative is not a property of an individual predicted span. We get around this by noting that a predicted span forces a false negative if it partially overlaps with a gold span. To avoid double counting, as multiple predicted spans may overlap with a gold span, we assign blame only to the span that contains the first token (
) of the missing gold span:
The softmax-margin criterion, like loglikelihood, normalizes globally over all of the exponentially many possible labeled segmentations. The following zeroth-order semi-Markov dynamic program (Sarawagi et al., 2004) effi-ciently computes Z:
where , under the base case
cost function is easily incorporated because it factors in the same way as
The model’s prediction can be calculated using a similar dynamic programming algorithm with the following recurrence (and the usual “arg max” bookkeeping):
Our model formulation enforces only the nooverlap constraint; we expect it learn other SRL constraints from the data. We optimize using ADAM (Kingma and Ba, 2014). Models are trained a single thread on an NVIDIA Tesla K40 CPU. Convergence requires 15 epochs. Hyperparameters are tuned based on performance on the held-out development set, with further implementation details given in
Syntax has been important in many past models for semantic argument prediction (Punyakanok et al., 2008; Toutanova et al., 2008; Johansson and Nugues, 2008; FitzGerald et al., 2015, inter alia).
Table 1: Phrase-structure syntactic features.
The SegRNN-based model in is syntax-free, but it is straightforward to incorporate features from a syntactic parse of the sentence as additional input to the model. We note that the computational cost of syntactic parsing—especially phrase-structure parsing—is significant, but it is important to test whether the SegRNN model can benefit from syntax.
Phrase-structure features. We apply a state-of-the-art phrase-structure parser (RNNG; Dyer et al., 2016), extracting the features in Table 1 for each span. We then concatenate these features to the span representation (Equation 6) in our span scoring model. We found that 84% of gold-standard argument spans correspond to predicted constituents.
Dependency features. We apply a state-of-the- art dependency parser (SyntaxNet; Andor et al., 2016), extracting the features in Table 2 from its output. The two word-level features are concatenated to the word vectors before they are fed into the token BiLSTM. The three span-level features are concatenated to the span representation (Equation 6). The out# features capture information similar to that used in prior work’s span-filtering heuristics. The out#=1 feature, for example, is an approximation of the is phrase constituent feature. Since word-level representations include dependency head and label information, the path lstm has access to that information for each dependency along the path as well.
Finally, note that these two variants of our model do not filter candidate spans, as done in past work. Instead, we allow the learner to flexibly consider syntactic features when scoring spans, potentially leading to a kind of “soft” filter.
Table 2: Dependency syntax features.
A wide range of recent results in NLP have shown the benefit of multitask representation learning (Caruana, 1997), where the same embeddings of words are learned to minimize the loss functions of multiple tasks (Luong et al., 2015; Chen et al., 2017; Peng et al., 2017). Here we consider a “scaffold” task—one we use only during learning, and whose output we are not especially interested in—in a multitask setup with our basic model (The second task in our setup is learning to predict syntactic constituent spans. Since frame-semantic arguments are often also constituents, we hypothesize that learning which text spans are constituents might help us learn which spans could be arguments. Toward this end we use additional annotations from a separate training corpus that does not significantly overlap the FrameNet corpus: the Penn Treebank (PTB).5 Our multi-task learning setting allows us to learn span embeddings that are shared between our frame-semantic parser and a model for predicting syntactic constituents.
Formally, we consider a segment is a label with a corpus-specific definition. Spans in PTB starting at position i and ending at position j which are gold constituents get
, and others get
. Similarly, for FrameNet, we assign
for every span in a sentence which has been annotated as a frame element for any frame, and
otherwise.
Analogous to the scoring function under our basic model (), we define a new scoring function,
for every segment
where are parameters of the model,
is a learned embedding for the label r, and
is the span embedding, reused from our basic model (
Our scaffold loss function is essentially a binary logistic regression loss for each text span: lossscaffold
The joint multi-task loss for a single sentence is:
where is a hyperparameter used to deemphasize the scaffold task, tuned on the development set. The first term does not apply to sentences in PTB, since we do not have frame-semantic annotations
At prediction time (including test time), no syntactic prediction is necessary; the scaffold is removed.
In this section, we provide details of the dataset and experimental setup for all four models: the basic SegRNN (), the phrase-structure and dependency syntactic feature additions (
syntactic scaffold (
6.1 Data
Our dataset contains sentences6 from the full-text portion of FrameNet release 1.5 (September 2010). We use the same test set as Das and Smith (2011) to facilitate comparison with related work. We chose eight additional files at random as a held-out development set; the rest of the files in the full-text data are used for training. The FrameNet
Figure 3: Development-set with log-loss (no cost) vs. recall-oriented cost.
full-text data occassionally contains multiple annotations for the same target. We use only the first annotation for such examples, following FitzGer- ald et al. (2015).
We use SyntaxNet (Andor et al., 2016) for predicted part-of-speech tags and Universal dependencies, from a released pretrained model.7 For phrase-structure parses, we use the RNNG parser (Dyer et al., 2016), trained on WSJ We stochastically (with probability 0.1) replace words that only appear once in the training data with an UNK token to acquire estimates for out-of-vocabulary words at test time.
For the syntactic scaffold, we used all 49,208 sentences from WSJ 00–24 of Penn Treebank.
6.2 Hyperparameters
We used single-layer LSTMs for sentence encoding, spans, targets, dependency and nonterminal paths, each with a hidden state of size 64. Pretrained GloVe (Pennington et al., 2014) vectors of dimension 100 are used, trained on a corpus of 6 billion words; we do not update these during training. Learned embeddings of size 60, 4, 100, 64, 50, 8 and 16 are used for words, POS tags, frames, lexical units, frame-elements, dependency labels and nonterminals, respectively. For ADAM we set the initial learning rate to 0.0005, the moving average parameter to 0.01, the moving average variance to 0.9999, and the parameter (to prevent numerical instability) to
; no learning rate decay is used. To prevent “exploding” gradients, we clip the 2-norm of the gradient (Graves, 2013) to 5 before each gradient update. These values were selected based on intuition and prior work; a more careful tuning of the above hyperparameters could be expected to improve performance.
The remaining hyperparameters were chosen based on their performance on the held-out development set. We selected the dropout rate (Srivastava et al., 2014) of 0.05 from the set {0.01, 0.05, 0.1}. We selected the recall-oriented cost
from the set {1, 2, 5, 10}. We selected the scaffold weight
from the set {0.17, 0.34, 0.89}.
Our experiments were run using the DyNet library (Neubig et al., 2017).8
6.3 Self-Ensembling
To compensate for the variance resulting from different initializations, we use a self-ensembling approach. We train five models, differing only in their random initialization, and ensemble their local scores at test time. Specifically, we calculate the sum of the segment scores under each model () to get the final ensembled segment score, which is then plugged into Eq. 13 to decode.
6.4 Evaluation
All systems are evaluated for precision, recall, and , micro-averaged across test examples, following standard practice. We use the standard script provided by SemEval 2007 (Baker et al., 2007), with a single modification provided by Kshirsagar et al. (2015) to optionally ignore the frame iden-tification output. This allows us to evaluate for argument identification in isolation, which is the primary focus of this paper; for this setting we use gold frames (without rewarding them in
ation). We also evaluate with predicted frames to illustrate our effect on end-to-end parsing performance; for this, we use the same predicted frames as FitzGerald et al. (2015), who retrained the frame identification model from Hermann et al. (2014) but with an updated dependency parser.
6.5 Baselines
SEMAFOR (Das et al., 2014) is a widely used system that identifies frame-semantic arguments using a linear model with hand-engineered features based on dependency parses. SEMAFOR also prunes out argument spans using syntactic heuristics and uses beam search, or optionally AD
Table 3: Parsing results on the FrameNet 1.5 test set. The first three columns evaluate performance of argument identification only using gold frames. The last three columns are a combined evaluation of frame identification and argument identification together, using predicted frames from FitzGerald et al. (2015). These systems use additional semantic resources during training, a technique orthogonal to those presented in this paper.
decode while respecting constraints (Das et al., 2012). Kshirsagar et al. (2015) extended SEMAFOR through the use of exemplar FrameNet annotations, guide features from PropBank, and the FrameNet hierarchy.
FitzGerald et al. (2015) proposed a multi-task learning approach for frame-semantic parsing and PropBank SRL, using a feed-forward neural network to score candidate arguments. The input to the neural network is a set of hand-engineered features extracted from a dependency parse.
In a separate line of work, Framat (Roth and Lapata, 2015) adds features based on context and discourse to improve an SRL system (Bj¨orkelund et al., 2010) adapted for frame-semantics, using a global model with reranking. Roth and La- pata (2016) and Roth (2017) extend this model by learning embeddings for dependency paths between targets and their arguments. We borrow their use of path embeddings as syntactic features, but we explicitly model argument spans using SegRNNs, without any reranking. More importantly, our scaffolding model does not rely on a syntactic parser.
6.6 Results
Table 3 shows the performance of five published baseline systems, along with our four new models (with and without ensembling). Surprisingly, the basic model (Open-SESAME) outperforms the original SEMAFOR parser (Das et al., 2014), and matches an improved version (Kshirsagar et al., 2015), without using any cues from syntax. We attribute this improvement to representation learning.
Perhaps less surprisingly, all three syntactic additions improve the performance of our basic model. The performances of phrase-structure features and dependency features are comparable. The syntactic scaffold performance is not far behind, and in fact beats the dependency features model after ensembling.
Self-ensembling markedly improves performance across all models, presumably because it reduces variance due to the non-convexity of the learning objective. Our best ensembled performance is tied with the state-of-the-art system by FitzGerald et al. (2015), which uses external semantic resources such as PropBank in addition to extensive syntactic features and self-ensembling. Our best scaffolding model is within 0.2% absolute of state-of-the-art, without any use of a syntactic parser or external semantic resources.
We also tested our recall-oriented softmax-margin loss (Eq. 8) against a plain log loss. Softmax-margin proved a good choice consistently across all our models (see Fig. 3).
Joint syntax and semantics. There has been much research on jointly modeling syntax and semantic roles, mostly on PropBank dependencies. Other work has used parallel annotations to jointly model syntax and semantics, as in the CoNLL 2008–9 shared tasks (Surdeanu et al., 2008; Hajiˇc et al., 2009). Such methods are able to more directly model the connection between syntax and semantics, but they require syntactic and semantic annotations over the same corpora (Johansson and Nugues, 2008; Llu´ıs and M`arquez, 2008; Titov et al., 2009; Roth and Woodsend, 2014; Lewis et al., 2015; Swayamdipta et al., 2016, inter alia). Our work also learns syntactic representations, but needs only partial syntax annotations (bracketing), and not necessarily on the same data as that annotated with frame semantics.
Latent syntax Some work has treated syntax as a latent variable and marginalized it out (Narad- owsky et al., 2012; Gormley et al., 2014), an approach that requires no syntactic supervision, even during training, making it especially suitable for low-resource settings. Collobert et al. (2011) use only syntactic boundaries for semantic role labeling; Zhou and Xu (2015) extend their approach by forgoing all syntax and using very deep neural nets. Kim et al. (2017) and Parikh et al. (2016) use different kinds of attention mechanism (Bahdanau et al., 2014) to learn representations of sentences for natural language inference. To our knowledge, our basic model () is the first non-syntactic frame-semantic parser.
We presented a softmax-margin semi-Markov model that uses representation learning to predict frame-semantic arguments. Our basic model achieves strong results without using any syntax. We add syntax through a traditional pipeline as well as a multi-task learning approach which uses a syntactic scaffold only at training time. Both approaches improve over the baseline, and achieve state-of-the-art performance, showing that syntax continues to be beneficial in frame-semantic parsing. We conclude that scaffolding is a cheaper alternative to syntactic features since it does not require syntactic parsing at train or at test time. Applying this technique to other tasks which rely on pipelining syntactic parsing is a promising avenue for future work. Our parser is open-source and available at: https://github. com/Noahs-ARK/open-sesame/.
We thank Adhiguna Kuncoro for help with RNNG, and Dipanjan Das and Michael Roth for providing their output and for their helpful communication. We also thank Hao Peng, George Mulcaire, Trang Tran, Kenton Lee, Luke Zettlemoyer, and ARK members for valuable feedback. This work was supported by DARPA grant FA8750-12-2-0342 funded under the DEFT program.
Apoorv Agarwal, Sriramkumar Balasubramanian, Anup Kotalwar, Jiehan Zheng, and Owen Rambow. 2014. Frame semantic tree kernels for social network extraction from text. In Proc. of EACL.
Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition- based neural networks. ArXiv:1603.06042. http://arxiv.org/abs/1603.06042.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2014. Neural machine translation by jointly learning to align and translate. ArXiv:1409.0473.
Collin Baker, Michael Ellsworth, and Katrin Erk. 2007. SemEval’07 Task 19: Frame semantic structure extraction. In Proc. of SemEval.
Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project. In Proc. of ACL.
Anders Bj¨orkelund, Bernd Bohnet, Love Hafdell, and Pierre Nugues. 2010. A high-performance syntactic and semantic dependency parser. In Proc. of COLING.
Rich Caruana. 1997. Multitask learning. Machine Learning 28(1).
Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial multi-criteria learning for chinese word segmentation. ArXiv:1704.07556.
Yun-Nung Chen, William Yang Wang, and Alexander I Rudnicky. 2013. Unsupervised induction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing. In Proc. of ASRU-IEEE.
Ronan Collobert, Jason Weston, L´eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from
scratch. Journal of Machine Learning Research 12:2493–2537.
Bob Coyne, Alex Klapheke, Masoud Rouhizadeh, Richard Sproat, and Daniel Bauer. 2012. Annotation tools and knowledge representation for a text-to-scene system. In Proc. of COLING.
Dipanjan Das, Desai Chen, Andr´e FT Martins, Nathan Schneider, and Noah A Smith. 2014. Frame-semantic parsing. Computational linguistics 40(1):9–56.
Dipanjan Das, Andr F. T. Martins, and Noah A. Smith. 2012. An exact dual decomposi- tion algorithm for shallow semantic parsing with constraints. In Proc. of *SEM. http://dl.acm.org/citation.cfm?id=2387636.2387671.
Dipanjan Das and Noah A Smith. 2011. Semisupervised frame-semantic parsing for unknown predicates. In Proc. of ACL.
Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A Smith. 2016. Recurrent neural network grammars. ArXiv:1602.07776.
Charles J. Fillmore. 1976. Frame semantics and the na- ture of language. Annals of the New York Academy of Sciences 280(1):20–32.
Nicholas FitzGerald, Oscar T¨ackstr¨om, Kuzman Ganchev, and Dipanjan Das. 2015. Semantic role labeling with neural network factors. In Proc. of EMNLP.
Daniel Gildea and Daniel Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics 28(3):245–288.
Kevin Gimpel and Noah A. Smith. 2010. Softmaxmargin CRFs: Training log-linear models with cost functions. In Proc. of NAACL.
Matthew R. Gormley, Margaret Mitchell, Benjamin Van Durme, and Mark Dredze. 2014. Low- resource semantic role labeling. In Proc. of ACL. http://www.aclweb.org/anthology/P14-1111.
Alex Graves. 2012. Supervised Sequence Labelling with Recurrent Neural Networks, volume 385 of Studies in Computational Intelligence. Springer.
Alex Graves. 2013. Generating sequences with recur- rent neural networks. ArXiv:1308.0850.
Jan Hajiˇc, Massimiliano Ciaramita, Richard Johans- son, Daisuke Kawahara, Maria Ant`onia Mart´ı, Llu´ıs M`arquez, Adam Meyers, Joakim Nivre, Sebastian Pad´o, Jan ˇStˇep´anek, Pavel Straˇn´ak, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009. The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proc. of CoNLL.
Silvana Hartmann, Ilia Kuznetsov, Teresa Martin, and Iryna Gurevych. 2017. Out-of-domain framenet semantic role labeling. In Proc. of EACL.
Luheng He, Kenton Lee, Mike Lewis, and Luke Zettle- moyer. 2017. Deep semantic role labeling: What works and what’s next. In Proc. of ACL.
Karl Moritz Hermann, Dipanjan Das, Jason Weston, and Kuzman Ganchev. 2014. Semantic frame iden-tification with distributed word representations. In Proc. of ACL.
Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9(8):1735–1780.
Richard Johansson and Pierre Nugues. 2008. Dependency-based syntactic-semantic analysis with PropBank and NomBank. In Proc. of CoNLL.
Yoon Kim, Carl Denton, Luong Hoang, and Alexan- der M Rush. 2017. Structured attention networks. arXiv:1702.00887 .
Diederik P. Kingma and Jimmy Ba. 2014. ADAM: A method for stochastic optimization. ArXiV:1412.6980. http://arxiv.org/abs/1412.6980.
Lingpeng Kong, Chris Dyer, and Noah A. Smith. 2016. Segmental Recurrent Neural Networks. In Proc. of ICLR. http://arxiv.org/abs/1511.06018.
Meghana Kshirsagar, Sam Thomson, Nathan Schnei- der, Jaime Carbonell, Noah A Smith, and Chris Dyer. 2015. Frame-semantic role labeling with heterogeneous annotations. In Proc. of NAACL.
Mike Lewis, Luheng He, and Luke Zettlemoyer. 2015. Joint A* CCG parsing and semantic role labelling. In Proc. of EMNLP.
Xavier Llu´ıs and Llu´ıs M`arquez. 2008. A joint model for parsing syntactic and semantic dependencies. In Proc. of CoNLL.
Liang Lu, Lingpeng Kong, Chris Dyer, Noah A. Smith, and Steve Renals. 2016. Segmental recurrent neural networks for end-to-end speech recognition. ArXiV:1603.00223.
Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015. Multi-task se- quence to sequence learning. ArXiv:1511.06114. http://arxiv.org/abs/1511.06114.
Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer, and Noah A. Smith. 2012. Recall-oriented learning of named en- tities in arabic wikipedia. In Proc. of EACL. http://dl.acm.org/citation.cfm?id=2380816.2380839.
Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proc. of ICML.
Jason Naradowsky, Sebastian Riedel, and David A. Smith. 2012. Improving NLP through marginalization of hidden syntactic structure. In Proc. of EMNLP. http://dl.acm.org/citation.cfm?id=2390948.2391035.
Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017. DyNet: The dynamic neural network toolkit. ArXiv:1701.03980.
Ankur P. Parikh, Oscar T¨ackstr¨om, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. ArXiV:1606.01933.
Hao Peng, Sam Thomson, and Noah A. Smith. 2017. Deep multitask learning for semantic dependency parsing. In Proc. of ACL.
Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. GloVe: Global vec- tors for word representation. In Empirical Methods in Natural Language Processing (EMNLP). http://www.aclweb.org/anthology/D14-1162.
Vasin Punyakanok, Dan Roth, and Wen-tau Yih. 2008. The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics 34(2):257–287.
Michael Roth. 2017. Improvingframe semantic parsing via dependency path embeddings. Https://goo.gl/3Exmip.
Michael Roth and Mirella Lapata. 2015. Contextaware frame-semantic role labeling. Transactions of the ACL 3:449–460.
Michael Roth and Mirella Lapata. 2016. Neural semantic role labeling with dependency path embeddings. ArXiv:1605.07515.
Michael Roth and Kristian Woodsend. 2014. Composi- tion of word representations improves semantic role labelling. In Proc. of EMNLP.
Josef Ruppenhofer, Michael Ellsworth, Miriam RL Petruck, Christopher R Johnson, and Jan Scheffczyk. 2010. FrameNet II: Extended theory and practice.
Sunita Sarawagi, William W Cohen, et al. 2004. Semi- markov conditional random fields for information extraction. In Proc. of NIPS. volume 17.
Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec- tional recurrent neural networks. IEEE Transactions on Signal Processing 45(11):2673–2681.
Dan Shen and Mirella Lapata. 2007. Using semantic roles to improve question answering. In Proc. of EMNLP-CoNLL.
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1).
Mihai Surdeanu, Richard Johansson, Adam Meyers, Llu´ıs M`arquez, and Joakim Nivre. 2008. The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies. In Proc. of CoNLL.
Swabha Swayamdipta, Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2016. Greedy, joint syntactic-semantic parsing with Stack LSTMs. In Proc. of CoNLL.
Oscar T¨ackstr¨om, Kuzman Ganchev, and Dipanjan Das. 2015. Efficient inference and structured learning for semantic role labeling. Transactions of the ACL 3:29–41.
Ivan Titov, James Henderson, Paola Merlo, and Gabriele Musillo. 2009. Online graph planarisation for synchronous parsing of semantic and syntactic dependencies. In Proc. of IJCAI.
Kristina Toutanova, Aria Haghighi, and Christopher D. Manning. 2008. A global joint model for semantic role labeling. Computational Linguistics 34(2):161– 191.
Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. Wsabie: Scaling up to large vo- cabulary image annotation. In Proc. of IJCAI. http://dx.doi.org/10.5591/978-1-57735-516- 8/IJCAI11-460.
David Wood, Jerome S. Bruner, and Gail Ross. 1976. The role of tutoring in problem solving. Journal of Child Psychology and Psychiatry 17(2).
Jie Zhou and Wei Xu. 2015. End-to-end learning of semantic role labeling using recurrent neural networks. In Proc. of ACL.
To complement our argument identification system, we also release a syntax-free frame identifi-cation system, which we describe here. We treat frame id as an independent multiclass classifica-tion task for each target. Given a sentence, a target, and its corresponding lexical unit, the task is to identify the frame the target evokes. Hermann et al. (2014) and FitzGerald et al. (2015) use gold targets and lexical units; for a fair comparison we do the same.
A.1 Model
Formally, the input to our model is a vector , following notation from
As before,
-word sentence,
its predicted part-of-speech tag sequence, t =
is a target span in the sentence, and
its lexical unit. For each lexical unit
provides the set of frames F
that it could possibly evoke.
At each word position q, we give as input to a token biLSTM (f-tok) an input vector is a learned embedding of the word type,
is a fixed pre-trained embedding of the word type, and
is a learned embedding of the part-of-speech tag. This yields a hidden state vector
th token, corresponding to a contextualized representation of the word:
To represent the target in context, we use over the target span t, as well as the neighboring context window of size 1, as input to a forward LSTM:
In addition to the above, we learn an embedding for each lexical unit
. A final layer with a rectified linear unit (Nair and Hinton, 2010) is applied to
to get our model scores:
where the scalar and the vector
parameters associated with the frame f. The probability under our model is given by:
Table 4: Test set accuracy of frame identification.
Table 5: Full end-to-end performance, when com- bined with our ensembled arg id models.
Negative log likelihood is minimized during training, using ADAM (Kingma and Ba, 2014). A single pass through the sentence is sufficient to predict frames for all targets.
To compensate for the variance resulting from different initializations, we use a five model selfensemble (as in ) to aggregate frame scores in Eq. 19. Pretrained GloVe (Pennington et al., 2014) vectors of dimension 100 are used. For tuning hyperparameters, we use the same methods as described in
A.2 Results
The performance of our model is shown in Table 4. Our model significantly outperforms SEMAFOR, which uses a variety of syntactic features. Hermann et al. (2014) reuse most of the features from SEMAFOR, as well as WSABIE embeddings (Weston et al., 2011) that also make use of syntax. FitzGerald et al. (2015) reimplement their model and report slightly better scores (possibly due to differences in parameter initialization or syntactic parses used). Our performance is similar to Hartmann et al. (2017), who focus on improving out of domain frame identification performance using a large external database. We use no external resources.
Finally, we show the performance of our end-to- end system with both frame and argument identi- fication in Table 5. As expected, it is lower than the results in Table 3, illustrating the importance of frame identification in a frame-semantic parsing pipeline.