Motivation. While graph-based dependency parsers are simple interfaces, extensible and modular implementations for sustainable parser research and development have to date been severely lacking in the research community. Parsing research generally centres around particular components of parsers in isolation, for example, a novel decoding algorithm, the encoding of new features, a new learning algorithm, etc. However, due to perceived gains in performance or due to the lack of foresight in writing sustainable code, these components are rarely implemented modularly or with a view to extensibility. Both with previous sparse feature graph-based dependency parsers (such as McDonald and Pereira (2006)’s MST parser), as well as with recent state-of-the-art neural parsers (specifically Kiperwasser and Goldberg (2016) or Dozat and Manning (2017)’s neural parsers), implementations of parser components are generally hard-coupled with each other.
Additionally, dependency parsers are often evaluated using mildly differently interpreted metrics, different data preprocessing choices, and over different target hardware. The persistently inadequate setting for parser architecture comparison entails that comparing and thereby exploring the effect of different design choices often becomes impossible to properly gauge.
With UniParse, we provide a flexible, highly expressive, scientific framework for easy, lowbarrier of entry, highly modular, highly efficient development and fair benchmarking of graph-based dependency parsing architectures. Additionally, the framework is pre-configured with current state-of-the-art first-order sparse and neural graph-based parser implementations.
Novel contributions.
• We align sparse feature and neural research in graph-based dependency parsing to a common terminology. With this shared terminology we develop a unified framework for the UniParse toolkit to rapidly prototype new parsers and easily compare performance to previous work.
• Prototyping is now rapid due to modularity: parser components must be developed in isolation, with no resulting loss in efficiency. For example, measuring the empirical performance of a new decoder no longer requires implementing an encoder too, and investigating the synergy between a learning strategy and a decoder no longer requires more than a flag or calling a library function.
• Preprocessing is now made explicit within its own component and is thereby adequately isolated and portable.
• The evaluation module is now easy to read and fully specified. We specify the subtle differences in computing UAS and LAS from
previous literature and have implemented these in UniParse in an explicit way.
• To the best of our knowledge, UniParse is the first attempt at unifying existing dependency parsers to the same code base. Moreover, UniParse appears to be the first attempt to enable state-of-the-art first-order sparse-feature dependency parsing within a Python environment.
We make the parser freely available under a GNU General Public License.1
A demonstration on how to easily integrate two recent, more complex, embedding components– ELMo embeddings (Peters et al., 2018) and TCN representations (Bai et al., 2018)–is also made available.2
Traditionally, a graph-based dependency parser consists of three components. An encoder set of parameters
sible dependency relations between all words of a sentence S can modeled as a complete directed graph
where the words are nodes and arcs are the relations. To sub-sets of arcs from
factors),
associates a d-dimensional feature vector, its encoding. The set of parameters
used to produce scores from the constructed feature vectors according to some learning architecture. These parameters are optimised over treebanks. Lastly the decoder h is some maximum spanning tree algorithm with input
and scores for factors of
; it outputs a wellformed dependency tree, which is the raw output of a dependency model.
Recent work on neural dependency parsers learns factor embeddings discriminatively alongside the parameters used for scoring. The result is that of dependency parsers fuse together into a union of parameters. Thus, in this work we fold the notion of encoding into the parameter space. Now for the neural models, all parameters are trainable, whereas for sparse-feature models, the encodings of sub-sets of arcs are non-trainable. So the unified terminology addresses only parameters
and a decoder h.
We provide two abstractions to implementing graph-based dependency parsers. First, our descriptive high-level approach focuses on expressiveness, enabling models to be described in just a few lines of code by providing an interface where the required code is minimal, only a means to con-figure design choices. Second, as an alternative to the high-level abstraction we emphasise that parser definition is nothing more than a composition of pre-configured low-level modular implementations. With this we invite cherry picking of the included implementations of optimised decoders, data preprocessors, evaluation module and more. We now briefly overview the basic use of the unified API and list the central low-level module implementations included with the UniParse toolkit.
Elementary usage. For ease of use we provide a high-level class to encapsulate all components of a parser. Its use results in a significant reduction in amount of code required to implement a parser and counters unwanted boilerplate code. It provides default best-practice configurations for all included components, while enabling custom implementation whenever needed so long as it is callable and adheres to the framework’s function definition of the specific component. The minimum requirements with the use of this interface are: decoder, loss function, optimiser, and batch strategy. In Figure 1 is an example implementation of Kiperwasser and Goldberg (2016)’s neural parser in only a few lines. The full list of possible arguments along with their interfaces can be found in the toolkit documentation.
Vocabulary. This module preprocesses a CoNLL-U formatted dataset and provides a mapper and lookup from tokens to identifiers, with support for alignment according to pretrained word embeddings. Text preprocessing strategies have significant impact on NLP model performance. Despite this, little effort has put into describing such techniques in recent literature, which obfuscates where a model’s contribution actually lies. In the UniParse toolkit, we have included implementations for recently employed techniques within parsing for cleaning and preprocessing during the tokenisation stage.
Data Provider. This module organises the tokenised data into batches for efficient learning ac-
Figure 1, Listing 1 & Table 1: (Right code snippet) Implementation of Kiperwasser and Goldberg (2016)’s neural parser in only a few lines using UniParse.
(Right table and left figure) Seconds a decoder takes to decode an entire dataset, given a set of scores. Score matrix entries are generated uniformly on [0, 1]. The random generated data has an impact on CLE since worst-case performance depends on the sorting bottleneck; the figure demonstrates this by the increasingly broad standard deviation band. Experiments are run on an Ubuntu machine with an Intel Xeon E5-2660, 2.60GHz CPU.
cording to the user-specified arguments. We provide several implementations for different batching strategies. This includes (1) batching by sentence length (bucketing), (2) fixed-size batching with padding, and (3) scaled padded batching through approximate clustering (Dozat and Man- ning, 2017).3
Decoders. We include optimised Cython implementations of first-order decoders with the toolkit (as well as Python versions of these for comparison), including both our implementations and “generic”4 implementations: Eisner’s algorithm (Eisner, 1996) and Chu-Liu-Edmonds (CLE) (Chu and Liu, 1965; Edmonds, 1967; Zwick, 2013). We compare Cython implementations in Table 1 and Figure 1 over randomised score input. Note that our implementations are significantly faster.
Evaluation. Unlabeled attachment score (UAS) and labeled attachment score (LAS) are central dependency parser performance metrics, measuring unlabeled and labeled arc accuracy respectively with UAS = #correct arcs#arcs and LAS = #correctly labeled arcs
#arcs . Unfortunately, there are also a number unreported preprocessing choices preceding the application of these metrics, which renders direct comparison of parser performance in the literature futile, regardless of how well-motivated these preprocessing choices are. These are generally discovered by manually screening the code implementations when these implementations are made available to the research community. Two important variations found in state-the-art parser evaluation are the following.
1. Punctuation removal. Arcs incoming to any punctuation are sometimes removed. Moreover, the definition of punctuation is not universally shared. We provide a clear Python implementation for these metrics with and without punctuation arc deletion before application, where the definition of punctuation is clear: punctuation refers to tokens that consist of characters complying to the Unicode punctuation standard.5 This is the strategy employed
Table 2: UAS/LAS for included parser configurations. We provide results with (w.p.) and without (n.p) punctuation. For the English universal dependencies (UD) dataset we exclude the github repository suffix EWT. Regarding (Dozat and Manning, 2017), despite having access to the published TensorFlow code of we never observed scores exceed 95.58. Scores for neural parsers are averages of 10 runs for the (Kiperwasser and Goldberg, 2016) reimplementation and 3 runs for the (Dozat and Manning, 2017) reimplementation–this difference in number of runs reflects running time of the corresponding parsers.
by the widely used Perl evaluation script, which to our knowledge, originates from the CoNLL 2006 and 2007 shared tasks.6 We infer this from references in (Buchholz and Marsi, 2006).
2. Label prefixing. Some arc labels are “composite”, their components separated by a colon. An example from the English Universal Dependencies data set is the label obl:tmod. The official CoNLL 2017 shared-task evaluation script7 allows partial matching of labels, for example matching to the non-language-specific label prefix obl within the language-specific label obl:tmod for full points. We include this variant in UniParse’s evaluation module.
Callbacks. We include a number of useful callback utilities, such as a Tensorboard logger8 and a patience mechanism for early stopping together with a model saver over iterations.
Loss Functions. State-of-the-art sparse feature parsers (and UniParse’s included specification for this) evade direct loss computation (for example, using insights by Crammer and Singer (2003)), directly computing parameter adjustments from feature vectors. To train neural parser models we formalise a function interface and provide a set of common loss functions. The possibilities for loss in graph-based neural dependency models have not been greatly explored. Rather than typical loss, which is calculated over all predictions, in parsing loss has been computed over only a subset of the predicted score matrix. We impose a loss function to adhere to the type defini-tion score tensor produced by the neural model,
optional predicted tree, and
the gold tree.
Included parser configurations. We include three state-of-the-art first-order dependency parser implementations as example configurations of UniParse: McDonald and Pereira (2006)’s MST sparse feature parser reimplementation9, and Kiperwasser and Goldberg (2016) and Dozat and Manning (2017)’s respective graph-based neural parsers. Experiments are carried out on English and Danish: the Penn Treebank (Marcus et al., 1994) (en ptb, training on sections 2-21, development on section 22 and testing on section 23), converted to dependency format following the default configuration of the Standford Dependency converter (version >= 3.5.2), and the English (en ud), and Danish (da) datasets from version 2.1 of the universal dependencies project (Nivre et al., 2017). Tables 2 shows how our parser configu-rations perform compared with the originally reported parser performance.
In this paper, we have described the design and usage of UniParse, a high-level un-opinionated framework and toolkit that supports both featurebased models with on-line learning techniques, as well as recent neural architectures trained through backpropagation. We have presented the framework as answer to a long-standing need for highly efficient, easily extensible, and, most of all, directly comparable graph-based dependency parsing research.
S. Bai, J. Zico Kolter, and V. Koltun. 2018. An Em- pirical Evaluation of Generic Convolutional and Re- current Networks for Sequence Modeling. ArXiv eprints.
Sabine Buchholz and Erwin Marsi. 2006. Conll-x shared task on multilingual dependency parsing. In In Proceedings of CoNLL, pages 149–164. Association for Computational Linguistics.
Y.J. Chu and T.H. Liu. 1965. On the shortest arbores- cence of a directed graph. Sci. Sinica, 14:13961400.
Koby Crammer and Yoram Singer. 2003. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3:951–991.
Timothy Dozat and Christopher M. Manning. 2017. Deep biaffine attention for neural dependency parsing. In Proceedings of ICLR.
J. Edmonds. 1967. Optimum branchings. J. Res. Nat. Bur. Standards, 71B:233240.
Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In Proceedings of the 16th Conference on Computational Linguistics - Volume 1, COLING ’96, pages 340– 345, Stroudsburg, PA, USA. Association for Computational Linguistics.
Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim- ple and accurate dependency parsing using bidirectional lstm feature representations. Transactions of the ACL, 4:313–327.
Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. 1994. The penn treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, HLT ’94, pages 114–119, Stroudsburg, PA, USA. Association for Computational Linguistics.
Ryan McDonald and Fernando Pereira. 2006. Online learning of approximate dependency parsing algorithms. In Proceedings of EACL. Association for Computational Linguistics.
Joakim Nivre et al. 2017. Universal dependencies 2.1. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics ( ´UFAL), Faculty of Mathematics and Physics, Charles University.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL. Association for Computational Linguistics.
Uri Zwick. 2013. Lecture notes on “analysis of algo- rithms”: Directed minimum spanning trees.