Neural Networks have become the backbone of some of the top performing transition-based dependency parsing systems. The use of a simple feed-forward network (FFN) by Chen and Manning (2014), kickstarted a string of improvements upon this approach. Weiss et al. (2015) trained a larger deeper network, and used a final structured perceptron layer on top of this. Andor et al. (2016) used global normalization and a large beam to achieve the state of the art results for dependency parsers of this type.
On the other hand, recurrent neural network (RNN) models have also started recieving more attention. Kiperwasser and Goldberg (2016a) used heirarchical tree LSTMs to model the dependency tree itself, and then passed on the extracted information to a feed-forward/output layer structure similar to that in Chen and Manning (2014)’s original model. Dyer et al. (2015) used stack-LSTMs to model the current states of the different structures of a transition-based system, with separate stack-LSTMs modeling the stack, the buffer, and the history of transitions made so far.
Chen et al. (2015) used two Gated Recurrent Unit (GRUs) networks to represent the dependency tree, while Zhang et al. (2015) developed TreeLSTMs to estimate the probability that a certain dependency tree is generated given a sentence.
Kiperwasser and Goldberg (2016b) used a conceptually simpler approach, by running a bidirectional LSTM (biLSTM) over the sentence. The inputs to these biLSTMs were various features describing the word, its part-of-speech (pos) tag, and various other structural information about each token in the sentence. The output of the biLSTMs was again passed onto a feed-forward layer to compute a hidden state before the final output layer.
These LSTM-based methods, however, all attempt to replace the original embeddings layer used by Chen and Manning (2014) with more sophisticated feature representation but, as we pointed out, they keep the main structure of the model largely the same. That is, a hidden layer encoding the input features at the current time-step before a final output layer scores possible transitions. These hidden layers can be seen as encoding the current configuration of inputs in a manner useful only for the decision made at that point in the transition sequence.
In contrast to these approaches, Kuncoro et. al. (2016) extended the basic model of Chen & Manning (2014) by replacing the hidden layer with an LSTM, thus allowing the network to model sequences of transitions instead of only immediate input/transition pairs.
In this work we build on Kuncoro et. al.(2016)’s
approach by initialising the weights of an LSTMbased dependency parser with weights of a pre-trained Feed-Forward network. We show that this method produces a substantial improvement in accuracy scores, and is also applicable to different kinds of RNNs. An additional contribution of this paper is a refinement of the basic training model of Chen & Manning (2014) producing a more accurate Feed Forward model as a baseline for our experiments.
We begin with a brief overview of transition-based dependency parsing, followed by an explanation of our baseline models; the basic FFN and LSTM-based models that are the center of this work. We then explain our proposed method for the alternative initialization of the LSTM weights, and then present the results of our experiments with a comparison with other state-of-the-art parsers. Finally we explore the use of GRUs and Elman networks in place of LSTMs, and show the effect of initializing individual gates using our proposed method on the overall performance.
A transition-based parsing system considers a given sentence one word at a time. The parser then makes a decision to either join this word to a word encountered previously with a dependency relation, or to store this word until it can be attached at a later point in the sentence. In this way the parser requires only a single pass through the sentence to produce a dependency tree.
In this work we use the arc-standard transition system (Nivre, 2004), which maintains two data structures, the stack (S), which holds the words that the parser has already seen and wishes to remember, and the buffer (B), containing all the words that it has yet to consider, in the order in which they appear in the sentence. In addition the parser keeps a list of all dependency arcs (A) produced throughout the parse. Together, the state of the stack, the buffer, and the list of arcs are referred to as the configuration (x) of the parser. In their initial states, the buffer contains all the words of a sentence in order, with the stack containing the ROOT token, which is typically attached to the main verb in the sentence. The parser can then perform one of 3 transitions:
• SHIFT removes the front word, and pushes it onto S.
• LEFT-ARC adds an arc between the top two items, being the head. is then removed from S.
• RIGHT-ARC adds an arc between the top two items, head. is then popped from S.
Each of these transitions changes the state of one or more of the structures in the parser, and therefore produces a new x.
Our proposed approach makes use of a simple feed-forward model to improve the performance of an LSTM-based model. We show that the final network surpasses both of our baselines, which are the original feed-forward network, and an LSTM model trained with randomly initialized weights. In this section we will describe the structure of both baselines.
3.1 Input Layer, Selected Features, & Output Layer
The Embeddings layer is a concatenation of the embedding vectors of select raw features of the parser configuration. The resulting layer is a dense feature representation of x. The features used in our implementation are shown in Table 1.
Table 1: Features extracted from a configuration. w, t, and l are words, pos tags, and dependency labels respectively. refer to the most/leftmost child.
We represent the configuration of the parser at a particular timestep as a number of raw features extracted from the data structures of x. We use vector embeddings to represent each of the raw features.
Each word (w), part of speech tag (t), and arc label (l) is represented as a d-dimensional vector tively. And so the embedding matrices for the different types of features are dimensionality of the embedding vector for a feature type, and is the vocabulary size. We add additional vectors for ”ROOT” and ”NULL” for all feature types, as well as ”UNK” (unknown), for unknown/infrequent words.
This embeddings layer is used as the input layer in all models described in this work. For all models we use dropout (Hinton et al., 2012) on the input layer. We find that this improves the final accuracy of all the networks trained.
The output layer y consists of nodes representing every possible transition, with one node representing Shift, and a node for every possible pair of arc transitions (Left/Right-Arc) and dependency labels. This makes the size of the output layer constant at , regardless of the structure of the network.
3.2 Feed-Forward Model
For our FFN model we use the same basic structure of Chen and Manning (2014) with a single hidden layer and a final softmax output layer. We however follow Weiss et. al. (2015) in using recti-fied linear units (ReLUs) (Nair and Hinton, 2010) as hidden neuron activation functions. Finally, we use dropout on the hidden layer similar to the input layer. The structure of the FFN is specified below.
Following Weiss et al. (2015) we set the initial bias of the hidden layer to 0.02 in order to avoid having any dead ReLUs at the start of training.
3.3 RNN-based Model
Our RNN-based model is an extension of the basic feed forward model, with Long Short-Term Memory (LSTM) units (Hochreiter and Schmidhuber, 1997) standing in for the traditional feed forward hidden layers.
The change allows for the information in the parser configuration to be shared as needed with future time-steps. This lets the network at any point in the sequence of transitions make a decision based on a more informative context, that is not only based on the current configuration, or the present state of the dependency tree, but also on the changes made to them.
Figure 1: An FNN and an RNN over 3 time-steps. The FFN shown in 1a only has access to information from the current configuration as represented in x. RNNs on the otherhand also recieve information about previous configurations as encoded in the hidden states from previous time-steps. The refers to the external and internal hidden states produced by an LSTM, however other types of RNN units do not necessarily maintain a
In their standard forms, RNNs are affected by both exploding and vanishing gradients (Bengio et al., 1994), making them notoriously hard to train despite their expressive ability. LSTMs are a variety of RNNs that maintains an internal state that forms the basis for the recurrence, and is passed from time-step to the next. This direct connection is not interrupted by any weight matrixes, as would be the case in simpler RNN architectures such as Elman networks (Elman, 1990), but is instead scaled and added to by a number of gates that handle extracting and scaling information from the input data, and computing a final hidden state at each time step to pass on to deeper layers. This uninterrupted connection of internal states throughout the sequence is an important part of how LSTMs address the shortcomings of RNNs.
There have been a variety of architectures in literature referred to as LSTMs, all bearing slight differences to the basic LSTM unit. The defini-tion of the LSTM we use in this work is shown
below.
With the final softmax output layer, just as with the FNN model.
Unlike Kuncoro et. al. (2016), we do not use peephole connections like those suggested by Graves (2013). Additionally, we add a bias of 1 to the LSTM’s forget gate following Gers et al. (2000). Finally, we also apply a dropout similar to that in Zaremba et. al. (2014).
As shown in this definition, the LSTM cell maintains an internal state , where the previous internal state is modulated at each time-step by the forget gate , and then added to by a scaled selection of the current input by the input gates is then used for the external state and passed on to the next time-step. All gates rely on weighted activations of the current input and the previous external state
This pair of hidden states allows the LSTM to contribute to long-term decisions with still being able to make immediate or short-term decisions with , and it is this final calculation of , along with , that is the focus of our contribution in this work.
Much has been written about the need for careful initialization of weights, often done to complement certain optimisation methods such as gradient descent with momentum in (Sutskever et al., 2013). For deep networks, Hinton et. al. (2006) and later Bengio et. al. (2007) approached initialization differently by using a greedy layer-wise unsupervised learning algorithm, which trains each layer sequentially, before fine-tuning the entire network as a whole.
Le et. al. (2015) suggested replacing traditional tanh units in a simple RNN with ReLUs, in addition to initializing the weights with an identity matrix.
Figure 2: A comparison of the architecture of an FFN and an LSTM-based model. The bold arrows represent the weight matrices that are roughly equivalent to those in an FFN, and is the final softmax layer that scores each possible transition. We only show labels for the matrices that we initialize with their FFN counterparts, and . Additionally we replace the biases of the LSTM gates with the bias of the hidden layer of the FFN, and all the FFN trained embeddings for all feature types.
As previously mentioned, Gers et al. (2000) suggested initializing the bias of the forget gate of an LSTM to 1. This allowed the LSTM unit to learn which information it needed to forget as opposed to detecting the opposite. This was later shown by Jozefowicz et. al. (2015) to improve performance of an LSTM on a variety of tasks.
Alternatively, it has become increasingly common to use tuned outputs from one network as initialization for another. For example the use of pre-trained embeddings as initialization for word vectors has become de facto standard procedure for tasks such as dependency parsing, language modelling and question answering.
Following this approach, we propose initializing the LSTM weights, specifically the bias of all LSTM gates, with the weight matrix and hidden bias of a pre-trained, similarly structured feed-forward neural network. We also initialize the embedding matrices used the weights of the final softmax layer those of the pre-trained feed-forward network.
To illustrate this idea we reproduce a modified version of the LSTM architecture diagram appearing in (Jozefowicz et al., 2015) in Figure 2, with the addition of the final softmax layer of information from the current input shown by the bold arrows) is almost identical to that in an FFN, except for the addition of input to o, and the “interference” of information from to produce
This approach rests on the 2 hidden states of the LSTM requiring different information from the same input data. Since is more concerned with immediate decisions, it would strongly bene-fit from the trained weights of a feed-forward network, which are tuned to extract the maximum relevant information from the input of the current time-step, since it has no access to prior information.
The various LSTM gates would still be able to learn to use information from but would be in a better position to do so with the biases and input weights closer to an optimum configuration.
Moreover, the internal state would receive less severe errors early on in the training process, owing to a better contribution from in the calculation of , and a less disruptive result from due to the input and forget gates initially behaving more similarly to the regular hidden layer of the original FFN.
This would mean less pressure on the weights of the input and forget gates to adapt to immediate decisions while the internal state would be more capable of gradually learning longer term patterns.
We will henceforth differentiate networks initialized in the manner described in this section by referring to them as bootstrapped models, while we refer to the usual randomly initialized networks as baselines models.
We begin by comparing the performance of our FFN and LSTM baseline networks with our bootstrapped model. For all networks we ran a model with a single hidden layer 256 neurons/LSTM units wide. The embeddings dimensions used were Weiss et al. (2015) showed that large gains can be made with a grid search to tune learning hyperparameters as well embeddings sizes and hidden layer dimensions, which we did not perform due to its very high computational cost. We use the GloVe pre-trained embeddings produced by Pennington et al. (2014) to initialize the word vectors.
Learning is done with mini-batch stochastic gradient descent (SGD) with momentum to minimise logistic loss with the learning rate and momentum . We also use an additional regularization cost (
Where represents all weight, biases, and embeddings matrices. We also set the dropout rate to 0.3 for the embeddings layers and hidden layer for both the baselines and bootstrapped model, and initialise all baseline weights randomly in the range [
For LSTM-based models we used truncated backpropagation throught time (BPTT), with a truncation limit . This means that errors are propagated backwards to layers in previous time steps until a limit is reached. In our experiments varying between 5 and full back propagation had a negligible effect on the final accuracy of the networks, while using a truncation limit produced a significant speed up in training. We stress that this insignificant difference is most likely a task and architecture specific issue, and would probably be much more pronounced in other tasks and neural network set-ups.
For our experiments we use the Wall Street Journal (WSJ) section from the Penn Treebank (Marcus et al., 1993). We use 2-21 for training, 22 for development, and 23 for testing. We use Stanford Dependencies (SD) (De Marneffe et al., 2006) converted from constituency trees using version 3.3.0 of the converter. As is standard we use predicted POS tags for the train, dev, and test sets. We report unlabeled attachment score (UAS) and labeled attachment score (LAS), with punctuation excluded.
The results in Table 2 show the effect of applying dropout on the input layer for our FFN baseline, when compared to the similarly sized Chen and Manning (2014) model which has 200 neurons in its hidden layer. This is in addition
Table 2: Final dev and test set scores on WSJ (SD). Zhang et al. (2015) do not use pre-trained word vectors for their final result. The values given for Andor et al. (2016) and Weiss et al. (2015) reflect only the performance of the greedy FFN models produced in their work, with other improvements made explained breifly in section 1. C & M refers to Chen & Manning, K & G refers to Kiperwasser & Goldberg.
to achieving very close dev score accuracy results with only a single 256 neuron hidden layer when compared to the significantly larger models of Weiss et al. (2015) with 2 layers of size 2048, and Andor et al. (2016) with 2 layers of size 1024 layers.
Comparing our 2 baseline models shows that the LSTM-based model performs much better than the FFN model, with an almost 0.5% gain in dev score accuracy. Our main result is our bootstrapped model, which not only surpassed the original FFN baseline, but also the LSTM baseline.
We note that our LSTM-baseline achieves a substantial improvement over the similar architecture of Kuncoro et al. (2016). The main differences in this case are a slightly larger model and using LSTMs without peephole connections.
In addition, our bootstrapped model produces better results than all the mentioned feed forward models in addition to most of the LSTMbased approaches in Table 2, with the exception of Kiperwasser and Goldberg (2016b), despite only having a single hidden layer of LSTM units and making no use of biLSTMs, TreeLSTMs, or Stack LSTMs.
The results of our experiments seem to lend credence to the idea that learning short and long-term patterns separately is useful to the performance of an LSTM. To generalize this further, one could say that a sequence modelling task where a 1-to-1 relation between input/output pairs can be learned should first attempt this with an FFN, and then transfer that knowledge to an LSTM as described in section 4, so sequence specific information can be further modelled.
An additional benefit of this approach is that it can be applied to previously trained FFNs and can improve any of the models that we have compared our results with in Table 2. This is also true of the LSTM-based models, where the strength of their contributions lies in their innovative approaches to feature extraction while keeping the rest of the network essentially the same.
For example, we can merge our work with that of Kiperwasser and Goldberg (2016b), by first training their model; a biLSTM input layer going to a feed-forward hidden layer followed by an output layer, and then replacing the hidden layer with an LSTM initialized with the weights of that hidden layer.
Finally, our addition of applying dropout to the input layer can also be used here to further strengthen the performance of this example.
So far we have shown how to improve the performance of LSTMs by drawing parallels between the functions of certain gates and the traditional feed-forward network. In this section we attempted to do the same for 2 other popular forms of RNNs, the Simple Recurrent Network, otherwise known as the Elman network (Elman, 1990), and the Gated Recurrent Unit (GRU) (Cho et al., 2014).
7.1 Elman networks
The Elman network is one of the earliest and simplest RNNs found in literature. It was the subject of much study and suffered from all the original problems of vanishing and exploding gradients mentioned before, which later motivated the development and adoption of more sophisticated units such as LSTMs and GRUs.
Nevertheless there have been examples where Elman networks were capable of performing rel-
Figure 3: The architectures of an Elman and a GRU-based model. As in 2, the bold arrows represent the path of information roughly equivalent to that in an FFN. The replaced matices in the Elmanbased model are the GRU-based model the replaced matrices are For both RNNs this is in addition to initializing the embeddings vectors with those trained by the baseline FFN for all feature types.
atively well, notably the work of Mikolov et al. (2010) on language modelling and an extended memory version of Elman networks in (Mikolov et al., 2014).
Elman networks themselves are only a simple addition to the architecture of the traditional feed-forward network. Whereas an FFN has a hidden layer, and Elman network has an additional context layer, that represents the output of the hidden layer in the previous time-step. In a way, it can be compared to output gate of an LSTM, without any additional tools to model the sequence.
In our experiment we use the ReLU activation function once more for the hidden layer similar to Le et al. (2015), but without their initialization strategy. The precise definition of the Elman network that we use is shown below.
In Figure 3a we illustrate the structure of this network. The simplicity of the addition here makes it far easier to draw parallels between the function of the weight matrices in the Elman network and in the FFN as shown in 2b.
7.2 Gated Recurrent Units
Introduced by Cho et al. (2014), GRUs are an architecture often compared to LSTMs. It also attempts to solve the gradient vanishing problem in a similar way, by keeping the modulation and addition of information in separate gates, and avoiding any weighted obstructions between the hidden states of one time-step and the next. A notable difference however is the lack of an internal state. All modifications are done directly to the external hidden state , potentially complicating the learning process with conflicting information about short and long-term dependencies.
Despite this apparently simpler structure, Chung et al. (2014) found GRUs to outperform LSTMs on a number of tasks, and Jozefowicz et al. (2015) also found that GRUs can beat LSTMs except in language modelling. However, Jozefowicz et al. (2015) also found that initializing the LSTM forget gate bias to 1 allowed the LSTM to almost match the performance of the GRU on other tasks.
The internal architecture of a GRU consists of a reset gate modulating the previous state a candidate gate computing the next addition to , and an update gate controlling how much of the candidate is added to
In this case the candidate gate is the most analogous to the hidden layer in an FFN. As shown by the bold lines in Figure 3b, this flow of information appears similar to that of the output gate in an LSTM, except that the additional input here is modulated by the instead of receiving addition to dealing with further interference from the update gate.
7.3 Comparison & Results
For the experiments in this section we used the same network dimensions as in section 5, as well as the same training parameters and procedure.
For each RNN type we trained 2 FFN and RNN baselines, one with GloVe pre-trained word embeddings(Pennington et al., 2014) and another with randomly initialized embeddings. We then trained bootstrapped models initialized with the FFN baselines. The results are shown in Table 3.
Table 3: Dev set scores on WSJ (SD) for different RNN types. The Random/Pre-trained embedding only refers to the initial word vectors of the FFN/RNN baseline. All other RNNs in these categories use the final trained embeddings of their respective FFN baseline.
As in section 5, this initialization method shows a positive effect on an LSTM-based model, again surpassing both its baselines. The Elman network is stronger than expected and benefits greatly from this approach. Indeed, the bootstrapped Elman model is comparable in accuracy to some of the results in Table 2.
This cannot be said of GRUs, however, where its baselines perform significantly worse than other RNNs. Moreover, bootstrapped GRU models perform even worse than their baselines, even failing to match the accuracy of the FFNs used to initialize them. This disparity in accuracy compared to LSTMs seems to lend credence to our earlier hypothesis that learning long-term sequences can interfere with learning to make immediate decisions based on the input from the current time step. The architecture of an LSTM which maintains a long-term internal state separate from a short-term external state , and the additional improvement gained from learning these separately, as opposed to the single common hidden state in GRUs appears to provide a distinct advantage here.
The improvement achieved by a bootstrapped Elman model can thus be explained by the fact that it suffers from gradient vanishing (Bengio et al., 1994), and so sequence specific information does not affect training to the extent that it does in GRUs.
Our final set of experiments is to investigate whether or not individual gates of LSTMs and GRUs can benefit from this initialization technique. We follow the same initialization and training procedures described previously, and for every gate we also initialize its corresponding bias vectors. We keep the same size and parameters as in section 7.3, and also train baselines with and without pre-trained embeddings.
Bootstrapping individual LSTM gates produces mixed results, especially when considering the difference in performance between the random and pre-trained embeddings experiments.
Full bootstrapping, bootstrapping the j gate or bootstrapping the o gate seem to be the most reliable options based on these results.
Results for bootstrapping individual GRU gates vary drastically, with individual gates performing very differently in their random and pre-trained embedding experiments.
Surprisingly, bootstrapping all GRU gates achieves better results than the GRU baseline for random embeddings, while severely hurting accuracy with pre-trained embeddings. All GRU experiments, bootstrapped or not, still do not perform better than the FFN baseline.
Table 4: Dev set scores on WSJ (SD) for individually bootstrapped LSTM gates
In this paper we have presented a simple and effective LSTM transition-based dependency parser. Its performance rivals that of far more complicated approaches, while still being capable of integrating with minimal changes to their architecture.
Additionally, we showed that the application of dropout to the input layer can improve the performance of a network. Like our other contributions here this is simple to apply to other models and is not only limited to the architectures presented in this work.
Finally, we proposed a method of using pre-trained FFNs as initializations for an RNN-based model. We showed that this approach can produce gains in accuracy for both LSTMs and Elman networks, with the final LSTM model surpassing or matching most state-of-the-art LSTM-based models.
This initialization method can potentially be applied to any LSTM-based task, where a 1-to-1 relation between inputs can first be modelled using an FFN. Exploring the effects of this method on other tasks is left for future work.
Table 5: Dev set scores on WSJ (SD) for individually bootstrapped GRU gates
Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition-based neural networks. arXiv preprint arXiv:1603.06042 .
Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. 2007. Greedy layer-wise training of deep networks. Advances in neural information processing systems 19:153.
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5(2):157–166.
Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser using neural networks. In EMNLP. pages 740–750.
Xinchi Chen, Yaqian Zhou, Chenxi Zhu, Xipeng Qiu, and Xuanjing Huang. 2015. Transition-based dependency parsing using two heterogeneous gated recursive neural networks. In EMNLP. pages 1879– 1889.
Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 .
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of
gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 .
Marie-Catherine De Marneffe, Bill MacCartney, Christopher D Manning, et al. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC. volume 6, pages 449–454.
Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A Smith. 2015. Transitionbased dependency parsing with stack long short-term memory. arXiv preprint arXiv:1505.08075 .
Jeffrey L Elman. 1990. Finding structure in time. Cognitive science 14(2):179–211.
Felix A Gers, J¨urgen Schmidhuber, and Fred Cummins. 2000. Learning to forget: Continual prediction with lstm. Neural computation 12(10):2451–2471.
Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 .
Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation 18(7):1527–1554.
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580 .
Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. 2015. An empirical exploration of recurrent network architectures. Journal of Machine Learning Research .
Eliyahu Kiperwasser and Yoav Goldberg. 2016a. Easy- first dependency parsing with hierarchical tree lstms. arXiv preprint arXiv:1603.00375 .
Eliyahu Kiperwasser and Yoav Goldberg. 2016b. Sim- ple and accurate dependency parsing using bidirectional lstm feature representations. arXiv preprint arXiv:1603.04351 .
Adhiguna Kuncoro, Yuichiro Sawai, Kevin Duh, and Yuji Matsumoto. 2016. Dependency parsing with lstms: An empirical evaluation. arXiv preprint arXiv:1604.06529 .
Quoc V Le, Navdeep Jaitly, and Geoffrey E Hin- ton. 2015. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941 .
Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational linguistics 19(2):313–330.
Tomas Mikolov, Armand Joulin, Sumit Chopra, Michael Mathieu, and Marc’Aurelio Ranzato. 2014. Learning longer memory in recurrent neural networks. arXiv preprint arXiv:1412.7753 .
Tomas Mikolov, Martin Karafi´at, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech. volume 2, page 3.
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). pages 807–814.
Joakim Nivre. 2004. Incrementality in deterministic dependency parsing. In Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together. Association for Computational Linguistics, Stroudsburg, PA, USA, IncrementParsing ’04, pages 50–57. http://dl.acm.org/citation.cfm?id=1613148.1613156.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP). pages 1532–1543. http://www.aclweb.org/anthology/D14-1162.
Ilya Sutskever, James Martens, George Dahl, and Ge- offrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In International conference on machine learning. pages 1139– 1147.
David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network transition-based parsing. arXiv preprint arXiv:1506.06158 .
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 .
Xingxing Zhang, Liang Lu, and Mirella Lapata. 2015. Top-down tree long short-term memory networks. arXiv preprint arXiv:1511.00060 .