Top-down Tree Long Short-Term Memory Networks

2015·arXiv

Abstract

1 Introduction

Neural language models have been gaining increasing attention as a competitive alternative to n-grams. The main idea is to represent each word using a real-valued feature vector capturing the contexts in which it occurs. The conditional probability of the next word is then modeled as a smooth function of the feature vectors of the preceding words and the next word. In essence, similar representations are learned for words found in similar contexts resulting in similar predictions for the next word. Previous approaches have mainly employed feed-forward (Bengio et al., 2003; Mnih and Hinton, 2007) and recurrent neural networks (Mikolov et al., 2010; Mikolov, 2012) in order to map the feature vectors of the context words to the distribution for the next word. Recently, RNNs with Long Short-Term Memory (LSTM) units (Hochreiter and Schmidhu- ber, 1997; Hochreiter, 1998) have emerged as a popular architecture due to their strong ability to capture long-term dependencies. LSTMs have been successfully applied to a variety of tasks ranging from machine translation (Sutskever et al., 2014), to speech recognition (Graves et al., 2013), and image description generation (Vinyals et al., 2015).

Despite superior performance in many applications, neural language models essentially predict sequences of words. Many NLP tasks, however, exploit syntactic information operating over tree structures (e.g., dependency or constituent trees). In this paper we develop a novel neural network model which combines the advantages of the LSTM architecture and syntactic structure. Our model estimates the probability of a sentence by estimating the generation probability of its dependency tree. Instead of explicitly encoding tree structure as a set of features, we use four LSTM networks to model four types of dependency edges which altogether specify how the tree is built. At each time step, one LSTM is activated which predicts the next word conditioned on the sub-tree generated so far. To learn the representations of the conditioned sub-tree, we force the four LSTMs to share their hidden layers. Our model is also capable of generating trees just by sampling from a trained model and can be seamlessly integrated with text generation applications.

Our approach is related to but ultimately different from recursive neural networks (Pollack, 1990) a class of models which operate on structured inputs. Given a (binary) parse tree, they recursively generate parent representations in a bottom-up fashion, by combining tokens to produce representations for phrases, and eventually the whole sentence. The learned representations can be then used in classi-fication tasks such as sentiment analysis (Socher et al., 2011b) and paraphrase detection (Socher et al., 2011a). Tai et al. (2015) learn distributed representations over syntactic trees by generalizing the LSTM architecture to tree-structured network topologies. The key feature of our model is not so much that it can learn semantic representations of phrases or sentences, but its ability to predict tree structure and estimate its probability.

Syntactic language models have a long history in NLP dating back to Chelba and Jelinek (2000) (see also Roark (2001) and Charniak (2001)). These models differ in how grammar structures in a parsing tree are used when predicting the next word. Other work develops dependency-based language models for specific applications such as machine translation (Shen et al., 2008; Zhang, 2009; Sennrich, 2015), speech recognition (Chelba et al., 1997) or sentence completion (Gubbins and Vlachos, 2013). All instances of these models apply Markov assumptions on the dependency tree, and adopt standard n-gram smoothing methods for reliable parameter estimation. Emami et al. (2003) and Sennrich (2015) estimate the parameters of a structured language model using feed-forward neural networks (Bengio et al., 2003). Mirowski and Vlachos (2015) re-implement the model of Gubbins and Vlachos (2013) with RNNs. They view sentences as sequences of words over a tree. While they ignore the tree structures themselves, we model them explicitly.

Our model shares with other structured-based language models the ability to take dependency information into account. It differs in the following respects: (a) it does not artificially restrict the depth of the dependencies it considers and can thus be viewed as an infinite order dependency language model; (b) it not only estimates the probability of a string but is also capable of generating dependency trees; (c) finally, contrary to previous dependency-based language models which encode syntactic information as features, our model takes tree structure into account more directly via representing different types of dependency edges explicitly using LSTMs. Therefore, there is no need to manually determine which dependency tree features should be used or how large the feature embeddings should be.

We evaluate our model on the MSR sentence completion challenge, a benchmark language modeling dataset. Our results outperform the best published results on this dataset. Since our model is a general tree estimator, we also use it to rerank the top K dependency trees from the (second order) MSTPasrser and obtain performance on par with recently proposed dependency parsers.

2 Tree Long Short-Term Memory Networks

We seek to estimate the probability of a sentence by estimating the generation probability of its dependency tree. Syntactic information in our model is represented in the form of dependency paths. In the following, we first describe our definition of dependency path and based on it explain how the probability of a sentence is estimated.

2.1 Dependency Path

Generally speaking, a dependency path is the path between ROOT and w consisting of the nodes on the path and the edges connecting them. To represent dependency paths, we introduce four types of edges which essentially define the “shape” of a dependency tree. Let w0 denote a node in a tree and w1its left dependents. As shown in Figure 1, LEFT edge is the edge between w0 and its first left dependent denoted as (with 1 ) denote a non-first left dependent of w0. The edge from wkedge (NX stands for NEXT), where wkis the right adjacent sibling of wk. Note that the NX-LEFT edge replaces edge (illustrated with a dashed line in Figure 1) in the original dependency tree. The modification allows information to flow from w0 to wk through w1rather than directly from w0 to wk. RIGHT and NX-RIGHT edges are defined analogously for right dependents.

Given these four types of edges, dependency paths (denoted as D(w)) can be defined as follows

Figure 1: LEFT and NX-LEFT edges. Dotted line between w1 and wk(also between wk and wn) indicate that there may be 0 nodes inbetween.

bearing in mind that the first right dependent of ROOT is its only dependent and that wp denotes the parent of w. We use (...) to denote a sequence, where () is an empty sequence and is an operator for concatenating two sequences.

(a) if w is the first left dependent, then (b) if w is not the first left dependent and ws is its right adjacent sibling, then (3) if w is a right dependent of wp

A dependency tree can be represented by the set of its dependency paths which in turn can be used to reconstruct the original tree.1

Dependency paths for the first two levels of the tree in Figure 2 are as follows (ignoring for the moment the subscripts which we explain in the next section). D(sold) = (see definitions (1) and (3a)), D(manufacturer(see (2b)), (see (3a)), (according to (3b)).

2.2 Tree Probability

The core problem in syntax-based language modeling is to estimate the probability of sentence S given

Figure 2: Dependency tree of the sentence The luxury auto manufacturer last year sold 1,214 cars in the U.S. Subscripts indicate breadth-first traversal. ROOT has only one dependent (i.e., sold) which we view as its first right dependent.

its corresponding tree T, P(S|T). We view the probability computation of a dependency tree as a generation process. Specifically, we assume dependency trees are constructed top-down, in a breadth-first manner. Generation starts at the ROOT node. For each node at each level, first its left dependents are generated from closest to farthest and then the right dependents (again from closest to farthest). The same process is applied to the next node at the same level or a node at the next level. Figure 2 shows the breadth-first traversal of a dependency tree.

Under the assumption that each word w in a dependency tree is only conditioned on its dependency path, the probability of a sentence S given its dependency tree T is:

where D(w) is the dependency path of w. Note that each word w is visited according to its breadth-first search order (BFS(T)) and the probability of ROOT is ignored since every tree has one. The role of ROOT in a dependency tree is the same as the begin of sentence token (BOS) in a sentence. When computing P(S|T) (or P(S)), the probability of ROOT (or BOS) is ignored (we assume it always exists), but is used to predict other words. We explain in the next section how TREELSTM estimates P(w|D(w)).

2.3 Tree LSTMs

A dependency path D(w) is subtree which we denote as a sequence of tuples. Our

Figure 3: Generation process of left (w1) and right (w4) dependents of tree node wo (top) using four LSTMs (GEN-L, GEN-R, GEN-NX-L and GEN-NX-R). The model can handle an arbitrary number of dependents due to GEN-NX-L and GEN-NX-R.

innovation is to learn the representation of D(w) using four LSTMs. The four LSTMs (GEN-L, GEN-R, GEN-NX-L and GEN-NX-R) are used to represent the four types of edges (LEFT, RIGHT, NX-LEFT and NX-RIGHT) introduced earlier. GEN, NX, L and R are shorthands for GENERATE, NEXT, LEFT and RIGHT. At each time step, an LSTM is chosen according to an edge-type; then the LSTM takes a word as input and predicts/generates its dependent or sibling. This process can be also viewed as adding an edge and a node to a tree. Specifi-cally, LSTMs GEN-L and GEN-R are used to generate the first left and right dependent of a node (w1 and w4 in Figure 3). So, these two LSTMs are responsible for going deeper in a tree. While GEN-NX-L and GEN-NX-R generate the remaining left/right dependents and therefore go wider in a tree. As shown in Figure 3, w2 and w3 are generated by GEN-NX-L, whereas w5 and w6 are generated by GEN-NX-R. Note that the model can handle any number of left or right dependents by applying GEN-NX-L or GEN-NX-R multiple times.

We assume time steps correspond to the steps taken by the breadth-first traversal of the dependency tree and the sentence has length n. At time step t (1 denote the last tuple in breadth-first search order of wt and wt, respectively. zt edge type (see the definitions in Section 2.1). Let We denote the word embedding matrix and Who the output matrix of our model, where |V| is the vocabulary size, s the word embedding size and d the hidden unit size. We use tied We and tied Who for the four LSTMs to reduce the number of parameters in our model. The four LSTMs also share their hidden states. Let H shared hidden states of all time steps and eone-hot vector of wt. Then, H[:,t] represents at time step t, and the computation2 is:

where the initial hidden state H[:,0] is initialized to a vector of small values such as 0.01. According to Equation (2b), the model selects an LSTM based on edge type zt. We describe the details of LSTMzt in the next paragraph. The probability of wt given its dependency path is estimated by a softmax function:

We must point out that although we use four jointly trained LSTMs to encode the hidden states, the training and inference complexity of our model is no different from a regular LSTM, since at each time step only one LSTM is working.

We implement LSTMz in Equation (2b) using a deep LSTM (to simplify notation, from now on we write z instead of zt). The inputs at time step t are xt and ht(the hidden state of an earlier time step t) and the output is ht (the hidden state of current time step). Let L denote the layer number of LSTMz and ˆhlt the internal hidden state of the l-th layer of the LSTMz at time step t, where xt is ˆh0t and ht. The LSTM architecture introduces mul- tiplicative gates and memory cells ˆclt (at l-th layer) in order to address the vanishing gradient problem which makes it difficult for the standard RNN model to learn long-distance correlations in a sequence. Here, ˆclt is a linear combination of the current input signal ut and an earlier memory cell ˆcltinput information ut will flow into ˆclt is controlled

Figure 4: Generation of left and right dependents of node w0 according to LDTREELSTM.

by input gate it and how much of the earlier memory cell ˆcltwill be forgotten is controlled by forget gate ft. This process is computed as follows:

ut it ft ˆclt

where WzWzare weight matrices for ut, WzWzare weight matrices for it and Wzare weight matrices for ft. is a sigmoid function and the element-wise product.

Output gate ot controls how much information of the cell ˆclt can be seen by other modules:

Application of the above process to all layers L, will yield ˆhLt , which is ht. Note that in implementation, all ˆclt and ˆhlt (1 ) at time step t are stored, although we only care about ˆhLt (ht).

2.4 Left Dependent Tree LSTMs

TREELSTM computes P(w|D(w)) based on the dependency path D(w), which ignores the interaction between left and right dependents on the same level. In many cases, TREELSTM will use a verb to predict its object directly without knowing its subject. For example, in Figure 2, TREELSTM uses RIGHTto predict cars. This information is unfortunately not specific to cars (many things can be sold, e.g., chocolates, candy). Considering manufacturer, the left dependent of sold would help predict cars more accurately.

In order to jointly take left and right dependents into account, we employ yet another LSTM, which goes from the furthest left dependent to the closest left dependent (LD is a shorthand for left dependent). As shown in Figure 4, LD LSTM learns the representation of all left dependents of a node w0; this representation is then used to predict the first right dependent of the same node. Non-first right dependents can also leverage the representation of left dependents, since this information is injected into the hidden state of the first right dependent and can percolate all the way. Note that in order to retain the generation capability of our model (Section 3.4), we only allow right dependents to leverage left dependents (they are generated before right dependents).

The computation of the LDTREELSTM is almost the same as in TREELSTM except when zt In this case, let vt be the corresponding left dependent sequence with length K (vt in Figure 4). Then, the hidden state (qk) of vt at each time step k is:

where qK is the representation for all left dependents. Then, the computation of the current hidden state becomes (see Equation (2) for the original computation):

where qK serves as additional input for LSTMGEN-R. All other computational details are the same as in TreeLSTM (see Section 2.3).

2.5 Model Training

On small scale datasets we employ Negative Loglikelihood (NLL) as our training objective for both TREELSTM and LDTREELSTM:

where S is a sentence in the training set S, T is the dependency tree of S and P(S|T) is defined as in Equation (1).

On large scale datasets (e.g., with vocabulary size of 65K), computing the output layer activations and the softmax function with NLL would become prohibitively expensive. Instead, we employ Noise Contrastive Estimation (NCE; Gutmann and Hyv¨arinen (2012), Mnih and Teh (2012)) which treats the normalization term ˆZ in ˆPexp

ˆZ as constant. The intuition behind NCE is to discriminate between samples from a data distribution ˆPand a known noise distribution Pn(w) via binary logistic regression. Assuming that noise words are k times more frequent than real words in the training set (Mnih and Teh, 2012), then the probability of a word w being from our model Pd. We apply NCE to large vocabulary models with the following training objective:

where ˜wtis a word sampled from the noise distribution Pn(w). We use smoothed unigram frequencies (exponentiating by 0.75) as the noise distribution Pn(w) (Mikolov et al., 2013b). We initialize ln ˆZ = 9 as suggested in Chen et al. (2015), but instead of keeping it fixed we also learn ˆZ during training (Vaswani et al., 2013). We set k = 20.

3 Experiments

We assess the performance of our model on two tasks: the Microsoft Research (MSR) sentence completion challenge (Zweig and Burges, 2012), and dependency parsing reranking. We also demonstrate the tree generation capability of our models. In the following, we first present details on model training and then present our results. We implemented our models using the Torch library (Collobert et al., 2011) and our code is available at https:// github.com/XingxingZhang/td-treelstm.

3.1 Training Details

We trained our model with back propagation through time (Rumelhart et al., 1988) on an Nvidia GPU Card with a mini-batch size of 64. The objective (NLL or NCE) was minimized by stochastic gradient descent. Model parameters were uniformly initialized in . We used the NCE objective on the MSR sentence completion task (due to the large size of this dataset) and the NLL objective on dependency parsing reranking. We used an initial learning rate of 1.0 for all experiments and when there was no significant improvement in loglikelihood on the validation set, the learning rate was divided by 2 per epoch until convergence (Mikolov et al., 2010). To alleviate the exploding gradients problem, we rescaled the gradient g when the gradient norm (Pascanu et al., 2013; Sutskever et al., 2014). Dropout (Srivastava et al., 2014) was applied to the 2-layer TREELSTM and LDTREELSTM models. The word embedding size was set to s = d/2 where d is the hidden unit size.

3.2 Microsoft Sentence Completion Challenge

The task in the MSR Sentence Completion Challenge (Zweig and Burges, 2012) is to select the correct missing word for 1,040 SAT-style test sentences when presented with five candidate completions. The training set contains 522 novels from the Project Gutenberg which we preprocessed as follows. After removing headers and footers from the files, we tokenized and parsed the dataset into dependency trees with the Stanford Core NLP toolkit (Manning et al., 2014). The resulting training set contained 49M words. We converted all words to lower case and replaced those occurring five times or less with UNK. The resulting vocabulary size was 65,346 words. We randomly sampled 4,000 sentences from the training set as our validation set.

The literature describes two main approaches to the sentence completion task based on word vectors and language models. In vector-based approaches, all words in the sentence and the five candidate words are represented by a vector; the candidate which has the highest average similarity with the sentence words is selected as the answer. For language model-based methods, the LM computes the probability of a test sentence with each of the five candidate words, and picks the candidate completion which gives the highest probability. Our model belongs to this class of models.

Table 1: Model accuracy on the MSR sentence completion task. The results of KN5, RNNME and RNNMEs are reported in Mikolov (2012), LSA and RNN in Zweig et al. (2012), UDepNgram and LDepNgram in Gubbins and Vlachos (2013), depRNN+3gram and depRNN+4gram in Mirowski and Vlachos (2015), LBL in Mnih and Teh (2012), Skip-gram and Skipgram+RNNMEs in Mikolov et al. (2013a), and IVLBL in Mnih and Kavukcuoglu (2013); d is the hidden size and ber of parameters in a model.

Table 1 presents a summary of our results together with previoulsy published results. The best performing word vector model is IVLBL (Mnih and Kavukcuoglu, 2013) with an accuracy of 55.5, while the best performing single language model is LBL (Mnih and Teh, 2012) with an accuracy of 54.7. Both approaches are based on the log-bilinear language model (Mnih and Hinton, 2007). A combination of several recurrent neural networks and the skip-gram model holds the state of the art with an accuracy of 58.9 (Mikolov et al., 2013b). To fairly compare with existing models, we restrict the layer

Table 2: Performance of TREELSTM and LDTREELSTM on

reranking the top dependency trees produced by the 2nd order MSTParser (McDonald and Pereira, 2006). Results for the NN and S-LSTM parsers are reported in Chen and Manning (2014) and Dyer et al. (2015), respectively. * indicates that the model is initialized with pre-trained word vectors.

size of our models to 1. We observe that LDTREELSTM consistently outperforms TREELSTM, which indicates the importance of modeling the interaction between left and right dependents. In fact, LDTREELSTM (d = 400) achieves a new state-of-the-art on this task, despite being a single model. We also implement LSTM and bidirectional LSTM language models.3 An LSTM with d = 400 outperforms its smaller counterpart (d = 300), however performance decreases with d = 450. The bidirectional LSTM is worse than the LSTM (see Mnih and Teh (2012) for a similar observation). The best performing LSTM is worse than a LDTREELSTM (d = 300). The input and output embeddings (We and Who) dominate the number of parameters in all neural models except for RNNME, depRNN+3gram and ldepRNN+4gram, which include a ME model that contains 1 billion sparse n-gram features (Mikolov, 2012; Mirowski and Vlachos, 2015). The number of parameters in TREELSTM and LDTREELSTM is not much larger compared to LSTM due to the tied We and Who matrices.

3.3 Dependency Parsing

In this section we demonstrate that our model can be also used for parse reranking. This is not possible for sequence-based language models since they cannot estimate the probability of a tree. We use our models to rerank the top K dependency trees produced by the second order MSTParser (McDonald and Pereira, 2006).4 We follow closely the experimental setup of Chen and Manning (2014) and Dyer et al. (2015). Specifically, we trained TREELSTM and LDTREELSTM on Penn Treebank sections 2–21. We used section 22 for development and section 23 for testing. We adopted the Stanford basic dependency representations (De Marneffe et al., 2006); part-of-speech tags were predicted with the Stanford Tagger (Toutanova et al., 2003). We trained TREELSTM and LDTREELSTM as language models (singletons were replaced with UNK) and did not use any POS tags, dependency labels or composition features, whereas these features are used in Chen and Manning (2014) and Dyer et al. (2015). We tuned d, the number of layers, and K on the development set.

Table 2 reports unlabeled attachment scores (UAS) and labeled attachment scores (LAS) for the MSTParser, TREELSTM (d = 300, 1 layer, K 200, 2 layers, K = 4). We also include the performance of two neural network-based dependency parsers; Chen and Manning (2014) use a neural network classifier to predict the correct transition (NN parser); Dyer et al. (2015) also implement a transition-based dependency parser using LSTMs to represent the contents of the stack and buffer in a continuous space. As can be seen, both TREELSTM and LDTREELSTM outperform the baseline MSTParser, with LDTREELSTM performing best. We also initialized the word embedding matrix We with pre-trained GLOVE vectors (Pennington et al., 2014). We obtained a slight improvement over TREELSTM (TREELSTM* in Table 2; d = 200, 2 layer, K = 4) but no improvement over LDTREELSTM. Finally, notice that LDTREELSTM is slightly better than the NN parser in terms of UAS but worse than the S-LSTM parser. In the future, we would like to extend our model so that it takes labeled dependency information into account.

3.4 Tree Generation

This section demonstrates how to use a trained LDTREELSTM to generate tree samples. The generation starts at the ROOT node. At each time step t, for each node wt, we add a new edge and node to

Figure 5: Generated dependency trees with LDTREELSTM trained on the PTB.

the tree. Unfortunately during generation, we do not know which type of edge to add. We therefore use four binary classifiers (ADD-LEFT, ADD-RIGHT, ADD-NX-LEFT and ADD-NX-RIGHT) to predict whether we should add a LEFT, RIGHT, NX-LEFT or NX-RIGHT edge.5 Then when a classifier predicts true, we use the corresponding LSTM to generate a new node by sampling from the predicted word distribution in Equation (3). The four classifiers take the previous hidden state Hand the output embedding of the current node Who as features.6 Specifically, we use a trained LDTREELSTM to go through the training corpus and generate hidden states and embeddings as input features; the corresponding class labels (true and false) are “read off” the training dependency trees. We use two-layer rec-tifier networks (Glorot et al., 2011) as the four clas-sifiers with a hidden size of 300. We use the same LDTREELSTM model as in Section 3.3 to generate dependency trees. The classifiers were trained using AdaGrad (Duchi et al., 2011) with a learning rate of 0.01. The accuracies of ADD-LEFT, ADD-RIGHT, ADD-NX-LEFT and ADD-NX-RIGHT are 94.3%, 92.6%, 93.4% and 96.0%, respectively. Figure 5 shows examples of generated trees.

4 Conclusions

In this paper we developed TREELSTM (and LDTREELSTM), a neural network model architecture, which is designed to predict tree structures rather than linear sequences. Experimental results on the MSR sentence completion task show that LDTREELSTM is superior to sequential LSTMs. Dependency parsing reranking experiments highlight our model’s potential for dependency parsing. Finally, the ability of our model to generate dependency trees holds promise for text generation applications such as sentence compression and simplification (Filippova et al., 2015). Although our experiments have focused exclusively on dependency trees, there is nothing inherent in our formulation that disallows its application to other types of tree structure such as constituent trees or even taxonomies.

Acknowledgments

We would like to thank Adam Lopez, Frank Keller, Iain Murray, Li Dong, Brian Roark, and the NAACL reviewers for their valuable feedback. Xingxing Zhang gratefully acknowledges the financial support of the China Scholarship Council (CSC). Liang Lu is funded by the UK EPSRC Programme Grant EP/I031022/1, Natural Speech Technology (NST).

References

[Bengio et al.2003] Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137–1155.

[Charniak2001] Eugene Charniak. 2001. Immediatehead parsing for language models. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pages 124–131. Association for Computational Linguistics.

[Chelba and Jelinek2000] Ciprian Chelba and Frederick Jelinek. 2000. Structured language modeling. Computer Speech and Language, 14(4):283–332.

[Chelba et al.1997] Ciprian Chelba, David Engle, Freder- ick Jelinek, Victor Jimenez, Sanjeev Khudanpur, Lidia Mangu, Harry Printz, Eric Ristad, Ronald Rosenfeld,

Andreas Stolcke, et al. 1997. Structure and performance of a dependency language model. In EUROSPEECH. Citeseer.

[Chen and Manning2014] Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 740–750, Doha, Qatar, October. Association for Computational Linguistics.

[Chen et al.2015] X Chen, X Liu, MJF Gales, and PC Woodland. 2015. Recurrent neural network language model training with noise contrastive estimation for speech recognition. In In 40th IEEE International Conference on Accoustics, Speech and Signal Processing, pages 5401–5405, Brisbane, Australia.

[Collobert et al.2011] Ronan Collobert, Koray Kavukcuoglu, and Cl´ement Farabet. 2011. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376.

[De Marneffe et al.2006] Marie-Catherine De Marneffe, Bill MacCartney, Christopher D Manning, et al. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC, volume 6, pages 449–454.

[Duchi et al.2011] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159.

[Dyer et al.2015] Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 334–343, Beijing, China, July. Association for Computational Linguistics.

[Eisner1996] Jason M Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In Proceedings of the 16th conference on Computational linguistics-Volume 1, pages 340–345. Association for Computational Linguistics.

[Emami et al.2003] Ahmad Emami, Peng Xu, and Fred- erick Jelinek. 2003. Using a connectionist model in a syntactical based language model. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 372–375, Hong Kong, China.

[Filippova et al.2015] Katja Filippova, Enrique Alfon- seca, Carlos A Colmenares, Lukasz Kaiser, and Oriol Vinyals. 2015. Sentence compression by deletion with lstms. In EMNLP, pages 360–368.

[Glorot et al.2011] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics, pages 315–323.

[Graves et al.2013] Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645–6649. IEEE.

[Gubbins and Vlachos2013] Joseph Gubbins and Andreas Vlachos. 2013. Dependency language models for sentence completion. In EMNLP, pages 1405–1410, Seattle, Washington, USA, October. Association for Computational Linguistics.

[Gutmann and Hyv¨arinen2012] Michael U Gutmann and Aapo Hyv¨arinen. 2012. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. The Journal of Machine Learning Research, 13(1):307–361.

[Hochreiter and Schmidhuber1997] Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.

[Hochreiter1998] Sepp Hochreiter. 1998. Vanishing gra- dient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-based Systems, 6(2):107–116.

[Manning et al.2014] Christopher D Manning, Mihai Sur- deanu, John Bauer, Jenny Finkel, Steven J Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60.

[McDonald and Pereira2006] Ryan T McDonald and Fer- nando CN Pereira. 2006. Online learning of approximate dependency parsing algorithms. In EACL.

[Mikolov et al.2010] Tomas Mikolov, Martin Karafi´at, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pages 1045–1048.

[Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In Proceedings of the 2013 International Conference on Learning Representations, Scottsdale, Arizona, USA.

[Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119.

[Mikolov2012] Tomas Mikolov. 2012. Statistical Language Models based on Neural Networks. Ph.D. thesis, Brno University of Technology.

[Mirowski and Vlachos2015] Piotr Mirowski and An- dreas Vlachos. 2015. Dependency recurrent neural language models for sentence completion. In ACL, pages 511–517, Beijing, China, July. Association for Computational Linguistics.

[Mnih and Hinton2007] Andriy Mnih and Geoffrey Hin- ton. 2007. Three new graphical models for statistical language modelling. In Proceedings of the 24th International Conference on Machine Learning, pages 641–648.

[Mnih and Kavukcuoglu2013] Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings effi-ciently with noise-contrastive estimation. In Advances in Neural Information Processing Systems 26, pages 2265–2273.

[Mnih and Teh2012] Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training neural probabilistic language models. In Proceedings of the 29th International Conference on Machine Learning, pages 1751–1758, Edinburgh, Scotland.

[Pascanu et al.2013] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning, pages 1310–1318, Atlanta, Georgia, USA.

[Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. EMNLP, 12:1532–1543.

[Pollack1990] Jordan B. Pollack. 1990. Recursive dis- tributed representations. Artificial Intelligence, 1– 2(46):77–105.

[Roark2001] Brian Roark. 2001. Probabilistic top-down parsing and language modeling. Computational linguistics, 27(2):249–276.

[Rumelhart et al.1988] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1988. Learning representations by back-propagating errors. Cognitive modeling, 5:3.

[Sennrich2015] Rico Sennrich. 2015. Modelling and op- timizing on syntactic n-grams for statistical machine translation. Transactions of the Association for Computational Linguistics, 3:169–182.

[Shen et al.2008] Libin Shen, Jinxi Xu, and Ralph Weischedel. 2008. A new string-to-dependency machine translation algorithm with a target dependency language model. In Proceedings of ACL-08: HLT, pages 577–585, Columbus, Ohio, USA.

[Socher et al.2011a] Richard Socher, Eric H. Huang, Jef- frey Pennington, Christopher D. Manning, and Andrew Ng. 2011a. Dynamic pooling and unfolding

recursive autoencoders for paraphrase detection. In Advances in Neural Information Processing Systems, pages 801–809.

[Socher et al.2011b] Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and Christopher D. Manning. 2011b. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 151–161, Edinburgh, Scotland, UK.

[Srivastava et al.2014] Nitish Srivastava, Geoffrey Hin- ton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.

[Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112.

[Tai et al.2015] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1556–1566, Beijing, China, July. Association for Computational Linguistics.

[Toutanova et al.2003] Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 173–180. Association for Computational Linguistics.

[Vaswani et al.2013] Ashish Vaswani, Yinggong Zhao, Victoria Fossum, and David Chiang. 2013. Decoding with large-scale neural language models improves translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1387–1392, Seattle, Washington, USA.

[Vinyals et al.2015] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In The IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA.

[Zhang2009] Ying Zhang. 2009. Structured language models for statistical machine translation. Ph.D. thesis, Johns Hopkins University.

[Zweig and Burges2012] Geoffrey Zweig and Chris J.C. Burges. 2012. A challenge set for advancing language modeling. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram

Model? On the Future of Language Modeling for HLT, pages 29–36, Montr´eal, Canada.

[Zweig et al.2012] Geoffrey Zweig, John C Platt, Christo- pher Meek, Christopher JC Burges, Ainur Yessenalina, and Qiang Liu. 2012. Computational approaches to sentence completion. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 601–610. Association for Computational Linguistics.

Designed for Accessibility and to further Open Science