Towards Formula Translation using Recursive Neural Networks

2018·Arxiv

Abstract

Abstract

1 Introduction

Most mathematical formulae are denoted in presentation language (PL) [1] for the purpose of displaying them. However, PL does not allow machine-interpretation of the formula, i.e. any semantic information contained in a given formulae cannot be understood by Computer Algebra Systems (CAS). One reason for this limitation is the ambiguity of PL. For example, ¯x (\bar{x}) may either be the mean of x, the complex conjugate of x, or it may take on an array of other meanings. On the other hand, for humans, the machine-readable unambiguous notations of content language (CL) are often more laborious to produce, read, and remember. Therefore, a reliable machine translation between PL and CL is a crucial step towards the automatized machine-interpretation of mathematical notation that is found in academic and technical documents.

The National Institute for Standards and Technology (NIST), the Digital Library for Mathematical Formulae (DLMF), and the Digital Repository for Mathematical Formulae (DRMF) developed a set of LTEX macros that allow an unambiguous mathematical notation within LTEX [2]. We refer to this notation as semantic LTEX and refer to the standard LTEX maths notation as generic LTEX. It is possible to translate from semantic LTEX to generic LTEX using LTEXML. However, the reverse translation from generic to semantic LTEX is not solved yet. To achieve this reverse translation the disambiguation of mathematical formulae is mandatory.

Thus, we experimented with a standalone neural disambiguation that classifies which semantic LTEX macro is most likely to be correct. To use this approach, we would require a rule-based translator that utilizes the neural network only in ambiguous cases.

Rule-based translators are not readily transferable to other mathematical languages. Furthermore, they require a very complex program that depends on the input and output language and are also hard to maintain if the languages are extended or a new language should be added.

Therefore, we use machine learning methods to build a full translator as announced in [3]. How well implicit mathematical disambiguation within a machine learning-based translator performs remains is subject of future research. We expect that our new approach outperforms existing approaches when it comes to the implicit mathematical disambiguation, since simple neural networks already yielded significantly better results than traditional approaches on similar tasks as discussed below.

Recently, recurrent neural networks with sequence-to-sequence models yielded state-of-the-art results on natural language translation tasks [4]. In natural language machine translation, even an inaccurate paraphrasing translation is sufficient to satisfy a user’s most basic information needs. However, in contrast, inaccurate translations in mathematical machine translation are most likely useless and misleading for the typical mathematical information needs. Thus, in contrast to natural language, for mathematical formulae a precise translation in mandatory.

As sequential recurrent neural networks don’t utilize the tree-like structure of input and output, they lack precision in the domain of structure.

An alternative to recurrent neural networks is recursive neural networks, which are not linear but have a tree-like structure.

By building a recursive neural network, the neural network explicitly receives information on the location of the respective words and symbols. As soon as this information is relevant for the translation, a recursive neural network can yield better results than a recurrent neural network. Further, recursive neural networks have a lower depth than recurrent neural networks. Thus, the problem of exploding or vanishing gradients is reduced.

In most cases, no structural information about natural language is given. Therefore, it is required to deduce a structure for these sentences. Finding such a structure is ambiguous and error-prone. Mathematical formulae often allow a straightforward derivation of the semantic structure or are already given in a structured format. Thus, we believe that utilizing recursive neural networks is a promising approach for the automatic translation of mathematical formulae.

In summary, our research contributes towards:

• the disambiguation of mathematical formulae, and

• the neural machine translation of mathematical formulae.

Furthermore, our research yields the following contributions for recursive neural networks:

• faster training by using a semi-batched computation, and

• a local minima avoiding loss function for decoders.

The contributions for recursive neural networks may also be applicable to other machine-learning tasks, including sentiment analysis.

We structure the presentation of our contributions as follows. Section 2 presents an overview of prior research. Section 3 states the problem of translation and presents our neural network approach. Section 4 presents the applied preprocessing steps for the formulae. Section 5 presents our evaluation techniques and current results. Section 6 discusses our results and conclusions. Finally, avenues for future work are presented in section 7.

2 Related Work

To the best of our knowledge, no research towards machine learning translation for mathematical formulae between different representations has been performed thus far. Our research towards machine translation of mathematical formulae will enable future work on various topics from automated checking of mathematical formulae to plagiarism detection.

There have already been rule-based attempts to translate between mathematical languages. There exists a rule-based translator from semantic LTEX to CASs (e.g., Maple) [5]. Thus, by translating from generic to semantic LTEX, we also obtain a notation that is readable by CASs. The semantic LTEX to Maple translator achieved an accuracy of 53.59% on correctly translating 4 165 test equations from the DLMF. The accuracy of the semantic LTEX to CAS translator is relatively low because of the high complexity of these test equations, and because most of the functions which are represented by a DLMF/DRMF LTEX macro are not defined in Maple [5]. Our approach may act as a substitute for this translator in cases where the aspired CAS is not yet implemented. The semantic LTEX to CAS translator may also be used as a reference for our approach.

State-of-the-art for language translation uses recurrent neural networks employing, e.g., Long Short-Term Memory (LSTM) cells in sequence-to-sequence networks [6].

In recent time, also recursive neural networks have been used for the translation of structured data in tree-to-tree networks. Chen et al. used tree-to-tree networks for program translation [7]. Recursive neural networks have mainly been used for different machine learning tasks, such as sentiment analysis of the Stanford sentiment tree bank [8], or classification of collimated sprays of energetic hadrons in quantum chromodynamics [9]. A majority of tasks does not require a random sized and structured output but rather a classification. Therefore, these neural networks only need a recursive encoder. Thus, only a minority of research has been performed on recursive decoders. Local minima of loss functions often cause neural networks to stop learning or even lead them to predict only one constant value. Thus, we propose a local minimum avoiding loss function for the training of recursive decoders.

Tai et al. [8] presented tree-structured LSTM networks. LSTMs are building blocks of neural networks that are composed of a cell, an input gate, an output gate and a forget gate. Their purpose is to remember values over a long time and avoid the problem of exploding gradients utilizing their forget gate. We extend this approach using multi-variate multi-valued LSTMs in order to build recursive LSTM networks with a recursive encoder, as well as a recursive decoder.

This means that we use a neural network whose encoder and decoder have the same topology as the binary parse tree of the input formula respectively output formula. Therefore, the structure of the input defines the topology of the neural network. Thus, in general, for the training of recursive neural networks, a batched computation is not directly possible. A reason for that is that trees with different topologies usually cannot share one neural network [10]. Alternatively, the neural network could be made big enough to capture the whole dataset, which would result in neural networks with tens or hundreds of thousands of LSTM-units. Since one of the main issues of neural networks is long training times, a speedup of the training is essential. Recursive neural networks especially suffer from long training times since a batched computation is not directly possible [10]. Thus, we propose a new technique that enables semi-batched computation for recursive neural networks utilizing clustering and tree traversal.

3 Method

In this section, we will present the problem of translation of trees and give an overview of our approach. Further, we will present our method for a standalone disambiguation and comment on our implementation.

Figure 1: Overview of the model pipeline.

First, we split our data into a training set (e.g., 90%) and a validation set (e.g., 10%). Then, we apply data augmentation on the training set to make it more robust (c.f. section 4.1). Then, we transform the formulae into trees (c.f. section 4.2). Then, we cluster the formula trees (c.f. section 4.3). Finally, we train our neural network on the training data set (c.f. section 3.2.1) and evaluate it on the validation data set (c.f. section 5).

3.1 Problem Statement

Let x X be the binary parse tree of a generic LTEX formula. Further, let y Y be the respective binary parse tree of the semantic LTEX translation of x. In these binary trees, only the leaves are labeled with positive integers corresponding to the words and symbols of the formula, all other nodes are labeled as zero. Further, let xfor a, being a binary tree with x, be a tree with the topology of a and the values of x

padded with the tag ywith respect to the topology of a. Here, “” means that for each node in b there is

a node in a at the same position in the tree. The padding is performed such that a position is tagged with yiff there is no node at the respective position in x. We want to find a mapping f : X y. This mapping has to be derivable with respect to the tree topologies, the definition of our recursive neural

network, and the trained weight and bias matrices. Therefore, we create a neural network and generate a loss as a distance measure between the ground truth

and the prediction. By minimizing the loss, we get an improved mapping.

3.2 Approach

Figure 1 gives an overview of our model pipeline. In the following, we will describe our recursive neural network approach as shown in Figure 2. Here, the recursive topology of these LSTM networks depends on the structure of the input tree. In the following, we will use terminology from the “Deep Learning” book by Goodfellow et al. [11].

The input of our neural network is a batch of trees with the same topology. The ground truth of our neural network is a batch of trees with the same topology.

A one-hot encoding represents the values of the nodes of the trees. The encoder (upper half of Figure 2) is responsible for converting the binary tree into a hidden state, while the decoder (lower half of Figure 2) is responsible for converting a hidden state into a new binary tree.

For the encoder, we recursively obtain a hidden state by feeding the hidden states of the left and right child as well as the value of the node to one two-variate (aka. two-ary) LSTM. The hidden states of the leaves are initialized as zero. Thus, we obtain a hidden state.

We may apply a quasi-linear function to this hidden state; whether this improves the translation is not yet determined. This is mandatory if the hidden state sizes of the encoder and the decoder differ.

For the decoder, we use one two-valued LSTM to generate the hidden state of the left and right child and predict the value of the node. Further, the decoder is fed with the prediction of its parent’s value. This two-valued LSTM is equivalent to two separate LSTMs, one LSTM being responsible for generating the left output and the other for the right output.

Figure 2: Scheme of the tree-to-tree neural network (figure based on a figure from [9])

Since we apply an additional tree traversal on the binary trees for the purpose of obtaining a better clustering, we need to consider this traversal inside the neural network. Each time we take or obtain two children, we lookup the stored traversal information T where T = 0 if the children were not exchanged and T = 1 if the children were exchanged in the preprocessing step. Then, we compute

for the left and right children’s weight matrices and the left and right children’s hidden states h. By this replacement, the neural network obtains the correct trees, while the topology of the neural network is independent of a clustering improving traversal.

We use this structure per layer, while the hidden state of a previous layer is the input of the next layer. The input of the first layer of the encoder is the embedding of the respective value in the input tree. The output of the highest layer is used for the prediction. The prediction is computed by the application of two embeddings on the output of a respective LSTM. We initialize the weight matrices using the Xavier initializer [12], we initialize the biases to zero.

3.2.1 Training

In this section, we describe how we trained the aforementioned recursive neural network. We use the clusters of trees as mini-batches. As loss function, we use softmax cross-entropy.

Since the actual formula is only a subtree of the input for the neural networks with sizes of roughly about 30 60%, most nodes are padded with zeros and ys. This causes the primary influence on the loss to be the prediction of the padding zeros and ys. Thus, the neural network learns the local minimum of only predicting these recurrent values quickly.

Furthermore, the high occurrence frequency of tokens such as, “^” in mathematical formulae yields a strong bias towards predicting these tokens.

Therefore, we mask the loss function to lower the influence of commonly occurring symbols, such as parentheses, on the learning process. We achieve this by element-wise multiplication of the losses with the mask

where ˆy is the prediction and y is the correct translation. Here, [0, 1]. sets the strength of the masking. While = 0 causes the mask to have no influence, = 1 causes the influence of a loss to be the reciprocal of the number of occurrences of its symbol in the mini-batch. is responsible for a reduction of the influence iff the prediction is correct. A too low may cause oscillation.

Because the speed of learning should be independent of the size of the input tree’s size and the local mini-batch size, we use reduce_sum instead of reduce_mean in order to weight every node equally and independent of the mini-batch in which it is included. The advantages of the common reduce_mean, easier search for a learning rate and better numeric stability, are outweighed in this case.

For training, we use the RMSProp optimizer [13] and the Adam optimizer [14] with learning rates between 10and 10. The learning rate for the Adam optimizer has to be lower than for the RMSProp optimizer. As the optimal dropout, we found that 5% performs best.

As an alternative to directly feeding the encoder with the input formula and the decoder with the output formula, we can use two auto-encoders, one for the input formula and another one for the output formula. This means that the encoder, as well as the decoder, is fed with the same values. Thus, the auto-encoders are trained to simulate the identity function. Because the information has to be compressed to be preserved while being propagated through the neural network as a state of lower size, a fitting representation is learned. Subsequently, we can combine the auto-encoder of the input formula as the encoder and the auto-encoder of the output formula as the decoder to a new neural network. By having translators from an input tree to a vector and from another vector to an output tree, we can easily train a mapping between these representations.

3.2.2 Standalone disambiguation

Since mathematical notations, as well as an element-wise mapping from generic to semantic LTEX, are ambiguous, we implemented a standalone disambiguation. This disambiguation decides - for a given generic LTEX symbol and a bag-of-words of the formula - which semantic LTEX macro is correct.

We restrict the classification to predictions that make sense. I.e., we choose only between those semantic LTEX macros that would be re-translated into the same generic LTEX symbol.

We trained several neural networks with one to five hidden layers on this classification task.

3.2.3 Implementation

At firsts, we implemented a single-layer version of the neural network in TensorFlow. We found that in our case plain Python resulted in a speed increase of about 10 to 50 times compared to TensorFlow. TensorFlow was so slow because the recurrent changing of the neural network topology takes a long time in TensorFlow. Thus, a majority of the training time has been used by the creation of the neural network because we needed to recreate the neural network for each mini-batch. Further, TensorFlow is optimized for GPUs, yet we used CPUs because of a high memory demand. Therefore, we rewrote our TensorFlow code in plain Python utilizing numpy and autograd. While the TensorFlow implementation consumed up to 500GB of RAM for a single layer, our numpy implementation uses only up to 12GB of RAM with three layers.

4 Data and Preprocessing

This section describes the preprocessing steps from formula strings to clusters of formula trees.

For training and evaluation, we used 13 939 generic / semantic LTEX formula pairs from the DLMF/DRMF, NIST. As shown in Figure 1, we first apply data augmentation on the training set. Then, we parse the formulae to trees and cluster these trees with respect to their topology.

4.1 Data augmentation

To achieve a robust neural network, we extended our dataset using the following mechanism. For this, we split our dataset on comparatorssince the subexpressions obtained by these splits are still valid mathematical expressions. We then take the single terms, every two and three adjacent terms, including one or two comparators, and the whole formula (less if less is given). The following example shows these steps. As a result of this data augmentation, we received a data set containing 51 217 mathematical terms.

1. π + δ

2. ph((λ)z) 3. δ

4ph(() 5. ph((

4.2 Formula Parsing and Tree Traversal

In the following, we will describe our heuristics based formula parser for transforming the formulae into only-leaf-labeled binary parse trees.

Our parser first tokenizes the formula string. It then recursively transforms the token stream into a non-full n-ary tree. Structure-only generating braces and brackets are omitted, since the structure of the tree represents them. The following heuristics can either be activated or deactivated; due to lack of time, we have not yet thoroughly evaluated, how well these heuristics perform. In future work, we will carry out this evaluation. To achieve the results reported on in this paper, we have activated the first two heuristics.

• If a node forms a command, <COMMAND_END> is added as an additional terminating leaf.

• If a node forms a concatenation on the same level (e.g., a + b), <CONCAT_END> is added as an additional terminating leaf.

• Leaves that are infix operators are moved to the left, such that they are present in prefix notation.

After applying this tree traversal, we transform the n-ary tree into an adaptation of left-child right-sibling binary trees. I.e., we adapt left-child right-sibling binary trees to leaf-only valued trees. Here, the left child is the first child, and the right child is the same node excluding the first child.

The following example input illustrates the parsing and traversal process:

First, we tokenize the input:

As shown in Figure 3, we generate an n-ary tree from this token stream. Next, we can (optionally) add the <COMMAND_END> and <CONCAT_END> tokens as shown in Figure 4. Additionally, we can (optionally) traverse the tree from infix to prefix notation as shown in Figure 5.

Figure 3: Parse tree without <COMMAND_END>, <CONCAT_END> or infix to prefix traversal (left is n-ary, right is binary)

Figure 4: Parse tree with <COMMAND_END> as <1> and <CONCAT_END> as <2> (left is n-ary, right is binary)

Figure 5: Parse tree with infix to prefix traversal (left is n-ary, right is binary)

Now, we can perform a traversal to gain the right-child-biggest property. For this, we recursively exchange the children of a tree if the size of the left child is greater than that of the right child. For a reconstruction within the neural network, we store the information on which children of a tree we performed the exchange. Thus, we obtain trees that are more similar, which improves the clustering (i.e., increases the speed of training).

After the subsequent clustering, these binary trees serve as input for the neural network.

4.3 Tree clustering

Because the topology of the neural network depends on the input, we would have to generate a new neural network for each formula. Therefore, we apply a padding to the binary trees. To do this, we create a binary tree a with a topology greater or equal to every tree x X and compute x, the padded version of x.

Generating the neural network for a for all trees, would result in having 80 000 nodes which would require approximately tens of Terabytes of RAM, which is impractical. Therefore, we cluster the pairs of binary trees (x, y) according to their topology.

We implemented numerous standard clustering algorithms, including k-means, DBSCAN, and linkage-algorithms. The results of these algorithms were insufficient. Therefore, we implemented Algorithm 1. For this algorithm, the choice of min-elems-list and max-size-list is crucial. The following choices yielded the best results:

Compared to the other tested clustering algorithms, this algorithm led to the best results on the given task of finding clusters with a small hull tree and finding as many elements as possible per cluster in a short time. The table in Figure 6 shows selected results of the clustering algorithms. Experiments showed that the training time is roughly proportional to

and where is the minimal tree, such that holds for each b in x’s cluster. Thus, we want to minimize this product to achieve a shorter training duration. In general, it is more important to reduce the number of clusters than the size of the paddings because this enables a better parallelization and because less native Python code is executed. By combining this clustering algorithm and the previously presented tree traversal, semi-batched computation and training is made possible.

Figure 6: Comparison of clustering algorithms. The application of tree traversal is implied. For performance reasons, the clusterings are only computed for 15 240 of the 51 217 mathematical terms in our data set.

Less important quality measure: Lower is better More important quality measure: Lower is better According to equation 3 with = 1 and multiplied with 10CPU-time of clustering including preprocessing and evaluation on SCCKN at As in equations 1 and 2 was very weak because either a lot of elements were not related to a cluster or the clusters were too huge.

5 Results

We evaluated our recursive neural network using 10-fold cross validation. As the accuracy measure, we computed full accuracy and masked accuracy .

The accuracies ignore that a prediction may be almost correct if the result is, for example, shifted by one node. Therefore, if an error occurs near the root, a proper subtree at the wrong position is interpreted as wrong, too. For this reason, we additionally implemented a bag-of-words accuracy

where n is the number of possible values, and #is the number of occurrences of i in y. The bag-of-words accuracy thus measures whether the prediction consists of the same unsorted values and frequencies as the ground truth but ignores their structure. This makes the measure robust against shifts of nodes. Since the training is in no way biased towards the direction of this measure, it is very likely that most of the errors which cause a reduction in the full accuracy, but are correct with respect to the bag-of-words accuracy, are only a slight displacement within the tree.

If the full accuracy is higher than the masked accuracy, the prediction of the topology of the tree (equal to the prediction of zeros) is better than the prediction of the values. This can indicate a bias towards predicting zeros. While this almost always occurs, our loss function avoids falling into a local minimum.

Until now, we mainly trained on a subset of only 15 240 mathematical terms for performance reasons. Further, we used relatively high learning rates.

Currently, we achieved a masked validation accuracy of = 47.05%. Further, we repeatedly obtained full validation accuracies of up to = 80%. Our experiment with the best masked validation reached a bag-of-words accuracy of = 92.3%.

The results show that the single-layer approach outperforms the multi-layer approaches. An LSTM state size of 256 produced the best results.

By traversing and clustering the trees, we could achieve a speed increase of roughly factor ten.

5.1 Standalone disambiguation

The standalone disambiguation of mathematical formulae, which chooses which mathematical interpretation (i.e., semantic LTEX macro) to use, yielded excellent results. Without training (i.e., random prediction), we achieved an accuracy of 50 60%. This is because some generic LTEX symbols are restricted to a single semantic LTEX macro. Using our bag-of-words approach and neural networks, we obtained an accuracy of about 97% per choice of a semantic LTEX macro.

Since we would need a rule-based translator which only utilizes the neural network in ambiguous cases, our standalone disambiguation approach is currently not applicable. Our results show that neural networks are adequate for this disambiguation task. Thus, if a rule-based translator with ambiguities will be implemented, this would be a good solution.

6 Discussion and Conclusion

The current results cannot yet be applied to a complete translation but are promising considering that we did not yet use an attention mechanism, and that we did not yet use the entire data set for training. Further, a shortage of time did not allow a full training.

One major drawback of recursive neural networks is their performance, because, batched training is generally not possible. We achieved a speed increase of roughly factor ten. How well our technique can be combined with other speed-optimizing approaches remains to be determined.

A drawback of clustering the training data is that the order of occurring formulae is biased, because formulae with a similar topology are used together for training. Experiments with a different neural network on the same bias showed that this bias can be neglected if the learning rate is low enough.

We think that the single-layer network outperforms the multi-layer network since pre-trained embeddings are not available, and since we didn’t perform layer-wise training, yet.

7 Outlook

In future research, we will improve our implementation to enable a complete formula translation. For that, we will implement a novel attention mechanism for recursive neural networks. Additionally, we will extend our auto-encoder approaches and will apply layer-wise training.

For the parsing and tree traversal, we will implement an improved traversal approach based on post-fix notation to enhance the tree structure.

In the future, this work may allow the DLMF/DRMF to extend their repository of semantic LTEX formulae more easily. Other applied use cases for this translator are mathematical spell-checkers, validation of mathematical books, easier interfaces for CASs, and the automated pre-correction of students’ math problem sets.

The standalone disambiguation could also be implemented using a recurrent or recursive neural network instead of a bag-of-words based approach. This would most likely yield better results, because the order of the formulae, or respectively, the structure of the formulae can be considered using this approach.

Acknowledgments

We would like to thank Howard Cohl et al. [3] for sharing their unreleased semantic LTEX macros including 13 939 formulae from the DLMF/DRMF in this format. Further, we would like to thank Christian Borgelt for advising on the domain of neural networks. This work was supported by the FITWeltweit program of the German Academic Exchange Service (DAAD), as well as the German Research Foundation (DFG, grant GI-1259-1).

References

[1] M. Schubotz, A. Greiner-Petter, P. Scharpf, N. Meuschke, H. S. Cohl, and B. Gipp, “Improving the representation and conversion of mathematical formulae by considering their textual context,” in Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL 2018, Fort Worth, TX, USA, June 03-07, 2018 (J. Chen, M. A. Gonçalves, J. M. Allen, E. A. Fox, M. Kan, and V. Petras, eds.), pp. 233–242, ACM, 2018.

[2] B. R. Miller and A. Youssef, “Technical aspects of the digital library of mathematical functions,” Annals of Mathematics and Artificial Intelligence, vol. 38, pp. 121–136, 05 2003.

[3] H. S. Cohl, M. Schubotz, M. A. McClain, B. V. Saunders, C. Y. Zou, A. S. Mohammed, and A. A. Danoff, “Growing the Digital Repository of Mathematical Formulae with Generic LTEX Sources,” in Intelligent

Computer Mathematics, Lecture Notes in Artificial Intelligence 9150 (M. Kerber, J. Carette, C. Kaliszyk, F. Rabe, and V. Sorge, eds.), vol. 9150 of LNCS, pp. 280–287, Springer, 2015.

[4] Z. C. Lipton, “A critical review of recurrent neural networks for sequence learning,” CoRR, vol. abs/1506.00019, 2015.

[5] H. S. Cohl, M. Schubotz, A. Youssef, A. Greiner-Petter, J. Gerhard, B. V. Saunders, M. A. McClain, J. Bang, and K. Chen, “Semantic preserving bijective mappings of mathematical formulae between document preparation systems and computer algebra systems,” in Intelligent Computer Mathematics - 10th International Conference, CICM 2017, Edinburgh, UK, July 17-21, 2017, Proceedings (H. Geuvers, M. England, O. Hasan, F. Rabe, and O. Teschke, eds.), vol. 10383 of Lecture Notes in Computer Science, pp. 115–131, Springer, 2017.

[6] M. Luong, E. Brevdo, and R. Zhao, “Neural machine translation (seq2seq) tutorial,” 2017.

[7] X. Chen, C. Liu, and D. Song, “Tree-to-tree neural networks for program translation,” CoRR, vol. abs/1802.03691, 2018.

[8] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic representations from tree-structured long short-term memory networks,” CoRR, vol. abs/1503.00075, 2015.

[9] G. Louppe, K. Cho, C. Becot, and K. Cranmer, “QCD-Aware Recursive Neural Networks for Jet Physics,” arXiv, vol. 1702.00748, 2017.

[10] S. R. Bowman, J. Gauthier, A. Rastogi, R. Gupta, C. D. Manning, and C. Potts, “A fast unified model for parsing and sentence understanding,” CoRR, vol. abs/1603.06021, 2016.

[11] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.

[12] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (Y. W. Teh and M. Titterington, eds.), vol. 9 of Proceedings of Machine Learning Research, (Chia Laguna Resort, Sardinia, Italy), pp. 249–256, PMLR, 05 2010.

[13] S. Ruder, “An overview of gradient descent optimization algorithms,” CoRR, vol. abs/1609.04747, 2016.

[14] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.

designed for accessibility and to further open science