A Primer on Neural Network Models for Natural Language Processing

2015·Arxiv

Abstract

Abstract

Over the past few years, neural networks have re-emerged as powerful machine-learning models, yielding state-of-the-art results in fields such as image recognition and speech processing. More recently, neural network models started to be applied also to textual natural language signals, again with very promising results. This tutorial surveys neural network models from the perspective of natural language processing research, in an attempt to bring natural-language researchers up to speed with the neural techniques. The tutorial covers input encoding for natural language tasks, feed-forward networks, convolutional networks, recurrent networks and recursive networks, as well as the computation graph abstraction for automatic gradient computation.

1. Introduction

For a long time, core NLP techniques were dominated by machine-learning approaches that used linear models such as support vector machines or logistic regression, trained over very high dimensional yet very sparse feature vectors.

Recently, the field has seen some success in switching from such linear models over sparse inputs to non-linear neural-network models over dense inputs. While most of the neural network techniques are easy to apply, sometimes as almost drop-in replacements of the old linear classifiers, there is in many cases a strong barrier of entry. In this tutorial I attempt to provide NLP practitioners (as well as newcomers) with the basic background, jargon, tools and methodology that will allow them to understand the principles behind the neural network models and apply them to their own work. This tutorial is expected to be self-contained, while presenting the different approaches under a unified notation and framework. It repeats a lot of material which is available elsewhere. It also points to external sources for more advanced topics when appropriate.

This primer is not intended as a comprehensive resource for those that will go on and develop the next advances in neural-network machinery (though it may serve as a good entry point). Rather, it is aimed at those readers who are interested in taking the existing, useful technology and applying it in useful and creative ways to their favourite NLP problems. For more in-depth, general discussion of neural networks, the theory behind them, advanced optimization methods and other advanced topics, the reader is referred to other existing resources. In particular, the book by Bengio et al (2015) is highly recommended.

Scope The focus is on applications of neural networks to language processing tasks. However, some subareas of language processing with neural networks were decidedly left out of scope of this tutorial. These include the vast literature of language modeling and acoustic modeling, the use of neural networks for machine translation, and multi-modal applications combining language and other signals such as images and videos (e.g. caption generation). Caching methods for efficient runtime performance, methods for efficient training with large output vocabularies and attention models are also not discussed. Word embeddings are discussed only to the extent that is needed to understand in order to use them as inputs for other models. Other unsupervised approaches, including autoencoders and recursive autoencoders, also fall out of scope. While some applications of neural networks for language modeling and machine translation are mentioned in the text, their treatment is by no means comprehensive.

A Note on Terminology The word “feature” is used to refer to a concrete, linguistic input such as a word, a suffix, or a part-of-speech tag. For example, in a first-order part-of-speech tagger, the features might be “current word, previous word, next word, previous part of speech”. The term “input vector” is used to refer to the actual input that is fed to the neural-network classifier. Similarly, “input vector entry” refers to a specific value of the input. This is in contrast to a lot of the neural networks literature in which the word “feature” is overloaded between the two uses, and is used primarily to refer to an input-vector entry.

Mathematical Notation I use bold upper case letters to represent matrices (X, Y, Z), and bold lower-case letters to represent vectors (b). When there are series of related matrices and vectors (for example, where each matrix corresponds to a different layer in the network), superscript indices are used (). For the rare cases in which we want indicate the power of a matrix or a vector, a pair of brackets is added around the item to be exponentiated: (. Unless otherwise stated, vectors are assumed to be row vectors. We use [] to denote vector concatenation.

2. Neural Network Architectures

Neural networks are powerful learning models. We will discuss two kinds of neural network architectures, that can be mixed and matched – feed-forward networks and Recurrent / Recursive networks. Feed-forward networks include networks with fully connected layers, such as the multi-layer perceptron, as well as networks with convolutional and pooling layers. All of the networks act as classifiers, but each with different strengths.

Fully connected feed-forward neural networks (Section 4) are non-linear learners that can, for the most part, be used as a drop-in replacement wherever a linear learner is used. This includes binary and multiclass classification problems, as well as more complex structured prediction problems (Section 8). The non-linearity of the network, as well as the ability to easily integrate pre-trained word embeddings, often lead to superior classification accuracy. A series of works (Chen & Manning, 2014; Weiss, Alberti, Collins, & Petrov, 2015; Pei, Ge, & Chang, 2015; Durrett & Klein, 2015) managed to obtain improved syntactic parsing results by simply replacing the linear model of a parser with a fully connected feed-forward network. Straight-forward applications of a feed-forward network as a classi-fier replacement (usually coupled with the use of pre-trained word vectors) provide benefits also for CCG supertagging (Lewis & Steedman, 2014), dialog state tracking (Henderson, Thomson, & Young, 2013), pre-ordering for statistical machine translation (de Gispert, Iglesias, & Byrne, 2015) and language modeling (Bengio, Ducharme, Vincent, & Janvin, 2003; Vaswani, Zhao, Fossum, & Chiang, 2013). Iyyer et al (2015) demonstrate that multi-layer feed-forward networks can provide competitive results on sentiment classification and factoid question answering.

Networks with convolutional and pooling layers (Section 9) are useful for classification tasks in which we expect to find strong local clues regarding class membership, but these clues can appear in different places in the input. For example, in a document classification task, a single key phrase (or an ngram) can help in determining the topic of the document (Johnson & Zhang, 2015). We would like to learn that certain sequences of words are good indicators of the topic, and do not necessarily care where they appear in the document. Convolutional and pooling layers allow the model to learn to find such local indicators, regardless of their position. Convolutional and pooling architecture show promising results on many tasks, including document classification (Johnson & Zhang, 2015), short-text categorization (Wang, Xu, Xu, Liu, Zhang, Wang, & Hao, 2015a), sentiment classification (Kalchbrenner, Grefenstette, & Blunsom, 2014; Kim, 2014), relation type classification between entities (Zeng, Liu, Lai, Zhou, & Zhao, 2014; dos Santos, Xiang, & Zhou, 2015), event detection (Chen, Xu, Liu, Zeng, & Zhao, 2015; Nguyen & Grishman, 2015), paraphrase iden-tification (Yin & Sch¨utze, 2015) semantic role labeling (Collobert, Weston, Bottou, Karlen, Kavukcuoglu, & Kuksa, 2011), question answering (Dong, Wei, Zhou, & Xu, 2015), predicting box-office revenues of movies based on critic reviews (Bitvai & Cohn, 2015) modeling text interestingness (Gao, Pantel, Gamon, He, & Deng, 2014), and modeling the relation between character-sequences and part-of-speech tags (Santos & Zadrozny, 2014).

In natural language we often work with structured data of arbitrary sizes, such as sequences and trees. We would like to be able to capture regularities in such structures, or to model similarities between such structures. In many cases, this means encoding the structure as a fixed width vector, which we can then pass on to another statistical learner for further processing. While convolutional and pooling architectures allow us to encode arbitrary large items as fixed size vectors capturing their most salient features, they do so by sacrificing most of the structural information. Recurrent (Section 10) and recursive (Section 12) architectures, on the other hand, allow us to work with sequences and trees while preserving a lot of the structural information. Recurrent networks (Elman, 1990) are designed to model sequences, while recursive networks (Goller & K¨uchler, 1996) are generalizations of recurrent networks that can handle trees. We will also discuss an extension of recurrent networks that allow them to model stacks (Dyer, Ballesteros, Ling, Matthews, & Smith, 2015; Watanabe & Sumita, 2015).

Recurrent models have been shown to produce very strong results for language modeling, including (Mikolov, Karafi´at, Burget, Cernocky, & Khudanpur, 2010; Mikolov, Kombrink, Luk´aˇs Burget, ˇCernocky, & Khudanpur, 2011; Mikolov, 2012; Duh, Neubig, Sudoh, & Tsukada, 2013; Adel, Vu, & Schultz, 2013; Auli, Galley, Quirk, & Zweig, 2013; Auli & Gao, 2014); as well as for sequence tagging (Irsoy & Cardie, 2014; Xu, Auli, & Clark, 2015; Ling, Dyer, Black, Trancoso, Fermandez, Amir, Marujo, & Luis, 2015b), machine translation (Sundermeyer, Alkhouli, Wuebker, & Ney, 2014; Tamura, Watanabe, & Sumita, 2014; Sutskever, Vinyals, & Le, 2014; Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk, & Bengio, 2014b), dependency parsing (Dyer et al., 2015; Watanabe & Sumita, 2015), sentiment analysis (Wang, Liu, SUN, Wang, & Wang, 2015b), noisy text normalization (Chrupala, 2014), dialog state tracking (Mrkˇsi´c, ´O S´eaghdha, Thomson, Gasic, Su, Vandyke, Wen, & Young, 2015), response generation (Sordoni, Galley, Auli, Brockett, Ji, Mitchell, Nie, Gao, & Dolan, 2015), and modeling the relation between character sequences and part-of-speech tags (Ling et al., 2015b).

Recursive models were shown to produce state-of-the-art or near state-of-the-art results for constituency (Socher, Bauer, Manning, & Andrew Y., 2013) and dependency (Le & Zuidema, 2014; Zhu, Qiu, Chen, & Huang, 2015a) parse re-ranking, discourse parsing (Li, Li, & Hovy, 2014), semantic relation classification (Hashimoto, Miwa, Tsuruoka, & Chikayama, 2013; Liu, Wei, Li, Ji, Zhou, & WANG, 2015), political ideology detection based on parse trees (Iyyer, Enns, Boyd-Graber, & Resnik, 2014b), sentiment classification (Socher, Perelygin, Wu, Chuang, Manning, Ng, & Potts, 2013; Hermann & Blunsom, 2013), target-dependent sentiment classification (Dong, Wei, Tan, Tang, Zhou, & Xu, 2014) and question answering (Iyyer, Boyd-Graber, Claudino, Socher, & Daum´e III, 2014a).

3. Feature Representation

Before discussing the network structure in more depth, it is important to pay attention to how features are represented. For now, we can think of a feed-forward neural network as a function NN(x) that takes as input a dimensional vector x and produces a dimensional output vector. The function is often used as a classifier, assigning the input x a degree of membership in one or more of classes. The function can be complex, and is almost always non-linear. Common structures of this function will be discussed in Section 4. Here, we focus on the input, x. When dealing with natural language, the input x encodes features such as words, part-of-speech tags or other linguistic information. Perhaps the biggest jump when moving from sparse-input linear models to neural-network based models is to stop representing each feature as a unique dimension (the so called one-hot representation) and representing them instead as dense vectors. That is, each core feature is embedded into a d dimensional space, and represented as a vector in that space.The embeddings (the vector representation of each core feature) can then be trained like the other parameter of the function NN. Figure 1 shows the two approaches to feature representation.

The feature embeddings (the values of the vector entries for each feature) are treated as model parameters that need to be trained together with the other components of the network. Methods of training (or obtaining) the feature embeddings will be discussed later. For now, consider the feature embeddings as given.

The general structure for an NLP classification system based on a feed-forward neural network is thus:

1. Extract a set of core linguistic features that are relevant for predicting the output class.

2. For each feature of interest, retrieve the corresponding vector

3. Combine the vectors (either by concatenation, summation or a combination of both) into an input vector x.

4. Feed x into a non-linear classifier (feed-forward neural network).

The biggest change in the input, then, is the move from sparse representations in which each feature is its own dimension, to a dense representation in which each feature is mapped to a vector. Another difference is that we extract only core features and not feature combinations. We will elaborate on both these changes briefly.

Dense Vectors vs. One-hot Representations What are the benefits of representing our features as vectors instead of as unique IDs? Should we always represent features as dense vectors? Let’s consider the two kinds of representations:

One Hot Each feature is its own dimension.

Figure 1: Sparse vs. dense feature representations. Two encodings of the information: (a) Sparse feature vector. Each dimension represents a feature. Feature combinations receive their own dimensions. Feature values are binary. Dimensionality is very high. (b) Dense, embeddings-based feature vector. Each core feature is represented as a vector. Each feature corresponds to several input vector entries. No explicit encoding of feature combinations. Dimensionality is low. The feature-to-vector mappings come from an embedding table.

• Features are completely independent from one another. The feature “word is ‘dog’ ” is as dis-similar to “word is ‘thinking’ ” than it is to “word is ‘cat’ ”.

Dense Each feature is a d-dimensional vector.

• Dimensionality of vector is d.

• Similar features will have similar vectors – information is shared between similar features.

One benefit of using dense and low-dimensional vectors is computational: the majority of neural network toolkits do not play well with very high-dimensional, sparse vectors. However, this is just a technical obstacle, which can be resolved with some engineering effort.

The main benefit of the dense representations is in generalization power: if we believe some features may provide similar clues, it is worthwhile to provide a representation that is able to capture these similarities. For example, assume we have observed the word ‘dog’ many times during training, but only observed the word ‘cat’ a handful of times, or not at all. If each of the words is associated with its own dimension, occurrences of ‘dog’ will not tell us anything about the occurrences of ‘cat’. However, in the dense vectors representation the learned vector for ‘dog’ may be similar to the learned vector from ‘cat’, allowing the model to share statistical strength between the two events. This argument assumes that “good” vectors are somehow given to us. Section 5 describes ways of obtaining such vector representations.

In cases where we have relatively few distinct features in the category, and we believe there are no correlations between the different features, we may use the one-hot representation. However, if we believe there are going to be correlations between the different features in the group (for example, for part-of-speech tags, we may believe that the different verb inflections VB and VBZ may behave similarly as far as our task is concerned) it may be worthwhile to let the network figure out the correlations and gain some statistical strength by sharing the parameters. It may be the case that under some circumstances, when the feature space is relatively small and the training data is plentiful, or when we do not wish to share statistical information between distinct words, there are gains to be made from using the one-hot representations. However, this is still an open research question, and there are no strong evidence to either side. The majority of work (pioneered by (Collobert & Weston, 2008; Collobert et al., 2011; Chen & Manning, 2014)) advocate the use of dense, trainable embedding vectors for all features. For work using neural network architecture with sparse vector encodings see (Johnson & Zhang, 2015).

Finally, it is important to note that representing features as dense vectors is an integral part of the neural network framework, and that consequentially the differences between using sparse and dense feature representations are subtler than they may appear at first. In fact, using sparse, one-hot vectors as input when training a neural network amounts to dedicating the first layer of the network to learning a dense embedding vector for each feature based on the training data. We touch on this in Section 4.4.

Variable Number of Features: Continuous Bag of Words Feed-forward networks assume a fixed dimensional input. This can easily accommodate the case of a featureextraction function that extracts a fixed number of features: each feature is represented as a vector, and the vectors are concatenated. This way, each region of the resulting input vector corresponds to a different feature. However, in some cases the number of features is not known in advance (for example, in document classification it is common that each word in the sentence is a feature). We thus need to represent an unbounded number of features using a fixed size vector. One way of achieving this is through a socalled continuous bag of words (CBOW) representation (Mikolov, Chen, Corrado, & Dean, 2013). The CBOW is very similar to the traditional bag-of-words representation in which we discard order information, and works by either summing or averaging the embedding vectors of the corresponding features:

A simple variation on the CBOW representation is weighted CBOW, in which different vectors receive different weights:

Here, each feature has an associated weight , indicating the relative importance of the feature. For example, in a document classification task, a feature may correspond to a word in the document, and the associated weight could be the word’s TF-IDF score.

Distance and Position Features The linear distance in between two words in a sentence may serve as an informative feature. For example, in an event extraction taskgiven a trigger word and a candidate argument word, and asked to predict if the argument word is indeed an argument of the trigger. The distance (or relative position) between the trigger and the argument is a strong signal for this prediction task. In the “traditional” NLP setup, distances are usually encoded by binning the distances into several groups (i.e. 1, 2, 3, 4, 5–10, 10+) and associating each bin with a one-hot vector. In a neural architecture, where the input vector is not composed of binary indicator features, it may seem natural to allocate a single input vector entry to the distance feature, where the numeric value of that entry is the distance. However, this approach is not taken in practice. Instead, distance features are encoded similarly to the other feature types: each bin is associated with a d-dimensional vector, and these distance-embedding vectors are then trained as regular parameters in the network (Zeng et al., 2014; dos Santos et al., 2015; Zhu et al., 2015a; Nguyen & Grishman, 2015).

Feature Combinations Note that the feature extraction stage in the neural-network settings deals only with extraction of core features. This is in contrast to the traditional linear-model-based NLP systems in which the feature designer had to manually specify not only the core features of interests but also interactions between them (e.g., introducing not only a feature stating “word is X” and a feature stating “tag is Y” but also combined feature stating “word is X and tag is Y” or sometimes even “word is X, tag is Y and previous word is Z”). The combination features are crucial in linear models because they introduce more dimensions to the input, transforming it into a space where the data-points are closer to being linearly separable. On the other hand, the space of possible combinations is very large, and the feature designer has to spend a lot of time coming up with an effective set of feature combinations. One of the promises of the non-linear neural network models is that one needs to define only the core features. The non-linearity of the classifier, as defined by the network structure, is expected to take care of finding the indicative feature combinations, alleviating the need for feature combination engineering.

Kernel methods (Shawe-Taylor & Cristianini, 2004), and in particular polynomial kernels (Kudo & Matsumoto, 2003), also allow the feature designer to specify only core features, leaving the feature combination aspect to the learning algorithm. In contrast to neural-network models, kernels methods are convex, admitting exact solutions to the optimization problem. However, the classification efficiency in kernel methods scales linearly with the size of the training data, making them too slow for most practical purposes, and not suitable for training with large datasets. On the other hand, neural network classification efficiency scales linearly with the size of the network, regardless of the training data size.

Dimensionality How many dimensions should we allocate for each feature? Unfortunately, there are no theoretical bounds or even established best-practices in this space. Clearly, the dimensionality should grow with the number of the members in the class (you probably want to assign more dimensions to word embeddings than to part-of-speech embeddings) but how much is enough? In current research, the dimensionality of word-embedding vectors range between about 50 to a few hundreds, and, in some extreme cases, thousands. Since the dimensionality of the vectors has a direct effect on memory requirements and processing time, a good rule of thumb would be to experiment with a few different sizes, and choose a good trade-off between speed and task accuracy.

Vector Sharing Consider a case where you have a few features that share the same vocabulary. For example, when assigning a part-of-speech to a given word, we may have a set of features considering the previous word, and a set of features considering the next word. When building the input to the classifier, we will concatenate the vector representation of the previous word to the vector representation of the next word. The classifier will then be able to distinguish the two different indicators, and treat them differently. But should the two features share the same vectors? Should the vector for “dog:previous-word” be the same as the vector of “dog:next-word”? Or should we assign them two distinct vectors? This, again, is mostly an empirical question. If you believe words behave differently when they appear in different positions (e.g., word X behaves like word Y when in the previous position, but X behaves like Z when in the next position) then it may be a good idea to use two different vocabularies and assign a different set of vectors for each feature type. However, if you believe the words behave similarly in both locations, then something may be gained by using a shared vocabulary for both feature types.

For multi-class classification problems with k classes, the network’s output is a k-dimensional vector in which every dimension represents the strength of a particular output class. That is, the output remains as in the traditional linear models – scalar scores to items in a discrete set. However, as we will see in Section 4, there is a matrix associated with the output layer. The columns of this matrix can be thought of as d dimensional embeddings of the output classes. The vector similarities between the vector representations of the k classes indicate the model’s learned similarities between the output classes.

Historical Note Representing words as dense vectors for input to a neural network was introduced by Bengio et al (Bengio et al., 2003) in the context of neural language modeling. It was introduced to NLP tasks in the pioneering work of Collobert, Weston and colleagues (2008, 2011). Using embeddings for representing not only words but arbitrary features was popularized following Chen and Manning (2014).

4. Feed-forward Neural Networks

A Brain-inspired metaphor As the name suggest, neural-networks are inspired by the brain’s computation mechanism, which consists of computation units called neurons. In the metaphor, a neuron is a computational unit that has scalar inputs and outputs. Each input has an associated weight. The neuron multiplies each input by its weight, and then sumsthem, applies a non-linear function to the result, and passes it to its output. The neurons are connected to each other, forming a network: the output of a neuron may feed into the inputs of one or more neurons. Such networks were shown to be very capable computational devices. If the weights are set correctly, a neural network with enough neurons and a non-linear activation function can approximate a very wide range of mathematical functions (we will be more precise about this later).

Figure 2: Feed-forward neural network with two hidden layers.

A typical feed-forward neural network may be drawn as in Figure 2. Each circle is a neuron, with incoming arrows being the neuron’s inputs and outgoing arrows being the neuron’s outputs. Each arrow carries a weight, reflecting its importance (not shown). Neurons are arranged in layers, reflecting the flow of information. The bottom layer has no incoming arrows, and is the input to the network. The top-most layer has no outgoing arrows, and is the output of the network. The other layers are considered “hidden”. The sigmoid shape inside the neurons in the middle layers represent a non-linear function (typically a 1)) that is applied to the neuron’s value before passing it to the output. In the figure, each neuron is connected to all of the neurons in the next layer – this is called a fully-connected layer or an affine layer.

While the brain metaphor is sexy and intriguing, it is also distracting and cumbersome to manipulate mathematically. We therefore switch to using more concise mathematic notation. The values of each row of neurons in the network can be thought of as a vector. In Figure 2 the input layer is a 4 dimensional vector (x), and the layer above it is a 6 dimensional vector (). The fully connected layer can be thought of as a linear transformation from 4 dimensions to 6 dimensions. A fully-connected layer implements a vector-matrix multiplication, h = xW where the weight of the connection from the ith neuron in the input row to the jth neuron in the output row is The values of h are then transformed by a non-linear function g that is applied to each value before being passed on to the next input. The whole computation from input to output can be written as: (where are the weights of the first layer and are the weights of the second one.

In Mathematical Notation From this point on, we will abandon the brain metaphor and describe networks exclusively in terms of vector-matrix operations. The simplest neural network is the perceptron, which is a linear function of its inputs:

W is the weight matrix, and b is a bias term.In order to go beyond linear functions, we introduce a non-linear hidden layer (the network in Figure 2 has two such layers), resulting in the 1-layer Multi Layer Perceptron (MLP1). A one-layer feed-forward neural network has the form:

Here are a matrix and a bias term for the first linear transformation of the input, g is a non-linear function that is applied element-wise (also called a non-linearity or an are the matrix and bias term for a second linear transform.

Breaking it down, is a linear transformation of the input dimensions to dimensions. g is then applied to each of the dimensions, and the matrix with bias vector are then used to transform the result into the dimensional output vector. The non-linear activation function g has a crucial role in the network’s ability to represent complex functions. Without the non-linearity in g, the neural network can only represent linear transformations of the input.

We can add additional linear-transformations and non-linearities, resulting in a 2-layer MLP (the network in Figure 2 is of this form):

of 6. The network in figure 2 does not include bias terms. A bias term can be added to a layer by adding to

it an additional neuron that does not have any incoming connections, whose value is always 1. 7. To see why, consider that a sequence of linear transformations is still a linear transformation.

It is perhaps clearer to write deeper networks like this using intermediary variables:

The vector resulting from each linear transform is referred to as a layer. The outer-most linear transform results in the output layer and the other linear transforms result in hidden layers. Each hidden layer is followed by a non-linear activation. In some cases, such as in the last layer of our example, the bias vectors are forced to 0 (“dropped”).

Layers resulting from linear transformations are often referred to as fully connected, or affine. Other types of architectures exist. In particular, image recognition problems benefit from convolutional and pooling layers. Such layers have uses also in language processing, and will be discussed in Section 9. Networks with more than one hidden layer are said to be deep networks, hence the name deep learning.

When describing a neural network, one should specify the dimensions of the layers and the input. A layer will expect a dimensional vector as its input, and transform it into a dimensional vector. The dimensionality of the layer is taken to be the dimensionality of its output. For a fully connected layer l(x) = xW + b with input dimensionality output dimensionality , the dimensions of 1

The output of the network is a dimensional vector. In case output is a scalar. Such networks can be used for regression (or scoring) by considering the value of the output, or for binary classification by consulting the sign of the output. Networks with 1 can be used for k-class classification, by associating each dimension with a class, and looking for the dimension with maximal value. Similarly, if the output vector entries are positive and sum to one, the output can be interpreted as a distribution over class assignments (such output normalization is typically achieved by applying a softmax transformation on the output layer, see Section 4.3).

The matrices and the bias terms that define the linear transformations are the parameters of the network. It is common to refer to the collection of all parameters as with the input, the parameters determine the network’s output. The training algorithm is responsible for setting their values such that the network’s predictions are correct. Training is discussed in Section 6.

4.1 Representation Power

In terms of representation power, it was shown by (Hornik, Stinchcombe, & White, 1989; Cybenko, 1989) that MLP1 is a universal approximator – it can approximate with any desired non-zero amount of error a family of functionsthat include all continuous functions on a closed and bounded subset of , and any function mapping from any finite dimensional discrete space to another. This may suggest there is no reason to go beyond MLP1 to more complex architectures. However, the theoretical result does not state how large the hidden layer should be, nor does it say anything about the learnability of the neural network (it states that a representation exists, but does not say how easy or hard it is to set the parameters based on training data and a specific learning algorithm). It also does not guarantee that a training algorithm will find the correct function generating our training data. Since in practice we train neural networks on relatively small amounts of data, using a combination of the backpropagation algorithm and variants of stochastic gradient descent, and use hidden layers of relatively modest sizes (up to several thousands), there is benefit to be had in trying out more complex architectures than MLP1. In many cases, however, MLP1 does indeed provide very strong results. For further discussion on the representation power of feed-forward neural networks, see (Bengio et al., 2015, Section 6.5).

4.2 Common Non-linearities

The non-linearity g can take many forms. There is currently no good theory as to which non-linearity to apply in which conditions, and choosing the correct non-linearity for a given task is for the most part an empirical question. I will now go over the common non-linearities from the literature: the sigmoid, tanh, hard tanh and the rectified linear unit (ReLU). Some NLP researchers also experimented with other forms of non-linearities such as cube and tanh-cube.

Sigmoid The sigmoid activation function ) is an S-shaped function, transforming each value x into the range [0, 1].

Hyperbolic tangent (tanh) The hyperbolic tangent activation func- tion is an S-shaped function, transforming the values x into the range [

Hard tanh The hard-tanh activation function is an approximation of the tanh function which is faster to compute and take derivatives of:

Rectifier (ReLU) The Rectifier activation function (Glorot, Bordes, & Bengio, 2011), also known as the rectified linear unit is a very simple activation function that is easy to work with and was shown many times to produce excellent results.The ReLU unit clips each value x < 0 at 0. Despite its simplicity, it performs well for many tasks, especially when combined with the dropout regularization technique (see Section 6.4).

As a rule of thumb, ReLU units work better than tanh, and tanh works better than sigmoid.

4.3 Output Transformations

In many cases, the output layer vector is also transformed. A common transformation is the softmax:

The result is a vector of non-negative real numbers that sum to one, making it a discrete probability distribution over k possible outcomes.

The softmax output transformation is used when we are interested in modeling a probability distribution over the possible output classes. To be effective, it should be used in conjunction with a probabilistic training objective such as cross-entropy (see Section 4.5 below).

When the softmax transformation is applied to the output of a network without a hidden layer, the result is the well known multinomial logistic regression model, also known as a maximum-entropy classifier.

4.4 Embedding Layers

Up until now, the discussion ignored the source of x, treating it as an arbitrary vector. In an NLP application, x is usually composed of various embeddings vectors. We can be explicit about the source of x, and include it in the network’s definition. We introduce a function from core features to an input vector.

It is common for c to extract the embedding vector associated with each feature, and concatenate them:

Another common choice is for c to sum the embedding vectors (this assumes the embedding vectors all share the same dimensionality):

The form of c is an essential part of the network’s design. In many papers, it is common to refer to c as part of the network, and likewise treat the word embeddings ) as resulting from an “embedding layer” or “lookup layer”. Consider a vocabulary of |V | words, each embedded as a d dimensional vector. The collection of vectors can then be thought of as a embedding matrix E in which each row corresponds to an embedded feature. Let -dimensional vector, which is all zeros except from one index, corresponding to the value of the ith feature, in which the value is 1 (this is called a one-hot vector). The multiplication will then “select” the corresponding row of ) can be defined in terms of

And similarly:

The input to the network is then considered to be a collection of one-hot vectors. While this is elegant and well defined mathematically, an efficient implementation typically involves a hash-based data structure mapping features to their corresponding embedding vectors, without going through the one-hot representation.

In this tutorial, we take c to be separate from the network architecture: the network’s inputs are always dense real-valued input vectors, and c is applied before the input is passed the network, similar to a “feature function” in the familiar linear-models terminology. However, when training a network, the input vector x does remember how it was constructed, and can propagate error gradients back to its component embedding vectors, as appropriate.

A note on notation When describing network layers that get concatenated vectors x, y and z as input, some authors use explicit concatenation ([x; y; z]W + b) while others use an affine transformation (xU + yV + zW + b). If the weight matrices U, V, W in the affine transformation are different than one another, the two notations are equivalent.

A note on sparse vs. dense features Consider a network which uses a “traditional” sparse representation for its input vectors, and no embedding layer. Assuming the set of all available features is V and we have k “on” features , the network’s input is:

and so the first layer (ignoring the non-linear activation) is:

This layer selects rows of W corresponding to the input features in x and sums them, then adding a bias term. This is very similar to an embedding layer that produces a CBOW representation over the features, where the matrix W acts as the embedding matrix. The main difference is the introduction of the bias vector b, and the fact that the embedding layer typically does not undergo a non-linear activation but rather passed on directly to the first layer. Another difference is that this scenario forces each feature to receive a separate vector (row in W) while the embedding layer provides more flexibility, allowing for example for the features “next word is dog” and “previous word is dog” to share the same vector. However, these differences are small and subtle. When it comes to multi-layer feed-forward networks, the difference between dense and sparse inputs is smaller than it may seem at first sight.

4.5 Loss Functions

When training a neural network (more on training in Section 6 below), much like when training a linear classifier, one defines a loss function ), stating the loss of predicting when the true output is y. The training objective is then to minimize the loss across the different training examples. The loss ) assigns a numerical score (a scalar) for the network’s output given the true expected output The loss is always positive, and should be zero only for cases where the network’s output is correct.

The parameters of the network (the matrices , the biases and commonly the embeddings E) are then set in order to minimize the loss L over the training examples (usually, it is the sum of the losses over the different training examples that is being minimized).

The loss can be an arbitrary function mapping two vectors to a scalar. For practical purposes of optimization, we restrict ourselves to functions for which we can easily compute gradients (or sub-gradients). In most cases, it is sufficient and advisable to rely on a common loss function rather than defining your own. For a detailed discussion on loss functions for neural networks see (LeCun, Chopra, Hadsell, Ranzato, & Huang, 2006; LeCun & Huang, 2005; Bengio et al., 2015). We now discuss some loss functions that are commonly used in neural networks for NLP.

Hinge (binary) For binary classification problems, the network’s output is a single scalar ˆy and the intended output The classification rule is classification is considered correct if 0, meaning that share the same sign. The hinge loss, also known as margin loss or SVM loss, is defined as:

The loss is 0 when share the same sign and 1. Otherwise, the loss is linear. In other words, the binary hinge loss attempts to achieve a correct classification, with a margin of at least 1.

Hinge (multiclass) The hinge loss was extended to the multiclass setting by Crammer and Singer (2002). Let be the network’s output vector, and y be the one-hot vector for the correct output class.

The classification rule is defined as selecting the class with the highest score:

The multiclass hinge loss attempts to score the correct class above all other classes with a margin of at least 1.

Both the binary and multiclass hinge losses are intended to be used with a linear output layer. The hinge losses are useful whenever we require a hard decision rule, and do not attempt to model class membership probability.

Log loss The log loss is a common variation of the hinge loss, which can be seen as a “soft” version of the hinge loss with an infinite margin (LeCun et al., 2006).

Categorical cross-entropy loss The categorical cross-entropy loss (also referred to as negative log likelihood) is used when a probabilistic interpretation of the scores is desired.

Let be a vector representing the true multinomial distribution over the labels 1be the network’s output, which was transformed by the softmax activation function, and represent the class membership conditional distribution ˆ). The categorical cross entropy loss measures the dissimilarity between the true label distribution y and the predicted label distribution , and is defined as cross entropy:

For hard classification problems in which each training example has a single correct class assignment, y is a one-hot vector representing the true class. In such cases, the cross entropy can be simplified to:

where t is the correct class assignment. This attempts to set the probability mass assigned to the correct class t to 1. Because the scores have been transformed using the softmax function and represent a conditional distribution, increasing the mass assigned to the correct class means decreasing the mass assigned to all the other classes.

The cross-entropy loss is very common in the neural networks literature, and produces a multi-class classifier which does not only predict the one-best class label but but also predicts a distribution over the possible labels. When using the cross-entropy loss, it is assumed that the network’s output is transformed using the softmax transformation.

Ranking losses In some settings, we are not given supervision in term of labels, but rather as pairs of correct and incorrect items , and our goal is to score correct items above incorrect ones. Such training situations arise when we have only positive examples, and generate negative examples by corrupting a positive example. A useful loss in such scenarios is the margin-based ranking loss, defined for a pair of correct and incorrect examples:

where NN(x) is the score assigned by the network for input vector x. The objective is to score (rank) correct inputs over incorrect ones with a margin of at least 1. A common variation is to use the log version of the ranking loss:

Examples using the ranking hinge loss in language tasks include training with the auxiliary tasks used for deriving pre-trained word embeddings (see section 5), in which we are given a correct word sequence and a corrupted word sequence, and our goal is to score the correct sequence above the corrupt one (Collobert & Weston, 2008). Similarly, Van de Cruys (2014) used the ranking loss in a selectional-preferences task, in which the network was trained to rank correct verb-object pairs above incorrect, automatically derived ones, and (Weston, Bordes, Yakhnenko, & Usunier, 2013) trained a model to score correct (head,relation,trail) triplets above corrupted ones in an information-extraction setting. An example of using the ranking log loss can be found in (Gao et al., 2014). A variation of the ranking log loss allowing for a different margin for the negative and positive class is given in (dos Santos et al., 2015).

5. Word Embeddings

A main component of the neural-network approach is the use of embeddings – representing each feature as a vector in a low dimensional space. But where do the vectors come from? This section will survey the common approaches.

5.1 Random Initialization

When enough supervised training data is available, one can just treat the feature embeddings the same as the other model parameters: initialize the embedding vectors to random values, and let the network-training procedure tune them into “good” vectors.

Some care has to be taken in the way the random initialization is performed. The method used by the effective word2vec implementation (Mikolov et al., 2013; Mikolov, Sutskever, Chen, Corrado, & Dean, 2013) is to initialize the word vectors to uniformly sampled random numbers in the range [is the number of dimensions. Another option is to use xavier initialization (see Section 6.3) and initialize with uniformly sampled values from

In practice, one will often use the random initialization approach to initialize the embedding vectors of commonly occurring features, such as part-of-speech tags or individual letters, while using some form of supervised or unsupervised pre-training to initialize the potentially rare features, such as features for individual words. The pre-trained vectors can then either be treated as fixed during the network training process, or, more commonly, treated like the randomly-initialized vectors and further tuned to the task at hand.

5.2 Supervised Task-specific Pre-training

If we are interested in task A, for which we only have a limited amount of labeled data (for example, syntactic parsing), but there is an auxiliary task B (say, part-of-speech tagging) for which we have much more labeled data, we may want to pre-train our word vectors so that they perform well as predictors for task B, and then use the trained vectors for training task A. In this way, we can utilize the larger amounts of labeled data we have for task B. When training task A we can either treat the pre-trained vectors as fixed, or tune them further for task A. Another option is to train jointly for both objectives, see Section 7 for more details.

5.3 Unsupervised Pre-training

The common case is that we do not have an auxiliary task with large enough amounts of annotated data (or maybe we want to help bootstrap the auxiliary task training with better vectors). In such cases, we resort to “unsupervised” methods, which can be trained on huge amounts of unannotated text.

The techniques for training the word vectors are essentially those of supervised learning, but instead of supervision for the task that we care about, we instead create practically unlimited number of supervised training instances from raw text, hoping that the tasks that we created will match (or be close enough to) the final task we care about.

The key idea behind the unsupervised approaches is that one would like the embedding vectors of “similar” words to have similar vectors. While word similarity is hard to define and is usually very task-dependent, the current approaches derive from the distributional hypothesis (Harris, 1954), stating that words are similar if they appear in similar contexts. The different methods all create supervised training instances in which the goal is to either predict the word from its context, or predict the context from the word.

An important benefit of training word embeddings on large amounts of unannotated data is that it provides vector representations for words that do not appear in the supervised training set. Ideally, the representations for these words will be similar to those of related words that do appear in the training set, allowing the model to generalize better on unseen events. It is thus desired that the similarity between word vectors learned by the unsupervised algorithm captures the same aspects of similarity that are useful for performing the intended task of the network.

Common unsupervised word-embedding algorithms include (Mikolov et al., 2013, 2013), GloVe (Pennington, Socher, & Manning, 2014) and the Collobert and Weston (2008, 2011) embeddings algorithm. These models are inspired by neural networks and are based on stochastic gradient training. However, they are deeply connected to another family of algorithms which evolved in the NLP and IR communities, and that are based on matrix factorization (see (Levy & Goldberg, 2014b; Levy et al., 2015) for a discussion).

Arguably, the choice of auxiliary problem (what is being predicted, based on what kind of context) affects the resulting vectors much more than the learning method that is being used to train them. We thus focus on the different choices of auxiliary problems that are available, and only skim over the details of the training methods. Several software packages for deriving word vectors are available, including word2vecimplementing the word2vec models with word-windows based contexts, word2vecfwhich is a modified version of word2vec allowing the use of arbitrary contexts, and GloVeimplementing the GloVe model. Many pre-trained word vectors are also available for download on the web.

While beyond the scope of this tutorial, it is worth noting that the word embeddings derived by unsupervised training algorithms have a wide range of applications in NLP beyond using them for initializing the word-embeddings layer of a neural-network model.

5.4 Training Objectives

Given a word w and its context c, different algorithms formulate different auxiliary tasks. In all cases, each word is represented as a d-dimensional vector which is initialized to a random value. Training the model to perform the auxiliary tasks well will result in good word embeddings for relating the words to the contexts, which in turn will result in the embedding vectors for similar words to be similar to each other.

Language-modeling inspired approaches such as those taken by (Mikolov et al., 2013; Mnih & Kavukcuoglu, 2013) as well as GloVe (Pennington et al., 2014) use auxiliary tasks in which the goal is to predict the word given its context. This is posed in a probabilistic setup, trying to model the conditional probability P(w|c).

Other approaches reduce the problem to that of binary classification. In addition to the set D of observed word-context pairs, a set ¯D is created from random words and context pairings. The binary classification problem is then: does the given (w, c) pair come from D or not? The approaches differ in how the set ¯D is constructed, what is the structure of the classifier, and what is the objective being optimized. Collobert and Weston (2008, 2011) take a margin-based binary ranking approach, training a feed-forward neural network to score correct (w, c) pairs over incorrect