b

DiscoverSearch
About
My stuff
A Survey of Neural Network Techniques for Feature Extraction from Text
2017·arXiv
Abstract
Abstract

This paper aims to catalyze research discussions about text feature extraction techniques using neural network architectures. The research questions discussed here focus on the state-of-the-art neural network techniques that have proven to be useful tools for language processing, language generation, text classification and other computational linguistics tasks.

RQ1 What are the relatively simple statistical techniques to extract features from text?

RQ2 Is there any inherent benefit to using neural networks as opposed to the simple methods?

RQ3 What are the trade-offs that neural networks incur as opposed to the simple methods?

RQ4 How do the different techniques compare to each other in terms of performance and accuracy?

RQ5 In what use-cases do the trade-offs outweigh the benefits of neural networks?

The research questions listed in Section 2 will be tackled by surveying a few of the important overview papers on the topic(Goldberg, 2016)(Bengio et al., 2003)(Morin and Bengio, 2005). A few of the groundbreaking research papers in this area will also be studied, including word embeddings(Mikolov et al., 2013a)(Mikolov et al., 2013b)(Mikolov et al., 2013c).

In addition to this, other less-obvious methods of features extraction will be surveyed, including tasks like part-of-speech tagging, chunking, named entity recognition, and semantic role labeling(Socher et al., 2011)(Luong et al., 2013)(Maas et al., 2015)(Li et al., 2015)(Collobert et al., 2011)(Pennington et al., 2014).

image

image

This section provides a high level background of the tasks within Computational Linguistics.

image

4.1 Part-of-Speech Tagging

image

POS tagging aims to label each word with a unique tag that indicates its syntactic role,

like noun, verb, adjective etc.

The best POS taggers are based on classifiers trained on windows of text, which are then fed to a bidirectional decoding algorithm during inference.

In general, models resemble a bi-directional dependency network, and can be trained using a variety of methods including support vector machines and bi-directional Viterbi decoders.

4.2 Chunking

Chunking aims to label segments of a sentence with syntactic constituents such as noun or verb phrases. It is also called shallow parsing and can be viewed as a generalization of part-of-speech tagging to phrases instead of words.

Implementations of chunking usually require an underlying POS implementation, after which the words are compounded or chunked by concatenation.

4.3 Named Entity Recognition

NER labels atomic elements in a sentence into categories such as PERSON or LOCATION.

Features to train NER classifiers include POS tags, CHUNK tags, prefixes and suffixes, and large lexicons of the labeled entities.

4.4 Semantic Role Labeling

SRL aims to assign a semantic role to a syntactic constituent of a sentence.

State-of-the-art SRL systems consist of several stages: producing a parse tree, identifying which parse tree nodes represent the arguments of a given verb, and finally classifying these nodes to compute the corresponding SRL tags.

SRL systems usually entail numerous features like the parts of speech and syntactic labels of words and nodes in the tree, the syntactic path to the verb in the parse tree, whether a node in the parse tree is part of a noun or verb phrase etc.

image

Document vectorization is needed to convert text content into a numeric vector representation that can be utilized as features, which can then be used to train a machine learning model on. This section talks about a few different statistical methods for computing this feature vector(John and Vech- tomova, 2017).

image

5.1 N-gram Model

N-grams are contiguous sequences of ‘n’ items from a given sequence of text or speech. Given a complete corpus of documents, each tuple of ‘n’ grams, either characters or words are represented by a unique bit in a bit vector, which, when aggregated for a body of text, form a sparse vectorized representation of the text in the form of n-gram occurrences.

image

5.2 TF-IDF Model

image

Term frequency - inverse document frequency (TF-IDF), is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus (Sparck Jones, 1972). The TF-IDF value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. It is a bag-of-words model, and doesn’t preserve word ordering.

image

5.3 Paragraph Vector Model

A Paragraph Vector model is comprised of an unsupervised learning algorithm that learns fixed-size vector representations for variable-length pieces of texts such as sentences and documents (Le and Mikolov, 2014). The vector representations are learned to predict the surrounding words in contexts sampled from the paragraph.

Two distinct implementations have gained prominence in the community.

Doc2Vec: A Python library implementation in Gensim. 1.

FastText: A standalone implementation in C++. (Bojanowski et al., 2016) (Joulin et al., 2016).

image

Fully connected feed-forward neural networks are non-linear learners that can be used as a drop-in replacement wherever a linear learner is used.

The high accuracy observed in experimental results is a consequence of this nonlinearity along with the availability of pre-trained word embeddings.

Multi-layer feed-forward networks can provide competitive results on sentiment classi-fication and factoid question answering

Convolutional and pooling architecture show promising results on many tasks, including document classification, short-text categorization, sentiment classification, relation type classification between entities, event detection, paraphrase identification, semantic role labeling, question answering, predicting box-office revenues of movies based on critic reviews, modeling text interestingness, and modeling the relation between charactersequences and part-of-speech tags.

Convolutional and pooling architectures allow us to encode arbitrarily large items as fixed size vectors capturing their most salient features, but, they do so by sacrificing most of the structural information.

Recurrent and recursive networks allows using sequences and trees and preserve the structural information.

Recurrent models have been shown to produce very strong results for language modeling as well as for sequence tagging, machine translation, dependency parsing, sentiment analysis, noisy text normalization, dialog state tracking, response generation, and modeling the relation between character sequences and part-of-speech tags.

Recursive models were shown to produce state-of-the-art or near state-of-the-art results for constituency and dependency parse reranking, discourse parsing, semantic relation classification, political ideology detection based on parse trees, sentiment classifi-cation, target-dependent sentiment classifica-tion and question answering.

Convolutional nets are observed to to work

well for summarization related tasks, just as recurrent/recursive nets work well for language modeling tasks.

image

Goal: Knowing the basic structure of a sentence, one should be able to create a new sentence by replacing parts of the old sentence with interchangeable entities(Bengio et al., 2003).

image

Challenge: The main bottleneck is computing the activations of the output layer, since it is a fully-connected softmax activation layer.

image

One of the major contributions of this paper in terms of optimizations was data parallel processing (different processors working on a different subsets of data) and asynchronous processor usage of shared memory.

The authors propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.

A fundamental problem that makes language modeling and other learning problems diffi-cult is the curse of dimensionality. It is particularly obvious in the case when one wants to model the joint distribution between many discrete random variables (such as words in a sentence, or discrete attributes in a datamining task).

State-of-the art results are typically obtained using trigrams.

Language generation via substitution of semantically similar language constructs of existing sentences can be done via sharedparameter multi-layer neural networks.

The objective of this paper is to obtain real-valued vector sequences of words and learn a joint probability function for those sequences of words alongside the feature vector, and hence, jointly learn both the real-valued vector representation and the parameters of the probability distribution.

This probability function can be tuned in or-

der to maximize log-likelihood of the training data, while penalizing the cost function, similar to the penalty term one used in Ridge regression.

This will ensure that semantically similar words end up with an almost equivalent feature vectors, called learned distributed feature vectors.

A challenge with modeling discrete variables like a sentence structure as opposed to a continuous value is that the continuous valued function can be assumed to have some form of locality, but the same assumption cannot be made in case of discrete functions.

N-gram models try to achieve a statistical modeling of languages by calculating the conditional probabilities of each possible word that can follow a set of n preceding words.

New sequences of words can be generated by effectively gluing together the popular combinations i.e. n-grams with very high frequency counts.

Similar to the previous paper, attempts to tackle the ‘curse of dimensionality’ (Section 7) and attempts to produce a much faster variant.

Back-off n-grams are used to learn a real-valued vector representation of each word.

The word embeddings learned are shared across all the participating nodes in the distributed architecture.

A very important component of the whole model is the choice of the words binary encoding, i.e. of the hierarchical word cluster-

ing. In this paper the authors combine empirical statistics with prior knowledge from the WordNet resource.

image

Goal: Attempts to build a paragraph embedding from the underlying word and sentence embeddings, and then proceeds to encode the paragraph embedding in an attempt to reconstruct the original paragraph(Li et al., 2015).

image

The implementation uses an LSTM layer to convert words into a vector representation of a sentence. A subsequent LSTM layer converts multiple sentences into a paragraph.

For this to happen, we need to preserve, syntactic, semantic and discourse related properties while creating the embedded representation.

Hierarchical LSTM utilized to preserve sentence structure.

Parameters are estimated by maximizing likelihood of outputs given inputs, similar to standard sequence-to-sequence models.

Estimates are calculated using softmax functions to maximize the likelihood of the constituent words.

Attention models using the hierarchical autoencoder could be utilized for dialog systems, since it explicitly models for discourse.

image

Goal: In this paper, the authors examine the vector-space word representations that are implicitly learned by the input-layer weights. These representations are surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset. This allows vector-oriented reasoning based on the offsets between words(Mikolov et al., 2013c). This is one of the seminal papers that led to the creation of Word2Vec, which is a state-of-the-art word embedding tool(Mikolov et al., 2013a).

image

In this model, words are converted via a learned lookup-table into real valued vectors which are used as the inputs to a neural network.

One of the main advantages of these models is that the distributed representation achieves a level of generalization that is not possible with classical n-gram language models.

The word representations in this paper are learned by a recurrent neural network language model.

The input vector w(t) represents input word at time t encoded using 1-of-N coding, and the output layer y(t) produces a probability distribution over words. The hidden layer s(t) maintains a representation of the sentence history. The input vector w(t) and the output vector y(t) have dimensionality of the vocabulary.

The values in the hidden and output layers are computed as follows:

where  f(z) = 11+e−z and g(zm) = ezm�k ezk

image

Figure 1: RNN Language Model

image

to compute the answer to an analogy question a : b; c : d where d is unknown. With continuous space word representations, this becomes as simple as calculating

image

y is the best estimate of d that the model could compute. If there is no vector amongst the trained words such that  y == xw,the nearest vector representation can be estimated using cosine similarity.

image

image

Goal: The paper aims to address the inaccuracy in vector representations of complex and rare words, supposedly caused by the lack of relation between morphologically related words(Luong et al., 2013).

image

The authors treat each morpheme as a basic unit in the RNNs and construct representations for morphologically complex words on the fly from their morphemes. By training a neural language model (NLM) and integrating RNN structures for complex words, they utilize contextual information to learn morphemic semantics and their compositional properties.

Discusses a problem that the Word2Vec syntactic relations like

might not hold true if the vector representation of a rare word is inaccurate to begin with.

morphoRNN operates at the morpheme level rather than the word level. An example of the this is illustrated in Figure 2.

Parent words are created by combining a stem vector and an affix vector, as shown in Equation 1.

image

image

Figure 2: morphoRNN

The paper describes both context sensitive and insensitive versions of the Morphological RNN.

Similar to a typical RNN, the network is trained by computing the activation functions and propagating the errors backward in a forward-backward pass architecture.

This RNN model performs better than most of the other neural language models, and could be used to supplement word vectors.

forming normalization at the final layer altogether.

image

The ideas presented in this paper build on the previous ideas presented by (Bengio et al., 2003).

The objective was to obtain high-quality word embeddings that capture the syntactic and semantic characteristics of words in a manner that allows algebraic operations to proxy the distances in vector space.

image

The training time here scales with the dimensionality of the learned feature vectors and not on the volume of training data.

The approach attempts to find a distributed vector representation of values as opposed to a continuous representation of values as computed by methods like LSA and LDA.

The models are trained using stochastic gradient descent and backpropagation.

The RNN models are touted to have an inherently better representation of sentence structure for complex patterns, without the need to specify context length.

To allow for the distributed training of the data, the framework DistBelief was used with multiple replicas of the model. Adagrad was utilized for asynchronous gradient descent.

Two distinct models were conceptualized for the training of the word vectors based on context, both of which are continuous and distributed representations of words. These are illustrated in Figure 3.

Continuous Bag-of-Words model: This model uses the context of a word i.e. the words that precede and follow it, to predict the current word.

Skip-gram model: This model uses the current word to predict the context it appeared in.

The experimental results show that the CBOW and skip-gram models consistently out-perform the then state-of-the-art models. It was also observed that after a point, increasing the dimensions

image

Figure 3: CBOW and Skip-gram models

One of the optimizations suggested is to subsample the training set words to achieve a speed-up in model training.

Given a sequence of training words [w1, w2, w3, ..., wT ],the objective of the skip-gram model is to maximize the average log probability shown in Equation 3

where c is the window or context surrounding the current word being trained on.

image

child nodes. These define a random walk that assigns probabilities to words.

The authors use a binary Huffman tree, as it assigns short codes to the frequent words which results in fast training. It has been observed before that grouping words together by their frequency works well as a very simple speedup technique for the neural network based language models.

Noise Contrastive Estimation (NCE), which is an alternative to hierarchical softmax, posits that a good model should be able to differentiate data from noise by means of logistic regression.

To counter the imbalance between the rare and frequent words, we used a simple subsampling approach: each word within the training set is discarded with probability computed by the below formula.

This is similar to a dropout of neurons from the network, except that it is statistically more likely that frequent words are removed from the corpus by virtue of this method.

Discarding the frequently occurring words allows for a reduction in computational and memory cost.

The individual words can easily be coalesced into phrases using unigram and bigram frequency counts, as shown below.

image

Another interesting property of learning these distributed representations is that the word and phrase representations learned by the skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetic.

image

image

Goal: This paper proposes a global log-bilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local

The relationship between any arbitrary words can be examined by studying the ratio of their co-occurrence probabilities with various probe words.

The authors suggest that the appropriate starting point for word vector learning should be with ratios of co-occurrence probabilities rather than the probabilities themselves.

We can express this co-occurrence relation as shown below

This makes the feature matrix interchangeable with its transpose.

image

which maintains the sparsity of X while avoiding the divergences while computing the co-occurrences matrix.

The model obtained in the paper could be compared to a global skip-gram model as opposed to a fixed window-size skip-gram model as proposed by (Mikolov et al., 2013a).

The performance seems to increase monotonically with an increase in training data.

RQ1 What are the relatively simple statistical techniques to extract features from text?

Word count frequency models like n-gram and simple bag-of-words models such as TFIDF are still the easiest tools to obtain an numeric vector representation of text.

RQ2 Is there any inherent benefit to using neural networks as opposed to the simple methods?

The benefit of using neural nets primarily is their ability to identify obscure patterns, and remain flexible enough for a varied set of application areas from topic classification to syntax parse-tree generation.

RQ3 What are the trade-offs that neural networks incur as opposed to the simple methods?

The trade-offs are typically expressed in terms of computational cost and memory usage, although model complexity is a factor too, given that neural nets can be trained to learn arbitrarily complex generative models.

RQ4 How do the different techniques compare to each other in terms of performance and accuracy?

This question can only be answered subjectively as it varies from application to application. Typically, document similarity can be tackled with a simple statistical approach like TF-IDF. CNNs inherently model input data in a manner that iteratively reduces the dimensionality, making it a great fit for topic classification and document summarization. RNNs are great at modeling sequences of text, which make them apt for language syntax modeling. Amongst the frameworks, GloVe’s pre-trained word-embeddings perform better than vanilla Word2Vec, which is considered state-of-the-art.

RQ5 In what use-cases do the trade-offs outweigh the benefits of neural networks?

As explained for the previous question, for a simple information retrieval use case such as document ranking, models such as TFIDF, and word PMI (pointwise mutual information) are sufficient, and neural networks would be overkill in such use-cases.

Yoshua Bengio, R´ejean Ducharme, Pascal Vincent, and

Christian Jauvin. 2003. A neural probabilistic lan-

guage model. Journal of machine learning research

3(Feb):1137–1155.

Piotr Bojanowski, Edouard Grave, Armand Joulin,

and Tomas Mikolov. 2016. Enriching word vec-

tors with subword information. arXiv preprint

arXiv:1607.04606 .

Ronan Collobert, Jason Weston, L´eon Bottou, Michael

Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011.

Natural language processing (almost) from scratch.

Journal of Machine Learning Research 12(Aug):2493–

2537.

Yoav Goldberg. 2016. A primer on neural network

models for natural language processing. Journal of Ar-

tificial Intelligence Research 57:345–420.

Vineet John and Olga Vechtomova. 2017. Uw-finsent

at semeval-2017 task 5: Fine-grained sentiment anal-

ysis on financial news headlines. Proceedings of the

11th international workshop on semantic evaluation .

Armand Joulin, Edouard Grave, Piotr Bojanowski, and

Tomas Mikolov. 2016. Bag of tricks for efficient text

classification. arXiv preprint arXiv:1607.01759 .

Quoc V Le and Tomas Mikolov. 2014. Distributed rep-

resentations of sentences and documents. In ICML.

volume 14, pages 1188–1196.

Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015.

A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057 .

Thang Luong, Richard Socher, and Christopher D Manning. 2013. Better word representations with recursive neural networks for morphology. In CoNLL. pages 104–113.

Andrew L Maas, Ziang Xie, Dan Jurafsky, and An- drew Y Ng. 2015. Lexicon-free conversational speech recognition with neural networks. In HLT-NAACL. pages 345–354.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. pages 3111–3119.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Hlt-naacl. volume 13, pages 746–751.

Frederic Morin and Yoshua Bengio. 2005. Hierarchi- cal probabilistic neural network language model. In Aistats. Citeseer, volume 5, pages 246–252.

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. volume 14, pages 1532–1543.

Richard Socher, Cliff C Lin, Chris Manning, and An- drew Y Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11). pages 129–136.

Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation 28(1):11–21.

image


Designed for Accessibility and to further Open Science