Distributed word representations have become a mainstay in natural language processing, enjoying a slew of applications (Sebastiani, 2002; Turian et al., 2010; Socher et al., 2013). Though Baroni et al. (2014) suggested that predictive models which use neural networks to generate the distributed word representations (also known as embeddings in this context) outperform counting models which work on co-occurrence matrices, recent work shows evidence to the contrary (Levy et al., 2014; Salle et al., 2016).
In this paper, we focus on improving a state-of-the-art counting model, LexVec (Salle et al., 2016), which performs factorization of the positive pointwise mutual information (PPMI) matrix using window sampling and negative sampling (WSNS). Salle et al. (2016) suggest that LexVec matches and often outperforms competing models in word similarity and semantic analogy tasks. Here we show that using positional contexts to approximate syntactic dependencies yields state-of-the-art performance on syntactic analogy tasks as well. We also show how it is possible to approximate WSNS using aggregate data, eliminating random access to the PPMI matrix, enabling the use of external memory. Though not undertaken in this paper, this modification effectively allows LexVec to be trained on web-scale corpora.
This paper is organized as follows: we review the LexVec model (2) and detail how positional contexts and external memory can be incorporated into the model (
3). We describe evaluation methods (
and discuss results in terms of related work (
finish with conclusions and future work.
Source code for the enhanced model is available at https://github.com/alexandres/ lexvec.
LexVec uses WSNS to factorize the PPMI matrix into two lower rank matrices. The co-occurrence matrix M is calculated from word-context pairs (w, c) obtained by sliding a symmetric window of size win over the training corpus C. The PPMI matrix is then calculated as follows
where represents index summation.
The word and context embeddings with dimensions
is the vocabulary and d the embedding dimension), are obtained by the minimization via stochastic gradient descent (SGD) of a combination of the loss functions
using WSNS. is the distribution used for drawing negative samples, chosen to be
with (Mikolov et al., 2013b; Salle et al., 2016), and #(w) the unigram frequency of w.
Two methods were defined for the minimization of eqs. (2) and (3): Mini-batch and Stochastic (Salle et al., 2016). Since the latter is more computationally efficient and yields equivalent results, we adopt it in this paper. The Stochastic method extends the context window with noise words drawn using negative sampling according to eq. (4). The key idea is that window sampling is likely to sample related words, approximating their vectors using eq. (2), while negative sampling is likely to select unrelated words, scattering their vectors using eq. (3). The resulting global loss, where
Given a word w and context word c, eq. (5) is proportional to #(w) and #(c). This is the desired behaviour for the global loss function, since the more frequent w or c are in the corpus, the more confident we can be about the corpus estimated Suppose both #(w) and #(c) are high, but
is low. This is unequivocal evidence of negative correlation between them, and so we should put more effort into approximating their PPMI. The argument is analogous for high PPMI. If on the other hand #(w) and #(c) are low, we cannot be too confident about the corpus estimated
, and so less effort should be spent on its approximation.
3.1 Positional Contexts
As suggested by Levy et al. (2015) and Salle et al. (2016), positional contexts (introduced in Levy et al. (2014)) are a potential solution to poor performance on syntactic analogy tasks. Rather than only accounting for which context words appear around a target word, positional contexts also account for their position relative to the target word. For example, in the sentence “the big dog barked loudly”, target word dog has contexts occurrence matrix, before having dimensions
|V |, takes on dimensions
using positional contexts.
This can be incorporated into LexVec with two minor modifications: 1) The context embedding takes on dimensions
tive sampling must now sample positional contexts rather than simple contexts. This latter point requires that the distribution from which negative samples are drawn become
Without positional contexts, either can be used as embeddings. Since positional contexts make the dimensions of both matrices incompatible,
cannot be used directly. We propose using the sum of all positional context vectors as the context vector for a word
3.2 External Memory
As window sampling scans over the training corpus and negative sampling selects random contexts, pairs are generated and the corresponding
cell must be accessed so that eqs. (2) and (3) can be minimized. Unfortunately, this results in random access to the PPMI matrix which requires it to be kept in main memory. Pennington et al. (2014) show that the under certain assumptions, this sparse matrix grows as
bounds the maximum corpus size that can be processed by LexVec.
We propose an approximation to WSNS that works as follows: All the word-context pairs generated by window sampling the corpus and by negative sampling each target word are first written to a file F. The file F is then sorted with duplicated lines collapsed, and the lines written in the format
indicates if the pair occurred or not in the corpus, tot is the number of times the pair occurs including negative sampling, and
the number of times it occurred in the corpus. F’s construction requires O(|C|) external memory, and only O(1) main memory. Additionally, all
are kept in main memory, using O(|V |) space. This is nearly identical to the way in which GloVe builds its sparse co-occurrence matrix on disk, with the additional logic for adding and merging negatively sampled pairs.
We now present two ways to proceed with training: multiple iteration or single iteration.
Multiple Iteration (MI): In this variant, F is shuffled. For each tuple F, eq. (2) is minimized tot times, using
and
to calculate
if marker is +, else
is equal to zero.
times to a new file
Then shuffle
and execute the MI algorithm described above on
Both MI and SI are minimizing the exact same global loss function given by eq. (5) as LexVec without external memory, the only difference between the three being the order in which word-context pairs are processed.
We report results from Salle et al. (2016) and use the same training corpus and parameters to train LexVec with positional contexts and external memory. The corpus is a Wikipedia dump from June 2015, tokenized, lowercased, and split into sentences, removing punctuation and converting numbers to words, for a final vocabulary of 302,203 words.
All generated embeddings have dimensionality equal to 300. As recommended in Levy et al. (2015) and used in Salle et al. (2016), the PPMI matrix used in all LexVec models and in PPMI-SVD is transformed using context distribution smoothing exponentiating context frequencies to the power 0.75. PPMI-SVD is the singular value decomposition of the PPMI matrix. LexVec and PPMI-SVD use symmetric windows of size 2. Both GloVe (Pennington et al., 2014) and Skip-gram with negative sampling (SGNS) (Mikolov et al., 2013b) were trained using a symmetric window of size 10. GloVe was run for 50 iterations, using parameters and learning rate of 0.05. LexVec and SGNS were run for 5 iterations, using 5 negative samples, and initial learning rate of 0.025. LexVec, PPMI-SVD, and SGNS use dirty subsampling (Mikolov et al., 2013b; Levy et al., 2015) with threshold
Words in the training corpus with unigram probability f greater than t are discarded with probability
. For LexVec, we report results for W and
embeddings when using positional contexts, otherwise
. For PPMI-SVD and GloVe we report
, and for SGNS, W, that correspond to their best results.
The goal of our evaluation is to determine whether: 1) Positional contexts improve syntactic performance 2) The use of external memory is a good approximation of WSNS. Therefore, we perform the exact same evaluation as Salle et al. (2016), namely the WS-353 Similarity (WSim) and Relatedness (WRel) (Finkelstein et al., 2001), MEN (Bruni et al., 2012), MTurk (Radinsky et al., 2011), RW (Luong et al., 2013), SimLex-999 (Hill et al., 2015), MC (Miller and Charles, 1991), RG (Rubenstein and Goodenough, 1965), and SCWS (Huang et al., 2012) word similarity tasks1, and the Google semantic (GSem) and syntactic (GSyn) analogy (Mikolov et al., 2013a) and MSR syntactic analogy dataset (Mikolov et al., 2013c) tasks. Word similarity tasks use cosine similarity. Word analogy tasks are solved using both 3CosAdd and 3CosMul (Levy et al., 2014).
Positional contexts improved performance in both similarity (table 1) and analogy tasks (table 2). As hypothesized, their use significantly improved LexVec’s performance on syntactic analogies, leading to the highest score on GSyn, surpassing GloVe and SGNS. This confirms the relevance of using positional contexts to capture syntactic dependencies.
Salle et al. (2016) reported that combining word and context embeddings scored marginally higher in word similarity tasks, and that holds true in our experiments, even for . In the analogy tasks, using only the word embedding leads to far better syntactic performance, indicating that the embedding W strikes a better balance between syntax and semantics than does
Table 1: Spearman rank correlation on word similarity tasks. Pos. = using positional contexts
Table 2: Results on word analogy tasks, given as percent accuracy.
The SI external memory implementation very closely approximates the standard variant (without the use of external memory), which was expected given that they minimize the exact same loss function. The gap between MI and standard was much wider. It seems that there is value in the way WSNS uses corpus ordering of word-context pairs to train the model. The SI variant more closely mimics this order, distributing same pair occurrences over the entire training. MI, on the other hand, has a completely artificial ordering, distant from corpus and SI’s ordering.
This paper presented two improvements to the word embedding model LexVec. The first yields state-of-the-art performance on syntactic analogies through the use of positional contexts. The second solves the need to store the PPMI matrix in main memory by using external memory. The SI variant of the external memory implementation was a good approximation of standard LexVec’s WSNS, enabling future training using web-scale corpora.
In future work, we plan to explore the model’s hyperparameter space, which could potentially boost model performance, having so far restricted ourselves to parameters recommended in Levy et al. (2015). Finally, following Tsvetkov et al. (2015), we will pursue evaluation of the model on downstream tasks in addition to the intrinsic evaluations used in this paper.
Marco Baroni, Georgiana Dinu, and Germ´an Kruszewski. 2014. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. volume 1, pages 238–247.
Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, pages 136– 145.
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web. ACM, pages 406–414.
Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics .
Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, pages 873– 882.
Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics pages 211–225.
Omer Levy, Yoav Goldberg, and Israel Ramat-Gan. 2014. Linguistic regularities in sparse and explicit word representations. CoNLL-2014 page 171.
Minh-Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better word representations with recursive neural networks for morphology. CoNLL-2013 104.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. pages 3111–3119.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In HLT-NAACL. pages 746–751.
George A. Miller and Walter G. Charles. 1991. Contextual correlates of semantic similarity. Language and cognitive processes 6(1):1–28.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014) 12.
Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web. ACM, pages 337–346.
Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM 8(10):627–633.
Alexandre Salle, Marco Idiart, and Aline Villavicencio. 2016. Matrix factorization using window sampling and negative sampling for improved word representations. arXiv preprint arXiv:1606.00819 .
Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR) 34(1):1–47.
Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. 2013. Parsing with compositional vector grammars. In ACL (1). pages 455–465.
Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Guillaume Lample, and Chris Dyer. 2015. Evaluation
of word vector representations by subspace alignment .
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pages 384–394.