Embedding Compression with Isotropic Iterative Quantization

2020·Arxiv

Abstract

Abstract

Continuous representation of words is a standard component in deep learning-based NLP models. However, representing a large vocabulary requires significant memory, which can cause problems, particularly on resource-constrained platforms. Therefore, in this paper we propose an isotropic iterative quantization (IIQ) approach for compressing embedding vectors into binary ones, leveraging the iterative quantization technique well established for image retrieval, while satisfying the desired isotropic property of PMI based models. Experiments with pre-trained embeddings (i.e., GloVe and HDC) demonstrate a more than thirty-fold compression ratio with comparable and sometimes even improved performance over the original real-valued embedding vectors.

1 Introduction

Words are basic units in many natural language processing (NLP) applications, e.g., translation (Bahdanau et al. 2014) and text classification (Joulin et al. 2016). Understanding words is crucial but can be very challenging. One difficulty lies in the large vocabulary commonly seen in applications. Moreover, their semantic permutations can be numerous, constituting rich expressions at the sentence and paragraph levels.

In statistical language models, word distributions are learned for unigrams, bigrams, and generally n-grams. A unigram distribution presents the probability for each word. The histogram is already sufficiently complex given a large vocabulary. Then, the complexity of bigram distributions is quadratic in the vocabulary size and that of n-gram ones is exponential. The combinatorial nature motivates researchers to develop alternative representations which otherwise explode.

Instead of word distributions, continuous representations with floating-point vectors are much more convenient to handle: they are differentiable, and their differences can be used to draw semantic analogy. A variety of algorithms were proposed over the years for learning these word vectors. Two representative ones are Word2Vec (Mikolov et al. 2013a) and GloVe (Pennington et al. 2014). Word2Vec is a classical algorithm based on either skip grams or a bag of words, both of which are unsupervised and can directly learn word embeddings from a given corpus. GloVe is another embedding learning algorithm, which combines the advantage of a global factorization of the word co-occurrence matrix, as well as that of the local context. Both approaches are effective in many NLP applications, including word analogy and name entity recognition.

Neural networks with word embeddings are frequently used in solving NLP problems, such as sentiment analysis (dos Santos and Gatti 2014) and name entity recognition (Lample et al. 2016). An advantage of word embeddings is that interactions between words may be modeled by using neural network layers (e.g., attention architectures).

Despite the success of these word embeddings, they often constitute a substantial portion of the overall model. For example, the pre-trained Word2Vec (Mikolov et al. 2013b) contains 3M word vectors and the storage is approximately 3GB. This cost becomes a bottleneck in deployment on resource-constrained platforms.

Thus, much work studies the compression of word embeddings. (Shu and Nakayama 2017) propose to represent word vectors by using multiple codebooks trained with Gumbel-softmax. (Grzegorczyk and Kurdziel 2017) learn binary document emebddings via a bag-of-word-like process. The learned vectors are demonstrated to be effective for document retrieval.

In information retrieval, iterative quantization (ITQ) (Gong et al. 2013) transforms vectors into binary ones, which are found to be successful in image retrieval. The method maximizes the bit variance meanwhile minimizing the quantization loss. It is theoretically sound and also computationally efficient. However, (Grzegorczyk and Kurdziel 2017) find that directly applying ITQ in NLP tasks may not be effective.

In (Mu et al. 2017), authors propose an alternate approach that improves the quality of word embeddings without incurring extra training. The main idea lies in the concept of isotropy used to explain the success of pointwise mutual information (PMI) based embeddings. The authors demonstrate that the isotropy could be improved through projecting

embedding vectors toward weak directions.

Therefore, in this work we propose isotropic iterative quantization (IIQ), which leverages iterative quantization meanwhile satisfying the isotropic property. The main idea is to optimize a new objective function regarding the isotropy of word embeddings, rather than maximizing the bit variance.

Maximizing the bit variance and maximizing isotropy are two opposite ideas, because the former performs projection toward large eigenvalues (dominant directions) while the latter projects toward the smallest ones (weak directions). Given prior success (Mu et al. 2017), it is argued that maximizing isotropy is more beneficial in NLP applications.

2 Related Work

In information retrieval (where the proposed method is inspired), locality-sensitive hashing (LSH) is well studied and explored. The aim of LSH is to preserve the similarity between inputs after hashing. This aim is well aligned with that of embedding compression. For example, word similarity can be measured by the cosine distance of their embeddings. If LSH is applied, the hashed emebddings should maintain a similar distance as the original cosine distance but have much lower complexity in the meantime. A well-known LSH method in image retrieval is ITQ (Gong et al. 2013). However, its application in NLP tasks such as document retrieval is not as successful (Grzegorczyk and Kurdziel 2017). Rather, the authors propose to learn binary paragraph embeddings via a bag-of-words-like model, which essentially computes a binary hash function for the real-valued embedding vectors. On the other hand, (Shu and Nakayama 2017) propose a compact structure for embeddings by using the gumble softmax. In this approach, each word vector is represented as the summation of a set of real-valued embeddings. This idea amounts to learning a low-rank representation of the embedding matrix. Pre-trained embeddings may be directly used in deep neural networks (DNN) or serve as initialization (Kim 2014). There exist several compression techniques for DNNs, including pruning (Han et al. 2015) and low-rank compression (Sainath et al. 2013). Most of these techniques requires retraining for specific tasks, thus challenges exist when applying them to unsupervised word embeddings (e.g., GloVe). (See et al. 2016) successfully apply DNN compression techniques to unsupervised embeddings. The authors use pruning to sparsify embedding vectors, which however requires retraining after each pruning iteration. Although retraining is common when compressing DNNs, it often takes a long time to recover the model performance. Similarly, (Acharya et al. 2019) uses low rank approximation to compress word embeddings, but they also face the same problem to fine-tune a supervised model.

3 Preliminaries

3.1 Iterative Quantization

In this section, we briefly revisit the iterative quantization method by breaking it down into two steps. The first step is to maximize bit variance when transforming given vectors into binary representation. The second step is about minimizing the quantization loss while maintaining the maximum bit variance.

Maximize Bit Variance. Let be the embedding dictionary, where each row denotes the em- bedding vector for the i-th word in the dictionary. Assuming that vectors are zero centered (), ITQ encodes vectors with a binary representation through maximizing the bit variance, which is achieved by solving the following optimization problem:

where and is the dimension of the encoded vectors. Here, B is the final binary representation of X and trand sgnare the trace and the sign function, respectively. The problem is the same as that of Principal Component Analysis (PCA) and could be solved by selecting the top c right singular vectors of X as W.

Minimize Quantization Loss. Given a solution W to Equation (1), U = WR is also a solution for any orthogonal matrix . Thus, we could minimize the quantization loss via adjusting the matrix R while maintaining the solution to (1). The quantization loss is defined as the difference between the vectors before and after the quantization:

where is the Frobenius norm. Note that B must be binary. The proposed solution in ITQ is an iterative procedure that updates B and R in an alternating fashion until convergence. In practice, ITQ turns out able to achieve good performance with early stopping (Gong et al. 2013).

3.2 Isotropy of Word Embedding

In (Arora et al. 2016), isotropy is used to explain the success of PMI based word embedding algorithm, for example GloVe embedding. However, (Mu et al. 2017) find that existing word embeddings are not nearly isotropic but could be improved. The proposed solution is to project word embeddings toward the weak directions rather than the dominant directions, which seems counter-intuitive but in practice works well. The isotropy of word embedding X is de-fined as:

where is the partition function

The value of is a measure of isotropy of the given embedding X. A higher means more isotropic and a better quality of the embedding. It is found making the singular values close to each other can effectively improve embedding isotropy.

4 Proposed Method

The preceding section hints that maximizing the isotropy and maximizing the bit variance are opposite in action: The former intends to make the singular values close by removing the largest singular values, whereas the latter removes the smallest singular values and maintains the largest. Given the success of isotropy in NLP applications, we propose to minimize the quantization loss while improving the isotropy, rather than maximizing the bit variance. We call the proposed method isotropic iterative quantization, IIQ.

The key idea of ITQ is based on the observation that U = WR is still a solution to the objective function of (1). In our approach IIQ, we show that the orthogonal transformation maintains the isotropy of the input embedding, so that we could apply a similar alternating procedure as in ITQ to minimize the quantization loss. As a result, our method is composed of three steps: maximizing isotropy, reducing dimension, and minimizing quantization loss.

Maximize Isotropy. The isotropy measure I(X) can be approximated as following (Mu et al. 2017) :

where and are the smallest and largest singular values of X, respectively. For to be 1, the middle term on both the numerator and the denominator must be zero and additionally . The former requirement can be easily satisfied by the zero-centering given embeddings:

where . The latter may be approximately achieved by removing the large singular values such that the rest of the singular values are close to each other. A reason why removing the large singular values makes the rest close, is that often the large singular values have substantial gaps while the rest are clustered. However, removing singular components does not change its dimension. We denote the maximized result as .

Dimension Reduction. To make our method more flex-ible, we perform a dimension reduction afterward by using PCA. This step essentially removes the smallest singular values so that the clustering of the singular values may be further tightened. Note that PCA won’t affect the maximized isotropy of given embeddings, since it only works on the singular values that are already closed to each other after previous step. One can treat the dimension as a hyperparameter, tailored for each data set.

Minimize Quantization Loss. Given a solution to the maximization of (5), we prove that multiplying with an orthogonal matrix R results in the same . In other words, we could minimize the quantization loss (2) while maintaining the isotropy.

Figure 1: Quantization loss curve of 50000 embedding vectors from a pre-trained CNN model.

Proposition 1. If is isotropic and is orthogonal, then admits .

Proof. Given that R is orthogonal, we first prove that U has the same singular values as does . Let have the singular value decomposition (SVD)

where and orthogonal matrix . Let . Then, we have

Since is also orthogonal, Equation (8) gives the SVD of U. Therefore, U has the same singular values as does .

Moreover, , thus U is also zero-centered. By Equation (5), we conclude .

With the given proof, we can always use an orthogonal matrix R to reduce the quantization loss. The iterative optimization strategy as in ITQ (Gong et al. 2013) is adopted to minimize the quantization loss. Two alternating steps lead to a local minimum. First, compute B given R:

Second, update R given B. The update minimizes the quantization loss, which essentially solves the orthogonal Procrustes problem. The solution is given by

where SVD() is the singular value decomposition function and is the diagonal matrix of singular values.

This iterative updating strategy runs until a local optimal solution is found. Fig. 1 shows an example of the quantization loss curve. This result is similar to the behavior of ITQ, the authors of which proposed using early stopping to terminate iteration in practice. We follow the guidance and run only 50 iterations in our experiments.

Overall Algorithm. Our method is an unsupervised approach, which does not require any label supervision. Therefore, it can be applied independently of downstream tasks and no fine tuning is needed. This advantage benefits many problems where embeddings often slow down the learning process because of the high space and computation complexity.

We present the pseudocode of the proposed IIQ method in Algorithm 1. The input D denotes the number of top singular values to be removed, T denotes the number of iterations for minimizing the quantization loss, and O denotes the dimension of the output binary vectors.

The first two lines make zero-centered embedding. Lines 3 to 5 maximize the isotropy. Lines 6 to 8 reduce the embedding dimension, if necessary. Lines 9 to 15 minimize the quantization loss. Within the iteration loop, lines 11 to 12 update B based on the most recent R, whereas lines 13 to 14 update R given the updated B. The last line uses the final transformation R to return the binary embeddings as output.

5 Experimental Results

We run the proposed method on pre-trained embedding vectors and evaluate the compressed embedding in various NLP tasks. For some tasks, the evaluation is directly conducted over the embedding (e.g., measuring the cosine similarity between word vectors); whereas for others, a classifier is trained with the embedding. We conduct all experiments in Python by using Numpy and Keras. The environment is Ubuntu 16.04 with Intel(R) Xeon(R) CPU E5-2698.

Pre-trained Embedding. We perform experiments with the GloVe embedding (Pennington et al. 2014) and the HDC embedding (Sun et al. 2015). The GloVe embedding is trained from 42B tokens of Common Crawl data. The HDC

Table 1: Experiment Configurations.

embedding is trained from public Wikipedia. It has a better quality than GloVe because the training process considers both syntagmatic and paradigmatic relations. All embedding vectors are used in the experiment without vocabulary truncation or post-processing.

In addition, we evaluate embedding compression on a CNN model pre-trained with the IMDB data set. Different from the prior case, the embedding from CNN is trained with supervised class labels. We compress the embedding and retrain the model to evaluate performance. This way enables us to compare with other compression methods fairly.

Configuration. We compare IIQ with the traditional ITQ method (Gong et al. 2013), the pruning method (See et al. 2016), deep compositional code learning (DCCL) (Shu and Nakayama 2018) and a recent method (Tissier et al. 2019) we name as NLB. The pruning method is set to prune 95% of the words for a similar compression ratio. The DCCL method is similarly configured. We run NLB with its default setting. We train the DCCL method for 200 epochs and set the batch size to be 1024 for GloVe and 64 for HDC. For our method, we set the iteration number T to be 50 since early stopping works sufficiently well. We set the same iteration number for ITQ. We also set the parameter D to be 2 for HDC, and 14 for Glove embedding. Note that we perform all vector operations in real domain on the platform (Jastrzebski et al. 2015) and (Conneau and Kiela 2018).

Table 1 lists the experiment configurations with method name, dimension, embedding value type, and compression ratio. The baseline means the original embedding. Our method starts with “IIQ,” followed by the compression ratio. The “dimension” column gives the number of vectors and the vector dimension. For DCCL, we list the parameters M and K that determine the compression ratio. Note that we use single precision for real values. The last column shows the compression ratio, which is the the size of the original embedding over that of the compressed one. Thus, the compression from real value to binary is 32. Moreover, we also

Table 2: Word Similarity Results.

apply dimension reduction in IIQ so that higher compression ratio is possible.

5.1 Word Similarity

The task measures Spearman’s rank correlation between word vector similarity and human rated similarity. A higher correlation means a better quality of the word embedding. The similarity between two words is computed as the cosine of the corresponding vectors, i.e., cos(x, y) = , where x and y are two word vectors. Out-of-vocabulary (OOV) words are replaced by the mean vector.

In this experiment, seven data sets are used, including MEN (Bruni et al. 2014) with 3000 pairs of words obtained from Amazon crowdsourcing; MTurk (Radinsky et al. 2011) with 287 pairs, focusing on word semantic relatedness; RG65 (Rubenstein and Goodenough 1965) with 65 pairs, an early published dataset; RW (Luong et al. 2013) with 2034 pairs of rare words selected based on frequencies; SimLex999 (Hill et al. 2015) with 999 pairs, aimed at genuine similarity estimation; TR9856 (Levy et al. 2015) with 9856 pairs, containing many acronyms and name entities; and WS353 (Agirre et al. 2009) with 353 pairs of mostly verbs and nouns. The experiment is conducted on the platform (Jastrzebski et al. 2015).

Table 2 summarizes the results. The performance of IIQ degrades as the compression ratio increases. This is expected, since a higher compression ratio leads to more loss of information. In addition, our IIQ method consistently achieves better results than ITQ, DCCL, NLB and the pruning method. Particularly, one sees that on the Men data set, IIQ even outperforms the baseline embedding Glove. Another observation is that on TR9856, a higher compression ratio surprisingly yields better results for IIQ. We speculate that the cause is the multi-word term relations unique to TR9856. Interestingly, the pruning method results in negative correlation in SimLex999 for the GloVe embedding. This means that pruning too many small values inside word embedding can drastically destroy the embedding quality.

5.2 Categorization

The task is to cluster words into different categories. The performance is measured by purity, which is defined as the fraction of correctly classified words. We run the experiment using agglomerative clustering and k-means clustering, and select the highest purity as the final result for each embedding. This experiment is conducted on the platform (Jas- trzebski et al. 2015) where OOV words are replaced by the mean vector.

Four data sets are used in this experiment: AlmuharebPoesio (AP) (Almuhareb and Poesio 2005) with 402 words in 21 categories; BLESS (Baroni and Lenci 2011) with 200 nouns (animate or inanimate) in 17 categories; Battig (Bat- tig and Montague 1969) with 5231 words in 56 taxonomic categories; and ESSLI2008 Workshop (M Baroni and Lenci 2008) with 45 verbs in 9 semantic categories.

Table 3 lists evaluation results for GloVe and HDC embeddings. One sees that the proposed IIQ method works better than ITQ, DCCL, and the pruning method on all data sets. But NLB sometimes achieves the best result for example on Battig. For ESSLI, IIQ even outperforms the original GloVe and HDC embedding.

5.3 Topic Classification

In this experiment, we perform topic classification by using sentence embedding. The embedding is computed as the average of the corresponding word vectors. The average of binary embedding is fed to the classifier in single precision. Missing words are treated as zero and so are OOV words. In this task, we train a Multi-Layer Perceptron (MLP) as the classifier for each method. Due to the different size of embeddings, we train 10 epochs for all Glove embeddings and 4 epochs for all HDC embedding. Five-fold cross validation is used to report classification accuracy.

Table 3: Categorization Results.

Four data sets are selected from (Wang and Manning 2012), including movie review (MR), customer review (CR), opinion-polarity (MPQA), and subjectivity (SUBJ). Similar performance is achieved by using the original embedding. The experiment is conducted on the platform of (Conneau and Kiela 2018).

Table 4 shows the results for each method. Similar to the previous tasks, the proposed IIQ method consistently performs better than ITQ, pruning, and DCCL. The only exception is that for MPQA and SUBJ, DCCL and NLB achieves the best result for the GloVe embedding respectively. As the compression ratio increases, IIQ encounters performance degrade.

Table 4: Topic Classification Results.

Table 5: Configurations for IMDB Classification.

5.4 Sentiment Analysis

In this experiment, we evaluate over the embedding input to a pre-trained Convolutional Neural Network (CNN) model on the IMDB data set (Maas et al. 2011). The CNN model follows the Keras tutorial (Chollet and others ). We train 50,000 embedding vectors in 300 dimensions. The model is composed of an embedding layer, followed by a dropout layer with probability 0.2, a 1D convolution layer with 250 filters and kernel size 3, a 1D max pooling layer, a fully connected layer with hidden dimension 250, a dropout layer with probability 0.2, a ReLU activation layer, and a single output fully connected layer with sigmoid activation. Moreover, we use adam optimizer with learning rate 0.0001, sentence length 400, batch size 128, and train for 20 epochs. Input embedding fed into CNN is kept fixed (not trainable).

The data set contains 25,000 movie reviews for training and another 25,000 for testing. We randomly separate 5,000 reviews from the training set as validation data. The model with the best performance on the validation set is kept as the final model for measuring test accuracy. Moreover, all results are averaged from 10 runs for each embedding. The baseline model is the pre-trained CNN model with 87.89% accuracy. Table 5 summarizes the configurations for this experiment. All configurations are similar to the previous experiments. The DCCL method is now configured with M = 32 and K = 32 to achieve a similar compression ratio.

Figure 2: IMDB CNN Test Accuracy Results.

We present in Fig. 2 the result of each embedding. The histogram shows the average accuracy of 10 runs experi-

Figure 3: Visualizing Binary IIQ Word Embedding.

ments for each method and the error bar shows the standard deviation. One sees that among all compression methods, IIQ achieves the least performance degrade. IIQ with compression ratio 64 is the best.

5.5 Visualization

We visualize the binary IIQ embedding in Fig. 3 The nearest and furthest 100 word vectors are shown. The distance is calculated by the dot product. Fig. 3(a) shows the IIQcompressed GloVe embedding and Fig. 3(b) shows the IIQcompressed HDC embedding. The y axis lists every 10 words and the x axis is the dimension of the embedding. One sees that similar word vectors have similar patterns in many dimensions. A white column means that the dimension is zero for all words. A black column means one. Moreover, there is obvious difference between nearest and furthest words.

6 Conclusion

This paper presents an isotropic iterative quantization (IIQ) method for compressing word embeddings. While it is based on the ITQ method in image retrieval, it also maintains the embedding isotropy. We evaluate the proposed method on GloVe and HDC embeddings and show that it is effective for word similarity, categorization, and several other downstream tasks. For pre-trained embeddings that are less isotropic (e.g., GloVe), IIQ performs better than ITQ owing to the improvement on isotropy. These findings are based on a 32-fold (and higher) compression ratio. The results point to promising deployment of trained neural network models with word embeddings on resource constrained platforms in real life.

References

[Acharya et al. 2019] Anish Acharya, Rahul Goel, Angeliki Metallinou, and Inderjit Dhillon. Online embedding compression for text classification using low rank matrix factorization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6196–6203, 2019.

[Agirre et al. 2009] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pas¸ca, and Aitor Soroa. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19–27. Association for Computational Linguistics, 2009.

[Almuhareb and Poesio 2005] Abdulrahman Almuhareb and Massimo Poesio. Concept learning and categorization from the web. In proceedings of the annual meeting of the Cognitive Science society, 2005.

[Arora et al. 2016] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics, 4:385–399, 2016.

[Bahdanau et al. 2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[Baroni and Lenci 2011] Marco Baroni and Alessandro Lenci. How we blessed distributional semantic evaluation. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, pages 1–10. Association for Computational Linguistics, 2011.

[Battig and Montague 1969] William F Battig and William E Montague. Category norms of verbal items in 56 categories a replication and extension of the connecticut category norms. Journal of experimental Psychology, 80(3p2):1, 1969.

[Bruni et al. 2014] Elia Bruni, Nam-Khanh Tran, and Marco Baroni. Multimodal distributional semantics. Journal of Artificial Intelligence Research, 49:1–47, 2014.

[Chollet and others ] Francois Chollet et al. Keras documen- tation, convolution1d for text classification. https://keras.io/ examples/imdb cnn/. Accessed: 2019-08.

[Conneau and Kiela 2018] Alexis Conneau and Douwe

Kiela. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449, 2018.

[dos Santos and Gatti 2014] Cicero dos Santos and Maira Gatti. Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 69–78, 2014.

[Gong et al. 2013] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013.

[Grzegorczyk and Kurdziel 2017] Karol Grzegorczyk and Marcin Kurdziel. Binary paragraph vectors. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 121–130, 2017.

[Han et al. 2015] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

[Hill et al. 2015] Felix Hill, Roi Reichart, and Anna Korhonen. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695, 2015.

[Jastrzebski et al. 2015] Stanisław Jastrzebski, Damian Le´sniak, and Wojciech Marian Czarnecki. word-embeddings-benchmarks. https://github.com/kudkudak/ word-embeddings-benchmarks, 2015.

[Joulin et al. 2016] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.

[Kim 2014] Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.

[Lample et al. 2016] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360, 2016.

[Levy et al. 2015] Ran Levy, Liat Ein-Dor, Shay Hummel, Ruty Rinott, and Noam Slonim. Tr9856: A multi-word term relatedness benchmark. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, pages 419–424, 2015.

[Luong et al. 2013] Thang Luong, Richard Socher, and Christopher Manning. Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 104–113, 2013.

[M Baroni and Lenci 2008] S Evert M Baroni and A Lenci. Bridging the gap between semantic theory and computational simulations: Proceedings of the esslli workshop on distributional lexical semantics. In Proceedings of the esslli workshop on distributional lexical semantics, 2008.

[Maas et al. 2011] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.

[Mikolov et al. 2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[Mikolov et al. 2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

[Mu et al. 2017] Jiaqi Mu, Suma Bhat, and Pramod Viswanath. All-but-the-top: Simple and effective post-processing for word representations. arXiv preprint arXiv:1702.01417, 2017.

[Pennington et al. 2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

[Radinsky et al. 2011] Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web, pages 337–346. ACM, 2011.

[Rubenstein and Goodenough 1965] Herbert Rubenstein and John B Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633, 1965.

[Sainath et al. 2013] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Lowrank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6655–6659. IEEE, 2013.

[See et al. 2016] Abigail See, Minh-Thang Luong, and Christopher D Manning. Compression of neural machine translation models via pruning. arXiv preprint arXiv:1606.09274, 2016.

[Shu and Nakayama 2017] Raphael Shu and Hideki Nakayama. Compressing word embeddings via deep compositional code learning. arXiv preprint arXiv:1711.01068, 2017.

[Shu and Nakayama 2018] Raphael Shu and Hideki Nakayama. Compressing word embeddings via deep compositional code learning. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

[Sun et al. 2015] Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. Learning word representations by jointly modeling syntagmatic and paradigmatic relations. In Pro-

ceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 136–145, 2015.

[Tissier et al. 2019] Julien Tissier, Christophe Gravier, and Amaury Habrard. Near-lossless binarization of word embeddings. In Proceedings of the AAAI Conference on Artifi-cial Intelligence, volume 33, pages 7104–7111, 2019.

[Wang and Manning 2012] Sida Wang and Christopher D Manning. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th annual meeting of the association for computational linguistics: Short papers-volume 2, pages 90–94. Association for Computational Linguistics, 2012.