HyperEmbed: Tradeoffs Between Resources and Performance in NLP Tasks with Hyperdimensional Computing enabled Embedding of n-gram Statistics

2020·Arxiv

Abstract

Abstract

Recent advances in Deep Learning have led to a significant performance increase on several NLP tasks, however, the models become more and more computationally demanding. Therefore, this paper tackles the domain of computationally efficient algorithms for NLP tasks. In particular, it investigates distributed representations of n-gram statistics of texts. The representations are formed using hyperdimensional computing enabled embedding. These representations then serve as features, which are used as input to standard classifiers. We investigate the applicability of the embedding on one large and three small standard datasets for classification tasks using nine classifiers. The embedding achieved on par scores while decreasing the time and memory requirements by several times compared to the conventional n-gram statistics, e.g., for one of the classifiers on a small dataset, the memory reduction was 6.18 times; while train and test speed-ups were 4.62 and 3.84 times, respectively. For many classifiers on the large dataset, memory reduction was ca. 100 times and train and test speed-ups were over 100 times. Importantly, the usage of distributed representations formed via hyperdimensional computing allows dissecting strict dependency between the dimensionality of the representation and n-gram size, thus, opening a room for tradeoffs.

Index Terms—hyperdimensional computing, n-gram statistics, intent classification, embedding

I. INTRODUCTION

Recent work [1] has brought significant attention by demonstrating potential cost and environmental impact of developing and training state-of-the-art models for Natural Language Processing (NLP) tasks. The work suggested several countermeasures for changing the situation. One of them [1] recommends a concerted effort by industry and academia to promote research of more computationally efficient algorithms. The main focus of this paper falls precisely in this domain.

In particular, we consider NLP systems using a well-known technique called n-gram statistics. The key idea is that hyperdimensional computing [2] allows forming distributed

The work of DK was supported by the European Union’s Horizon 2020 Research and Innovation Programme under the Marie Skłodowska-Curie Individual Fellowship Grant Agreement 839179 and in part by the DARPA’s VIP (Super-HD Project) and AIE (HyDDENN Project) programs.

representations of the conventional n-gram statistics [3]. The use of these distributed representations, in turn, allows trading-off the performance of an NLP system (e.g., score) and its computational resources (i.e., time and memory). The main contribution of this paper is the systematic study of these tradeoffs on nine machine learning algorithms using several benchmark classification datasets. This is the first study where the computational tradeoffs of the distributed representations of n-gram statistics is studied in an extensive manner on numerous datasets. We demonstrate the usefulness of hyperdimensional computing-based embedding, which is highly time and memory efficient. Our experiments on a well-known dataset [4] for intent classification show that it is possible to reduce memory usage by x and speed-up training by x without compromising score.

Several important use-cases are motivating the efforts towards trading-off the performance of a system against computational resources required to achieve that performance: high-throughput systems with an extremely large number of requests/transactions (the power of one per cent); resourceconstrained systems where computational resources and energy are scarce (edge computing); green computing systems taking into account environmental sustainability when considering algorithms efficiency [5].

The paper is structured as follows. Section II covers the related work. Section III outlines the evaluation and describes the datasets. The methods being used are presented in Section IV. Section V evaluates of the experimental results. Discussion and conclusion are presented in Section VI.

II. RELATED WORK

Commonly, data for NLP tasks are represented in the form of vectors, which are then used as an input to machine learning algorithms. These representations range from dense learnable vectors to extremely sparse non-learnable vectors. Well-known examples of such representations include one-hot encodings, count-based vectors, and Term Frequency Inverse Document Frequency (TF-IDF) among others. Despite being very useful, non-learnable representations have their disadvantages such as resource inefficiency due to their sparsity and absence of contextual information (except for TF-IDF). Learnable vector representations such as word embeddings (e.g., Word2Vec [6] or GloVe [7]) partially address these issues by obtaining dense vectors in an unsupervised learning fashion. These representations are based on the distributional hypothesis: words located nearby in a vector space should have similar contextual meaning. The idea has been further improved in [8] by representing words with character n-grams. Another efficient way of representing a word is the concept of Byte Pair Encoding, which has been introduced in [9]. The disadvantage of the learnable representations, however, is that they require pretraining involving large train corpus as well as have a large memory footprint (in order of GB). As an alternative to word/character embedding, [10] introduced the idea of Subword Semantic Hashing that uses a hashing method to represent subword tokens, thus, reducing the memory footprint (in order of MB) and removing the necessity of pretraining over a large corpus. The approach has demonstrated state-of-the-art results on 3 intent classification datasets.

The Subword Semantic Hashing, however, relies on n-gram statistics for extracting the representation vector used as an input to classification algorithms. It is worth noting that the conventional n-gram statistics uses a positional representation where each position in the vector can be attributed to a particular n-gram. The disadvantage of the conventional n-gram statistics is that the size of the vector grows exponentially with n. Nevertheless, it is possible to untie the size of representation from n by using distributed representations [11], where the information is distributed across the vector’s positions. In particular, [3] suggest how to embed conventional n-gram statistics into a high-dimensional vector (HD vector) using the principles of hyperdimensional computing. Hyperdimensional computing also known as Vector Symbolic Architectures [2], [12]–[15] is a family of bio-inspired methods of manipulating and representing information. The method of embedding n-gram statistics into the distributed representation in the form of an HD vector has demonstrated promising results on the task of language identification while being hardware-friendly [16]. In [17] it was further applied to the classification of news articles and for traffic clustering in [18]. The method has also shown promising results [19] when using HD vectors for training Self-Organizing Maps [20], [21]. There are also other domains, which have not required the use of n-gram statistics, but benefited from the use of hyperdimensional computing: biomedical signal processing [22], [23], seizure onset detection and localization [24], gesture recognition [25], [26], fault isolation [27], physical activity recognition [28], and communications [29]–[31]. However, there are no previous studies comprehensively exploring tradeoffs achievable with the method on NLP datasets with the supervised classifiers.

III. EVALUATION OUTLINE

A. Classifiers and performance metrics

To obtain the results applicable to a broad range of existing machine learning algorithms, we have performed experiments with several conventional classifiers.1 In particular, the following classifiers were studied: Ridge Classifier (Ridge), k-Nearest Neighbors (KNN), Multilayer Perceptron (MLP), Passive Aggressive (PA), Random Forest (RF), Linear Support Vector Classifier (LSVC), Stochastic Gradient Descent (SGD), Nearest Centroid (NC), and Bernoulli Naive Bayes (BNB).2 All the classifiers are available in the scikit-learn library [34], which was used in the experiments.

Since the main focus of this paper is the tradeoff between classification performance and computational resources, we have to define metrics for both aspects. The quality of the classification performance of a model will be measured by a simple and well-known metric – score (please see [35]). The computational resources will be characterized by three metrics: the time it takes to train a model, the time it takes to test the trained model, and the memory, where the memory is defined as the sum of the size of input feature vectors for train and test splits as well as the size of the trained model. To avoid the dependencies such as particular specifications of a computer and dataset size, the train/test times and memory are reported as relative values (i.e., train/test speed-up and memory reduction), where the reference is the value obtained for the case of the conventional n-gram statistics.3

B. Datasets

Four different datasets were used to obtain the empirical results reported in this paper: the Chatbot Corpus (Chatbot), the Ask Ubuntu Corpus (AskUbuntu), the Web Applications Corpus (WebApplication), and the 20 News Groups Corpus (20NewsGroups). The first three are referred to as small datasets. The Chatbot dataset comprises questions posed to a Telegram chatbot. The chatbot, in turn, replied the questions of the public transport of Munich. The AskUbuntu and WebApplication datasets are questions and answers from the StackExchange. The 20NewsGroups dataset comprises news posts labelled into several categories. All datasets have predetermined train and test splits. The first three datasets [4] are available on GitHub.4

The Chatbot dataset consists of two intents: the (Departure Time and Find Connection) with 206 questions. The corpus has a total of five different entity types (StationStart, StationDest, Criterion, Vehicle, Line), which were not used in our benchmarks, as the results were only for intent classification. The samples come in English. Despite this, the train station names are in German, which is evident from the text where the German letters appear (¨a,¨o,¨u,ß). The dataset has the following data sample distribution (train/test): Departure Time (43/35); Find Connection (57/71).

The AskUbuntu dataset comprises five intents with the following data sample distribution (train/test): Make Update (10/37); Setup Printer (10/13); Shutdown Computer (13/14); Software Recommendation (17/40); None (3/5). Thus, it includes 162 samples in total. The samples were gathered directly from the AskUbuntu platform. Only questions with the highest scores and upvotes were considered.

The WebApplication dataset comprises 89 text samples of eight intents with the following distribution (train/test): Change Password (2/6); Delete Account (7/10); Download Video (1/0); Export Data (2/3); Filter Spam (6/14); Find Alternative (7/16); Sync Accounts (3/6); None (2/4).

The 20NewsGroups dataset has been originally collected by Ken Lang. It comprises of 20 categories. Each category has exactly 18, 846 text samples. Moreover, the samples of each category are split neatly into the train (11, 314 samples) and test (7, 532 samples) sets. The dataset comes prepackaged with Python scikitlearn library.

IV. METHODS

A. Conventional n-gram statistics

An empty vector s stores n-gram statistics for an input text D. D consists of symbols from the alphabet of size a; ith position in s keeps the counter of the corresponding n-gram from the set A of all unique n-grams; corresponds to a symbol in jth position of . The dimensionality of s equals the total number of n-grams in A and calculated as . Usually, s is obtained via a single pass-through D using the overlapping sliding window of size n. The value of a position in s (i.e., counter) corresponding to an n-gram observed in the current window is incremented, i.e., s summarizes how many times each n-gram in A was observed in D.

B. Embeddings with Subword Information

Work by [36] demonstrated that words’ representations can be formed via learning character n-grams, which are then summed up to represent words. This method (FastText) has an advantage over the conventional word embeddings since unseen words could be better approximated as it is highly likely that some of their n-gram subwords have already appeared in other words. Therefore, each word w is represented as a bag of its character n-gram. Special boundary symbols “<” and “>” are added at the beginning and the end of each word. The word w itself is added to the set of its n-grams, to learn a representation for each word along with character n-grams. Taking the word have and n = 3 as an example, (have) = [< ha, hav, ave, ve >, have]. Formally, for a given word denotes the set of G n-grams appearing in w. Each n-gram g has an associated vector representation . Word w is represented as the sum of the vector representations of its n-grams. A scoring function g is defined for each word that is represented as a set of respective n-grams and the context word (denoted as c), as:

where is the vector representation of the context word c. Practically, a word is represented by its index in the dictionary and a set of its n-grams.

C. Byte Pair Encoding

The idea of Byte Pair Encoding (BPE) was introduced in [9]. BPE iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. It can be similarly used to merge characters or character sequences for words representations. A symbol vocabulary is initialized with a character vocabulary with every word represented in the form of characters, where “.” is used as the end of word symbol. All symbol pairs are counted iteratively and then replaced with a new symbol. Each operation results in a new symbol, which represents an n-gram. Similarly, frequently occurring n-grams are eventually merged into a single symbol. This makes the final vocabulary size equal to the sum of initial vocabulary and number of merge operations.

D. SubWord Semantic Hashing

Subword Semantic Hashing (SemHash) is described in details in [10], [37]. SemHash represents the input sentence in the form of subword tokens using a hashing method reducing the collision rate. These subword tokens act as features to the model and can be used as an alternative to word/n-gram embeddings. For a given input sample text T, e.g., “I have a flying disk”, we split it into a list of words . The output of the split would look as follows: [“I”, “have”, “a”, “flying”, “disk”]. Each word is then passed into a prehashing function first adds a # at the beginning and at the end of . Then it generates subwords via extracting n-grams (n=3) from , e.g., H(have) = [#ha, hav, ave, ve#]. These tri-grams are the subwords denoted as , where j is the index of a subword. is then applied to the entire text corpus to generate subwords via n-gram statistics. These subwords are used to extract features for a given text.

E. Embedding n-grams into an HD vector

Alphabet’s symbols are the most basic elements of a system. We assign each symbol ‘a’ with a random d-dimensional bipolar HD vector. These vectors are stored in a matrix (denoted as H, where ), which is referred to as the item memory, For a given symbol S its HD vector is denoted as . To manipulate HD vectors, hyperdimensional computing defines three key operations5 on them: bundling (denoted with + and implemented via position-wise addition), binding (denoted with and implemented via position-wise multiplication), and permutation6 (denoted with ). The bundling operation allows storing information in HD vectors [38]; if several copies of any HD vector are included (e.g., ), the resultant HD vector is more similar to the dominating HD vector than to other components. Since the main focus of this paper is on empirical demonstration of the usefulness of embedding n-gram statistics to HD vectors it does not go into deep analytical details of why HD vectors allow embedding the conventional n-gram statistics, the diligent readers are referred to [39], [40] for the relevant analysis.

It is worth mentioning, however, that intuitively the whole approach works because the embedding is done in such a way that in the projected space, two similar n-gram statistics (in the original space) still remain very similar.

Three operations above allow embedding various data structures into distributed representation (HD vector): sequences [2], sets [41], state automata [42], [43]. See [44] for more examples. The embedding for n-gram statistics was proposed in [3]. First, H is generated for the alphabet. A position of symbol in is represented by applying to the corresponding HD vector times, which is denoted as . Next, a single HD vector for (denoted as ) is formed via the consecutive binding of permuted HD vectors representing symbols in each position j of .

To concrete these ideas with an example consider the tri-gram ‘cba’, it will be mapped to its HD vector as follows: . In general, the process of forming HD vector of an n can be formalized as follows:

where denotes the binding operation when applied to n HD vectors. Once it is known how to get , embedding the conventional n-gram statistics stored in s (see section IV-A) is straightforward. HD vector h corresponding to s is created by bundling all n-grams observed in the data:

where denotes the bundling operation when applied to several HD vectors.

Let us consider an example of forming h for tri-gram statistics extracted from the word “hello”. The tri-gram statistics includes three tri-grams: ‘hel’, ‘ell’, and ‘llo’. So the HD vector representing the statistics is formed as:

Note that h is not bipolar due to the usage of the bundling operation. In fact, the components in h will be integers in

the range

highly unlikely since HD vectors for different n-grams are quasi orthogonal, which means that in the simplest (but not practical) case when all n-grams have the same probability the expected value of a component in h is 0. Also, the use of means that two HD vectors mapping two different n-gram statistics might have very different magnitudes if the number of observations in these statistics are very different, therefore, it is convenient to use the cosine similarity between HD vectors as it neglects the magnitude. Since there is no simple way to set a particular metric for a given machine learning algorithm (usually the Euclidean distance is used), in the experiments below we have imposed the use of the cosine similarity implicitly by normalizing each h by its norm, thus, all h had the same norm and their dot product was equivalent to their cosine similarity. It is worth noting that the normalization of an HD vector could also be used as a way to keep its values in the limited range, which might be useful for efficient implementation of the conventional machine learning algorithms [21], [45], [46], but we do not study this effect in this paper.

F. Motivation for the chosen baselines

Since the primary claim in this paper is that with HD vectors, it is possible to approximate (even accurately) the results obtained with the conventional n-gram statistics, the most proper baseline for classification performance comparison is the conventional n-gram statistics itself7. It is also worth mentioning that there are methods (see, e.g., [47]) for making efficient data structures for storing n-gram statistics. However, such approaches rely on the fact that there are clear regularities when words are used as the basic elements for n-gram statistic. This is not the case for character n-grams are used here.

In addition to the methods presented above, while designing the evaluation experiments it was considered whether word embeddings such as Word2vec [6] or GloVe [7]) should be used as baselines. It was concluded that from a computational point of view, it would be unfair neglecting the computational resources spent while training these embeddings. On top of this, the require quite some memory even to keep the learned embedding for each word in the dictionary. Therefore, trainable word embeddings are not part of the baseline as the resources needed to train them are significantly higher. One exception, however, was made for the case of FastText, which are the trainable subword embeddings. Please see the discussion on this matter at the end of Section V-A.

When it comes to other well-known methods such as bag of words and TF-IDF, it was decided that since the dimensionality of the input feature equals the number of words in the dictionary, the computational efficiency of both approaches would not be much better than that of the conventional n-gram statistics. This assumption is correct, at least for the small datasets, where the number of unique n-grams is in the order of several thousand. At the same time, it was relevant to observe whether HD vectors could be used to embed bag of words and TF-IDF features, therefore, the experiments on the small datasets were also performed with these methods.

G. Setup

All datasets were preprocessed using the spacy library. It was used to remove stop words from data. We used the spacy model called “en core web lg” to parse the datasets. Also, all text samples were preprocessed by removing control characters; in particular, the ones in the set [Cc], which includes Unicode characters from U+0000 to U+009F. It is also worth noting that the realization of the conventional n-gram statistics used in the experiments was forming a model storing only n-grams present in the train split.

Since the 20NewsGroups dataset is already large, it does not seem to be necessary to apply the SemHash to it, therefore, it was omitted in the experiments (i.e., SH in Table IV refers to pure n-grams). Last, the small datasets were augmented, making all smaller classes having the same number of samples as the largest class in the train split for that dataset. Using WordNet as a dictionary, nouns and verbs were swapped with their synonyms creating new sentences until all the classes for that set have the same number of samples.

For BPE, vocabulary size of 1000 was used for WebApplication and AskUbuntu dataset whereas a vocabulary size of 250 was used for Chatbot dataset due to its smaller size. n-gram range of () was used with analyzer as char. Crossvalidation was set to 5.

For FastText, autotune validation was used to find optimal hyperparameters for all datasets. No quantization of the model was performed to prevent the compromise on model accuracy.

When it comes to hyperparameters, in order to find optimal hyperparameters, a grid-based search was applied to three small datasets for the following classifiers: MLP, RF, and KNN. The configuration performing best among all small datasets was chosen to be used report the results in this paper. Moreover, the same configuration was used for the 20NewsGroups dataset.

In the case of MLP, four different configurations of hidden layers were considered: [(100, 50), (300, 100),(300, 200, 100), and (300, 100, 50)]; (300, 100, 50) configuration has been chosen. The maximal number of MLP iterations was set to 500. In the case of RF, two hyperparameters were optimized number of estimators ([50, 60, 70]) and minimum samples leaf ([1, 11]); we used 50 estimators and 1 leaf. In the case of KNN, the number of neighbors between 3 and 7 was considered; 3 neighbors were used in the experiments. For all the other classifiers default hyperparameter settings provided by Sklearn library were used.

The range of n in the experiments with small datasets was while for the 20NewsGroups dataset it was since the number of possible 4-grams was overwhelming. All results reported for small datasets were obtained by averaging across 50 independent simulations. In the case of the 20NewsGroups dataset, the number of simulations was decreased to 10 due to high computational costs. To have a fair comparison of computational resources, all results for small datasets were obtained on a dedicated laptop without involving GPUs while the results for the 20NewsGroups were obtained with a computing cluster (CPU only) without the intervention of other users.

V. EMPIRICAL EVALUATION

A. Results

First, we report the results of the MLP classifier on all datasets as it represents a widely used class of algorithms – neural networks. The goal of the experiments was to observe how the dimensionality of HD vectors embedding n-gram statistics affects the scores and the computational resources. Figures 1a-1d present the results for the AskUbuntu, Chatbot, WebApplication, and 20NewsGroups datasets, respectively. The dimensionality of HD vectors varied as . All figures have an identical structure. Shaded areas depict 95% confidence intervals. Left panels depict the score while right panels depict the train and test speed-ups as well as memory reduction. Note that there are different scales (y-axes) in the right panels. A solid horizontal line indicates 1 for the corresponding y-axis, i.e., the moment when both models consume the same resources.

The results in all figures are consistent in a way that up to a certain point score was increasing with the increasing dimensionality. For the small datasets even small dimensionalities of HD vectors (e.g., ) led to the scores, which are far beyond random. For example, for the AskUbuntu dataset, it was 84 % of the conventional n-gram statistics score. For the values above 512 the performance saturation begins. Moreover, the improvements beyond 2048 are marginal. The situation is more complicated for the 20NewsGroups dataset where for 32-dimensional HD vectors score is fairly low though still better than a random guess (0.05). However, it increases steeply until 1024 and achieves its maximum at 4096 being 92 % of the conventional n-gram statistics score. The dimensionalities above 4096 showed worse results.

When it comes to computational resources, there is a similar pattern for all the datasets. The train/test speed-ups and memory reduction are diminishing with the increased dimensionality of HD vectors. At the point when the dimensionality of HD vectors equals the size of the conventional n-gram statistics, both approaches consume approximately the same resources. These points in the figures are different because the datasets have different size of n-gram statistics: 3729, 2753, 2734, and 192652, for the AskUbuntu, Chatbot, WebApplication, and 20NewsGroups datasets, respectively. Also, for all datasets, the memory reduction is higher than the speed-ups. The most impressive speed-ups and reductions were observed for the 20NewsGroups dataset (e.g., 186 times less memory for 1024-dimensional HD vectors). This is due to its large size it contains a huge number of n-grams resulting in large size of the n-

Fig. 1: MLP results vs. the dimensionality of HD vectors.

gram statistics. Nevertheless, even for small datasets, the gains were noticeable. For instance, for the WebApplication dataset at score was 99 % of the conventional n-gram statistics while the train/test speed-ups and the memory reduction were 5.6, 3.4, and 7.6, respectively. Thus, these empirical results suggest that the quality of embedding w.r.t. the achievable score improves with increased dimensionality, however, after saturation or peak increasing dimensionality further does not affect or worsen classification performance and becomes

impractical when considering computational resources.

Tables I-IV8 report the results for all datasets when applying all the considered classifiers. For the sake of brevity, a fixed dimensionality of HD vectors is reported only: 512 for small datasets in Tables I-III and 2048 for the 20NewsGroups dataset in Table IV. These dimensionalities were chosen based on the

TABLE I: Performance of all classifiers for the AskUbuntu dataset.

TABLE II: Performance of all classifiers for the Chatbot dataset.

results in Figures 1a-1d as the ones allowing to achieve a good approximation of score while providing substantial speed-up/reduction. We also performed experiments when using the BPE instead of the SemHash before extracting n-gram statistics.9 Throughout the tables, the BPE demonstrated scores comparable to the SemHash while showing the train/test speed-ups and memory reduction at about 2 times. This is because the usage of the BPE resulted in smaller sizes of the n-gram statistics, which were 2176, 1467, and 1508 for the AskUbuntu, Chatbot, and WebApplication datasets. In the case of HD vectors, the picture is less coherent. For example, there is a group of classifiers (e.g., MLP, SGD, KNN) where scores are well approximated (or even improved) while achieving noticeable computational reductions. In the case of LSVC, scores are well-preserved and there is times memory reduction but test/train speed-ups are marginal (even slower for training the Chatbot). This is because LSVC implementation benefits from sparse representations (conventional n-gram statistics) while HD vectors in this study are dense. Last, for BNB and RF scores were not approximated well (cf. 0.93 vs. 0.82 for BNB in the case of the Chatbot). This is likely because both classifiers are relying on local information contained in individual features, which is not the case in HD vectors where information is represented distributively. The slow train time of RF is likely because in the absence of well-separable features it constructs large trees. Due to the difference in the implementation (the official

TABLE III: Performance of all classifiers for the WebAppli- cation dataset.

TABLE IV: Performance of all classifiers for 20NewsGroups.

implementation of FastText only uses a linear classifier), we were not able to have a proper comparison of computational resources with the FastText.10 However, we obtained the following scores with auto hyperparameter search: 0.91, 0.97, 0.76 for the AskUbuntu, Chatbot, and WebApplication datasets, respectively. These results indicate that for the considered datasets there is no drastic classification performance improvement (even worse for the WebApplication) when using the learned representations of n-grams.

Table V reports the results for the AskUbuntu dataset when applying all the considered classifiers on the features extracted with bag of words (denoted as TF) and TF-IDF.11 In these experiments as input to the classifiers we either used the features extracted by these methods or HD vectors (N = 512) embedding these features. With respect to the compromise in terms of resources the classifiers performed similarly to the previous experiment with the difference that a typical speed-up and memory reduction were about three times for HD vectors. When it comes to scores the results are consistent with the original motivation for the SemHash method, which argued that subword representations help in getting better performance compared to word-based representations at least for small datasets due to limited training data.

Finally, for the small datasets Table VI places the results reported here in the context of results obtained in [10]12. One thing to note in Table VI is the differences in the

TABLE V: Performance for AskUbuntu dataset with TF-IDF.

TABLE VI: score comparison of various platforms on three smaller datasets with results in the paper.

scores of the SemHash approach from the ones reported in [10] for all three small datasets. There were some data augmentation techniques, which were used in the paper, most prominently a QWERTY-based word augmentation accounting for the spelling mistakes. This technique was not used here, which resulted in a slight difference in the obtained scores.

VI. DISCUSSION AND CONCLUSIONS

The first observation is that the results on the 20News- Groups dataset are not the state-of-the-art, which is currently score achieved with the BERT model as reported in [48]. Nevertheless, it is important to keep in mind that the main goal of the experiments with the 20NewsGroups dataset has been to demonstrate that n-gram statistics embedded into HD vectors allows getting the tradeoff even for a large text corpus. We even observed that for large datasets the usage of HD vectors is likely to provide the best gains in terms of resource-efficiency. Moreover, the gains on the small datasets were also noteworthy (several times). Thus, based on these observations we conclude that HyperEmbed would be a very useful feature in the standard ML libraries. It is also worth noting that an additional improvement for the inference step could be achieved when using binarized classifiers (see, e.g., [49]). A more general conclusion is that it is worth revisiting results in the area of random projection [50] as they are likely to allow achieving performance/resources tradeoff in a range of NLP scenarios (see, e.g., [51] for one such example).

It was stated in Section III-A the speed-ups reported above did not include the time for forming HD vectors. The main reason for that is that our Python-based implementation of the method was quite inefficient, especially the cyclic shifts implemented with numpy.roll. At the same time, as it could be seen from the formulation of the embedding method in Section IV-E its complexity is linear and depends on n as well as on the length of the sample text, thus, fast implementation is doable. We made the proof-of-concept in Matlab, which is much faster. For example, for AskUbuntu forming 512-dimensional HD vectors of the train split (the same machine) took 7.5 % of the MLP training time, which is a positive result.

Despite the demonstrated tradeoffs between the score and the computational resources, it is extremely hard to have an objective function, which would tell us when the compromise is acceptable and when it is not. In our opinion, a general solution would be to define a utility function, which would be able to assign a certain cost to both a unit of performance (e.g., 0.01 increase in score) and a unit of computation (e.g., 10 % decrease in the inference time). The use of the utility function would allow deciding whether an alternative solution, which is, e.g., faster but less accurate, is better or not than the existing one. However, the main challenge here would be to define such a utility function since it would have to be defined for each particular application. Moreover, defining such functions even for the considered classification problems is out of the scope of this study. Nevertheless, we believe that it is the way forward to get an objective comparison criterion.

REFERENCES

[1] E. Strubell, A. Ganesh, and A. McCallum, “Energy and Policy Con- siderations for Deep Learning in NLP,” in 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019, pp. 3645–3650.

[2] P. Kanerva, “Hyperdimensional Computing: An Introduction to Com- puting in Distributed Representation with High-Dimensional Random Vectors,” Cognitive Computation, vol. 1, no. 2, pp. 139–159, 2009.

[3] A. Joshi, J. T. Halseth, and P. Kanerva, “Language Geometry Using Random Indexing,” in International Symposium on Quantum Interaction (QI), 2016, pp. 265–274.

[4] D. Braun, A. Hernandez-Mendez, F. Matthes, and M. Langen, “Eval- uating Natural Language Understanding Services for Conversational Question Answering Systems,” in Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 2017, pp. 174–185.

[5] AI HLEG, High-Level Expert Group on Artificial Intelligence. Ethics Guidelines for Trustworthy AI, 2019.

[6] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” arXiv:1301.3781, 2013.

[7] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word Representation,” in Conference on Empirical Methods in Natural Language Processing (EMNLP), vol. 14, 2014, pp. 1532–1543.

[8] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of Tricks for Efficient Text Classification,” in Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2017, pp. 427– 431.

[9] P. Gage, “A New Algorithm for Data Compression,” The C Users Journal, vol. 12, no. 2, pp. 23–38, 1994.

[10] K. Shridhar, A. Dash, A. Sahu, G. G. Pihlgren, P. Alonso, V. Pondenkan- dath, G. Kovacs, F. Simistira, and M. Liwicki, “Subword Semantic Hashing for Intent Classification on Small Datasets,” in International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–6.

[11] G. E. Hinton, J. L. McClelland, and D. E. Rumelhart, “Distributed Representations,” in Parallel Distributed Processing. Explorations in the Microstructure of Cognition. Volume 1. Foundations, 1986, pp. 77–109.

[12] T. A. Plate, Holographic Reduced Representations: Distributed Representation for Cognitive Structures. Stanford: Center for the Study of Language and Information (CSLI), 2003.

[13] D. A. Rachkovskij, “Representation and Processing of Structures with Binary Sparse Distributed Codes,” IEEE Transactions on Knowledge and Data Engineering, vol. 3, no. 2, pp. 261–276, 2001.

[14] C. Eliasmith, How to Build a Brain. Oxford University Press, 2013.

[15] E. P. Frady, D. Kleyko, and F. T. Sommer, “Variable Binding for Sparse Distributed Representations: Theory and Applications,” arXiv:2009.06734, pp. 1–16, 2020.

[16] A. Rahimi, P. Kanerva, and J. Rabaey, “A Robust and Energy Effi- cient Classifier Using Brain-Inspired Hyperdimensional Computing,” in IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), 2016, pp. 64–69.

[17] F. Najafabadi, A. Rahimi, P. Kanerva, and J. Rabaey, “Hyperdimensional Computing for Text Classification,” in Design, Automation and Test in Europe Conference (DATE), 2016, pp. 1–1.

[18] T. Bandaragoda and D. De Silva and D. Kleyko and E. Osipov and U. Wiklund and D. Alahakoon, “Trajectory Clustering of Road Traffic in Urban Environments using Incremental Machine Learning in Combination with Hyperdimensional Computing,” in IEEE Intelligent Transportation Systems Conference (ITSC), 2019, pp. 1664–1670.

[19] D. Kleyko and E. Osipov and D. De Silva and U. Wiklund and V. Vyatkin and D. Alahakoon, “Distributed Representation of n-gram Statistics for Boosting Self-Organizing Maps with Hyperdimensional Computing,” in International Andrei Ershov Memorial Conference on Perspectives of System Informatics (PSI), 2019, pp. 64–79.

[20] T. Kohonen, Self-Organizing Maps. Springer Series in Information Sciences, 2001.

[21] D. Kleyko and E. Osipov and D. De Silva and U. Wiklund and D. Alahakoon, “Integer Self-Organizing Maps for Digital Hardware,” in International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–8.

[22] D. Kleyko, E. Osipov, and U. Wiklund, “A Hyperdimensional Com- puting Framework for Analysis of Cardiorespiratory Synchronization During Paced Deep Breathing,” IEEE Access, vol. 7, pp. 34 403–34 415, 2019.

[23] A. Rahimi, P. Kanerva, L. Benini, and J. M. Rabaey, “Efficient Biosignal Processing Using Hyperdimensional Computing: Network Templates for Combined Learning and Classification of ExG Signals,” Proceedings of the IEEE, vol. 107, no. 1, pp. 123–143, 2019.

[24] A. Burrello, K. Schindler, L. Benini, and A. Rahimi, “Hyperdimensional Computing with Local Binary Patterns: One-Shot Learning of Seizure Onset and Identification of Ictogenic Brain Regions Using Short-Time iEEG Recordings,” IEEE Transactions on Biomedical Engineering, vol. 67, no. 2, pp. 601–613, 2020.

[25] D. Kleyko, A. Rahimi, D. Rachkovskij, E. Osipov, and J. Rabaey, “Classification and Recall with Binary Hyperdimensional Computing: Tradeoffs in Choice of Density and Mapping Characteristic,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 12, pp. 5880–5898, 2018.

[26] A. Rahimi, S. Benatti, P. Kanerva, L. Benini, and J. M. Rabaey, “Hyperdimensional biosignal processing: A case study for emg-based hand gesture recognition,” in 2016 IEEE International Conference on Rebooting Computing (ICRC), 2016, pp. 1–8.

[27] D. Kleyko, E. Osipov, N. Papakonstantinou, and V. Vyatkin, “Hy- perdimensional Computing in Industrial Systems: The Use-Case of Distributed Fault Isolation in a Power Plant,” IEEE Access, vol. 6, pp. 30 766–30 777, 2018.

[28] O. Rasanen and S. Kakouros, “Modeling Dependencies in Multiple Parallel Data Streams with Hyperdimensional Computing,” IEEE Signal Processing Letters, vol. 21, no. 7, pp. 899–903, 2014.

[29] P. Jakimovski, H. R. Schmidtke, S. Sigg, L. W. F. Chaves, and M. Beigl, “Collective Communication for Dense Sensing Environments,” Journal of Ambient Intelligence and Smart Environments, vol. 4, no. 2, pp. 123– 134, 2012.

[30] D. Kleyko, N. Lyamin, E. Osipov, and L. Riliskis, “Dependable MAC Layer Architecture based on Holographic Data Representation using

[31] H. Kim, “HDM: Hyper-Dimensional Modulation for Robust Low-Power Communications,” in IEEE International Conference on Communications (ICC), 2018, pp. 1–6.

Hyper-Dimensional Binary Spatter Codes,” in Multiple Access Communications (MACOM), 2012, pp. 134–145.

[32] A. Rosato, M. Panella, and D. Kleyko, “Hyperdimensional Computing for Efficient Distributed Classification with Randomized Neural Networks,” in International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1–10.

[33] C. Diao, D. Kleyko, J. M. Rabaey, and B. A. Olshausen, “Generalized Learning Vector Quantization for Classification in Randomized Neural Networks and Hyperdimensional Computing,” in International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1–9.

[34] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

[35] T. Fawcett, “An Introduction to ROC Analysis,” Pattern Recognition Letters, vol. 27, pp. 861–874, 2006.

[36] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.

[37] P. Huang, X.He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning Deep Structured Semantic Models for Web Search using Clickthrough Data,” in ACM international conference on Information and Knowledge Management (CIKM), 2013, pp. 2333–2338.

[38] D. Kleyko, E. Osipov, A. Senior, A. I. Khan, and Y. A. S¸ekerciog˘glu, “Holographic Graph Neuron: A Bioinspired Architecture for Pattern Processing,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 6, pp. 1250–1262, 2016.

[39] E. Frady, D. Kleyko, and F. Sommer, “A Theory of Sequence Indexing and Working Memory in Recurrent Neural Networks,” Neural Computation, vol. 30, pp. 1449–1513, 2018.

[40] D. Kleyko, A. Rosato, E. P. Frady, M. Panella, and F. T. Sommer, “Perceptron Theory for Predicting the Accuracy of Neural Networks,” arXiv:2012.07881, pp. 1–12, 2020.

[41] D. Kleyko, A. Rahimi, R. W. Gayler, and E. Osipov, “Autoscaling Bloom Filter: Controlling Trade-off Between True and False Positives,” Neural Computing and Applications, vol. 32, pp. 3675–3684, 2020.

[42] T. Yerxa, A. Anderson, and E. Weiss, “The Hyperdimensional Stack Machine,” in Cognitive Computing, 2018, pp. 1–2.

[43] E. Osipov, D. Kleyko, and A. Legalov, “Associative Synthesis of Finite State Automata Model of a Controlled Object with Hyperdimensional Computing,” in Annual Conference of the IEEE Industrial Electronics Society (IECON), 2017, pp. 3276–3281.

[44] D. Kleyko, M. Davies, E. P. Frady, P. Kanerva, S. J. Kent, B. A. Olshausen, E. Osipov, J. M. Rabaey, D. A. Rachkovskij, A. Rahimi, and F. T. Sommer, “Vector Symbolic Architectures as a Computing Framework for Nanoscale Hardware,” arXiv, pp. 1–28, 2021.

[45] D. Kleyko, E. P. Frady, M. Kheffache, and E. Osipov, “Integer Echo State Networks: Efficient Reservoir Computing for Digital Hardware,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–14, 2020.

[46] D. Kleyko, M. Kheffache, E. P. Frady, U. Wiklund, and E. Osipov, “Density Encoding Enables Resource-Efficient Randomly Connected Neural Networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–7, 2020.

[47] G. E. Pibiri and R. Venturini, “Handling Massive N-Gram Datasets Efficiently,” ACM Transactions on Information Systems, vol. 37, no. 2, pp. 25:1–25:41, 2019.

[48] A. Mahabal, J. Baldridge, B. K. Ayan, V. Perot, and D. Roth, “Text Classification with Few Examples using Controlled Generalization,” in Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), 2019, pp. 3158–3167.

[49] K. Shridhar, H. Jain, A. Agarwal, and D. Kleyko, “End to End Binarized Neural Networks for Text Classification,” in Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), 2020, pp. 29–34.

[50] D. Rachkovskij, “Real-Valued Embeddings and Sketches for Fast Dis- tance and Similarity Estimation,” Cybernetics and Systems Analysis, vol. 52, no. 6, pp. 967–988, 2016.

[51] D. Nunes and L. Antunes, “Neural Random Projections for Language Modelling,” arXiv:1807.00930, pp. 1–15, 2018.