My stuff
Learning Directly from Grammar Compressed Text

Neural networks using numerous text data have been successfully applied to a variety of tasks. While massive text data is usually compressed using techniques such as grammar compression, almost all of the previous machine learning methods assume already decompressed sequence data as their input. In this paper, we propose a method to directly apply neural sequence models to text data compressed with grammar compression algorithms without decompression. To encode the unique symbols that appear in compression rules, we introduce composer modules to incrementally encode the symbols into vector representations. Through experiments on real datasets, we empirically showed that the proposal model can achieve both memory and computational efficiency while maintaining moderate performance.

Machine learning methods have been successfully applied to huge text data such as large text corpora (Devlin et al., 2019; Radford et al., 2019) and multiple genome sequences (Phaml et al., 2005). In practice, large text data are usually stored in a compressed format. For example, genome sequences are treated and maintained in a compression format for each individual, and news items are compressed into a certain time period. Standard machine-learning models can only take an original (i.e., non-compressed) text as an input, and thus, a naive method for these compressed text data involves a decompression step. However, decompression can be expensive because its running time depends on the size of the original uncompressed text, which can be much longer than its compressed text. Also, since large data makes machine learning timeand memory-consuming, it is undesirable to decompress and inflate size of data especially if the compression ratio

of the data is high.

In this paper, we propose an efficient machine learning method that does not involve any decompression process and directly conducts inference and learning on compressed texts. More specifically, the proposed method directly applies neural sequence models to text data that are compressed with grammar compression algorithms.

Grammar compression algorithms are commonly used to compress text data since they allow the original text to be perfectly reconstructed from the compressed text, i.e., they are lossless compression, and it is known that they can effectively compress text in general (Larsson & Moffat, 1999; Goto et al., 2015). Technically, a grammar compression algorithm encodes an original text into a compressed sequence and a set of replacement rules of context-free grammar. A compressed sequence consists of terminals, which are the alphabets appeared in the original text, and non-terminals, which are new alphabets generated by a replacement rules; see Section 3 for the precise definition.

Since replacement rules of a compressed text may contain information of sensitive sub-parts of the original uncompressed text, it is not desirable to compress texts using a shared single set of replacement rules constructed on a whole corpus in applications that require privacy such as genome classification. Grammar compression algorithms can be also applied to such applications because they can construct unique replacement rules and non-terminals for each text. Thus, in this paper, we focus on compressed texts which are independently compressed by grammar compression algorithms. However, independently compressed text has different rules and non-terminals, and thus, we cannot naively replace non-terminals by embedding vectors, making it difficult to directly apply neural sequence models to grammar compressed text. To circumvent this diffi-culty, the proposed method uses a trainable composer module to incrementally compute vector representations of non-terminals from trainable embeddings of terminals based on replacement rules, making it possible to represent a compressed text by a sequence of vectors which can be input to existing neural sequence models.

Since length of the sequence of vector is same as that of the compressed text which is shorter than the uncompressed text, it is expected that the proposed method can reduce memory and computational cost of neural sequence models. Also, the proposed method employs a clustering algorithm of non-terminals to parallelize the computation of vector representations of non-terminals to improve ef-ficiency. We empirically show efficiency of the proposed method through experiments on real datasets1.

Our contributions are summarized as follows:

We propose a method to apply neural sequence models to grammar compressed text data without decompression. Also, we propose a clustering algorithm of non-terminals to improve efficiency of the proposed method by parallel computation.

Through experiments on real datasets, we empirically showed that the proposed model can achieve both memory and computational efficiency while maintaining moderate performance.


Text mining or matching on compressed text data has been extensively studied for many years (Bille et al., 2007; Navarro, 2003; Gasieniec et al., 2005).

To accelerate basic matrix operation, compressionbased methods, which is aim to in-memory processing, have been developed (Rendle, 2013; Tabei et al., 2016; Oyamada et al., 2018). After the each column of the matrix is compressed, for matrix operation, the methods conduct access to elements of the matrix without decompression.


While there is no existing method that learns from compressed text, there exists a method that learns directly from compressed image, which represented as JPEG format (Gueguen et al., 2018; Ulicny & Dahyot, 2017). JPEG image data from the original RGB data is encoded in three steps: first, the RGB image is converted into the YCbCr color space and the chroma channels are downsampled, next, the channels are projected through the discrete cosine transform (DCT) and quantized, and finally, the quantized coefficients are losslessly compressed by using Huffman codes (e.g., RGB  →YCbCr  →Quantized DCT  →Huffman codes steps). (Gueguen et al., 2018) run only the first step of decoding and then feed the DCT coefficients directly into a neural network. (Gueguen et al., 2018) extend (Ulicny & Dahyot, 2017) to the full early JPEG stack, to which allows for much deeper networks and for training on a much larger dataset and more difficult tasks.


There are a few lines of existing research which are investigating methods to make neural sequence models memory efficient.

Following idea of trading computation time and memory consumption during back-propagation (Dauvergne & Hasco¨et, 2006; Chen et al., 2016; Vaswani et al., 2017), in (Gruslys et al., 2016), the authors proposed efficient approach which reduce memory consumption of the backpropagation when training recurrent neural networks with additional computational costs.

More recently, (Kitaev et al., 2020) proposed various techniques including approximate attention computation to reduce memory usage of Transformers from  O(n2)to O(n log n).

3.1. Notations

For a finite set C, the cardinality of C is denoted by |C|. Let  Σbe a finite set of symbols. The set of all possible sequences of  Σis denoted by  Σ∗. For a sequence S, the length of S is denoted by |S|. For i = 1, . . . , |S|, the i-th symbol of S is denoted by S[i]. The contiguous subsequence that begins at position i and ends at position j is denoted by S[i..j].

3.2. Grammar Compression

A context free grammar (CFG) is a collection of rules of the form:  A → a1a2 . . . ak. The symbols appearing on the left-hand side of the rules are called non-terminals, and the symbols appearing on the input sequence called terminals. Note that both terminals and non-terminals appear on the right-hand side of the rules.

Given the original sequence S, a grammar compression method constructs a set of non-terminals V , a compressed sequence  C ∈ (Σ ∪ V )∗, and a set of restricted CFG rules R, from which we can uniquely derive the original sequence S. We define a compression tuple of S as a tuple Scomp = (V, R, C).

In this paper, we focus on grammar compression methods whose each replacement rule in R contains only two symbols in its right-hand side unless otherwise specified. Also, without loss of generality, we assume that both right-hand symbols  a(i)1 , a(i)2 in i-th replacement rule  A(i) → a(i)1 a(i)2in R are terminal symbols or appear in right-hand side of previous rules, i.e.,  a(i)1 , a(i)2 ∈ (Σ ∪ {A(1), ..., A(i−1))}.

Below, we introduce three commonly-used grammar com-

pression algorithms.


The recursive pairing compression algorithm (Re-Pair) is a grammar compression algorithm proposed by Larsson and Moffat (1999). This algorithm repeatedly finds a most frequently appearing bi-gram and replaces it with a new non-terminal symbol until the replacement does not reduce the representation cost (i.e., the sum of the length of the compressed sequence and the sizes of the rules). Here, each replacement rule is inserted into a rule dictionary.

Example 3.1. Suppose that we have the following sequence: S = aababcababcabcd. (3.1)

Then, Re-Pair gives the following compressed sequence C and the rule dictionary R:


Re-Pair first replaces ab with a new non-terminal symbol A, as ab is the most frequently appearing pair. Then it recounts the frequency of pairs in the replaced sequence aAAcAAcAcd. It replaces Ac with B and stops this procedure when it can no longer reduce the representation cost.


The LZ78 algorithm was proposed by Lempel and Jacob Ziv (1978). This algorithm works by constructing a dictionary of parts, which we call phrases, that have appeared in the sequence. Assume we are compressing a sequence S[1..n] and have already processed  S[1..i−1]into r phrases  P0P1..Pr−1, where phrase  P0represents an empty sequence for convenience. Then, to compute  Pr, we find the longest prefix  S[i..j − 1](with  j − 1 < n) that is equal to some  Pq, with q < r. Here we define  Pr = PqT [j], which is represented as the pair (q, T [j]).

Example 3.2. Suppose that we have the string (3.1). LZ78 starts with the shortest phrase on the left that we haven’t seen before. This will always be a single symbol; in this case,  P1 = a. It now takes the next phrase we haven’t seen. We have already seen a, so it takes  P2 = ab. Continuing, finally, we get the following sequence of phrases:


and obtains the rule dictionary R and compressed sequence C as


In reality, these can be efficiently represented by2


The LZ double algorithm (LZD) (Goto et al., 2015) is an extension of LZ78. (Goto et al., 2015) simply change the definition of a phrase  Pito the pair of the longest previously occurring phrase  Pj1and the longest previously occurring phrase  Pj2that also appears at position  |P1|+· · ·+|Pi−1|+|Pj1| + 1.

Example 3.3. Suppose that we have the sequence (3.1). The LZD phrase of the sequence (3.1) is


and can be represented by


3.3. Neural Sequence Models

Various neural network architectures have been applied to sequence data, including recurrent neural network (RNN) (Elman, 1990), long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997), gated recurrent unit (GRU) (Cho et al., 2014), convolutional neural network (CNN) (Le & Mikolov, 2014), and Transformers (Vaswani et al., 2017).

Mathematically, these models can be represented as follows: (y1, ..., yn) = f((x1, ..., xn)),(3.8)

where n is the length of the input sequence, and  xt ∈ Rdiand  yt ∈ Rdoare input and output vector representations of the t-th entry of the input sequence, respectively. In clas-sification and regression tasks, the final prediction for each input sequence is generally computed from output vector representations  y1, ..., yn.

While neural sequence models have been successfully applied to various tasks involving sequence data, they require at least O(n) time, where n is the length of the input sequence (Vaswani et al., 2017).

4.1. Task Definitions

We focus on sequence classification tasks in which each input text is compressed by a grammar compression algorithm3, i.e., i-th input  Scomp,iis represented by its compression tuple  Scomp,i = (Vi, Ri, Ci), where  Viis non-terminals in  Scomp,i, Riis a rule dictionary corresponding to  Scomp,i, and  Ciis a compressed sequence. Note that we use different CFGs for each sequence. We call this setting individual compression. This setting is suitable for applications mentioned in Section 1.

4.2. Encoding Non-Terminals

We can assign a trainable parameter vector  xcfor each terminal symbol  c ∈ Σbecause the terminal symbols are common for all input sequences. However, we cannot assign a trainable vector for non-terminal symbols in  Vibecause the non-terminal symbols are generated for each input sequence; therefore, they are not shared across different sequences.

Here, we propose a method to obtain vector representations of non-terminals by composing vector representations of the terminals.


For the k-th replacement rule  A(k) → a(k)1 a(k)2 in Riand vector representations of the symbols in the right-hand, xa(k)1 , xa(k)2, we compute a vector representation of the left- hand symbol  xA(k)by using composer module M as follows:


Here, M is a function that has a trainable parameter  θ. In this paper, we test two composer modules, multi-layer perceptron (MLP) composer  MMLPand dual-GRU composer MGRU4.

MLP: As a baseline, we use the multi-layer perceptron fMLP (·; θ) : R2d → Rdas a composer module:


Here,  [·; ·]denotes the vector concatenation.

Dual-GRU: Because we recursively apply composer modules to encode non-terminals, the MLP composer may cause exploding and/or vanishing gradient problems. To alleviate this problem, we use a slightly modified version of GRU cell (Cho et al., 2014) as a composer module:


where  ◦is the element-wise product, and  z1, z2, zi, and i are defined by the following formulas:

[z1, z2, zi] =softmax(reshape(Wz[x1; x2] +  bz)),r =σ(Wr[x1; x2] +  br),


where  softmax : R3×d → Rd×Rd×Rddenotes a function that applies the softmax function to each row of the input matrix and outputs each column,  reshape : R3d → R3×dis the function that converts a 1D vector into a 2D matrix of a suitable shape, and  σdenotes the element-wise sigmoid function.  Wz ∈ R3d×2d, bz ∈ R3d, Wr ∈ R2d×2d, br ∈R2d, Wi ∈ Rd×2d,and  bi ∈ Rdare trainable parameters  θof the composer.


As described in Section 3.2, right-hand symbols in each rule are either a terminal symbol or a left-hand symbol in a previous rule. Thus, given the vector representations of terminal symbols, we can compute the vector representations of all non-terminals by sequentially applying a composer module to those vector representations of the right-hand symbols of each rule.


To take advantage of the massively parallel computing capabilities of GPUs, we propose a clustering algorithm for non-terminals to compute the vectors of non-terminal symbols in parallel.

First, we introduce a definition for a directed task graph (Sarkar, 1989).

Definition 4.1. A directed task graph is denoted as G = (V, E, T , C ), where V is a set of task nodes, E is a set of edges, T is the set of node computation costs, and C is the set of edge communication costs.  ti ∈ Tis the execution time of node  vi ∈ Vand  ci,j ∈ Cis the communication cost incurred along the edge  ei,j = (ni, nj) ∈ E, which is zero if both nodes are mapped on the same processor.

Clustering is a mapping of the task nodes in the graph into clusters. A cluster is a set of tasks that will be executed on the same processor. The clustering problem is known to be NP-complete for a general task graph and for several cost functions (Papadimitriou & Yannakakis, 1990; Sarkar, 1989). Even if the cost function is the minimization of parallel time with an unbounded number of processors, this problem is NP-hard (Papadimitriou & Yannakakis, 1990; Sarkar, 1989).

To overcome this complexity, we consider a simple task graph G = (V, E, T , C ), where V indicates tasks that


compute the representations for non-terminals, E indicates the inverse dependencies corresponding to rules, all the values in T are equal, and all the values in C are zero5. Furthermore, computational cost of a composer module can be assumed to be equal with fixed size of input vectors.

For this task graph, we give the simple clustering algorithm in Algorithm 1. In this algorithm, the subprocedure PAR- ALLELNODE starts with the input task graph G and i = 0. Then it adds nodes whose indegree is zero to the list  L0as the executable task nodes in parallel. Here, for a vertex x, the number of edges whose heads end adjacent to x is called the indegree of the vertex x. It recursively calls with i + 1 and  G′ = (V, E′)where  E′is edges excluding edges whose tail nodes are in  Li. Finally, after it calculates all of the lists of executable nodes in parallel, it assigns each node in the list for each depth to a different cluster.

As for the time complexity of this algorithm, it runs in O(V )-time, which is exactly equal to the number of rules R, since line 4 in Algorithm 1 traverses all edges only once. Note that, as described in Section 3.2, since we assume each replacement rule contains only two symbols in its right-hand side, the edges in our task graph are twice the number of the nodes.

4.4. Complexity

With an input sequence of length n, a l-layer sequence model requires  O(ln(nd + d2))(Transformer) or  O(lnd2)(CNN, RNN) time to process each sequence, where d is the dimension of vector representations (Vaswani et al., 2017).6 On the other hand, with a compressed sequence of length  n′and |R| replacement rules, the proposed method requires  O(|R|d2)time to encode all non-terminals with a composer module. Since the proposed method inputs a compressed sequence of length  n′to a sequence model, its overall time complexity is either  O((|R| + ln′)d2 + ln′2d)(Transformer) or  O((|R| + ln′)d2)(CNN, RNN).

With an input sequence of length n, a l-layer sequence model requires  O(l(n2 + dn))(Transformer) or O(lnd) (CNN, RNN) memory space to store intermediate vectors for back propagation. Since the proposed method requires additional O(|R|d) memory space to store intermediate vectors of composer modules and inputs a compressed sequence of length  n′to a sequence model, the overall memory complexity during the training phase is either  O((|R|+ln′)d+ln′2)(Transformer) or  O((|R|+ln′)d)(CNN, RNN).

The above complexity analysis shows that the proposed method can effectively reduce computational complexity and memory complexity during the training phase for compressed sequences with high compression ratios (i.e., |R| + n′ < n). Also, note that the above complexity analysis do not change with the clustering algorithm described in Section 4.3.

5.1. Dataset

We applied the proposed method to two sequence classi-fication tasks: character-level text classification and DNA sequence classification. For each of the datasets used in the tasks, we compressed sequences using Re-Pair and LZD algorithms. The average sequence length of the original and compressed sequences is shown in Table 1a7. Also, Table 1b shows the average number of replacement rules of Re-Pair and LZD algorithms in each dataset.

Table 1: Compression results.


We conducted experiments on the character-level text clas-sification dataset8 in (Zhang et al., 2015). Among the eight datasets released in the paper, we used Sogou News (Sogou), Yelp Review Polarity (Yelp P.), and Yelp Review Full (Yelp F.), as these three have a longer average sequence length than the other five. The Sogou and Yelp F. datasets have five class categories and the Yelp P. dataset has two. The alphabet used in our experiments consists of the 70 characters used in (Zhang et al., 2015) and one blank space character. We did not distinguish between lower and upper case letters in our experiments.


We conducted experiments on 10 DNA sequence classifica-tion datasets9 in (Phaml et al., 2005) which are constructed using DNA sequences reported in (Pokholok et al., 2005), each of which contains DNA sequences relating to a spe-cific type of histone protein. The number of class categories of all dataset is two. Sequences in the datasets consist of four characters, A, G, C, and T, which correspond to four nucleobases of DNA. The length of each sequence is fixed to 500. CNN has recently been successfully applied to these dataset with domain-specific 3-gram features (Nguyen et al., 2016).

5.2. Method

Baseline: As a simple baseline, we applied sequence models to the original decompressed sequences. To encode the decompressed sequences, we encode each character  c ∈ Σin the sequence by its corresponding trainable parameter vector  xc ∈ Rd.

Proposed Method: We tested two types of composer modules described in  §4.2. As  fMLPin a naive MLP composer, we used a multi-layer perceptron with a single hidden layer. The dimension of the hidden layer is set to d. We used ReLU (Nair & Hinton, 2010) and sigmoid activation in the


hidden and output layers, respectively10.

Sequence Model: For both the baseline and proposed method, we used a bidirectional LSTM (Bi-LSTM) (Schuster & Paliwal, 1997) as a sequence model. The dimension of the input and hidden representations d is set to 200, and the number of Bi-LSTM layers is set to one. We max-pooled the outputs of Bi-LSTM to compute a single representation of the input sequence. We applied dropout of p = 0.5 to the max-pooled output of Bi-LSTM in the text classification datasets. Linear transformation was applied to the sequence representation to compute a score for each category label.

We trained the models using Adam optimizer (Kingma & Ba, 2015) with a learning rate of  10−3for 50 epochs and halved the learning rate every 10 (text) or 20 (DNA) epochs. Also, we linearly increased the learning rate from 0 to  10−3during the first 1000 steps. We set the mini-batch size to 126 (text) or 10 (DNA) and randomly sampled 100000 training sequences every epoch for the text classification datasets. We randomly sampled 20% of the training data as a held-out development set and report the performance of the best performing model on the development set. To decide the hyper-parameter settings described above, we empirically tuned the learning rate, dropout rate, weight decay ratio, and frequency of learning rate decay on the basis of the performance of the baseline method on a development set.

5.3. Results


Table 2 and Table 3 show the performance of the baseline (Bi-LSTM (baseline)) and the proposed method with naive MLP and Dual-GRU composer module (Bi-LSTM-mlp, Bi-LSTM-gru) in DNA and text classification datasets respectively. Reported values are averages over four experiments, in which the initial network parameters and held-out development set were randomly sampled. For all settings,


Figure 1: Memory usage and training time for each mini-batch size. Solid (rectangle plot) , dashed (triangle plot), and dotted line (circle plot) represent Bi-LSTM, Bi-LSTM-gru (Re-Pair), and Bi-LSTM-gru (LZD) respectively. Corresponding mini-batch size bs is written beside each plot. These figures are visualized using adjustText library (https://github.com/Phlya/adjustText) to adjust label placement.

Table 2: Testing accuracy (%) in DNA classification datasets (mean  ±std. dev.).  †denotes that Dual-GRU composer performs better than Naive MLP composer on the specific dataset and compression algorithm (p < 0.02). ‡denotes that decline of performance of the best performing setting of the proposed method compared to the baseline Bi-LSTM is less than 5 points (p < 0.03). (bracket) and bold denote the best performing setting of the proposed method on the specific dataset (p > 0.05 and p < 0.05 respectively).


Table 3: Testing errors (%) in text classification datasets (mean  ±std. dev.).  †denotes that Dual-GRU composer performs better than Naive MLP composer on the specific dataset and compression algorithm (p < 0.02). ‡denotes that decline of performance of the best performing setting of the proposed method compared to the baseline Bi-LSTM is less than 3 points (p < 0.02). (bracket) and bold denote the best performing setting of the proposed method on the specific dataset (p > 0.05 and p < 0.05 respectively).


we used the same set of four seed values to sample the held-out development set. Also, we used Wilcoxon rank sum test (Wilcoxon, 1945) with multiple-test adjustment using Holm’s method (Holm, 1979) to check the statistical significance of the results.

We include the reported performances of character-level CNN (Zhang et al., 2015) (Char-CNN) in the text classifi-cation datasets, and those of seq-CNN (Johnson & Zhang, 2015) in the DNA classification datasets reported in (Nguyen et al., 2016)

As shown in Table 3 and Table 2, the performance drop of the best performing setting of the proposed method compared to the baseline Bi-LSTM was significantly less than three points on two out of three text dataset; and less than five points on six out of ten DNA datasets. Dual-GRU composer performed significantly better than naive MLP composer on all combinations of the text datasets and the compression algorithms; and on 17 out of 20 combinations of the DNA dataset and the compression algorithms, which suggests the importance of designing good composer modules.


We also investigated the memory and computational ef-ficiency of the proposed method compared to the baseline Bi-LSTM on non-compressed sequences.s Since the mini-batch size bs influences the memory usage and training time, we tested three different mini-batch sizes bs = {21, 42, 126}.11 Setting of the other hyperparemters are kept same as the setting in the experiments in Section 5.3. Also, we did not include decompression time to the results of the baseline method. Experiments of text datasets were conducted with a Tesla V100-SXM2 with a 32GB memory, and experiments of DNA datasets were conducted with Tesla V100-PCIE with 16GB memory.

Figure 1 shows peak GPU memory usage and required training time during training phase of the baseline method (Bi-LSTM) and the proposed method (Bi-LSTM-gru) with two compression algorithms (Re-Pair and LZD)12. As shown in Figure 1, with a fixed size of GPU memory, compared to the baseline method, training time of the proposed method is around five times faster in Sogou, around two times faster in Yelp.P and Yelp.F, and around 3 times faster in the DNA datasets respectively. These results agrees with complexity analysis in Section 4.4 that the proposed method is more memory and computationally efficient on the datasets with higher compression ratio (e.g. Sogou) than the datasets with moderate compression ratio (e.g. Yelp.P and Yelp.F).

In this paper, we proposed a method to directly apply sequence models to grammar compressed sequence data without decompression. To encode a compressed sequence into a sequence of vector representations that sequence models can handle, the proposed method incrementally encode unique symbols in compression rules into the vector representation by means of a composer module.

We empirically showed that, by applying the proposed method to compressed sequences, we can reduce the required training time and GPU memory compared to sequence models applied to original decompressed sequences.

There are several possible directions for future work. First, we will consider applying sequence models for compressed text using another kind of compression technique such as zip encoding, which is not grammar compression. Finally, we will investigate the performanceof the proposed method with shared replacement rules across a whole dataset.

The authors are grateful to Keigo Kimura, Carolin Lawrence, Yuzuru Okajima, and Kunihiko Sadamasa for their comments and discussion, which significantly improved the correctness and the presentation of this paper.

Bille, P., Fagerberg, R., and Gørtz, I. L. Improved approximate string matching and regular expression matching on ziv-lempel compressed texts. In Combinatorial Pattern Matching, 18th Annual Symposium, CPM 2007, London, Canada, July 9-11, 2007, Proceedings, pp. 52– 62, 2007.

Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.

Cho, K., van Merri¨enboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111, Doha, Qatar, October 2014. Association for Computational Linguistics.

Dauvergne, B. and Hasco¨et, L. The data-flow equations of checkpointing in reverse automatic differentiation. In International Conference on Computational Science, pp. 566–573. Springer, 2006.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.

Elman, J. L. Finding structure in time. Cognitive Science, 14(2):179–211, 1990.

Gasieniec, L., Kolpakov, R. M., Potapov, I., and Sant, P. Real-time traversal in grammar-based compressed files. In 2005 Data Compression Conference (DCC 2005), 29-31 March 2005, Snowbird, UT, USA, pp. 458, 2005.

Goto, K., Bannai, H., Inenaga, S., and Takeda, M. LZD factorization: Simple and practical online grammar compression with variable-to-fixed encoding. In Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, Ischia Island, Italy, June 29 - July 1, 2015, Pro- ceedings, pp. 219–230, 2015.

Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., and Graves, A. Memory-efficient backpropagation through time. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4125–4133, 2016.

Gueguen, L., Sergeev, A., Kadlec, B., Liu, R., and Yosinski, J. Faster neural networks straight from JPEG. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr´eal, Canada, pp. 3937–3948, 2018.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.

Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pp. 65–70, 1979.

Johnson, R. and Zhang, T. Effective use of word order for text categorization with convolutional neural networks. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, pp. 103– 112, 2015.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

Kitaev, N., Kaiser, Ł., and Levskaya, A. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.

Larsson, N. J. and Moffat, A. Offline dictionary-based compression. In Data Compression Conference, DCC 1999, Snowbird, Utah, USA, March 29-31, 1999, pp. 296–305, 1999.

Le, Q. V. and Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp. 1188–1196, 2014.

Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp. 807–814, 2010.

Navarro, G. Regular expression searching on compressed text. J. Discrete Algorithms, 1(5-6):423–443, 2003.

Nguyen, N. G., Tran, V. A., Ngo, D. L., Phan, D., Lumbanraja, F. R., Faisal, M. R., Abapihi, B., Kubo, M., and Satou, K. Dna sequence classification by convolutional neural network. Journal of Biomedical Science and Engineering, 9(05):280, 2016.

Oyamada, M., Liu, J., Ito, S., Narita, K., Araki, T., and Kitagawa, H. Compressed vector set: A fast and space-efficient data mining framework. JIP, 26:416–426, 2018.

Papadimitriou, C. H. and Yannakakis, M. Towards an architecture-independent analysis of parallel algorithms. SIAM J. Comput., 19(2):322–328, 1990.

Phaml, T. H., Tran, D. H., Ho, T. B., Satou, K., and Valiente, G. Qualitatively predicting acetylation and methylation areas in dna sequences. Genome Informatics, 16 (2):3–11, 2005.

Pokholok, D. K., Harbison, C. T., Levine, S., Cole, M., Hannett, N. M., Lee, T. I., Bell, G. W., Walker, K., Rolfe, P. A., Herbolsheimer, E., Zeitlinger, J., Lewitter, F., Gifford, D. K., and Young, R. A. Genome-wide map of nucleosome acetylation and methylation in yeast. Cell, 122(4):517–527, 2005.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.

Rendle, S. Scaling factorization machines to relational data. PVLDB, 6(5):337–348, 2013.

Sarkar, V. Partitioning and Scheduling Parallel Programs for Multiprocessors. MIT Press, Cambridge, MA, USA, 1989. ISBN 0262691302.

Schuster, M. and Paliwal, K. K. Bidirectional recurrent neural networks. IEEE Trans. Signal Processing, 45(11): 2673–2681, 1997.

Tabei, Y., Saigo, H., Yamanishi, Y., and Puglisi, S. J. Scalable partial least squares regression on grammarcompressed data matrices. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp. 1875–1884, 2016.

Tai, K. S., Socher, R., and Manning, C. D. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp. 1556–1566, 2015.

Ulicny, M. and Dahyot, R. On using cnn with dct based image data. In In Proceedings of the 19th Irish Machine Vision and Image Processing conference (IMVIP), 2017.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 5998–6008, 2017.

Wilcoxon, F. Individual comparisons by ranking methods. Biometrics bulletin, 1(6):80–83, 1945.

Zhang, X., Zhao, J. J., and LeCun, Y. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 649–657, 2015.

Ziv, J. and Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. Information Theory, 24(5):530–536, 1978.

Designed for Accessibility and to further Open Science