xLSTM: Extended Long Short-Term Memory

1 month ago·arXiv

Abstract

1 Introduction

The Long Short-Term Memory (LSTM) ideas (Hochreiter, 1991; Hochreiter & Schmidhuber, 1997b,a), i.e., the constant error carousel and gating, were introduced to overcome the vanishing gradient problem of recurrent neural networks (Hochreiter, 1991; Hochreiter et al., 2000):

The constant error carousel is the additive update of the cell state (green) by cell inputs and moderated by sigmoid gates (blue). The input gate and the forget gate control this update, while the output gate controls the output of the memory cell, i.e. the hidden state . The cell state is normalized or squashed by and then output gating gives the hidden state.

LSTMs have been successfully applied to various domains (Hochreiter et al., 2001, 2007; Schmid- huber, 2015), and prevailed over text generation until the dawn of Transformers in 2017 (Vaswani et al., 2017). The effectiveness of LSTMs has been demonstrated at numerous sequence-related tasks such as generating text (Graves, 2013; Karpathy, 2015), generating handwritings (Graves, 2013), sequence-to-sequence translation (Sutskever et al., 2014), evaluating computer programs (Zaremba & Sutskever, 2014), generating image captions (Karpathy & Fei-Fei, 2015; Hossain et al., 2019), generating source code (Karpathy, 2015), rainfall-runoff modeling (Kratzert et al., 2018, 2019), or hydrological models for flooding warnings (Nearing et al., 2024). In reinforcement learning, LSTMs are the best performing sequence models, e.g., the AlphaStar model for StarCraft II (Vinyals et al., 2017), the OpenAI Five model for Dota 2 (Karpathy, 2019), and models of the magnetic controller for nuclear fusion (Degrave et al., 2022). LSTMs excel at learning abstractions, i.e., adeptly extracting semantic information and storing it in their memory cells (Karpathy, 2015), which for example became evident by number and syntax neurons (Lakretz et al., 2019), linguistic neurons (Bau et al., 2019), and sentiment neurons (Radford et al., 2017). LSTMs are still used in highly relevant applications (Degrave et al., 2022; Nearing et al., 2024) and have stood the test of time.

Figure 2: LSTM limitations. Left: Nearest Neighbor Search problem in terms of mean squared error (MSE). Given a reference vector, a sequence is scanned sequentially for the most similar vector with the objective to return its attached value at sequence end. LSTM struggles to revise a stored value when a more similar vector is found. Our new xLSTM overcomes this limitation by exponential gating. Right: Rare Token Prediction. The perplexity (PPL) of token prediction on Wikitext-103, in buckets of token frequency. LSTM performs worse on predicting rare tokens because of its limited storage capacities, whereas our new xLSTM solves this

LSTM performs worse on rare tokens because of its limited storage capacities. Our new xLSTM solves this problem by a matrix memory. (iii) Lack of parallelizability due to memory mixing, i.e., the hidden-hidden connections between hidden states from one time step to the next, which enforce sequential processing.

These limitations of LSTM have paved the way for the emergence of Transformers (Vaswani et al., 2017) in language modeling. What performances can we achieve in language modeling when overcoming these limitations and scaling LSTMs to the size of current Large Language Models?

2 Extended Long Short-Term Memory

To overcome the LSTM limitations, Extended Long Short-Term Memory (xLSTM) introduces two main modifications to the LSTM idea of Equation (1). Those modifications – exponential gating and novel memory structures – enrich the LSTM family by two members: (i) the new sLSTM (see Section 2.2) with a scalar memory, a scalar update, and memory mixing, and (ii) the new mLSTM (see Section 2.3) with a matrix memory and a covariance (outer product) update rule, which is fully parallelizable. Both sLSTM and mLSTM enhance the LSTM through exponential gating. To enable parallelization, the mLSTM abandons memory mixing, i.e., the hidden-hidden recurrent connections. Both mLSTM and sLSTM can be extended to multiple memory cells, where sLSTM features memory mixing across cells. Further, the sLSTM can have multiple heads without memory mixing across the heads, but only memory mixing across cells within each head. This introduction of heads for sLSTM together with exponential gating establishes a new way of memory mixing. For mLSTM multiple heads and multiple cells are equivalent.

Integrating these new LSTM variants into residual block modules results in xLSTM blocks (see Section 2.4). Residually stacking those xLSTM blocks in architectures provides xLSTM architectures (see Section 2.4). See Figure 1 for the xLSTM architecture with its components.

2.1 Review of the Long Short-Term Memory

The original LSTM idea (Hochreiter, 1991; Hochreiter & Schmidhuber, 1997b,a) introduced the scalar memory cell as a central processing and storage unit that avoids vanishing gradients (Hochreiter, 1991; Hochreiter et al., 2000) through the constant error carousel (the cell state update). The memory cell contains three gates: input, output, and forget gate. The forget gate has been introduced by Gers et al. (2000). The LSTM memory cell update rules at time step t are:

The weight vectors , and correspond to the input weight vectors between inputs and cell input, input gate, forget gate, and output gate, respectively. The weights , and correspond to the recurrent weights between hidden state and cell input, input gate, forget gate, and output gate, respectively. , and are the corresponding bias terms. and are the cell input and hidden state activation functions (typically is used to normalize or squash the cell state, which would be unbounded otherwise. All gate activation functions are sigmoid, i.e., . In later formulations, multiple memory cells were combined in a vector, which allows the usage of recurrent weight matrices to mix the cell outputs of memory cells (Greff et al., 2015), for more details see Appendix A.1. Ablation studies showed that all components of the memory cell are crucial (Greff et al., 2015).

2.2 sLSTM

To empower LSTMs with the ability to revise storage decisions, we introduce exponential gates (red) together with normalization and stabilization. In particular, input and forget gates can have exponential activation functions. For normalization, we introduce a normalizer state that sums up the product of input gate times all future forget gates.

The sLSTM forward pass is:

We broadcast the original LSTM gating techniques, i.e., input- and/or hidden-dependent gating plus bias term, to the new architectures. Exponential activation functions can lead to large values that cause overflows. Therefore, we stabilize gates with an additional state Milakov & Gimelshein, 2018):

We show in Appendix A.2, that replacing in the forward pass does neither change the output of the whole network nor the derivatives of the loss with respect to the parameters.

New Memory Mixing. sLSTM can have multiple memory cells like the original LSTM (see Appendix A.2). Multiple memory cells enable memory mixing via recurrent connections , from hidden state vector h to memory cell input z and the gates i, f, o, respectively. A new aspect in memory mixing is the effect of exponential gating. The new sLSTM can have multiple heads with memory mixing within each head but not across heads. The introduction of heads for sLSTM together with exponential gating establishes a new way of memory mixing.

2.3 mLSTM

To enhance storage capacities of LSTMs, we increase the LSTM memory cell from a scalar a matrix . Hence, retrieval is performed via a matrix multiplication. At time t, we want to store a pair of vectors, the key and the value (we use the Transformer terminology). Later at time , the value should be retrieved by a query vector . This is the setting of Bidirectional Associative Memories (BAMs) (Kohonen, 1972; Anderson, 1972; Nakano, 1972; Anderson et al., 1977). The covariance update rule (Sejnowski, 1977; Dayan & Willshaw, 1991) for storing a key-value pair is

We assume a layer-norm before projecting inputs to keys and values, therefore they have zero mean. The covariance update rule is optimal (Dayan & Willshaw, 1991) for a maximal separability of retrieved binary vectors, which is equivalent to a maximal signal/noise ratio. Higher separability is possible when limiting retrieval to pairwise interactions and conceding quadratic complexity like attention (Krotov & Hopfield, 2016, 2017; Ramsauer et al., 2021). The covariance update rule is equivalent to Fast Weight Programmers (Schmidhuber, 1992; Schlag et al., 2021), which have later been equipped with a constant decay rate multiplied to and a constant learning rate multiplied to (Ba et al., 2016a). In this spirit, we integrate the covariance update rule into the LSTM framework, where the forget gate corresponds to decay rate and the input gate to the learning rate, while the output gate scales the retrieved vector.

For this matrix memory, the normalizer state is the weighted sum of key vectors, where each key vector is weighted by the input gate and all future forget gates. Again, the normalizer state keeps record of the strength of the gates. Since the dot product between query and normalizer state can be close to zero, we use the absolute value of this dot product and lower bound it by a threshold (typically 1.0) as done previously (Sun et al., 2023). The mLSTM forward pass is:

mLSTM can have multiple memory cells like the original LSTM. For mLSTM, multiple heads and multiple cells are equivalent as there is no memory mixing. In order to stabilize the exponential gates of mLSTM, we use the same stabilization techniques as for sLSTM, see Equation (15). Since the mLSTM has no memory mixing, this recurrence can be reformulated in a parallel version. For more details we refer to Appendix A.3.

2.4 xLSTM Architecture

Figure 3: xLSTM blocks. Left: A residual sLSTM block with post up-projection (like Transformers): The input is fed into an sLSTM – with an optional convolution – followed by a gated MLP. Right: A residual mLSTM block with pre up-projection (like State Space models): mLSTM is wrapped inside two MLPs, via a convolution, a learnable skip connection, and an output gate that acts component-wise. See Figure 9 and Figure 10 in the appendix for details.

xLSTM Blocks. An xLSTM block should non-linearly summarize the past in a high-dimensional space to better separate different histories or contexts. Separating histories is the prerequisite to correctly predict the next sequence element such as the next token. We resort to Cover’s Theorem (Cover, 1965), which states that in a higher dimensional space non-linearly embedded patterns can more likely be linearly separated than in the original space. We consider two residual block architectures: (i) A residual block with post up-projection (like Transformers), which non-linearly summarizes the past in the original space, then linearly maps into a high-dimensional space, applies a non-linear activation function, and linearly maps back to the original

space; see left panel of Figure 3 and third column in Figure 1. A more detailed version is depicted in Figure 9 in the appendix. (ii) A residual block with pre up-projection (like State Space Models), which linearly maps to a high-dimensional space, non-linearly summarizes the past in the high-dimensional space and then linearly maps back to the original space. For an xLSTM block containing an sLSTM, we mostly use the post up-projection block. For an xLSTM block containing an mLSTM, we use the pre up-projection block since the memory capacity becomes larger in the high-dimensional space. See left panel of Figure 3 and third column in Figure 1, or Figure 9 in the appendix for more details.

xLSTM Architecture. An xLSTM architecture is constructed by residually stacking building blocks (Srivastava et al., 2015; He et al., 2016). We rely on the most commonly used preLayerNorm (Ba et al., 2016b) residual backbones as used in contemporary Large Language Models. See last column in Figure 1.

2.5 Memory and Speed Considerations

Contrary to Transformers, xLSTM networks have a linear computation and a constant memory complexity with respect to the sequence length. Since the xLSTM memory is compressive, it is well suited for industrial applications and implementations on the edge.

The memory of mLSTM does not require parameters but is computationally expensive through its matrix memory and update. We trade off memory capacity against computational complexity. Nevertheless, the computations can be done in parallel on GPUs, therefore these computations have only a minor effect on the wall clock time.

While mLSTM is parallelizable analog to FlashAttention (Dao et al., 2022; Dao, 2024) or GLA (Yang et al., 2023), sLSTM is not parallelizable due to the memory mixing (hidden-hidden connections). However, we developed a fast CUDA implementation with GPU memory optimizations to the register level which is typically less than two times slower than mLSTM.

3 Related Work

Linear Attention. Several methods have been suggested to overcome the quadratic complexity in terms of context length of the Transformer and make attention linear in the context length. The Synthesizer learns synthetic attention weights without token-token interactions (Tay et al., 2020). Linformer realizes self-attention by a low-rank matrix and even linearly approximates it (Wang et al., 2020). Linear Transformer linearizes the attention mechanism (Katharopoulos et al., 2020). Performer linearly approximates the attention softmax by positive orthogonal random features approach (Choromanski et al., 2021). Attention has been replaced by fast long convolutions in the Structured Global Convolution (SGConv) (Li et al., 2022) and the Hyena Hierarchy (Poli et al., 2023).

State Space Models. Recently, State Space Models (SSMs) became very popular since they are linear in the context length and show promising performance compared to Transformers. One of the first proposed models was Structured State Space sequence model (S4) (Gu et al., 2021), followed by Diagonal State Space (DSS) model (Gupta et al., 2022), Gated State Space (GSS) models (Mehta et al., 2022), S5 model (Smith et al., 2022), Bidirectional Gated SSM (BiGS) (Wang et al., 2022), H3 model (Fu et al., 2023), and Mamba (Gu & Dao, 2023).

Recurrent Neural Networks. Recurrent Neural Networks (RNNs) have been suggested to replace Transformer and attention due to their linearity in the context length. RNNs with Deep Linear Recurrent Units (LRUs) showed promising results for language modeling (Orvieto et al., 2023; De et al., 2024), as did Hierarchically Gated Linear RNN (HGRN) (Qin et al., 2023) and HGRN2 (Qin et al., 2024). A well-known RNN approach to large language modeling is RWKV (Peng et al., 2023, 2024), showcasing competitive performance to Transformers.

Gating. One of the key ideas of LSTM is gating, which was rediscovered and reinterpreted in many recent approaches. Gating was used in HGRN (Qin et al., 2023), HGRN2 (Qin et al., 2024), Gated Linear Attention (GLA) (Yang et al., 2023), Gated State Space (GSS) models (Mehta et al., 2022), Bidirectional Gated SSM (BiGS) (Wang et al., 2022), Moving Average Equipped Gated Attention (MEGA) (Ma et al., 2022), RWKV (Peng et al., 2023), and Mamba (Gu & Dao, 2023).

Covariance Update Rule. To enhance storage capacities, we equipped the mLSTM cell with a matrix memory with a covariance update rule. Other methods which build on such an update mechanism are Fast Weight Programmers (Schmidhuber, 1992; Schlag et al., 2021), RWKV-5 and RWKV-6 (Peng et al., 2024), Retention (Sun et al., 2023), Linear Transformer (Katharopoulos et al., 2020), and HGRN2 (Qin et al., 2024).

Most Related. Conceptually the closest models to xLSTM are Retention (Sun et al., 2023), RWKV (Peng et al., 2023, 2024), and HGRN2 (Qin et al., 2024). These models share the concepts matrix memory and/or gating. However, in contrast to the new sLSTM, these approaches do not allow memory mixing. Memory mixing enables to solve state tracking problems, and therefore LSTMs are more expressive than State Space Models (SSMs) and Transformers (Merrill et al., 2024; Delétang et al., 2023). State tracking is required to evaluate code or to track entities in a long narrative.

Residually Stacking Architectures. Like almost all contemporary large deep learning models, xLSTM architectures are constructed by residually stacking building blocks (Srivastava et al., 2015; He et al., 2016). This construction enabled deep convolutional networks (He et al., 2016) and Transformers (Vaswani et al., 2017). Transformers are the ultimate force behind Large Language Models (LLMs) like GPT-3 (Brown et al., 2020), ChatGPT (Schulman et al., 2022), GPT-4 (Achiam et al., 2023), Megatron-LM (Shoeybi et al., 2019), Gopher (Rae et al., 2021), ERNIE 3.0 Titan (Wang et al., 2021), GLaM (Du et al., 2021), Chinese M6 (Lin et al., 2021), mutilingual AlexaTM 20B (Soltan et al., 2022), OPT (Zhang et al., 2022), Chinchilla (Hoffmann et al., 2022), BLOOM (Scao et al., 2022), GLM-130B (Zeng et al., 2022), LaMDA (Thoppilan et al., 2022), PaLM (Chowdhery et al., 2022), Llama (Touvron et al., 2023), Gemini (Google, 2023; Reid et al., 2024).

4 Experiments

In this section, we experimentally evaluate xLSTM and compare it to existing methods with a focus on language modeling. We investigate xLSTM’s specific capabilities on synthetic tasks in Section 4.1. In Section 4.2, we compare the validation set perplexity of various current language modeling methods that were trained on 15B tokens from SlimPajama (Soboleva et al., 2023). On the same dataset, we perform ablation studies for xLSTM. Then, we assess the scaling behavior of the different methods analogous to Kaplan et al. (2020) and Brown et al. (2020). In Section 4.3, we conduct a more thorough language modeling experiment. We compare xLSTM and the best performing methods from Section 4.2 after being trained on 300B tokens from SlimPajama (Soboleva et al., 2023). First, we assess how well the methods perform in extrapolating to longer contexts, secondly we test the methods via validation perplexity and performance on downstream tasks (Sutawika et al., 2024), thirdly we evaluate the methods on 571 text domains of the PALOMA language benchmark dataset (Magnusson et al., 2023), fourthly we again assess the scaling behavior of the different methods, but now with 20 times more training data.

For all experiments, we use the notation xLSTM[a:b] for the ratio a/b of mLSTM-based versus sLSTM-based xLSTM blocks. For example, xLSTM[7:1] means that out of eight blocks, seven are mLSTM-based blocks and one is an sLSTM-based block. For a common total block number of 48, this translates to 6 sLSTM-based blocks and 42 mLSTM-based blocks. Further, for all experiments, we use pre and post up-projection blocks for mLSTM and sLSTM, respectively.

4.1 Synthetic Tasks and Long Range Arena

First, we test the effectiveness of xLSTM’s new exponential gating with memory mixing on formal languages (Delétang et al., 2023). Then, we assess the effectiveness of xLSTM’s new matrix memory on the Multi-Query Associative Recall task (Arora et al., 2023). Finally, xLSTM’s performance at processing long sequences in the Long Range Arena is evaluated (Tay et al., 2021).

Test of xLSTM’s Exponential Gating with Memory Mixing. We test xLSTM’s new exponential gating with memory mixing, which should enable it to solve state tracking problems (Merrill et al., 2024; Merrill & Sabharwal, 2023). We implement and extend the formal language tasks from Delétang et al. (2023) to enable multi-length training for length extrapolation. For a detailed description of all tasks and extended results see Appendix B.1.1. We compare xLSTM to other methods including Transformers, State Space Models, and Recurrent Neural Networks. The accuracy of the tested methods is evaluated on those tokens relevant to the task. The accuracy is scaled between 0 (random) and 1 (perfect). We compare 2-block architectures of the following methods on these tasks: xLSTM[0:1] (i.e., only sLSTM), xLSTM[1:0] (i.e., only mLSTM), xLSTM[1:1], Llama, Mamba, RWKV, Retention, Hyena, LSTM, and LSTM in Transformer blocks (LSTM (Block)). The results of this experiment are shown in Figure 4. Models such as Transformers or State Space Models without memory mixing (no state tracking) cannot solve e.g. regular grammars like the parity task.

Figure 4: Test of xLSTM’s exponential gating with memory mixing. Results are given by the scaled accuracy of different models at solving formal language tasks, of which some require state tracking. The different tasks are grouped by the Chomsky hierarchy.

This result is in agreement with findings that Transformers and State Space models are fundamentally less powerful than RNNs (Merrill et al., 2024; Merrill & Sabharwal, 2023; Delétang et al., 2023).

Test of xLSTM’s Memory Capacities on Associative Recall Tasks. In this experiment, we test xLSTM’s new matrix memory in terms of the memory capacity on the Multi-Query Associative Recall task (Arora et al., 2023): For each sequence, key-value pairs are randomly chosen from a large vocabulary, which must be memorized for later retrieval. To enhance the difficulty of the original task, we increase the number of key-value pairs up to 256 and extend the context length up to 2048. Thus, we have broader tests for the memory capacities of different models. We compare 2-block architectures of Llama, Mamba, RWKV-5, RWKV-6, xLSTM[1:1] and xLSTM[1:0]. The models are evaluated by the accuracy at recalling the pairs. Since Transformers (e.g. Llama) have a memory that is exponential in the coding dimension (Ramsauer et al., 2021), they constitute the gold standard at this task. Results are shown in Figure 5. xLSTM[1:1] performs best among all non-Transformer models, also for small models. Interestingly, the sLSTM block does not diminish the memory capacity but rather leverages it, which becomes evident at the most difficult task with 256 key-value pairs. Additional results are presented in Appendix B.1.2, where extrapolation analyses indicate that xLSTM’s enhanced memory capacities also pertain when extrapolating to contexts that are longer than those seen during training.

Figure 5: Test of memory capacities of different models at the Multi-Query Associative Recall task with context length 2048. Each panel is dedicated to a different number of key-value pairs. The x-axis displays the model size and the y-axis the validation accuracy.

Test of xLSTM’s Long Context Capabilities on Long Range Arena. To assess xLSTM’s performance on long sequences and large contexts, we compare different methods on the Long Range Arena (Tay et al., 2021). xLSTM demonstrates consistent strong performance on all of the tasks, suggesting that the xLSTM architecture is remarkably efficient in handling different aspects of long context problems. For more details, see Appendix B.1.3.

4.2 Method Comparison and Ablation Study

The main question of this paper is, what can we achieve in language modeling when scaling up the new LSTM variants. Therefore, we train xLSTMs, Transformers, State Space Models, and other methods on 15B tokens from SlimPajama in an auto-regressive language modeling setting. We compare the trained models on the validation set. Finally, we perform ablation studies for xLSTM.

Table 1: Method comparison on next token prediction when trained on 15B tokens from SlimPajama. Best validation perplexities within model classes, i.e., Transformers, LSTMs, SSMs, RNNs, and linear Transformers are underlined and overall best is in bold. For each model class, the best performing methods are later used in Section 4.3 for LLM training. xLSTMs with new memory (xLSTM[1:0] and xLSTM[7:1]) perform best.

Comparing xLSTM to Other Methods. For comparison, we train models on 15B tokens from SlimPajama (Soboleva et al., 2023). The trained models are evaluated by their perplexity on the validation set. We compare the following methods: xLSTM (our new method), GPT-3 (Transformer) (Brown et al., 2020), Llama (Transformer) (Touvron et al., 2023), H3 (SSM) (Fu et al., 2023), Mamba (SSM) (Gu & Dao, 2023), RWKV-4 (RNN) (Peng et al., 2023), RWKV-5 (RNN) (Peng et al., 2024), RWKV-6 (RNN) (Peng et al., 2024), GLA (linear Transformer) (Yang et al., 2023), HGRN (RNN) (Qin et al., 2023), HGRN2 (RNN) (Qin et al., 2024). RetNet (linear Transformer) (Sun et al., 2023), Hyena (linear Transformer) (Poli et al., 2023), xLSTM[1:0], and xLSTM[7:1] (see Section 4). The models were trained with mixed precision, except RWKV-5, RWKV-6, GLA, HGRN, HGRN2, where mixed-precision training was not supported by the reference implementation. We categorize the methods into (a) Transformers, (b) State Space Models (SSMs), and (c) Recurrent Neural Networks (RNNs) together with linear Transformers. Linear Transformers are linear methods that substitute the Transformer attention mechanism. The models match a GPT-3 model with 350M parameters in size, i.e. embedding dim 1024 and 24 residual blocks. Only GPT-3 uses shared weights for token and output embeddings, therefore has fewer parameters. The results in Table 1 show that xLSTM outperforms all existing methods in validation perplexity. For details see Appendix B.2. Figure 6 shows the scaling behaviour for this experiment, indicating that xLSTM will also perform favorably for larger models.

Ablation Studies. Table 1 and Figure 6 demonstrate that xLSTM achieves excellent results at language modeling when being trained on 15B tokens from SlimPajama. Thus, it is only natural to ask which of the elements of xLSTM is responsible for the improvements over vanilla LSTM performances, evoking an ablation study of the individual new xLSTM components. For doing so, we morph a vanilla LSTM architecture step-by-step into an xLSTM architecture. First, we integrate LSTM layers into pre-LayerNorm residual backbones, second we extend this to a post up-projection block, then we add exponential gating, and finally the matrix memory. The results are shown in Table 2 (top). The ablation studies attribute the strong performance improvement to both the exponential gating and the matrix memory. Additionally, since gating is an ever-occuring topic in RNNs and State Space Models, we ablate different gating mechanisms. In Table 2 (bottom), we conclude that having each gate learnable and influenced by the input has an incremental positive effect. Additional studies on the individual backbone components are discussed in Appendix B.2.

Figure 6: Method comparison on next token prediction when trained on 15B tokens from SlimPajama. Performance measure in validation perplexity for the best methods of each model class (see Table 1) are reported. The performance degradation of xLSTM[7:1] at 2.7B is due to initially slower training convergence that leads to an especially undertrained model. xLSTM is the best method at all sizes.

Table 2: Ablation studies. Top: Ablation studies on the new xLSTM components, contributing the strong performance improvement of xLSTM over vanilla LSTM to both the exponential gating and the matrix memory. Bottom: Ablation studies on different gating techniques. We consider an xLSTM[1:0] with sigmoid forget gate and exponential input gate. Bias initialization means that the forget gate is set to one, [3, 6] indicates that values are taken equidistant in the respective interval, and N(0, 0.1) that values are randomly chosen from a Gaussian with mean 0 and std 0.1. PPL denotes validation perplexity. The first two lines correspond to models similar to linearized attention, line four to Retention, line five to RWKV-5, and line six to RWKV-6. Dependencies of the gates on the input lead to better performance.

4.3 xLSTM as Large Language Model

We culminate this study in large-scale language modeling experiments, testing the potential of xLSTM as an LLM. We therefore increase the amount of training data and train on 300B tokens from SlimPajama. The same number of tokens is used in e.g., Mamba (Gu & Dao, 2023) and Griffin (De et al., 2024). We compare xLSTM, RWKV-4, Llama, and Mamba, which were selected as the best-performing methods in their respective method classes in the model comparison in Section 4.2. We train different model sizes (125M, 350M, 760M, 1.3B), test all models for length extrapolation capabilities and evaluate their performance on the validation set. We assess their performance on downstream tasks, test their performance in language modeling on 471 text domains of the PALOMA benchmark, and, finally, investigate their scaling law behavior.

Sequence Length Extrapolation. First, we test the sequence length extrapolation for 1.3B-sized, large models of xLSTM, RWKV-4, Llama, and Mamba. All models are trained on context length 2048, and then tested for context lengths up to 16384. See Figure 7 for the results. In contrast to other methods, xLSTM models maintain low perplexities for longer contexts.

Figure 7: Sequence extrapolation in language modeling. This is a comparison of 1.3B-sized, large models of xLSTM, RWKV-4, Llama, and Mamba at next token prediction on the SlimPajama validation set after training on 300B tokens from SlimPajama. Models are trained with context length 2048 and then tested for context lengths up to 16384. Left: Token perplexities evaluated at different context lengths. In contrast to other methods, xLSTM models remain at low perplexities for longer contexts. Right: Prediction quality when extrapolating to long context sizes in terms of validation perplexity (PPL). xLSTM yields the best PPL values (best in bold, second best underlined).

Validation Perplexity and Downstream Tasks. Secondly, for all model sizes, we evaluate the performance of xLSTM, RWKV-4, Llama, and Mamba models on the SlimPajama validation set for next token prediction and on downstream tasks that measure common sense reasoning. The third column of Table 3 lists the validation set perplexities of different methods. Both xLSTM[1:0] and xLSTM[7:1] are the best models for all model sizes with respect to the validation set perplexity. The other columns of Table 3 provide the performance on downstream tasks. In the vast majority of tasks and across all model sizes xLSTM is the best method — only on the ARC task Mamba is in some cases the best method. For details see Appendix B.3.

Performance on PALOMA Language Tasks. Thirdly, for all model sizes, we test the next token prediction performance of xLSTM, RWKV-4, Llama, and Mamba models on PALOMA language tasks (Magnusson et al., 2023). We measure the performance by the perplexity for next token prediction on 571 text domains, which range from nytimes.com to r/depression on Reddit. Table 4 shows token prediction perplexity grouped into language modeling (first seven columns) and fine-grained domain benchmarks (last 5 columns). xLSTM[1:0] performs better than xLSTM[7:1] on these language tasks. xLSTM[1:0] has in 568 out of 571 (99.5%) text domains a lower perplexity than Mamba, in 486 out of 571 (85.1%) a lower perplexity than Llama, in 570 out of 571 (99.8%) a lower perplexity than RWKV-4. For details see Appendix B.3.

Table 3: Validation set perplexity and downstream tasks. Comparison of xLSTM, RWKV-4, Llama, and Mamba on the validation set at next token prediction and on downstream tasks after training on 300B tokens from SlimPajama. Model sizes are 125M, 250M, 760M, and 1.3B. The first column shows the methods and the second the actual number of parameters. The third column lists the validation set perplexities, while the remaining columns show the performance on downstream tasks. Best model per model size is depicted bold and the second best is underlined. In the vast majority of tasks and across all model sizes xLSTM is the best method — only on the ARC task Mamba is in some cases the best method. xLSTM[1:0] and xLSTM[7:1] are the two best models with respect to validation set perplexity.

Table 4: Performance on PALOMA Language Modeling Tasks. Comparison of xLSTM, RWKV-4, Llama, and Mamba by the perplexity of next token prediction on the PALOMA language benchmark after training on 300B tokens from SlimPajama. Model sizes are 125M, 250M, 760M, and 1.3B. The second column shows the actual number of parameters. The 571 text domains are grouped into language modeling (next seven columns) and fine-grained domain benchmarks (further 5 columns). The last column shows the average perplexity across all of these tasks. Best model per model size is given in bold and the second best is underlined. xLSTM yields the best performance.

Scaling Laws. Fourthly, we assess the power-law scaling behavior, which allows to extrapolate the performance to larger model sizes (Kaplan et al., 2020; Brown et al., 2020). Figure 8 presents the scaling behavior. All models share a similar scaling behavior but with different offsets. RWKV-4 performs worst, followed by Llama and Mamba. xLSTM is better than Mamba with a similar margin to Mamba as Mamba has to Llama. The scaling behavior indicates that for larger models xLSTM will continue to perform favourable compared to Transformers and State-Space models.

Figure 8: Scaling laws. Next token prediction perplexity of xLSTM, RWKV-4, Llama, and Mamba on the SlimPajama validation set when trained on 300B tokens from SlimPajama. Model sizes are 125M, 350M, 760M, and 1.3B. Best models for each model class, see Table 1, were selected. The scaling laws indicate that for larger models xLSTM will perform well, too.

5 Limitations

(i) In contrast to mLSTM, memory mixing of the sLSTM prohibits parallelizable operations, and therefore does not allow a fast parallel implementation. Nevertheless, we developed a fast CUDA kernel for sLSTM, which is currently around 1.5 times slower than our parallel mLSTM implementation. (ii) The CUDA kernels for mLSTM are not optimized, and therefore the current implementation is about 4 times slower than FlashAttention or the scan used in Mamba. Faster CUDA kernels could be obtained in the vein of FlashAttention. (iii) The matrix memory of mLSTM has high computation complexity since matrices must be processed. Still, the memory update and retrieval does not use parameters and can be parallelized using standard matrix operations, therefore the wall clock time overhead due to the complex memory is minor. (iv) The initialization of the forget gates must be chosen carefully. (v) Since the matrix memory is independent of the sequence length, increasing the sequence length might overload the memory for longer context sizes. Still, this does not appear to be a limitation for contexts up to 16k, see Section 4.3. (vi) Due to the expensive computational load for large language experiments, we did neither fully optimize the architecture nor the hyperparameters, especially for larger xLSTM architectures. We anticipate that an extensive optimization process is needed for xLSTM to reach its full potential.

6 Conclusion

We have partly answered our simple question: How far do we get in language modeling when scaling LSTM to billions of parameters? So far, we can answer: “At least as far as current technologies like Transformers or State Space Models”. We have enhanced LSTM to xLSTM by exponential gating with memory mixing and a new memory structure. xLSTM models perform favorably on language modeling when compared to state-of-the-art methods like Transformers and State Space Models. The scaling laws indicate that larger xLSTM models will be serious competitors to current Large Language Models that are built with the Transformer technology. xLSTM has the potential to considerably impact other deep learning fields like Reinforcement Learning, Time Series Prediction, or the modeling of physical systems.

Acknowledgements

We thank Sebastian Lehner, Daniel Klotz, Thomas Adler, Matthias Dellago, Gerald Gutenbrunner, Fabian Paischer, Vihang Patil, Niklas Schmidinger, Benedikt Alkin, Kajetan Schweighofer, Anna Zimmel, Lukas Aichberger, Lukas Hauzenberger, Bernhard Schäfl, Johannes Lehner for helpful discussions and feedback.

References

J. Achiam, S. Adler, S. Agarwal, et al. GPT-4 technical report. ArXiv, 2303.08774, 2023.

J. Anderson, J. Silverstein, S. Ritz, and R. Jones. Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84:413–451, 1977. doi: 10.1037/0033-295X.84.5.413.

J. A. Anderson. A simple neural network generating an interactive memory. Mathematical Biosciences, 14, 1972. doi: 10.1016/0025-5564(72)90075-2.

S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, A. Rudra, and C. Ré. Zoology: Measuring and improving recall in efficient language models. ArXiv, 2312.04927, 2023.

J. Ba, G. E. Hinton, V. Mnih, J. Z. Leibo, and C. Ionescu. Using fast weights to attend to the recent past. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp. 4331–4339. Curran Associates, Inc., 2016a.

J. Ba, J. R. Kiros, and G. Hinton. Layer normalization. ArXiv, 1607.06450, 2016b.

A. Bau, Y. Belinkov, H. Sajjad, N. Durrani, F. Dalvi, and J. Glass. Identifying and controlling important neurons in neural machine translation. In International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id=H1z-PsR5KX.

Y. Bisk, R. Zellers, R. LeBras, J. Gao, and Y. Choi. Piqa: Reasoning about physical commonsense in natural language. In AAAI Conference on Artificial Intelligence, volume 34, pp. 7432–7439, 2020.

S. L. Blodgett, L. Green, and B. O’Connor. Demographic dialectal variation in social media: A case study of African-American English. In Conference on Empirical Methods in Natural Language Processing, pp. 1119–1130, 2016. doi: 10.18653/v1/D16-1120.

T. Brown, B. Mann, N. Ryder, et al. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020.

K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, and A. Weller. Rethinking attention with performers. In 9th International Conference on Learning Representations (ICLR). OpenReview.net, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH.

A. Chowdhery, S. Narang, J. Devlin, et al. PaLM: scaling language modeling with pathways. ArXiv, 2204.02311, 2022.

A. Chronopoulou, M. Peters, and J. Dodge. Efficient hierarchical domain adaptation for pretrained lan- guage models. In Conference of the North American Chapter of the Association for Computational Linguistics, pp. 1336–1351, 2022. doi: 10.18653/v1/2022.naacl-main.96.

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. ArXiv, 1803.05457, 2018.

T. M. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. Electronic Computers, IEEE Transactions on, EC-14(3):326–334, 1965.

T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), volume 12, 2024. URL https: //openreview.net/forum?id=mZn2Xyh9Ec.

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho (eds.), Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://openreview.net/ forum?id=H4DqfPSibmx.

P. Dayan and D. J. Willshaw. Optimising synaptic learning rules in linear associative memories. Biological Cybernetics, 65, 1991. doi: 10.1007/bf00206223.

S. De, S. L. Smith, A. Fernando, A. Botev, G. Cristian-Muraru, A. Gu, R. Haroun, L. Berrada, Y. Chen, S. Srinivasan, G. Desjardins, A. Doucet, D. Budden, Y. W. Teh, R. Pascanu, N. DeFreitas, and C. Gulcehre. Griffin: Mixing gated linear recurrences with local attention for efficient language models. ArXiv, 2402.19427, 2024.

J. Degrave, F. Felici, J. Buchli, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602:414–419, 2022. doi: 10.1038/s41586-021-04301-9.

G. Delétang, A. Ruoss, J. Grau-Moya, T. Genewein, L. K. Wenliang, E. Catt, C. Cundy, M. Hutter, S. Legg, J. Veness, and P. A. Ortega. Neural networks and the Chomsky hierarchy. In International Conference on Learning Representations (ICLR), volume 11, 2023. URL https://openreview. net/forum?id=WbxHAzkeQcn.

N. Du, Y. Huang, A. M. Dai, et al. GLaM: efficient scaling of language models with mixture-of- experts. ArXiv, 2112.06905, 2021.

D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. Re. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=COZDy0WYGg.

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy. The Pile: An 800gb dataset of diverse text for language modeling. ArXiv, 2101.00027, 2021.

F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. Neural Compututation, 12(10):2451–2471, 2000.

Gemini Team Google. Gemini: A family of highly capable multimodal models. ArXiv, 2312.11805, 2023.

A. Graves. Generating sequences with recurrent neural networks. ArXiv, 1308.0850, 2013.

S. Greenbaum and G. Nelson. The international corpus of English (ICE) project. World Englishes, 15(1):3–15, 1996.

K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space odyssey. ArXiv, 1503.04069, 2015.

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. ArXiv, 2312.00752, 2023.

A. Gu, K. Goel, and C. Ré. Efficiently modeling long sequences with structured state spaces. ArXiv, 2111.00396, 2021.

A. Gupta, A. Gu, and J. Berant. Diagonal state spaces are as effective as structured state spaces. ArXiv, 2203.14343, 2022.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.

S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Master’s thesis, Technische Universität München, 1991.

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997a.

S. Hochreiter and J. Schmidhuber. LSTM can solve hard long time lag problems. In M. C. Mozer, M. I. Jordan, and T. Petsche (eds.), Advances in Neural Information Processing Systems (NeurIPS), volume 9, pp. 473–479. MIT Press, Cambridge MA, 1997b.

S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In J. Kolen and S. Kremer (eds.), A Field Guide to Dynamical Recurrent Networks. IEEE, 2000.

S. Hochreiter, A. Steven Younger, and Peter R. Conwell. Learning to learn using gradient descent. In G. Dorffner, H. Bischof, and K. Hornik (eds.), Proc. Int. Conf. on Artificial Neural Networks (ICANN 2001), pp. 87–94. Springer, 2001.

S. Hochreiter, M. Heusel, and K. Obermayer. Fast model-based protein homology detection without alignment. Bioinformatics, 23(14):1728–1736, 2007.

J. Hoffmann, S. Borgeaud, A. Mensch, et al. Training compute-optimal large language models. ArXiv, 2203.15556, 2022.

M. D. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR), 51(6):118, 2019.

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. ArXiv, 2001.08361, 2020.

A. Karpathy. The unreasonable effectiveness of recurrent neural networks. http://karpathy.github.io/2015/05/21/rnn-effectiveness/, 2015.

A. Karpathy. OpenAI Five defeats Dota 2 world champions. https://openai.com/research/openai-five- defeats-dota-2-world-champions, 2019.

A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128–3137, 2015.

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In E. H. Daumé III and A. Singh (eds.), International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pp. 5156–5165. PMLR, 2020.

T. Katsch. GateLoop: Fully data-controlled linear recurrence for sequence modeling. ArXiv, 2311.01927, 2023.

D. Kocetkov, R. Li, L. BenAllal, J. Li, C. Mou, C. Mu nozFerrandis, Y. Jernite, M. Mitchell, S. Hughes, T. Wolf, D. Bahdanau, L. vonWerra, and H. deVries. The Stack: 3 TB of permissively licensed source code. ArXiv, 2211.15533, 2022.

T. Kohonen. Correlation matrix memories. IEEE Transactions on Computers, C-21(4), 1972. doi: 10.1109/tc.1972.5008975.

F. Kratzert, D. Klotz, C. Brenner, K. Schulz, and M. Herrnegger. Rainfall-runoff modelling using long short-term memory (LSTM) networks. Hydrology and Earth System Sciences, 22(11):6005–6022, 2018.

F. Kratzert, D. Klotz, G. Shalev, G. Klambauer, S. Hochreiter, and G. Nearing. Benchmarking a catchment-aware long short-term memory network (LSTM) for large-scale hydrological modeling. ArXiv, 1907.08456, 2019.

A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Deptartment of Computer Science, University of Toronto, 2009.

D. Krotov and J. J. Hopfield. Dense associative memory for pattern recognition. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, pp. 1172–1180. Curran Associates, Inc., 2016.

D. Krotov and J. J. Hopfield. Dense associative memory is robust to adversarial inputs. ArXiv, 1701.00939, 2017.

Y. Lakretz, G. Kruszewski, T. Desbordes, D. Hupkes, S. Dehaene, and M. Baroni. The emergence of number and syntax units in LSTM language models. In J. Burstein, C. Doran, and T. Solorio (eds.), Conference of the North American Chapter of the Association for Computational Linguistics, pp. 11–20. Association for Computational Linguistics, 2019. doi: 10.18653/v1/N19-1002.

Y. Li, T. Cai, Y. Zhang, D. Chen, and D. Dey. What makes convolutional models great on long sequence modeling? ArXiv, 2210.09298, 2022.

P. Liang, R. Bommasani, T. Lee, et al. Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525:140–146, 2023.

J. Lin, R. Men, A. Yang, C. Zhou, M. Ding, Y. Zhang, P. Wang, A. Wang, L. Jiang, X. Jia, J. Zhang, J. Zhang, X. Zou, Z. Li, X. Deng, J. Liu, J. Xue, H. Zhou, J. Ma, j. Yu, Y. Li, W. Lin, J. Zhou, J. Tang, and H. Yang. M6: A Chinese multimodal pretrainer. ArXiv, 2103.00823, 2021.

D. Linsley, J. Kim, V. Veerabadran, C. Windolf, and T. Serre. Learning long-range spatial dependen- cies with horizontal gated recurrent units. Advances in Neural Information Processing Systems (NeurIPS), 31, 2018.

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id= Bkg6RiCqY7.

X. Ma, C. Zhou, X. Kong, J. He, L. Gui, G. Neubig, J. May, and L. Zettlemoyer. Mega: Moving average equipped gated attention. ArXiv, 2209.10655, 2022.

A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Annual Meeting of the Association for Computational Linguistics, volume 49, pp. 142–150, 2011.

I. Magnusson, A. Bhagia, V. Hofmann, et al. Paloma: A benchmark for evaluating language model fit. ArXiv, 2312.10523, 2023.

H. Mehta, A. Gupta, A. Cutkosky, and B. Neyshabur. Long range language modeling via gated state spaces. ArXiv, 2206.13947, 2022.

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. In International Conference on Learning Representations (ICRL), 2017. URL https://openreview.net/ forum?id=Byj72udxe.

W. Merrill and A. Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023. doi: 10.1162/ tacl_a_00562.

W. Merrill, J. Petty, and A. Sabharwal. The illusion of state in state-space models. ArXiv, 2404.08819, 2024.

M. Milakov and N. Gimelshein. Online normalizer calculation for softmax. ArXiv, 1805.02867, 2018.

K. Nakano. Associatron – a model of associative memory. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3):380–388, 1972. doi: 10.1109/TSMC.1972.4309133.

G. Nearing, D. Cohen, V. Dube, M. Gauch, O. Gilon, S. Harrigan, A. Hassidim, D. Klotz, F. Kratzert, A. Metzger, S. Nevo, F. Pappenberger, C. Prudhomme, G. Shalev, S. Shenzis, T. Y. Tekalign, D. Weitzner, and Y. M. B. Kosko. Global prediction of extreme floods in ungauged watersheds. Nature, 627:559–563, 2024. doi: 10.1038/s41586-024-07145-1.

C. Olsson, N. Elhage, N. Nanda, et al. In-context learning and induction heads. ArXiv, 2209.11895, 2022.

A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, and S. De. Resurrecting recurrent neural networks for long sequences. In Proceedings of the 40th International Conference on Machine Learning (ICML). JMLR.org, 2023. doi: 10.5555/3618408.3619518.

A. Papasavva, S. Zannettou, E. DeCristofaro, G. Stringhini, and J. Blackburn. Raiders of the lost KeK: 3.5 years of augmented 4chan posts from the politically incorrect board. In International AAAI Conference on Web and Social Media (ICWSM), volume 14, pp. 885–894, 2020.

D. Paperno, G. Kruszewski, A. Lazaridou, N.-Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, Gemma G. Boleda, and R. Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Annual Meeting of the Association for Computational Linguistics, volume 1, pp. 1525–1534, 2016.

G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Al- mazrouei, and J. Launay. The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only. ArXiv, 2306.01116, 2023.

B. Peng, E. Alcaide, Q. Anthony, et al. RWKV: Reinventing RNNs for the transformer era. ArXiv, 2305.13048, 2023.

B. Peng, D. Goldstein, Q. Anthony, et al. Eagle and Finch: RWKV with matrix-valued states and dynamic recurrence. ArXiv, 2404.05892, 2024.

M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus, Y. Bengio, S. Ermon, and C. Ré. Hyena hierarchy: Towards larger convolutional language models. In Proceedings of the 40th International Conference on Machine Learning (ICML). JMLR.org, 2023. doi: 10.5555/3618408.3619572.

M. Poli, A. W. Thomas, E. Nguyen, P. Ponnusamy, B. Deiseroth, K. Kersting, T. Suzuki, B. Hie, S. Er- mon, C. Ré, C. Zhang, and S. Massaroli. Mechanistic design and scaling of hybrid architectures. ArXiv, 2403.17844, 2024.

Z. Qin, S. Yang, and Y. Zhong. Hierarchically gated recurrent neural network for sequence modeling. In Advances in Neural Information Processing Systems (NeurIPS), volume 37, 2023. URL https://openreview.net/forum?id=P1TCHxJwLB.

Z. Qin, S. Yang, W. Sun, X. Shen, D. Li, W. Sun, and Y. Zhong. HGRN2: Gated linear RNNs with state expansion. ArXiv, 2404.07904, 2024.

D. R. Radev, P. Muthukrishnan, and V. Qazvinian. The ACL anthology network corpus. In Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL), pp. 54–61. Association for Computational Linguistics, 2009.

A. Radford, R. Jozefowicz, and I. Sutskever. Learning to generate reviews and discovering sentiment. ArXiv, 1704.01444, 2017.

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. https://openai.com/index/better-language-models, 2019.

J. W. Rae, S. Borgeaud, T. Cai, et al. Scaling language models: Methods, analysis & insights from training Gopher. ArXiv, 2112.11446, 2021.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, 1910.10683, 2019.

H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, L. Gruber, M. Holzleitner, M. Pavlovi´c, G. K. Sandve, V. Greiff, D. Kreil, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter. Hopfield networks is all you need. In International Conference on Learning Representations (ICLR). OpenReview, 2021.

M. Reid, V. Zhong, S. Gururangan, and L. Zettlemoyer. M2D2: A massively multi-domain language modeling dataset. In Conference on Empirical Methods in Natural Language Processing, pp. 964–975, 2022.

M. Reid, N. Savinov, D. Teplyashin, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv, 2403.05530, 2024.

M. H. Ribeiro, J. Blackburn, B. Bradlyn, E. DeCristofaro, G. Stringhini, S. Long, S. Greenberg, and S. Zannettou. The evolution of the manosphere across the web. In Proceedings of the international AAAI conference on web and social media, volume 15, pp. 196–207, 2021.

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.

T. L. Scao, A. Fan, C. Akiki, et al. BLOOM: A 176B-parameter open-access multilingual language model. ArXiv, 2211.05100, 2022.

I. Schlag, K. Irie, and J. Schmidhuber. Linear transformers are secretly fast weight programmers. In M. Meila and T. Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning (ICML), volume 139 of Proceedings of Machine Learning Research, pp. 9355–9366. PMLR, 2021.

J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131–139, 1992.

J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015. doi: 10.1016/j.neunet.2014.09.003.

J. Schulman, B. Zoph, C. Kim, J. Hilton, et al. ChatGPT: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/, 2022. OpenAI Research.

T. J. Sejnowski. Storing covariance with nonlinearly interacting neurons. Journal of Mathematical Biology, 4, 1977. doi: 10.1007/BF00275079.

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism. ArXiv, 1909.08053, 2019.

J. T. H. Smith, A. Warrington, and S. W. Linderman. Simplified state space layers for sequence modeling. ArXiv, 2208.04933, 2022.

D. Soboleva, F. Al-Khateeb, R. Myers, J. R. Steeves, J. Hestness, and N. Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.

L. Soldaini, R. Kinney, A. Bhagia, et al. Dolma: an open corpus of three trillion tokens for language model pretraining research. ArXiv, 2306.01116, 2023.

S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza, H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky, C. S. Prakash, M. Sridhar, F. Triefenbach, A. Verma, G. Tur, and P. Natarajan. AlexaTM 20B: Few-shot learning using a large-scale multilingual Seq2Seq model. ArXiv, 2208.01448, 2022.

R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems (NeurIPS), volume 28. Curran Associates, Inc., 2015.

Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei. Retentive network: A successor to transformer for large language models. ArXiv, 2307.08621, 2023.

L. Sutawika, L. Gao, H. Schoelkopf, et al. EleutherAI/lm-evaluation-harness: Major refactor, 2023.

L. Sutawika, H. Schoelkopf, L. Gao, B. Abbasi, S. Biderman, J. Tow, B. fattori, C. Lovering, farzanehnakhaee70, J. Phang, A. Thite, Fazz, T. Wang, N. Muennighoff, Aflah, sdtblck, nopperl, gakada, tttyuntian, researcher2, Chris, J. Etxaniz, H. A. Lee, Z. Kasner, Khalid, J. Hsu, A. Kanekar, P. S. Ammanamanchi, V. Boykis, and AndyZwei. EleutherAI/lm-evaluation-harness, 2024.

I. Sutskever, O. Vinyals, and Q. V. V. Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27 (NIPS’13), pp. 3104–3112. Curran Associates, Inc., 2014.

Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng. Synthesizer: Rethinking self- attention in transformer models. ArXiv, 2005.00743, 2020.

Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler. Long range arena: A benchmark for efficient transformers. In International Conference on Learning Representations (ICRL), 2021. URL https://openreview.net/forum?id=qVyeW-grC2k.

R. Thoppilan, D. deFreitas, J. Hall, et al. LaMDA: Language models for dialog applications. ArXiv, 2201.08239, 2022.

TogetherComputer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models. ArXiv, 2302.1397, 2023.

D. Vadas and J. R. Curran. Parsing noun phrases in the Penn Treebank. Computational Linguistics, 37(4):753–809, 2011.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, pp. 5998–6008. Curran Associates, Inc., 2017.

O. Vinyals, T. Ewalds, S. Bartunov, et al. Starcraft II: A new challenge for reinforcement learning. ArXiv, 1708.04782, 2017.

J. Wang, J. N. Yan, A. Gu, and A. M. Rush. Pretraining without attention. ArXiv, 2212.10544, 2022.

S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity. ArXiv, 2006.04768, 2020.

S. Wang, Y. Sun, Y. Xiang, et al. ERNIE 3.0 Titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation. ArXiv, 2112.12731, 2021.

Y. Wu and K. He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp. 3–19, 2018.

L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel. mT5: A massively multilingual pre-trained text-to-text transformer. In Conference of the North American Chapter of the Association for Computational Linguistics, pp. 483–498, 2021. doi: 10.18653/v1/2021.naacl-main.41.

S. Yang and Y. Zhang. FLA: A Triton-based library for hardware-efficient implementations of linear attention mechanism, 2024. URL https://github.com/sustcsonglin/ flash-linear-attention.

S. Yang, B. Wang, Y. Shen, R. Panda, and Y. Kim. Gated linear attention transformers with hardware- efficient training. ArXiv, 2312.06635, 2023.

S. Zannettou, B. Bradlyn, E. DeCristofaro, H. Kwak, M. Sirivianos, G. Stringini, and J. Blackburn. What is Gab: A bastion of free speech or an alt-right echo chamber. In The Web Conference, pp. 1007–1014, 2018. doi: 10.1145/3184558.3191531.

W. Zaremba and I. Sutskever. Learning to execute. ArXiv, 1410.4615, 2014.

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. HellaSwag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800, 2019.

A. Zeng, X. Liu, Z. Du, et al. GLM-130B: An open bilingual pre-trained model. ArXiv, 2210.02414, 2022.

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer. OPT: Open pre-trained transformer language models. ArXiv, 2205.01068, 2022.

Contents

A Extended Long Short-Term Memory 23

A.1 Vanilla Long Short-Term Memory Formulation: Vector Notation . . . . . . . . . . 23

A.2 sLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

A.3 mLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

A.4 Detailed Block Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

B Experiments 31

B.1 Synthetic Tasks and Long Range Arena . . . . . . . . . . . . . . . . . . . . . . . . 31

B.1.1 Test of xLSTM’s Exponential Gating with Memory Mixing. . . . . . . . . . 31

B.1.2 Test of xLSTM’s Memory Capacities on Associative Recall Tasks. . . . . . 34

B.1.3 Test of xLSTM’s Long Range Capabilities on the Long Range Arena. . . . 36

B.2 Method Comparison and Ablation Study on SlimPajama (15B) . . . . . . . . . . . 40

B.3 xLSTM Large Language Models – SlimPajama300B . . . . . . . . . . . . . . . . 42

C Detailed Results on PALOMA Language Model Evaluation 44

A Extended Long Short-Term Memory

A.1 Vanilla Long Short-Term Memory Formulation: Vector Notation

The vanilla LSTM memory cell update rules (Greff et al., 2015) at time step t extend the scalar cell state formulation to a vector of cell states:

The matrices , and correspond to the input weights between inputs and cell input, input gate, forget gate, and output gate, respectively. The matrices , and correspond to the recurrent weights between hidden state and cell input, input gate, forget gate, and output gate, respectively. are the corresponding bias vectors. cell input and hidden state activation functions (typically is used to normalize or squash the cell state, which would be unbounded otherwise.

A.2 sLSTM

Similar to the LSTM in Section A.1, also the sLSTM can be vectorized to multiple cells:

Here, the cell input activation function , the hidden state activation function is the identity. helps stabilizing the recurrence.

Considering external gradient contribution from subsequent layers and recurrent gradient contri- bution from gradients from future states flowing over the cell interaction matrix R, we obtain the recursive backward pass of sLSTM, where indicates gradients with respect to parameter / internal variable a:

with the derivatives of the respective gate activation function , and or depending on the forget gate activation. is the derivative of the cell input activation function

The matrices are block-diagonal which is analogous to multiple heads in the mLSTM. This way, the parameters reduce to is the number of heads, limiting the cell interactions to individual heads. This parameter efficient formulation of cell interactions together with the exponential gating is called the new memory mixing. Finally, to stabilize the backward pass, we clip the magnitude of , as a means to prohibit exploding gradients for long context lengths.

Proof of Equivalence for sLSTM Stabilized Version. The stabilization state m, see Equation (15) in the main paper, has no gradient, and hence does not influence the other gradients. We go back to the scalar version (Equation 8) here for simplicity. We re-define and as stabilized cell and normalizer states:

Inserting Equation 15 into Equation 8 yields:

Therefore, since the loss solely depends on , there’s no dependency on , and consequently, no gradient exists for this stabilization state. Note that can be chosen arbitrarily. We choose , which stabilizes the exponential function. One can even find , such that the normalizer state can be eliminated, but this version was experimentally found to be numerically unstable in the backward pass.

A.3 mLSTM

Throughout this section, denotes a column vector of ones and a row vector of ones, where T is the dimension of this vector.

Recurrent mLSTM Backward Pass. The recurrent formulation of the mLSTM cell in Equation 19 yields the following backward pass recurrence, where indicates gradients with respect to parameter or internal variable denotes gradients from subsequent layers:

and being the Heaviside step function. or exp (z), depending on the forget gate activation.

Parallel mLSTM Forward Pass. The mLSTM recurrence in Equations (19-27) can be reformulated in a parallel form, which is used to speed up training. After training we can still use the recurrent formulation for fast text generation.

Instead of processing each input at time step t sequentially, the parallel version processes all timesteps of a full sequence at once, where T is the sequence length and d is the head dimension. We present the forward pass of the mLSTM for a single head and drop the head dimension for simplicity.

Let be the forget gate pre-activations and be the input gate pre-activations for a full sequence. We construct the forget gate activation matrix

and the input gate pre-activation matrix

By applying the elementwise exponential input gate activation function naively, we obtain the unstabilized gate activation matrix

In order to avoid overflow due to the exponential function we apply the same stabilization as in the recurrent sLSTM, see Equation 15. In the parallel formulation of the mLSTM we get a numerically stable gate activation matrix by taking the logarithm of D element-wise and subtracting the row-wise maximum value of D from each element:

Given the queries, keys and values , for a full sequence we can compute all hidden pre-activation states in parallel for the un-stabilized version by

Note that we extract the factor for K explicitly here and further on. For the stabilized version this yields

where for both versions the hidden pre-activation states are identical. With the output gate pre-activations we can compute the hidden states timesteps by applying the output gate in parallel for each timestep element-wise:

This gives the parallel forward pass of the mLSTM for a full input sequence

Parallel mLSTM Backward Pass. We present the backward pass of the mLSTM for the stabilized version only. For completeness we summarize the forward pass in the stabilized version before we present the backward pass.

Given the forget gate matrix , the logarithm of the forget gate matrix and the input gate matrix as introduced above, together with the queries, keys and values

, we can write the forward pass of the mLSTM in the stabilized version as:

With this forward pass we can compute the gradients for all intermediate and input variables to the mLSTM forward pass in the backward pass. We denote the gradient with respect to variable

Given the output gradient we can compute the backward pass for the intermediate gradients as:

We do not compute the gradients for m as they cancel out (see the proof in the recurrent sLSTM).

With these intermediate gradients the gradients for the logarithmic forget gate matrix , the input gate matrix , and the queries, keys and values are given by

Having computed the gradients for the logarithmic forget gate matrix , we can compute the gradients for the forget gate pre-activations

Recall the logarithmic forget gate matrix F = log F is computed by

With the substitution we compute the gradients for the logarithmic forget gate activations

where the last equation makes use of the following:

Finally, we compute the input gate pre-activations’ gradients as the column-wise sum over the rows of the input gate matrix

This completes the backward pass of the parallel mLSTM for a full input sequence

A.4 Detailed Block Structure

Figure 9: Schematic representation of an sLSTM Block – post up-projection: Embedded in a preLayerNorm residual structure, the input is optionally passed through a causal convolution of window size 4 that includes a Swish activation for input and forget gates. Then, for all input, forget and output gates i, f, o, and the cell update z the input is fed through a block-diagonal linear layer with four diagonal blocks or “heads”. These diagonal blocks coincide with the recurrent gate pre-activations from the last hidden state, which corresponds to an sLSTM with four heads depicted with the circular arrows. The resulting hidden state goes through a GroupNorm layer (Wu & He, 2018) – a head-wise LayerNorm for each of the four heads. Finally, the output is up- and down-projected using a gated MLP, with GeLU activation function and projection factor 4/3 to match parameters.

Figure 10: Schematic representation of an mLSTM block – pre up-projection: Embedded in a pre-LayerNorm residual structure, the input is up-projected first with projection factor 2, once for an externalized output gate and once as input for the mLSTM cells. The mLSTM cell input is dimension-wise causally convolved (kernel size 4), before entering a learnable skip connection. We obtain input q and k via block-diagonal projection matrices of block size 4. The values v are fed directly, skipping the convolution part. After the mLSTM sequence mixing, outputs are normalized via GroupNorm (Wu & He, 2018) – a head-wise layer norm for each of the four heads. Finally, the learnable skip input is added and the result is gated component-wise with the external output gate. The output is down-projected.

B Experiments

Training Setup. For all experiments, we use Python 1 3.11 with PyTorch 2.2.0 2, and CUDA 12.1 3 on NVIDIA A100 GPUs.

Nearest Neighbor Search Task. For this auxiliary task, we use randomly sampled feature vectors of dimension 2 and unit norm. The attached value is a uniformly distributed random number from [0, 1], leading to inputs vectors of dimension 3. The first feature vector serves as search key, with the first value being ignored. Then the model has to predict the value of the nearest neighbor so far in the sequence. We train on 8192 sequences of context length up to 64 (uniformly sampled) and validate on 8192 different samples. All models have two blocks and embedding dimension 128. We use a dropout of 0.1, 10% linear warm-up steps and cosine decay to 1e-7 for 100k total training steps. We sweep over learning rates 1e-4, 1e-3, 1e-2, 1e-1 and 5 seeds each. The reported values in Figure 2 are mean values for the best learning rate and 99% confidence intervals. Note that LSTM requires very high learning rates, whereas Transformers (Llama) perform best at the smallest learning rate. The xLSTM[0:1] reaches similar performance across all learning rates.

Wikitext-103 Rare Token Prediction. For this exemplary experiment on rare token prediction, we trained 125M-sized models on Wikitext-103 (Merity et al., 2017). All models have an embedding dimension of 768 in a post up-projection structure of 12 residual blocks. The Transformer model (Llama) uses Multi-Head Attention, for what is called LSTM the Multi-Head Attention is replaced by an LSTM and the xLSTM[1:0] contains mLSTM layers with matrix memory. Models were trained with maximum learning rate 1e-3, 4k steps linear warm-up and cosine decay for in total 50k steps, using a batch size of 256 and context length of 512. We use the validation perplexity as a stopping criterion and evaluate on the test set.

B.1 Synthetic Tasks and Long Range Arena

B.1.1 Test of xLSTM’s Exponential Gating with Memory Mixing.

We evaluate xLSTM on a suite of formal language tasks to test its exponential gating and memory mixing mechanism.

Formal languages provide a framework to probe the generalization capabilities of models. They allow to specifically test different expressivity levels, e.g. along the Chomsky hierarchy. Typical language model architectures do not necessarily fit perfectly in these hierarchies (Delétang et al., 2023) — nevertheless these languages allow to illustrate differences in generalization expressivity between different architectures. Our evaluation tasks are heavily based on the work of Delétang et al. (2023).

Experiment Setup. The different formal language tasks in the experiment (see individual tasks description below) encompass different levels of the Chomsky hierarchy as well as additional counting and memory-focused tasks. We use different lengths per sample, which allows us to validate in a length extrapolation setting. We train on a varying task length up to 40. The evaluation is done for task lengths between 40 and 256 as we are only interested in the “task generalization capabilities“ of the models.

In all experiments, we use two blocks (or layers for the pure LSTM) for all models. We compare Llama, Mamba, Retention, Hyena, RWKV-4, RWKV-5, RWKV-6, LSTM, xLSTM[0:1], xLSTM[1:0] and xLSTM[1:1]. The sLSTM block is used without a convolution and with normal weight initialization. LSTM (Block) refers to an architecture where a vanilla LSTM is used instead of self-attention inside a Transformer block.

All models are trained with 3 different learning rates (1e-2, 1e-3, 1e-4), each with two seeds. Batch size is 256 — cosine annealing (min lr: 1e-5) with 10% warm-up steps is applied. We use AdamW (Loshchilov & Hutter, 2019) and a weight decay of 0.1 for training. In each experiment we train for 100k steps — the samples are generated randomly, however, all experiments are trained and evaluated on the same samples.

Figure 11: Supplementary results given by scaled accuracy of different models at solving formal language tasks. Tasks are grouped by the Chomsky hierarchy.

Additional Formal Language Results. Figure 11 showcases supplementary results on formal language task, detailing tasks where no model attained a minimum scaled accuracy of 0.3. Although no model achieves proper extrapolation of the task to a larger context length, xLSTM performs best among the evaluated models.

Individual Task Description. The majority of tasks are based on Delétang et al. (2023). We provide the vocabulary size |V | and the random accuracy (for accuracy scaling), used in the evaluation. As we evaluate different task lengths each task has a padding token which is used to pad the sequence to the given context length. In Listing 1 there is an example for each task.

• Bucket Sort Given a string of tokens of a sorted alphabet, compute the sorted string.

• Cycle Nav Given a string of “movement tokens” (, STAY) compute the end position of the agent with start position 0. The position must be computed modulo the maximum position.

• Even Pairs Given a binary string of a and b tokens, compute whether the number of ab and ba is even. This task can be solved by checking if the first and last token of the string are equal.

• Majority Given a string of tokens, compute the token that occurred most often in the sequence.

• Majority Count Given a string of tokens of an ordered alphabet. Compute the count of the token that occurred most often in the sequence. If the count exceeds the vocab size, the highest vocab token should be outputted.

• Missing Duplicate Given a string of tokens. The string is repeated but one of the tokens is masked in the repetition. Output the token that is masked.

• Mod Arithmetic (w/o Brackets) Calculate the result — modulo the max number — of the arithmetic operations in the context. The maximum number is the vocabulary size minus the number of special tokens (+,-,*,=, [PAD]).

• Mod Arithmetic (w Brackets) Calculate the result — modulo the maximum number — of the arithmetic operations in the context. The maximum number is vocabulary size minus the number of special tokens (+,-,*,=,(,), [PAD]).

• Odds First An string of tokens is given. Output all tokens with and odd index () then the token with an even index (,..) . Apart from that keep the ordering of the initial string.

• Parity Given a binary string of a and b tokens, compute if the number of b‘s is even. If the number is even output a otherwise b. This is equivalent to sequentially calculating the half-adder sum.

• Repetition Given a string of tokens — repeat it.

• Reverse String Given a string of tokens — repeat it in reverse order.

• Stack Manipulation An initial stack content is given, followed by a sequence of push and pop operations. Compute the stack content after the operations

• Set Given a string of tokens, compute the ordered set of the tokens. Keep the ordering so that tokens that occurred first are also outputted first.

• Solve Equation Given is an equation with the operators {+,-,*,=,(,)}, number, and an unknown variable x. Compute the value of the variable modulo the max number. The maximum number is vocabulary size minus the number of special tokens (+,-,*,=,(,), [PAD], [ACT]).

Listing 1: Examples of the formal language tasks. Red tokens are evaluated for loss and accuracy metrics, but are padded for the input. The tokens are illustrated in a way that allows easy semantic interpretation for the given task — hence, some tokens are represented by multiple characters.

B.1.2 Test of xLSTM’s Memory Capacities on Associative Recall Tasks.

We test the memory capacity of xLSTM with the Multi-Query Associative Recall task proposed by Arora et al. (2023). Figure 12 illustrates the basic task setup.

Why Multi-Query Associative Recall for Memory Tests of LLM Architectures. Associative Recall (AR), the ability to retrieve a specific value (information) associated with a given key (information), constitutes a key capability for LLM to perform well (Poli et al., 2024; Arora et al., 2023; Olsson et al., 2022). Especially its quality of in-context learning seems to be strongly connected to this capability (Olsson et al., 2022). Arora et al. (2023) attribute performance gaps between early non-Transformer and Transformer language models specifically to performance gaps in associative recall. They argue that prior AR evaluations fall short of capturing these differences and propose MQAR, which can show the AR performance differences that translate to performance differences in language modeling performance. Hence, MQAR is especially suitable to analyze the memory capacity of LLM. Transformer (e.g. Llama) models can be seen as the gold standard for this task as their memory is exponential in the coding dimension (Ramsauer et al., 2021).

Experiment Setup. There are two relevant variables that determine different experimental setups. (1) Context Length (CL): Length of the sequence of one sample — this influences the distances between the key-value definition and the recall. (2) Number Key-Value Pairs (KV): Influences how many key-value pairs the model needs to keep track of. The vocabulary size is always 8192.

In all experiments, we use two blocks (or layers for the pure LSTM) for all models. LSTM (Block) model refers to an architecture where a vanilla LSTM is used instead of self-attention inside a Transformer block.

For each task setup, we train each model with 4 different learning rates (batch size > 24: {1e-2, 2.15e-3, 4.6e-4, 1e-4}, batch size 24: {1e-3, 2.2e-4, 5e-5, 1e-5}). The batch size (BS) changes depending on the context length (CL) (CL=64/128: BS=512; CL=256: BS=256; CL=756: BS=128; CL=1024: BS=96; CL=2048: BS=24). We vary the embedding dimension (Model Dim) between different experiments – different numbers of heads are used accordingly. For each experiment, we generate 100,000 training samples (validation: 3,000 samples) and train for 64 epochs. We apply cosine annealing (min lr: 1e-4 and 1e-5) with 10% warm-up steps. We use AdamW (Loshchilov & Hutter, 2019) and a weight decay of 0.1 for training.

We conduct three different experiments:

• MQAR-Experiment 1 evaluates, in the same fashion as Arora et al. (2023), a variety of models (Llama, Mamba, Mamba (noWT) - i.e. without weight tying, Retention, Hyena, H3, RWKV-4, RWKV-5, RWKV-6, LSTM, LSTM (Block), xLSTM[0:1], xLSTM[1:0] and xLSTM[1:1]) on increasing task difficulty by increasing the context length and number of key-value pairs simultaneously. We benchmark three parameter settings: CL,KV={(64,4),(128,8),(256,16)}.

• MQAR-Experiment 2 increases the task difficulty notably and goes beyond previous evaluations on this task. We individually scale the context length (CL={756, 1024, 2048}) and the key-value pairs (KV={48, 96, 256}) and evaluate all combinations. This experiment especially probes the memory capacity because the number of key-value pairs is high. To reduce the computational burden we only evaluate models that perform flawlessly in Experiment 1 — additionally we evaluate Transformer only in the hardest setting (CL=2048) as sanity check, because no performance decrease is expected.

• MQAR-Experiment 3 analyzes whether the AR capability learned on a certain context length extrapolates to bigger context lengths. For each KV setting of Experiment 2, we use the models (we select the 3 biggest model dimensions) trained on CL=2048 and evaluate bigger context lengths (CL={4096, 6144, 8192}).

Extended Results. The result of Experiment 1 can be found in Figure 13. In accordance to the results of Arora et al. (2023) H3, Hyena, RWKV-4 fail to solve the task with a smaller model dimension. In contrast, xLSTM[1:1], xLSTM[1:0], Mamba, RWKV-5 and RWKV-6 are able to solve these settings for all model dimensions. The comparison of xLSTM[0:1] with both original LSTM variants indicates that the exponential gating mechanism improves the AR capabilities of the model. However, both fall short because of the reduced memory capacity compared to xLSTM[1:1] and xLSTM[1:0].

The results of Experiment 2 are presented in Figure 14. Scaling the context length has a low impact on the performance of the models. However, while xLSTM[1:1] and xLSTM[1:0] show no clear decay, both RWKV variants slightly, but consistently lose performance with increasing context lengths. The varying number of key-value pairs, which mainly probes the memory capacity of the non-Transformer models, has a more notable impact across all models. RWKV-5 seems to outperform RWKV-6. The latter fails to learn the task at all in some KV=256 settings. Overall xLSTM[1:1] is the best-performing non-Transformer model — suggesting that it provides enhanced memory capacity, also in long contexts.

Figure 15 shows the extrapolation results from Experiment 3. For xLSTM[1:1], xLSTM[1:0], and Mamba the model performance does not change in the extrapolation setting. The RWKV models (especially RWKV5) degrade slightly with increasing context length. xLSTM[1:1] performs best, as it maintains its superior performance of Experiment 2.

Figure 12: Illustration of the MQAR task. Color pairs represent key-value pairs (keys have darker shade). The first part of the sequence defines the key-value pairs for the respective sample. After that, the keys appear randomly according to a power law distribution 4. Grey tokens in the input sequence represent a zero token. The “target” sequence contains the value after the respective key appearance — the rest of the tokens are ignored for the accuracy and loss calculation. The model must predict the value tokens given the respective key.

B.1.3 Test of xLSTM’s Long Range Capabilities on the Long Range Arena.

We assess the performance of xLSTM across tasks in the Long Range Arena benchmark (Tay et al., 2021), examining its ability to effectively handle longer context lengths and diverse data types.

Our experiments on Long Range Arena benchmark are composed of five tasks:

• Retrieval: The task is to predict if two documents have a citation link. The dataset of text documents is derived from the ACL Anthology Network (Radev et al., 2009).

• ListOps: This is a set of modular arithmetic tasks including brackets and lists of numbers, using the operations MIN, MAX, MEDIAN and SUMMOD (modular sum). A particular example is:

• Image: This task is based on a version of the CIFAR dataset (Krizhevsky, 2009), where images are transformed to a sequence of pixels and this sequence has to be classified into the usual CIFAR classes. We test both a gray-scale (G-Image) and RGB (RGB-Image) version of this dataset, as Orvieto et al. (2023) uses colored images contrary to the standard setup.

• Pathfinder: The input for this task is a 32x32 gray-scale image, given as pixel sequence, with two dots and several curved lines on it. The task is to predict if the two dots are connected by any of the lines (Linsley et al., 2018).

We omit the Text classification task (Maas et al., 2011), as the language modeling experiments already test this kind of data, and the Pathfinder-X version of Pathfinder.

Experiment Setup. The architectures that are tested in this experiment comprise LLama, Mamba, LSTM, RWKV-4, and xLSTM. LSTM (Block) refers to an architecture where a vanilla LSTM is used inside a post up-projection block (like Transformer with attention replaced by LSTM). For xLSTM we choose the best performing of xLSTM[0:1] or xLSTM[1:0] on the validation set, specifically the former for the Image tasks and the latter for all other ones.

We use the hyperparameter settings of the S5 model (Smith et al., 2022) and Linear Recurrent Unit model (Orvieto et al., 2023), with additional hyperparamter search on learning rates and schedulers for all models. We use two different schedulers: Linear Warm-up Cosine Annealing and Linear Warm-up Cosine Annealing with Restarts. Both learning rate schedulers were evaluated with learning rates of 1e-3, 6e-4 and 1e-4. For the second scheduler, the number of restarts (R) is set to 3. The model hyperparameters for each dataset are displayed in Table 5.

Results. Table 6 shows the result of experiments on the Long Range Arena benchmark. xLSTM demonstrates consistent strong performance on all of the tasks, suggesting that the proposed architecture is remarkably efficient in handling different aspects of long context problems.

Figure 13: Result of MQAR-Experiment 1. The columns show different task settings (context length and key-value pairs). The rows group related models for better clarity. The x-axis gives the model size and the y-axis the validation accuracy.

Figure 14: Result of MQAR-Experiment 2. The columns and rows correspond to different numbers of key-value pairs and the context length respectivly. The x-axis gives the model size and the y-axis the validation accuracy.

Table 5: Long Range Arena model hyperparameters. These are the model hyperparameters used in each of the Long Range Arena tasks. For each model we used the best learning rate and the better of the two learning rate schedulers.

Figure 15: Result of MQAR-Experiment 3 (Extrapolation). All evaluated models were trained on context length 2048 and the number of key-value pairs given by the columns of the plot. The rows show the different context lengths used in the evaluation. The x-axis gives the model size and the y-axis the validation accuracy.

Table 6: Long Range Arena test accuracy. Bold highlights the best performing model, underlined the second best. X denotes models that fail to outperform random baselines. xLSTM is the best of xLSTM[1:0], xLSTM[0:1] based on validation dataset accuracy.

B.2 Method Comparison and Ablation Study on SlimPajama (15B)

General Training Procedure. We tokenize our datasets using the HuggingFace GPT-2 tokenizer (Radford et al., 2019; Brown et al., 2020) 5 and use this tokenizer for all models in this paper. In general, we try to follow Brown et al. (2020) for the general training setup, i.e. we choose context length 2048 and batch sizes 256 or 512 for our models. We use the AdamW (Loshchilov & Hutter, 2019) optimizer with beta parameters (learning rate scheduler we use a linear warm-up with 750 steps and cosine decay to 10% of the peak learning rate. We apply a weight decay of 0.1 to all our models and always exclude the token embedding matrix from weight decay. If not specified otherwise, we do not tie the weights of the token embedding and the language model head. For parallelization, we use PyTorch FSDP in SHARD_GRAD_OP mode with mixed precision in bfloat16, where applicable. For small models we use NO_SHARD. We keep the weights in float32 and reduce the gradients across GPUs in float32. We use torch.compile to speed up models, except for Transformer models as their training curves did not match the non-compiled versions. For xLSTM[7:1], we use positions [3, 5, 7, 40, 42, 44] for sLSTM-based blocks, except for the 125M size, where we use [3, 20] (this is actually a [11:1] ratio).

Table 7: Peak learning rates and model dimensions for scaling law plots.

Details on Comparison to Other Methods. For the model comparison on 15B training tokens of SlimPajama we train all models with context length 2048 and batch size 256. We use a peak learning rate of 1e-3 for all models for comparability. The learning rate decays over 30k training steps. The models are compared after one epoch at training step 28170. As model implementations we use the original repositories’ code for Mamba (Gu & Dao, 2023) 6, RWKV-5, RWKV-6 (Peng et al., 2024) 7. For RWKV-4, we use a cleaned and validated re-implementation based on the original repo and kernels (Peng et al., 2023). For HGRN (Qin et al., 2023), GLA (Yang et al., 2023), HGRN2 (Qin et al., 2024) we use the a re-implementation by the authors of GLA (Yang et al., 2023; Yang & Zhang, 2024) 8. For GPT-3 and Llama-like Transformers, we use our own implementations based on PyTorch. Note that for all xLSTMs, Transformers, Mamba and RWKV-4, we use Mixed Precision training with bfloat16 and weights in float32 precision, while resorting to full bfloat16 precision (weights and operations) for all other models due to their custom kernels that force one precision internally. Following the general training procedure we use torch.compile for all models, except for models using the flash-linear-attention (Yang & Zhang, 2024) library because of compilation problems.

General Details on Ablation Studies. We follow our general training procedure and train all models with context length 2048, batch size 256 and peak learning rate 1e-3. We report perplexity values on the validation set.

Additional Ablation Study on Matrix Memory. As default block configuration we use the mLSTM in the pre up-projection block (see Figure 10) and the sLSTM in the post up-projection block (see Figure 9). In this experiment we study combination of mLSTM with different block variants using the xLSTM[1:0] architecture. We compare the mLSTM in a post up-projection block (see Figure 3 and 9) with ReLUactivation function and non-gated feed-forward network to mLSTM in a pre up-projection block with and without a dimension-wise causal convolution. Table 8 shows that the matrix memory benefits from the pre up-projection block structure, and that the convolution within this block is important.

Table 8: Matrix Memory variants. We study different configurations for the matrix memory. Matrix memory in the pre up-projection block performs best and gives xLSTM[1:0]. Notably, it seems that the dimension-wise causal convolution within the pre up-projection block is important.

Details on new xLSTM Components Ablation Study. In Table 2 (top), we show our modifications to the vanilla LSTM that transform the vanilla LSTM into the xLSTM. We start with a large default PyTorch LSTM with 24 layers and 1536 hidden size. Due to a lack of skip-connections and LayerNorms, vanilla LSTMs of this size are not trainable. We then add skip-connections and pre-LayerNorms before each LSTM layer corresponding to a residual architecture. This enables training for LSTMs at this scale. Replacing every second LSTM layer by a non-gated feed-forward network with GeLU activation function (similar to Vaswani et al.), which corresponds to the post up-projection backbone (see Figure 3) further boosts performance. Adding Exponential Gating to this architecture yields the sLSTM as depicted in Figure 9, with another large performance improvement. Finally, adding the best Matrix Memory variant found in Table 8 by replacing some sLSTM blocks with the mLSTM (see Figure 10) gives xLSTM[7:1] with the best performance.

Details on Gating Technique Ablation Study. In Table 2 (bottom), we investigate the effect of trainable and input-dependent gates for mLSTM. The results show that, in contrast to other methods (Katharopoulos et al., 2020; Sun et al., 2023; Qin et al., 2023; Katsch, 2023; Yang et al., 2023; Qin et al., 2024; Peng et al., 2024), having the gates both learnable and input dependent gives the best results.

Details on Scaling Experiments. We follow our general training procedure (see paragraph above) and train all models, including the 1.3B and 2.7B model sizes, with context length 2048 and batch size 256. We use the peak learning rates from Table 7.

B.3 xLSTM Large Language Models – SlimPajama300B

General Training Procedure. We use the same general training procedure as in Section B.2 with peak learning rates from Table 7. All models are trained with context length 2048. The 125M, 350M and 760M models are trained with batch size 256 for 600k training steps, whereas the 1.3B models are trained with batch size 512 for 300k training steps. We keep the same learning rate scheduler across all models.

Details on Downstream Evaluation. We use the LM Evaluation Harness from EleutherAI (Sutawika et al., 2023) for evaluating the following tasks that measure common sense reasoning: LAMBADA (OpenAI version in LM Evaluation Harness) (Paperno et al., 2016), HellaSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), ARC-challenge, ARC-easy (Clark et al., 2018), WinoGrande (Sakaguchi et al., 2021). This selection of downstream tasks is inspired by (Gu & Dao, 2023).

Following Gu & Dao (2023), we report accuracy for LAMADA, WinoGrande, PIQA, and ARC-easy, and accuracy normalized by sequence length for HellaSwag and ARC-challenge.

We evaluate all models in full float32, full bfloat16 and bfloat16 Mixed Precision with weights in float32. For each model we select the best value respectively.

Details on PALOMA. We use 16 out of the 18 data sources of the PALOMA dataset (Magnusson et al., 2023). We use C4 (Raffel et al., 2019), MC4-EN (Xue et al., 2021), Wikitext-103 (Merity et al., 2017), PennTreebank (Vadas & Curran, 2011), RedPajama (TogetherComputer, 2023), Falcon Refinedweb (Refined Web) (Penedo et al., 2023), Dolma v1.5 (Soldaini et al., 2023), M2D2 S2ORC, M2D2 Wikipedia (Reid et al., 2022), C4-100-Domains (C4 Domains) (Chronopoulou et al., 2022), Dolma-100-Subreddits (Dolma Subreddits) (Soldaini et al., 2023), Dolma-100-Programming Languages (Dolma Coding) (Soldaini et al., 2023; Kocetkov et al., 2022), TwitterAAE (Blodgett et al., 2016; Liang et al., 2023), Manosphere Corpus (Ribeiro et al., 2021), GAB Corpus (Zannettou et al., 2018), 4CHAN Corpus (Papasavva et al., 2020). We leave out ThePile (Gao et al., 2021) and ICE (Greenbaum & Nelson, 1996) as they are not part of Paloma’s Huggingface dataset repository9. A detailed description of these datasets can be found in Magnusson et al. (2023, Table 2). All models are evaluated in bfloat16 Mixed Precision.

Results on the data sources TwitterAAE, Manosphere, GAB and 4CHAN are reported in Table 9 and for each individual dataset the results are given in Section C.

Table 9: Perplexity values per domain.

In order to evaluate the perplexity values on each data source, we split the text documents into sequences of length 2048, which corresponds to the pre-training context length of all models. For documents longer than 2048 tokens we split each document into non-overlapping input sequences. In this case for the last input sequence, we follow the LM Evaluation Harness and fill up the full 2048 token context window with previous tokens, but compute the perplexity only on the remaining tokens.

We compute the token perplexities per data source in Table 4 as the exponential of the negative loglikelihoods per domain weighted by the number of tokens per domain in that data source as it is defined in Magnusson et al. (2023, Equation 1)

C Detailed Results on PALOMA Language Model Evaluation

We report the perplexity values on each of the 571 subdomains of PALOMA in Table 10. Note that the aggregated perplexity values in Table 4 are not macro averages of the values shown in Table 10.

Table 10: PPL Evaluations: For the 1.3B sized models trained on 300B SlimPajama tokens, these are the detailed evaluation results on the respective validation datasets.

Designed for Accessibility and to further Open Science