35:[["$","audio",null,{"id":"tts"}],["$","$L3a",null,{"paperID":"2405.05417","publisher":"arxiv","paperJSON":{"title":"Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models","paperID":"2405.05417","avgLineHeight":13.55,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"The disconnect between tokenizer creation and model training in language models allows for specific inputs, such as the infamous ","element":"span"},{"text":"_SolidGoldMagikarp ","element":"span"},{"text":"token, to induce unwanted model behaviour. ","element":"span"},{"text":"Although such ‘glitch tokens’, tokens present in the tokenizer vocabulary but that are nearly or entirely absent during model training, have been observed across various models, a reliable method to identify and address them has been missing. We present a comprehensive analysis of Large Language Model tokenizers, specifically targeting this issue of detecting under-trained tokens. Through a combination of tokenizer analysis, model weight-based indicators, and prompting techniques, we develop novel and effective methods for automatically detecting these problematic tokens. Our find-ings demonstrate the prevalence of such tokens across a diverse set of models and provide insights into improving the efficiency and safety of language models.","element":"span"}],[{"style":{"width":"82%"},"width":722,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/0-0.png","element":"img"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Large Language Models (LLMs) have undergone remarkable advancements, becoming increasingly capable of understanding and generating humanlike text. While most components of these models are trained in an unsupervised fashion on vast amounts of data, the tokenizer typically remains a separately trained component based on custom algorithms and smaller datasets.","element":"span"}],[{"text":"GPT-2 laid the foundation for much of current-day transformer-based language modelling (","element":"span"},{"href":"#id-0","text":"Rad- ","element":"a"},{"href":"#id-0","text":"ford et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","text":"2019","element":"a"},{"text":"), including a framework for tokenization building on previous work in byte-pair encoding (BPE) (","element":"span"},{"href":"#id-1","text":"Sennrich et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","text":"2016","element":"a"},{"text":"), that has since been widely adopted. ","element":"span"},{"text":"Tokenization using BPE converts input text to a sequence of subword tokens by iteratively merging two neighbouring tokens using a fixed set of merge rules. These rules","element":"span"}],[{"style":{"width":"99%"},"width":870,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/0-1.png","element":"img"}],[{"text":"Figure 1: Illustrative example of ‘glitch’ tokens.","element":"figcaption","subtype":"caption"}],[{"text":"are learned using a greedy training algorithm on a smaller dataset, which is ideally representative of the LLM’s training data. Recent work in this area has primarily focused on techniques to remove the need for tokenization altogether by moving to raw byte input (","element":"span"},{"href":"#id-2","text":"Xue et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","text":"2022","element":"a"},{"text":"). This choice typically comes at a significant inference speed cost, which can be compensated for by specialized architectures at the initial and final layers (","element":"span"},{"href":"#id-3","text":"Yu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","text":"2023","element":"a"},{"text":"), or variable compute at intermediate layers (","element":"span"},{"href":"#id-4","text":"Sla- ","element":"a"},{"href":"#id-4","text":"gle","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","text":"2024","element":"a"},{"text":"). However, these techniques have not been widely adopted, and the vast majority of contemporary models still rely on subword tokenization. The main alternative to BPE for subword tokenization is the Unigram method (","element":"span"},{"href":"#id-5","text":"Kudo","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","text":"2018","element":"a"},{"text":"), which despite work suggesting it outperforms BPE (","element":"span"},{"href":"#id-6","text":"Bostrom and Durrett","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","text":"2020","element":"a"},{"text":") is not in common use. For an in-depth overview of tokenization methods and their history, see ","element":"span"},{"href":"#id-7","text":"Mielke et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-7","text":"2021","element":"a"},{"text":").","element":"span"}],[{"text":"Despite its widespread use, the tokenization step has generally been found to be unsatisfactory, responsible for many unwanted LLM behaviours (","element":"span"},{"href":"#id-8","text":"Karpathy","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","text":"2024","element":"a"},{"text":"). ","element":"span"},{"text":"The disconnect between tokenizer and model training creates the potential for some tokens to rarely or never be seen in training. The presence of such tokens in model inputs can lead to unexpected model behaviour including hallucination or the generation of garbled outputs, leading to such tokens commonly being referred to as ‘glitch tokens’ (","element":"span"},{"href":"#id-9","text":"Geiping et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","text":"2024","element":"a"},{"text":"). We refer to these as ‘under-trained’ or ‘untrained’ tokens, reserving the latter term only for cases in which we have clear indication that the specific token had no model training data occurrences.","element":"span"}],[{"text":"The presence of such under-trained tokens has several drawbacks. ","element":"span"},{"text":"Firstly, they occupy capacity in a fixed-size tokenizer that could be better utilized for more frequently occurring tokens, reducing average input/output length and inference costs. Secondly, their deliberate or accidental presence in input data has the potential to cause unwanted model outputs and break downstream applications. Robustness to such unexpected or malicious input data is increasingly important with the proliferation of tool use and agents in LLMs that retrieve and process external data. Lastly, these tokens can potentially be exploited to more easily circumvent guardrails by pushing the model beyond its trained distribution (","element":"span"},{"href":"#id-9","text":"Geiping et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","text":"2024","element":"a"},{"text":").","element":"span"}],[{"text":"Although previous work exists on identifying such tokens through model and tokenizer analysis (","element":"span"},{"href":"#id-10","text":"Rumbelow and Watkins","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-11","text":"Watkins and ","element":"a"},{"href":"#id-11","text":"Rumbelow","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-12","text":"Fell","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","text":"2023","element":"a"},{"text":"), there is a lack of reliable, automated methods that are well-tested and perform consistently across a wide range of models. Automated tools for detecting tokenizer issues provide ways to test and iteratively improve the development of tokenizers, and can also provide methods for protecting deployed models from unwanted inputs, for example through sanitization.","element":"span"}],[{"text":"In this work, we present effective and efficient techniques for identifying such problematic tokens based on the model embedding weights and tokenizer configuration. We apply these methods to a wide range of popular and recent open-weight models. Finally, we include a brief exploration of extensions of these techniques to closed-source models. To the best of our knowledge, this is the first work to present a set of automated, efficient, and theoretically sound methods that systematically and demonstrably identify ’glitch’ tokens across various models and tokenizers. ","element":"span"},{"text":"We also publish a general analysis tool compatible with Hugging Face models (","element":"span"},{"href":"#id-13","text":"Wolf et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","text":"2020","element":"a"},{"text":"), along with detailed results for each analyzed model.","element":"span"}]]},{"heading":"2 Methods","paragraphs":[[{"text":"Our approach consists of three main steps: i) First, we perform an in-depth tokenizer analysis by inspecting its vocabulary and observing its encoding and decoding behaviour, ii) Second, we calculate several indicators to identify candidate under-trained tokens, and iii) Third, we verify whether identified candidate tokens are indeed out of distribution by prompting the target model.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Tokenizer analysis","element":"span"}],[{"text":"To aid our analysis, we start by defining a number of useful token categories.","element":"span"}],[{"text":"P","element":"span"},{"text":"ARTIAL ","element":"span"},{"text":"UTF-8 ","element":"span"},{"text":"SEQUENCES ","element":"span"},{"text":"are tokens representing byte sequences that cannot be converted to Unicode characters as they contain only part of the full UTF-8 encoding for a character. This is typical for ‘fallback byte’ tokens in the ","element":"span"},{"text":"0x80","element":"span"},{"text":"- ","element":"span"},{"text":"0xFF ","element":"span"},{"text":"range but can also include tokens with other partial Unicode characters, depending on whether BPE was applied directly at the byte level.","element":"span"}],[{"text":"U","element":"span"},{"text":"NREACHABLE TOKENS ","element":"span"},{"text":"are those that are never produced as a result of tokenizing text. We test this by checking if decoding a token to a string, and re-tokenizing this string, results in the original token ID. Such tokens are typically the result of tokenizer configuration errors or conflicts between trained and manually added tokens. As this test does not work when tokens cannot be decoded to a string, we exclude partial UTF-8 sequences from this category.","element":"span"}],[{"text":"S","element":"span"},{"text":"PECIAL TOKENS ","element":"span"},{"text":"are manually-defined tokens that typically bypass the standard pre-tokenization pipeline, and often serve specific purposes as control tokens, such as ","element":"span"},{"text":"","element":"span"},{"text":", which typically marks the beginning of an input sequence. We identify special tokens using the patterns ","element":"span"},{"text":"<...> ","element":"span"},{"text":"and ","element":"span"},{"text":"[...] ","element":"span"},{"text":"and list them separately from unreachable tokens, even if they may also be considered unreachable due to input sanitization in preprocessing.","element":"span"}],[{"text":"We detect and exclude partial UTF-8 sequences and unreachable tokens from our under-trained token detection pipeline, as they are not suitable for automatically building verification prompts. Our published model reports include separate tables with these tokens, and we briefly discuss some interesting model-specific results in section ","element":"span"},{"href":"#id-14","text":"3.2","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Under-trained token indicators","element":"span"}],[{"text":"This section outlines our model architecturedependent indicators, which we use to identify under-trained token candidates. A key distinction is made based on whether or not a model uses ‘tied’ embeddings (","element":"span"},{"href":"#id-15","text":"Inan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","text":"2017","element":"a"},{"text":"), that is, the model uses the same matrix for its input embeddings ","element":"span"},{"style":{"height":17.25},"width":60.43,"height":43.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/1-0.png","element":"img","alt":" Ein","inline":true,"padRight":true},{"text":"and the output embeddings matrix ","element":"span"},{"style":{"height":17.65},"width":254.04,"height":44.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/1-1.png","element":"img","alt":" Eout in the fi-","inline":true,"padRight":true},{"text":"nal ‘language modelling head’ layer","element":"span"},{"text":"1","element":"span"},{"text":". Regardless of whether tied embeddings are used, all weights of the output embeddings influence the token predictions at every training step. ","element":"span"},{"text":"Specifically, all untrained tokens will experience similar updates in training, ‘moving away’ from the mean output vector of the model (","element":"span"},{"href":"#id-16","text":"Bi´s et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","text":"2021","element":"a"},{"text":"). Thus, we can expect to find that under-trained token embeddings share a similar direction in output embedding space, and we can use this to identify them based on the distance to the embeddings of reference untrained tokens. This common direction can also be interpreted as a learnt constant vector that is shared between the residual stream and certain output embeddings, allowing the model to reliably generate highly negative logits for tokens that are never the correct prediction To calculate the indicators based on the output embeddings ","element":"span"},{"style":{"height":17.25},"width":161.41,"height":43.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/2-0.png","element":"img","alt":" Eout, we","inline":true,"padRight":true},{"text":"start by defining a set of known untrained or highly under-trained embedding indices ","element":"span"},{"style":{"height":16.79},"width":54.37,"height":41.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/2-1.png","element":"img","alt":" tref","inline":true},{"text":", e.g. the token IDs for tokens such as ","element":"span"},{"text":"","element":"span"},{"text":", or the space of embeddings above the tokenizer vocabulary size.","element":"span"}],[{"text":"Next, we calculate the mean unused token embedding vector to serve as a reference:","element":"span"}],[{"style":{"width":"47%"},"width":412,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/2-2.png","element":"img"}],[{"text":"Finally, we take the cosine distances ","element":"span"},{"style":{"height":18.79},"width":236.36,"height":46.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/2-3.png","element":"img","alt":" C(Eout, uref)","inline":true,"padRight":true},{"text":"between this mean unused embedding vector and rows in ","element":"span"},{"style":{"height":18.45},"width":376.44,"height":46.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/2-4.png","element":"img","alt":" Eout, where C(A, x)","inline":true,"padRight":true},{"text":"is the vector of cosine distances between ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"and rows in matrix ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":":","element":"span"}],[{"style":{"width":"51%"},"width":454,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/2-5.png","element":"img"}],[{"text":"In addition to the cosine distance between output embeddings, we also calculate and visualize the Euclidean distance between output embeddings and the untrained reference ","element":"span"},{"style":{"height":18.79},"width":284.86,"height":46.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/2-6.png","element":"img","alt":" L2(Eout − uref)","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":17.6},"width":279.85,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/2-7.png","element":"img","alt":" L2(A)i = ∥Ai∥","inline":true},{"text":". Finally, we also test more complex output embedding indicators which compensate for the possibility of a common directional bias being present in all embeddings. These experiments, which indicate that the simpler formulation is sufficient, are outlined in Appendix ","element":"span"},{"text":"A","element":"span"},{"text":".","element":"span"}],[{"text":"When embeddings are not tied, input embeddings for tokens which do not appear in the input for a training step are only affected by a potential weight decay term. If weight decay is applied to the input embedding matrix, the embeddings corresponding to under-trained tokens will tend to zero as training progresses. Alternatively, they will stay at a (typically low) initial value. The norm of the input embeddings thus provides an additional indicator of under-trained tokens with potentially higher sensitivity, and which conveniently does not require a set of previously known untrained tokens. Specifically, we expect that this indicator will not predict control tokens (such as ","element":"span"},{"text":"","element":"span"},{"text":") that are only seen in inputs.","element":"span"}],[{"text":"Thus, for models with tied embeddings, we use the cosine distance-based indicator ","element":"span"},{"style":{"height":18.8},"width":236.36,"height":46.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/2-8.png","element":"img","alt":" C(Eout, uref)","inline":true,"padRight":true},{"text":"to select candidate tokens. ","element":"span"},{"text":"For models without tied embeddings, we use the norm of ","element":"span"},{"style":{"height":17.65},"width":222.14,"height":44.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/2-9.png","element":"img","alt":" Ein, denoted","inline":true},{"style":{"height":18.45},"width":145.02,"height":46.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/2-10.png","element":"img","alt":"L2(Ein)","inline":true},{"text":", and additionally calculate and visualize all output embedding-based indicators.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Verification of candidate tokens","element":"span"}],[{"text":"The indicators we propose provide a natural ranking of candidate under-trained tokens, but do not give a definitive selection threshold. Their relative simplicity, while desired, is also likely to result in a somewhat noisy relation between indicator scores and model behaviour. To confirm that candidate tokens indeed induce unwanted model outputs, we verify all tokens that rank among the most likely 2% according to the chosen indicator, excluding partial UTF-8 sequences and unreachable tokens. This verification process involves constructing specific repetitive prompts that induce a high output probability for normal tokens, and checking if a particular candidate token has a very low (","element":"span"},{"style":{"fontStyle":"italic"},"text":"< ","element":"span"},{"text":"1","element":"span"},{"text":"%","element":"span"},{"text":") output probability. See Appendix ","element":"span"},{"text":"B ","element":"span"},{"text":"for details of parameters and model prompts.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Effectiveness of token indicators","element":"span"}],[{"text":"We validate our proposed indicators by relating them to both model behaviour and training data statistics. ","element":"span"},{"text":"Although such training data statistics are rarely publicly available, we are able to run a comprehensive three-way comparison on the open OLMo v1.7 model (","element":"span"},{"href":"#id-17","text":"Groeneveld et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","text":"2024","element":"a"},{"text":"). Figure ","element":"span"},{"href":"#id-18","text":"2 ","element":"a"},{"text":"shows a strong correlation between all proposed indicators and training data, not only predicting under-trained tokens, but extending to the entire range of token frequencies. Applying our verification step to all tokens shows that, despite their relative simplicity, our indicators are highly predictive of the maximal token output probability (Figure ","element":"span"},{"href":"#id-19","text":"3","element":"a"},{"text":"). More precisely, 191 out of 49,575 tokens pass our verification step, compared to 175 of 993 when testing only the top 2% candidate tokens, validating that the 2% threshold is a reasonable trade-off between computational cost and the ability to detect the majority of highly under-trained tokens. Finally, Figure ","element":"span"},{"href":"#id-20","text":"4 ","element":"a"},{"text":"shows examples of the visualizations we perform on all model indicators. These show a clear secondary peak near zero across models that contain the under-trained tokens, as well as high correlation between alternative indicators, further validating their effectiveness.","element":"span"}]]},{"heading":"3 Results","paragraphs":[[{"text":"In this section, we present a summary of our key findings on under-trained token detection. Table ","element":"span"},{"href":"#id-21","text":"1 ","element":"a"},{"text":"presents verification statistics and examples of ver-ified under-trained tokens for a wide range of models. The number of verified tokens varies significantly across different model families and tokenizer vocabulary sizes, and also depends on the number of unused special tokens that a model’s tokenizer allows as plain-text input. The percentage of verified tokens typically ranges between 5–50% of tested candidate tokens, corresponding to 0.1– 1% of the total vocabulary size.","element":"span"}],[{"text":"Given the model-specific nature and the extensive volume of results, we elaborate on some common findings and showcase representative examples for particular models. Comprehensive reports covering an increasing number of tested models and token types are available in our repository.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Common observations","element":"span"}],[{"text":"Although many of our findings are dependent on model-specific details such as tokenizer training and configuration, model architecture, and training data, there are a number of commonalities that appear across many different models.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Single-byte tokens","element":"span"}],[{"text":"Tokens representing a single byte are a common source of untrained tokens. The most common occurrences are the bytes ","element":"span"},{"text":"0xF5","element":"span"},{"text":"–","element":"span"},{"text":"0xFF ","element":"span"},{"text":"which are not used in UTF-8 encoded text","element":"span"},{"text":"2","element":"span"},{"text":", and are a convenient source for quickly locating reference untrained tokens for indicators that require them. In addition, many tokenizers including those from the Gemma, Llama2 and Mistral families include every byte as a token, with many of them in the normal ASCII range ","element":"span"},{"text":"0x00","element":"span"},{"text":"–","element":"span"},{"text":"0x7F ","element":"span"},{"text":"being redundant and unreachable due to the existence of a token for the corresponding character.","element":"span"}],[{"text":"These issues are not universal, and we also find models which include precisely the 243 bytes used in UTF-8 as tokens. Untrained single byte tokens are typically classified as ‘partial UTF-8 sequences’ or ‘unreachable’, and our indicators are effective in revealing which ones are never or rarely seen in model training.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Intermediate BPE fragments","element":"span"}],[{"text":"All tested models use BPE-based tokenization, which retains the original tokens after a merge, often causing intermediate ‘junk’ tokens (","element":"span"},{"href":"#id-6","text":"Bostrom ","element":"a"},{"href":"#id-6","text":"and Durrett","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","text":"2020","element":"a"},{"text":"). ","element":"span"},{"text":"When mentioning examples of such as under-trained fragments, we denote the more complete token in parentheses, e.g. ","element":"span"},{"text":"_TheNitrome ","element":"span"},{"text":"(","element":"span"},{"text":"_TheNitromeFan","element":"span"},{"text":") in the GPT-2 tokenizer. In some instances, the longest token is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"also ","element":"span"},{"text":"under-trained, along with a variety of fragments. The same mechanism appears to explain many under-trained partial UTF-8 sequences in byte-level BPE tokenizers, with multiple bytes being merged over several steps, potentially leaving multiple intermediate tokens with partial Unicode characters.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Special tokens","element":"span"}],[{"text":"Many models include untrained special tokens, such as ","element":"span"},{"text":"","element":"span"},{"text":", ","element":"span"},{"text":"","element":"span"},{"text":", or ","element":"span"},{"text":"<|unused_123|>","element":"span"},{"text":". In the following discussion we generally omit mentioning them, unless their status as an (un)trained token is particularly surprising, as their inclusion in the tokenizer and training data is typically deliberate, for purposes such as the ability to fine-tune models without changing tokenizers. ","element":"span"},{"text":"One common observation is that, on many occasions, tokens such as ","element":"span"},{"text":"","element":"span"},{"text":", which we expect to be completely untrained, nevertheless appear to have been seen in training. A likely cause is code repositories or guides about language models using these tokens in normal text, along with tokenizers allowing such special control tokens in input text.","element":"span"}],[{"text":"Special tokens can be unreachable due to input sanitization as well as configuration errors. In particular, both the Gemma and Yi models include special tokens relating to HTML tags, which were initially detected as unreachable, with the tags being split up in pre-tokenization.","element":"span"},{"text":"3","element":"span"}],[{"id":"id-14","style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Model-specific observations","element":"span"}],[{"text":"In this section we highlight notable model-specific observations, ","element":"span"},{"text":"grouped by the tokenizer used. These examples are mainly intended to illustrate the variety of different under-trained tokens and","element":"span"}],[{"id":"id-21","style":{"width":"101%"},"width":1839,"height":1507,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/4-0.png","element":"img"}],[{"text":"Table 1: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Detection of under-trained tokens. ","element":"figcaption","subtype":"caption"},{"text":"#Confirmed are the confirmed/tested numbers for the tokens tested in verification that are predicted with a maximal probability of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"< ","element":"figcaption","subtype":"caption"},{"text":"1","element":"figcaption","subtype":"caption"},{"text":"% ","element":"figcaption","subtype":"caption"},{"text":"across verification prompts. Examples were manually chosen for readability, similarity across models or for being particularly striking. Note that the leading ‘","element":"figcaption","subtype":"caption"},{"text":"_","element":"figcaption","subtype":"caption"},{"text":"’ in tokens such as ","element":"figcaption","subtype":"caption"},{"text":"_SolidGoldMagikarp ","element":"figcaption","subtype":"caption"},{"text":"indicates a leading space.","element":"figcaption","subtype":"caption"}],[{"id":"id-18","style":{"width":"98%"},"width":1785,"height":710,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/4-1.png","element":"img"}],[{"text":"Figure 2: Under-trained token indicators are highly predictive of training data. The embedding-based under-trained token indicators for the OLMo v1.7 7B model and the number of times each token appears in the first epoch of training are shown. All indicators correlate strongly with the number of times a token is seen in training, not only at the expected lower values, but extending across ten orders of magnitude.","element":"figcaption","subtype":"caption"}],[{"id":"id-19","style":{"width":"94%"},"width":829,"height":566,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/5-0.png","element":"img"}],[{"text":"Figure 3: Under-trained token indicators are predictive of verification probability. The rate of successful veri-fication (","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"p < ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"01","element":"figcaption","subtype":"caption"},{"text":") correlates very highly with our proposed indicator, with no false positives at low values of the indicator scores and a low rate of false negatives. The dotted vertical line indicates the default 2% threshold used for verification.","element":"figcaption","subtype":"caption"}],[{"id":"id-20","style":{"width":"97%"},"width":853,"height":1108,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/5-1.png","element":"img"}],[{"text":"Figure 4: Comparison of indicators. The scatter plots are coloured by token ID, from light green to dark blue. ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Top","element":"figcaption","subtype":"caption"},{"text":": Rakuten 7B showing a separate cluster for added tokens, and high correlation near zero, showing that the two different indicators are similar in effectiveness. ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Bottom","element":"figcaption","subtype":"caption"},{"text":": In density plots, a clear peak appears near zero for most models, giving rise to a bimodal distribution.","element":"figcaption","subtype":"caption"}],[{"text":"configuration issues that can be identified using our methods, and are not exhaustive.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"GPT-2 ","element":"span"},{"text":"(","element":"span"},{"href":"#id-0","text":"Radford et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","text":"2019","element":"a"},{"text":") introduced the framework for much of current-day LLM development, and the tokenizer has been re-used extensively. ","element":"span"},{"text":"We confirm previous findings with a significant number of tokens related to (fragments of) usernames (e.g. ","element":"span"},{"text":"_TheNitrome","element":"span"},{"text":", ","element":"span"},{"text":"_RandomRedditor","element":"span"},{"text":"). ","element":"span"},{"text":"We also find a number of under-trained non-English tokens. ","element":"span"},{"text":"Additionally, all ASCII control characters except for the newline character, but including the tab and carriage return characters, appear untrained. This suggests a potential mismatch in data normalization between training and inference.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"GPT-J 6B ","element":"span"},{"text":"(","element":"span"},{"href":"#id-22","text":"Wang and Komatsuzaki","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","text":"2021","element":"a"},{"text":") and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Phi-2 ","element":"span"},{"text":"(","element":"span"},{"href":"#id-23","text":"Microsoft","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","text":"2023","element":"a"},{"text":") are independent models which both also use the GPT-2 tokenizer, and have significantly more under-trained tokens, likely due to their training data being further removed from the data used to train the tokenizer. These additional tokens include ","element":"span"},{"text":"_SolidGoldMagikarp","element":"span"},{"text":", which is not among verified candidates in GPT-2.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"GPT-NeoX ","element":"span"},{"text":"is an open-source library and associated family of models that uses a tokenizer with the same vocabulary size as GPT-2, but trained on the same ‘The Pile’ dataset also used for model training, and with added tokens for multiple spaces (","element":"span"},{"href":"#id-24","text":"Black et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-24","text":"2022","element":"a"},{"text":"). The GPT-NeoX 20B model has very few under-trained tokens, likely in part due to this alignment between tokenizer and model training, with the fragment ","element":"span"},{"text":"FFIRMED ","element":"span"},{"text":"showing up most consistently. The Pythia 6.7B model based on the same library (","element":"span"},{"href":"#id-25","text":"Biderman et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-25","text":"2023","element":"a"},{"text":") also shows very similar results.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"OLMo ","element":"span"},{"text":"open language models (","element":"span"},{"href":"#id-17","text":"Groeneveld ","element":"a"},{"href":"#id-17","text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","text":"2024","element":"a"},{"text":") also use the GPT-NeoX tokenizer, but have a much higher rate of under-trained tokens, including a wide range of punctuation-based tokens. We also detect over 200 unreachable tokens representing combinations of spaces and line breaks in the tokenizer, which appear to be caused by the aforementioned ‘multiple spaces’ tokens taking precedence. However, many of them appear to have been seen in training, based on both our indicators and training data statistics.","element":"span"},{"text":"4","element":"span"}],[{"text":"Furthermore, we noticed that embedding-based indicators are not near zero for the GPT-NeoX and Pythia models, as well as v1 of the OLMo","element":"span"}],[{"text":"model. For the GPT-NeoX/Pythia models, this is explained by a specific implementation of weight decay, where only weights that are used in the forward pass are affected, but we find that having low but non-zero embeddings is still a good predictor of under-trained tokens. The OLMo v1 model instead applies no weight decay, and requires using output embedding-based indicators instead. However, the OLMo v1.7 model does apply weight decay to embeddings, and its embedding norms are near zero for untrained tokens (cf. Figure ","element":"span"},{"href":"#id-18","text":"2","element":"a"},{"text":"), and we use only this more recent version in this work. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Llama2 ","element":"span"},{"text":"models (","element":"span"},{"href":"#id-26","text":"Touvron et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","text":"2023","element":"a"},{"text":") use a relatively compact BPE tokenizer, and have a low number of under-trained tokens, mostly relating to long non-English words, including ","element":"span"},{"style":{"height":15.6},"width":882.98,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/6-0.png","element":"img","alt":"_Mediabestanden, _Расподела, and _Portály.","inline":true,"padRight":true},{"text":"We also find under-trained intermediate fragments such as ","element":"span"},{"text":"_gepublic ","element":"span"},{"text":"(","element":"span"},{"text":"_gepubliceerd","element":"span"},{"text":"). Several of these tokens were also found in previous work on steering model outputs (","element":"span"},{"href":"#id-9","text":"Geiping et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","text":"2024","element":"a"},{"text":"). ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Mistral ","element":"span"},{"text":"models (","element":"span"},{"href":"#id-27","text":"Jiang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","text":"2023","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","text":"2024","element":"a"},{"text":") use a similar tokenizer, but its vocabulary includes a significant number of multi-character punctuation sequences ending in a carriage return (","element":"span"},{"text":"\\r","element":"span"},{"text":"), which are the main source of under-trained tokens. The ","element":"span"},{"text":"\\uefc0 ","element":"span"},{"text":"token representing a single unassigned Unicode character in the ‘private use area’ is consistently among the most under-trained, along with ","element":"span"},{"style":{"height":12.8},"width":26,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/6-1.png","element":"img","alt":"᥀","inline":true},{"text":", a character from the Limbu script. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Rakuten ","element":"span"},{"text":"7B (","element":"span"},{"href":"#id-29","text":"Rakuten Group et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","text":"2024","element":"a"},{"text":") is a derived model with an extended vocabulary for Japanese, and continued pre-training. Among the extended vocabulary we find a few under-trained fragments such as ","element":"span"},{"style":{"height":16},"width":532.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/6-2.png","element":"img","alt":" 稲田大学 (早稲田大学, ‘Waseda","inline":true,"padRight":true},{"text":"University’). Their presence is proportional to the extended vocabulary, which forms a distinct cluster when visualising their indicators (see Figure ","element":"span"},{"href":"#id-20","text":"4","element":"a"},{"text":"). ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Gemma ","element":"span"},{"text":"is a family of models by Google Deepmind (","element":"span"},{"href":"#id-30","text":"Gemma Team et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","text":"2024","element":"a"},{"text":") and uses a large 256,000 token vocabulary, which includes a sig-nificant number of under-trained fragments in various scripts. Most notably we find many under-trained tokens which contain ‘","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/6-3.png","element":"img","alt":"ſ","inline":true},{"text":"’ (an archaic form of ‘s’ in German), including ","element":"span"},{"style":{"height":12.4},"width":151.15,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/6-4.png","element":"img","alt":" _müſſen","inline":true},{"text":", as well as a number of translations of ‘stock photos’ such as ","element":"span"},{"text":"_stockbilder ","element":"span"},{"text":"and ","element":"span"},{"text":"_stockfotos","element":"span"},{"text":". ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Command R and R+ ","element":"span"},{"text":"are models by ","element":"span"},{"href":"#id-31","text":"Cohere","element":"a"}],[{"text":"(","element":"span"},{"href":"#id-31","text":"2024","element":"a"},{"text":") which also have a large multi-lingual vocabulary with over 255,000 tokens. The most notable discovery in these models is that over 1,400 manually added tokens of emojis are categorized as unreachable, and are all clearly untrained according to the indicators. ","element":"span"},{"text":"Additionally, among partial UTF-8 sequences are several tokens related to the English flag followed by invisible Unicode ‘tag’ characters, which we tracked to a conversion step from image-based flags to emojis in an open-source pipeline for parsing Wikipedia pages, potentially affecting other models as well.","element":"span"},{"href":"#id-32","text":"5","element":"a"}],[{"text":"The ","element":"span"},{"style":{"fontWeight":"bold"},"text":"tiktoken ","element":"span"},{"text":"library by OpenAI (","element":"span"},{"href":"#id-33","text":"OpenAI","element":"a"},{"text":", ","element":"span"},{"href":"#id-33","text":"2024","element":"a"},{"text":"), includes the ‘cl100k’ tokenizer as used in GPT-3.5/GPT-4 as well as several other models. This tokenizer use a pre-tokenization pattern which allows not only a starting space, but many other single punctuation characters at the start of a token. ","element":"span"},{"text":"This choice results in tokens such as ","element":"span"},{"text":"\\tTokenNameIdentifier ","element":"span"},{"text":"and ","element":"span"},{"text":"$$PostalCodesNL","element":"span"},{"text":", which are highly sensitive to pre-tokenization splitting, with leading spaces before the token resulting in different tokenization. In combination with their specific content, this is likely to have made them more severely under-trained across models.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"StableLM2 ","element":"span"},{"text":"is a model by Stability AI (","element":"span"},{"href":"#id-34","text":"Bella- ","element":"a"},{"href":"#id-34","text":"gente et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","text":"2024","element":"a"},{"text":") that uses a slightly modified version of the ‘cl100k’ tokenizer. Due to the addition of digit splitting, the original multi-digit tokens were expected to show up as both unreachable and untrained, but were initially only detected as untrained due to a tokenizer configuration error.","element":"span"},{"href":"#id-35","text":"6","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Qwen ","element":"span"},{"text":"is a model family by Alibaba (","element":"span"},{"href":"#id-36","text":"Bai et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","text":"2023","element":"a"},{"text":") which significantly extends the ‘cl100k’ tokenizer to over 150,000 tokens. ","element":"span"},{"text":"The added tokens and large inherited tokenizer results in many under-trained tokens, and among added tokens we find archaic Chinese characters (such as ","element":"span"},{"style":{"height":16},"width":130.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/6-5.png","element":"img","alt":" 𬳽) and","inline":true,"padRight":true},{"text":"Korean characters which are typographically valid but never seen in normal text (such as ","element":"span"},{"style":{"height":16},"width":58.45,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/6-6.png","element":"img","alt":" 앐).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Llama3 ","element":"span"},{"text":"is a recent model family by ","element":"span"},{"href":"#id-37","text":"Meta AI ","element":"a"},{"text":"(","element":"span"},{"href":"#id-37","text":"2024","element":"a"},{"text":") which also extends this tokenizer with 28,000 additional tokens. ","element":"span"},{"text":"Aside from sharing many under-trained tokens with other models using the ‘cl100k’ tokenizer, the newly added tokens include additional under-trained tokens such as ","element":"span"},{"style":{"height":12.8},"width":462.68,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/6-7.png","element":"img","alt":" ЎыџNЎыџN and krvldkf.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"StarCoder2 ","element":"span"},{"text":"is a family of models resulting from the BigCode project, an open-scientific collaboration focused on code (","element":"span"},{"href":"#id-38","text":"Lozhkov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-38","text":"2024","element":"a"},{"text":"). ","element":"span"},{"text":"The open nature of the project represents a great opportunity for further investigation, allowing us to determine the source","element":"span"}],[{"text":"5","element":"span"},{"id":"id-32","text":"Our ","element":"span"},{"href":"https://github.com/spencermountain/wtf_wikipedia/pull/573","text":"submitted fix ","element":"a"},{"text":"for this has been released. ","element":"span"},{"text":"6","element":"span"},{"id":"id-35","text":"This bug was fixed by disabling the ‘slow’ tokenizer.","element":"span"}],[{"text":"of under-trained tokens in the published tokenizer training data. We find a single document which illustrates maximal variable lengths in Java by repeating ‘LoremipumdolorsitametdconsecteturadipiscingelitIntegervelvelittr’ as the source of several long under-trained tokens, a single document with base-64 encoded strings as the origin of tokens such as ","element":"span"},{"text":"BjKPZFq","element":"span"},{"text":", and a single source code file with a list of solutions of a Wordle game with words categorized by dialect as the source of several tokens such as ","element":"span"},{"style":{"height":12.8},"width":476.58,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/7-0.png","element":"img","alt":" Ostschwizertütsch relat-","inline":true,"padRight":true},{"text":"ing to Swiss German dialects. Furthermore, the tokenizer is unique in missing the ","element":"span"},{"text":"0xF1 ","element":"span"},{"text":"byte as a token in addition to not including unused UTF-8 bytes, and input text containing this byte results in ","element":"span"},{"text":"<|endoftext|> ","element":"span"},{"text":"being used as a fallback ‘unknown’ token.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Yi ","element":"span"},{"text":"9B is a base model by 01.AI whose training data is focused on English and Chinese (","element":"span"},{"href":"#id-39","referenceIndex":1,"text":"01.AI ","element":"a"},{"href":"#id-39","referenceIndex":1,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":1,"text":"2024","element":"a"},{"text":"). Most notable among results are a number of strange tokens starting with ‘n’, including ","element":"span"},{"text":"nConsequently ","element":"span"},{"text":"and ","element":"span"},{"text":"nInterestingly ","element":"span"},{"text":"which may have been caused by incorrectly processing newline characters in tokenizer training data. In addition, three tokens with Chinese phrases including ","element":"span"},{"style":{"height":14.4},"width":107.56,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/7-1.png","element":"img","alt":" 毛泽东","inline":true,"padRight":true},{"text":"are unusual unreachable tokens.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Jamba ","element":"span"},{"text":"v0.1 is a model from AI21 based on a hybrid Transformer-Mamba mixture-of-experts architecture with 52B total parameters (","element":"span"},{"href":"#id-40","text":"Lieber et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","text":"2024","element":"a"},{"text":"). This model has very few tokens that pass our strict threshold for verification, and probabilities for token output are often unusually close to one. Tokenizer analysis does reveal 1,542 untrained special tokens, with ","element":"span"},{"text":"<|startoftext|> ","element":"span"},{"text":"as the only special token which has been trained. The latter is also an extreme outlier in our verification, with indicators showing its clear presence in training data, while the maximal probability of producing the token is ","element":"span"},{"style":{"height":15.13},"width":133.04,"height":37.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/7-2.png","element":"img","alt":" ≈ 10−8","inline":true},{"text":". The unusually sharp probability distributions may be an effect of the novel architecture of this model.","element":"span"}]]},{"heading":"4 Application to closed-source models","paragraphs":[[{"text":"As our techniques involve directly using model weights, they are not directly applicable to closed-source models whose weights are not publicly available. However, the experience gained in inspecting a large variety of open-weight models provides insight which we adapt and transfer to closed models. For these tests, we use a custom prompt designed to exactly repeat strings and see if models appear incapable of doing so. For details of prompts and results, see Appendix ","element":"span"},{"text":"D","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Mistral’s ","element":"span"},{"text":"flagship API models do not consistently include information about tokenizers, but tokenizers are available for their openly-released models. Due to a confirmed leak of an early version of their ‘medium’ model as ‘miqu’, we have some indication of the ‘medium’ model being potentially derived from Llama2 70B. By prompting both the ‘medium’ and ‘large’ models, we can confirm that the ‘medium’ model is unable to repeat strings that are typically under-trained in Llama2 models, and the ‘large’ model fails on typical tokens from the ‘small’ and ‘Mixtral’ model series. In addition, in experimenting with such prompts we find that the ‘large’ model occasionally responds with special tokens including ","element":"span"},{"text":"[TOOL_CALLS] ","element":"span"},{"text":"and ","element":"span"},{"text":"[control_331]","element":"span"},{"text":", which were recently confirmed to be part of the tokenizer for the 8x22B model, further highlighting the effectiveness of this approach.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Anthropic’s ","element":"span"},{"text":"models have limited documentation on their tokenizers. The Anthropic SDK contains some tokenizer utilities for Claude 2, with ","element":"span"},{"href":"https://github.com/anthropics/anthropic-sdk-python/blob/8e3d8a68d309424238ae54e03ee962f7147cfc60/src/anthropic/_client.py#L276","text":"remarks ","element":"a"},{"text":"that they are not accurate for Claude 3. Using the tokenizer provided for Claude 2, we can identify some candidates for intermediate fragments that are likely under-trained by looking for long tokens which are included as part of even longer tokens. ","element":"span"},{"text":"This results in candidates such as ","element":"span"},{"text":"CandidateFaciNum ","element":"span"},{"text":"(","element":"span"},{"text":"iCandidateFaciNum","element":"span"},{"text":"), ","element":"span"},{"text":"TrileptonPatTuple ","element":"span"},{"text":"(","element":"span"},{"text":"TrileptonPatTupleMC","element":"span"},{"text":"), ","element":"span"},{"text":"BFrontend ","element":"span"},{"text":"(","element":"span"},{"text":"DVBFrontend","element":"span"},{"text":") and others. Some of these tokens can be confirmed as problematic in Claude 2.1, although none appear effective in the Claude 3 family of models, consistent with the change in tokenizer implied by the SDK code.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"OpenAI’s ","element":"span"},{"text":"models have well-documented tokenizers from the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"tiktoken ","element":"span"},{"text":"package. ","element":"span"},{"text":"In addition, by using models that share a tokenizer (refer to section ","element":"span"},{"href":"#id-14","text":"3.2","element":"a"},{"text":"), ","element":"span"},{"text":"we already have access to a list of potential under-trained token candidates for GPT-3.5 and GPT-4, ","element":"span"},{"text":"including ","element":"span"},{"text":"_ForCanBeConverted","element":"span"},{"text":", ","element":"span"},{"text":"$$PostalCodesNL","element":"span"},{"text":", ","element":"span"},{"text":"useRalative","element":"span"},{"text":", ","element":"span"},{"text":"_typingsJapgolly","element":"span"},{"text":", and others. We find that all OpenAI models older than GPT-4o fail to handle many of them correctly, resulting in hallucinations followed by an inability to tell the difference between the inputs and incorrect outputs, or model output degrading into repetition. The GPT-4o model family uses a different tokenizer with a larger vocabulary, but the same techniques for tokenizer analysis are effective in finding under-trained tokens, including ","element":"span"},{"style":{"height":15.2},"width":168.42,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/8-0.png","element":"img","alt":" ម្បី, which","inline":true,"padRight":true},{"text":"appears to induce an ‘end of text’ token, as well as various tokens apparently derived from Chinese advertisements, such as ","element":"span"},{"style":{"height":14.4},"width":281.13,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/8-1.png","element":"img","alt":" _天天中彩票APP.","inline":true}]]},{"heading":"5 Discussion and Conclusion","paragraphs":[[{"text":"The presence of under-trained tokens has several negative consequences for language models, including inefficient inference and the potential to bypass guardrails. Our investigations show that a wide variety of untrained and under-trained tokens are present in model tokenizers, across a wide variety of model classes. Even with our relatively conservative threshold for verification, we detect the presence of such tokens across all tested models, with typically around 0.1–1% of the vocabulary consisting of severely under-trained tokens, although their prevalence varies significantly.","element":"span"}],[{"text":"$3b","element":"span"}],[{"text":"Based on our findings, we summarize a number of recommendations within the scope of current LLM development tooling. Firstly, we recommend ensuring that input data pre-processing is identical across tokenizer training data, model training data, and model inference. In particular, by carefully considering how to handle carriage returns, and special tokens present as plain text in training data and user input. Secondly, careful consideration of tokenizer training data is required, ensuring that it is representative of model training data. Next, after training a tokenizer, we recommend checking for unreachable tokens by encoding and decoding the vocabulary to ensure that manually added tokens are handled correctly. Finally, when training base models, checking for under-trained tokens after smaller test runs, or testing on a different corpus to reveal pre-processing bugs that may cause unrepresentative inputs in the main training data, provides a valuable sanity check.","element":"span"}],[{"text":"In addition to providing a set of useful tools for improving models and tokenizers, our work indicates several directions for future research. Firstly, the results from StarCoder2 (see section ","element":"span"},{"href":"#id-14","text":"3.2","element":"a"},{"text":") highlight a potential limitation in BPE training where occurrences in a single document or repository can define a token by themselves. Strategies to prevent this, such as limiting the count for pairs to be merged by document, can be explored to prevent this. Secondly, we note that byte-based BPE tokenization produces more intermediate fragments which additionally have the ability to cause outputs to be undecodable. ","element":"span"},{"text":"The trade-off between more efficient encoding methods and these downsides is particularly under-explored. Although allowing such tokens may lead to lower average token counts, this also leads to the presence of more untrained ‘fragments’ and tokens which are less semantically meaningful. Techniques such as BPEdropout (","element":"span"},{"href":"#id-41","text":"Provilkov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-41","text":"2020","element":"a"},{"text":") have been proposed to compensate for under-trained intermediate fragments, but direct comparisons on state-of-the-art models are lacking.","element":"span"}],[{"text":"Finally, we observe differences across models in terms of how weight decay is applied to tokens that are not present in the input, including not applying weight decay to embeddings, applying it only to tokens seen in a batch, or applying it across all model weights. This choice may affect the ability of models to learn richer semantic representations for rare tokens, and likely mitigate the severity and impact of under-trained tokens. Although this choice has been studied in older models (","element":"span"},{"href":"#id-42","text":"Sed- ","element":"a"},{"href":"#id-42","text":"hain et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","text":"2015","element":"a"},{"text":"), we are not aware of systematic ablations in recent LLMs.","element":"span"}],[{"text":"Our findings highlight a range of tokenizer issues, the severity of which varies across models. By analyzing tokenizers and model embeddings, we can identify under-trained tokens and improve the efficiency and security of LLMs.","element":"span"}]]},{"heading":"6 Limitations","paragraphs":[[{"text":"Although our pipeline for finding under-trained tokens is highly effective at finding such ‘glitch tokens’ across a wide range of models, our approach has a number of limitations.","element":"span"}],[{"text":"Most notably, the output embedding-based indicators require manually specifying a set of reference under-trained tokens, preventing the method from being fully automated for the minority of models with tied embeddings, and requiring at least minimal manual intervention.","element":"span"}],[{"text":"The output embedding-based indicators are heuristic, and based on a working hypothesis for the internal representation and training dynamics. Further research into model interpretability could refine our understanding of such representations, and lead to the development of more effective indicators. The input embeddings-based indicator, while not requiring such manual input, is only applicable to models without tied embeddings, and depends on particular choices for weight decay and initialization. ","element":"span"},{"text":"Although this constitutes the majority of models, there are various exceptions, and the exact weight decay applied is often not well documented.","element":"span"}],[{"text":"Aside from these limitations affecting the ability to automatically calculate under-trained token indicator scores, the relationship between our proposed indicators and model behaviour is noisy. Both the indicators themselves, as well as the veri-fication results, can be more indicative of problematic model behaviour on different occasions.","element":"span"}],[{"text":"Specifically, there are certain situations where the indicators we use offer a more reliable guide of a token’s tendency to induce unwanted output in typical prompting compared to our verifi-cation prompting techniques. These cases include input/output asymmetry, where tokens are solely present as inputs (e.g., ","element":"span"},{"text":"","element":"span"},{"text":"), or situations where the model exhibits a strong bias towards a specific language such as English, consistently producing translated outputs.","element":"span"}],[{"text":"Another common occurrence is the output of an equivalent token without a leading space, although the variation in our verification prompts compensates for this. ","element":"span"},{"text":"On the other hand, there are cases where tokens are rejected by the verifi-cation process, but can still induce incorrect behaviour, mainly due to our strict threshold and repetitive verification prompts, which are designed to detecting the most reliable under-trained tokens. However, despite these limitations, verification using prompting is highly effective at identifying a threshold below which candidate tokens induce unwanted behaviour, and selecting the most effective candidate tokens.","element":"span"}],[{"text":"Finally, the scope of our work is limited by an exclusive focus on models that use byte-pair encoding-based tokenization. ","element":"span"},{"text":"Results for Unigram-based models may be significantly different, with both the lack of intermediate fragments, and randomized tokenization preventing the intermediate fragments which are a source of under-trained tokens, and we leave investigation of such models to future work.","element":"span"}]]},{"heading":"Acknowledgments","paragraphs":[[{"text":"We are grateful to Dirk Groeneveld, Luca Soldaini, and Nathan Lambert from the Allen Institute for AI for insightful discussions and for providing data on weight decay, token counts, and tokenization in the OLMo models. We also thank Stella Biderman from EleutherAI for sharing information regarding weight decay and tokenization in the Pythia and GPT-NeoX models. Additionally, we appreciate the valuable feedback on the manuscript from Matthias Gallé, Phil Blunsom, and Kelly Marchisio, and thank Nathan Godey for helpful pointers to relevant literature.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-39","text":"01.AI, Alex Young, ","element":"span"},{"text":"Bei Chen, ","element":"span"},{"text":"Chao Li, ","element":"span"},{"text":"Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. 2024. ","element":"span"},{"href":"https://arxiv.org/abs/2403.04652","text":"Yi: Open foundation models ","element":"a"},{"href":"https://arxiv.org/abs/2403.04652","text":"by 01.AI","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:2403.04652.","element":"span"}],[{"id":"id-36","text":"Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, ","element":"span"},{"text":"Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023. ","element":"span"},{"href":"https://arxiv.org/abs/2309.16609","text":"Qwen technical report","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:2309.16609.","element":"span"}],[{"id":"id-34","text":"Marco Bellagente, Jonathan Tow, Dakota Mahan, Duy ","element":"span"},{"text":"Phung, Maksym Zhuravinskyi, Reshinth Adithyan, James Baicoianu, Ben Brooks, Nathan Cooper, Ashish Datta, Meng Lee, Emad Mostaque, Michael Pieler, Nikhil Pinnaparju, Paulo Rocha, Harry Saini, Hannah Teufel, Niccolo Zanichelli, and Carlos Riquelme. 2024. ","element":"span"},{"href":"https://arxiv.org/abs/2402.17834","text":"Stable LM 2 1.6B technical report","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:2402.17834.","element":"span"}],[{"id":"id-25","text":"Stella Biderman, Hailey Schoelkopf, Quentin An- ","element":"span"},{"text":"thony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar Van Der Wal. 2023. Pythia: a suite for analyzing large language models across training and scaling. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 40th International Conference on Machine Learning","element":"span"},{"text":", ICML’23. JMLR.org.","element":"span"}],[{"id":"id-16","text":"Daniel Bi´s, Maksim Podkorytov, and Xiuwen Liu. ","element":"span"},{"text":"2021. ","element":"span"},{"href":"https://doi.org/10.18653/v1/2021.naacl-main.403","text":"Too much in common: Shifting of embed- ","element":"a"},{"href":"https://doi.org/10.18653/v1/2021.naacl-main.403","text":"dings in transformer language models and its impli- ","element":"a"},{"href":"https://doi.org/10.18653/v1/2021.naacl-main.403","text":"cations","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","element":"span"},{"text":", pages 5117–5130.","element":"span"}],[{"id":"id-24","text":"Sidney Black, Stella Biderman, Eric Hallahan, Quentin ","element":"span"},{"text":"Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. ","element":"span"},{"href":"https://doi.org/10.18653/v1/2022.bigscience-1.9","text":"GPT-NeoX-20B: An ","element":"a"},{"href":"https://doi.org/10.18653/v1/2022.bigscience-1.9","text":"open-source autoregressive language model","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models","element":"span"},{"text":", pages 95–136, virtual+Dublin. Association for Computational Linguistics.","element":"span"}],[{"id":"id-6","text":"Kaj Bostrom and Greg Durrett. 2020. ","element":"span"},{"href":"https://doi.org/10.18653/v1/2020.findings-emnlp.414","text":"Byte pair encod- ","element":"a"},{"href":"https://doi.org/10.18653/v1/2020.findings-emnlp.414","text":"ing is suboptimal for language model pretraining","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Findings of the Association for Computational Linguistics: EMNLP 2020","element":"span"},{"text":", pages 4617–4624, Online. Association for Computational Linguistics.","element":"span"}],[{"id":"id-31","style":{"width":"93%"},"width":821,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/10-0.png","element":"img"}],[{"id":"id-12","text":"Martin Fell. 2023. ","element":"span"},{"href":"https://www.lesswrong.com/posts/kmWrwtGE9B9hpbgRT/a-search-for-more-chatgpt-gpt-3-5-gpt-4-unspeakable-glitch","text":"A search for more ChatGPT / GPT- ","element":"a"},{"href":"https://www.lesswrong.com/posts/kmWrwtGE9B9hpbgRT/a-search-for-more-chatgpt-gpt-3-5-gpt-4-unspeakable-glitch","text":"3.5 / GPT-4 \"unspeakable\" glitch tokens","element":"a"},{"text":". Blog post.","element":"span"}],[{"id":"id-9","text":"Jonas Geiping, Alex Stein, Manli Shu, Khalid Saiful- ","element":"span"},{"text":"lah, Yuxin Wen, and Tom Goldstein. 2024. ","element":"span"},{"href":"https://arxiv.org/abs/2402.14020","text":"Coercing ","element":"a"},{"href":"https://arxiv.org/abs/2402.14020","text":"llms to do and reveal (almost) anything","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:2402.14020.","element":"span"}],[{"id":"id-30","text":"Gemma Team, Thomas Mesnard, Cassidy Hardin, ","element":"span"},{"text":"Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, ","element":"span"},{"text":"Morgane Rivière, ","element":"span"},{"text":"Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Aakanksha Chowdhery, ","element":"span"},{"text":"Adam Roberts, ","element":"span"},{"text":"Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. ChoquetteChoo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni,","element":"span"}],[{"text":"$3c","element":"span"},{"href":"https://arxiv.org/abs/2403.08295","text":"Gemma: Open models based ","element":"a"},{"href":"https://arxiv.org/abs/2403.08295","text":"on Gemini research and technology","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:2403.08295.","element":"span"}],[{"id":"id-17","text":"Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita ","element":"span"},{"text":"Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, ","element":"span"},{"text":"Valentina Pyatkin, ","element":"span"},{"text":"Abhilasha Ravichander, ","element":"span"},{"text":"Dustin Schwenk, ","element":"span"},{"text":"Saurabh Shah, ","element":"span"},{"text":"William Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah Smith, and Hannaneh Hajishirzi. 2024. ","element":"span"},{"href":"https://doi.org/10.18653/v1/2024.acl-long.841","text":"OLMo: Accelerating the science of ","element":"a"},{"href":"https://doi.org/10.18653/v1/2024.acl-long.841","text":"language models","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","element":"span"},{"text":", pages 15789– 15809, Bangkok, Thailand. Association for Computational Linguistics.","element":"span"}],[{"id":"id-15","text":"Hakan Inan, Khashayar Khosravi, and Richard Socher. ","element":"span"},{"text":"2017. ","element":"span"},{"href":"https://arxiv.org/abs/1611.01462","text":"Tying word vectors and word classifiers: A ","element":"a"},{"href":"https://arxiv.org/abs/1611.01462","text":"loss framework for language modeling","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:1611.01462.","element":"span"}],[{"id":"id-27","text":"Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- ","element":"span"},{"text":"sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, ","element":"span"},{"text":"Lucile Saulnier, ","element":"span"},{"text":"Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Tim-othée Lacroix, and William El Sayed. 2023. ","element":"span"},{"href":"https://arxiv.org/abs/2310.06825","text":"Mistral ","element":"a"},{"href":"https://arxiv.org/abs/2310.06825","text":"7B","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:2310.06825.","element":"span"}],[{"id":"id-28","text":"Albert Q. Jiang, Alexandre Sablayrolles, Antoine ","element":"span"},{"text":"Roux, ","element":"span"},{"text":"Arthur Mensch, ","element":"span"},{"text":"Blanche Savary, ","element":"span"},{"text":"Chris","element":"span"}],[{"text":"Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, MarieAnne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024. ","element":"span"},{"href":"https://arxiv.org/abs/2401.04088","text":"Mixtral of experts","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:2401.04088.","element":"span"}],[{"id":"id-8","text":"Andrej Karpathy. 2024. ","element":"span"},{"href":"https://www.youtube.com/watch?v=zduSFxRajkE","text":"Let’s build the GPT Tokenizer","element":"a"},{"text":". YouTube Video.","element":"span"}],[{"id":"id-5","text":"Taku Kudo. 2018. ","element":"span"},{"href":"https://doi.org/10.18653/v1/P18-1007","text":"Subword regularization: Improving ","element":"a"},{"href":"https://doi.org/10.18653/v1/P18-1007","text":"neural network translation models with multiple sub- ","element":"a"},{"href":"https://doi.org/10.18653/v1/P18-1007","text":"word candidates","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","element":"span"},{"text":", pages 66–75, Melbourne, Australia. Association for Computational Linguistics.","element":"span"}],[{"id":"id-40","text":"Opher Lieber, Barak Lenz, Hofit Bata, Gal Co- ","element":"span"},{"text":"hen, ","element":"span"},{"text":"Jhonathan ","element":"span"},{"text":"Osin, ","element":"span"},{"text":"Itay ","element":"span"},{"text":"Dalmedigos, ","element":"span"},{"text":"Erez Safahi, ","element":"span"},{"text":"Shaked ","element":"span"},{"text":"Meirom, ","element":"span"},{"text":"Yonatan ","element":"span"},{"text":"Belinkov, Shai Shalev-Shwartz, ","element":"span"},{"text":"Omri Abend, ","element":"span"},{"text":"Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, ","element":"span"},{"text":"Avashalom Manevich, ","element":"span"},{"text":"Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. 2024. ","element":"span"},{"href":"https://arxiv.org/abs/2403.19887","text":"Jamba: A hybrid ","element":"a"},{"href":"https://arxiv.org/abs/2403.19887","text":"Transformer-Mamba language model","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:2403.19887.","element":"span"}],[{"id":"id-38","text":"Anton Lozhkov, Raymond Li, Loubna Ben Allal, ","element":"span"},{"text":"Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas KrauSS, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. 2024. ","element":"span"},{"href":"https://arxiv.org/abs/2402.19173","text":"StarCoder 2 and The Stack v2: The next gen- ","element":"a"},{"href":"https://arxiv.org/abs/2402.19173","text":"eration","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:2402.19173.","element":"span"}],[{"id":"id-37","text":"Meta AI. 2024. ","element":"span"},{"href":"https://ai.meta.com/blog/meta-llama-3/","text":"Introducing Meta Llama 3: The most ","element":"a"},{"href":"https://ai.meta.com/blog/meta-llama-3/","text":"capable openly available LLM to date","element":"a"},{"text":".","element":"span"}],[{"id":"id-23","text":"Microsoft. 2023. ","element":"span"},{"href":"https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/","text":"Phi-2: The surprising power of small ","element":"a"},{"href":"https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/","text":"language models","element":"a"},{"text":".","element":"span"}],[{"id":"id-7","text":"Sabrina J. Mielke, Zaid Alyafeai, Elizabeth Salesky, ","element":"span"},{"text":"Colin Raffel, Manan Dey, Matthias Gallé, Arun Raja, Chenglei Si, Wilson Y. Lee, Benoît Sagot, and","element":"span"}],[{"text":"Samson Tan. 2021. ","element":"span"},{"href":"https://arxiv.org/abs/2112.10508","text":"Between words and characters: ","element":"a"},{"href":"https://arxiv.org/abs/2112.10508","text":"A brief history of open-vocabulary modeling and to- ","element":"a"},{"href":"https://arxiv.org/abs/2112.10508","text":"kenization in NLP","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:2112.10508.","element":"span"}],[{"id":"id-33","text":"OpenAI. 2024. ","element":"span"},{"href":"https://github.com/openai/tiktoken","text":"tiktoken: a fast BPE tokeniser for use ","element":"a"},{"href":"https://github.com/openai/tiktoken","text":"with OpenAI’s models.","element":"a"}],[{"id":"id-41","text":"Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. ","element":"span"},{"text":"2020. ","element":"span"},{"href":"https://doi.org/10.18653/v1/2020.acl-main.170","text":"BPE-dropout: Simple and effective subword ","element":"a"},{"href":"https://doi.org/10.18653/v1/2020.acl-main.170","text":"regularization","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","element":"span"},{"text":", pages 1882–1892, Online. Association for Computational Linguistics.","element":"span"}],[{"id":"id-0","text":"Alec Radford, Jeff Wu, Rewon Child, David Luan, ","element":"span"},{"text":"Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.","element":"span"}],[{"id":"id-29","text":"Rakuten Group, Aaron Levine, Connie Huang, Chen- ","element":"span"},{"text":"guang Wang, Eduardo Batista, Ewa Szymanska, Hongyi Ding, Hou Wei Chou, Jean-François Pessiot, Johanes Effendi, Justin Chiu, Kai Torben Ohlhus, Karan Chopra, Keiji Shinzato, Koji Murakami, Lee Xiong, Lei Chen, Maki Kubota, Maksim Tkachenko, Miroku Lee, Naoki Takahashi, Prathyusha Jwalapuram, Ryutaro Tatsushima, Saurabh Jain, Sunil Kumar Yadav, Ting Cai, Wei-Te Chen, Yandi Xia, Yuki Nakayama, and Yutaka Higashiyama. 2024. ","element":"span"},{"href":"https://arxiv.org/abs/2403.15484","text":"RakutenAI-7B: Extending large language models ","element":"a"},{"href":"https://arxiv.org/abs/2403.15484","text":"for Japanese","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:2403.15484.","element":"span"}],[{"id":"id-10","text":"Jessica Rumbelow and Matthew Watkins. 2023. ","element":"span"},{"href":"https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation","text":"Solid- ","element":"a"},{"href":"https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation","text":"GoldMagikarp (plus, prompt generation)","element":"a"},{"text":". ","element":"span"},{"text":"Blog Post.","element":"span"}],[{"id":"id-42","text":"Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, ","element":"span"},{"text":"and Lexing Xie. 2015. Autorec: Autoencoders meet collaborative filtering. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 24th International Conference on World Wide Web","element":"span"},{"text":", pages 111–112. ACM.","element":"span"}],[{"id":"id-1","text":"Rico Sennrich, Barry Haddow, and Alexandra Birch. ","element":"span"},{"text":"2016. ","element":"span"},{"href":"https://doi.org/10.18653/v1/P16-1162","text":"Neural machine translation of rare words ","element":"a"},{"href":"https://doi.org/10.18653/v1/P16-1162","text":"with subword units","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","element":"span"},{"text":", pages 1715– 1725, Berlin, Germany. Association for Computational Linguistics.","element":"span"}],[{"id":"id-4","text":"Kevin Slagle. 2024. ","element":"span"},{"href":"https://arxiv.org/abs/2404.14408","text":"SpaceByte: Towards deleting to- ","element":"a"},{"href":"https://arxiv.org/abs/2404.14408","text":"kenization from large language modeling","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:2404.14408.","element":"span"}],[{"id":"id-45","text":"The Unicode Consortium. 2023. ","element":"span"},{"href":"https://www.unicode.org/versions/Unicode15.0.0","text":"The Unicode standard. ","element":"a"},{"href":"https://www.unicode.org/versions/Unicode15.0.0","text":"version 15.0 core specification","element":"a"},{"text":".","element":"span"}],[{"id":"id-26","text":"Hugo Touvron, Louis Martin, Kevin Stone, Peter ","element":"span"},{"text":"Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem","element":"span"}],[{"text":"Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. ","element":"span"},{"href":"https://arxiv.org/abs/2307.09288","text":"Llama ","element":"a"},{"text":"2: ","element":"span"},{"href":"https://arxiv.org/abs/2307.09288","text":"Open foundation and fine-tuned chat models","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:2307.09288.","element":"span"}],[{"id":"id-22","text":"Ben Wang and Aran Komatsuzaki. 2021. ","element":"span"},{"href":"https://github.com/kingoflolz/mesh-transformer-jax","text":"GPT-J- ","element":"a"},{"href":"https://github.com/kingoflolz/mesh-transformer-jax","text":"6B: A 6 Billion Parameter Autoregressive Language ","element":"a"},{"href":"https://github.com/kingoflolz/mesh-transformer-jax","text":"Model","element":"a"},{"text":".","element":"span"}],[{"id":"id-11","text":"Matthew Watkins and Jessica Rumbelow. 2023. ","element":"span"},{"href":"https://www.lesswrong.com/posts/8viQEp8KBg2QSW4Yc/solidgoldmagikarp-iii-glitch-token-archaeology","text":"Solid- ","element":"a"},{"href":"https://www.lesswrong.com/posts/8viQEp8KBg2QSW4Yc/solidgoldmagikarp-iii-glitch-token-archaeology","text":"GoldMagikarp III: Glitch token archaeology","element":"a"},{"text":". Blog Post.","element":"span"}],[{"id":"id-13","text":"Thomas Wolf, Lysandre Debut, Victor Sanh, Julien ","element":"span"},{"text":"Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. ","element":"span"},{"href":"https://arxiv.org/abs/1910.03771","text":"HuggingFace’s transformers: ","element":"a"},{"text":"State-of- ","element":"span"},{"href":"https://arxiv.org/abs/1910.03771","text":"the-art natural language processing","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Preprint","element":"span"},{"text":", arXiv:1910.03771.","element":"span"}],[{"id":"id-2","text":"Linting Xue, Aditya Barua, Noah Constant, Rami Al- ","element":"span"},{"text":"Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. 2022. ","element":"span"},{"href":"https://doi.org/10.1162/tacl_a_00461","text":"ByT5: Towards a token-free ","element":"a"},{"href":"https://doi.org/10.1162/tacl_a_00461","text":"future with pre-trained byte-to-byte models","element":"a"},{"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Transactions of the Association for Computational Linguistics","element":"span"},{"text":", 10:291–306.","element":"span"}],[{"id":"id-3","text":"Lily Yu, Daniel Simig, Colin Flaherty, Armen Agha- ","element":"span"},{"text":"janyan, Luke Zettlemoyer, and Mike Lewis. 2023. ","element":"span"},{"href":"https://openreview.net/forum?id=JTmO2V9Xpz","text":"MEGABYTE: Predicting million-byte sequences ","element":"a"},{"href":"https://openreview.net/forum?id=JTmO2V9Xpz","text":"with multiscale transformers","element":"a"},{"text":". In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Thirty-seventh Conference on Neural Information Processing Systems","element":"span"},{"text":".","element":"span"}]]},{"heading":"A Alternative under-trained token indicators","paragraphs":[[{"text":"For some models, in particular those in the Gemma series (","element":"span"},{"href":"#id-30","text":"Gemma Team et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","text":"2024","element":"a"},{"text":"), we noticed a very high similarity between the rows of their (tied) embedding matrix. ","element":"span"},{"text":"Such similarity between embeddings has been noted before, and has been attributed to all embeddings being pushed in a common direction during training (","element":"span"},{"href":"#id-16","text":"Bi´s ","element":"a"},{"href":"#id-16","text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","text":"2021","element":"a"},{"text":"). Although a constant component in all output embeddings has no effect on model predictions, as softmax is invariant to a constant shift of all logits, such similarity may affect the effectiveness of our under-trained token indicators.","element":"span"}],[{"text":"To compensate for this, we tested two variations for reducing or removing this constant component. Centering the embeddings by subtracting their mean, and removing their first principal component:","element":"span"}],[{"style":{"width":"70%"},"width":617,"height":262,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/12-0.png","element":"img"}],[{"text":"We can then take the cosine distance between rows in these adjusted output embedding matrices to obtain the additional indicators ","element":"span"},{"style":{"height":22.41},"width":236.36,"height":56.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/12-1.png","element":"img","alt":" C( ˆEout, ˆuref)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.61},"width":236.36,"height":54.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/12-2.png","element":"img","alt":" C( ˜Eout, ˜uref)","inline":true},{"text":". Testing additional indicators on a small range of models (see Table ","element":"span"},{"href":"#id-43","text":"2","element":"a"},{"text":") shows no consistent improvement in using these more complex methods.","element":"span"}]]},{"heading":"B Veriﬁcation details","paragraphs":[[{"text":"We use three repetitive prompts to induce models to output the candidate token we are testing, shown in Table ","element":"span"},{"href":"#id-44","text":"3","element":"a"},{"text":".","element":"span"}],[{"text":"These prompts are all designed to be suitable for base models and not require specialized instruction tuning or prompt templating. For each prompt we generate three tokens and check the maximal probability of our target token being predicted, and then take the maximum of this again over all three prompts. Variation in quoting and spacing helps to ensure we do not detect false positives based on models producing similar tokens without spaces, or tokens which start with punctuation partially merging with quotes. By using a temperature of zero, and designing our prompts such that the desired token is typically the first one to be sampled, we minimize the effect of random sampling.","element":"span"}]]},{"heading":"C A short primer on UTF-8 encoding","paragraphs":[[{"text":"UTF-8 is the most prevalent encoding scheme used to represent text in computers and communication protocols worldwide. It efficiently encodes Unicode characters, which encompass a vast range of characters from various writing systems and symbols (","element":"span"},{"href":"#id-45","text":"The Unicode Consortium","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","text":"2023","element":"a"},{"text":"). Encoding to UTF-8 is often the first step in tokenization.","element":"span"}],[{"text":"UTF-8 encoding can be summarized as follows:","element":"span"}],[{"text":"• ASCII (code points below 128): Single byte, binary 0xxxxxxx representing up to 7 bits.","element":"span"}],[{"id":"id-43","style":{"width":"93%"},"width":1702,"height":499,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/13-0.png","element":"img"}],[{"text":"Table 2: Effectiveness of different indicators. For each under-trained token indicator, we verified the top 2% of tokens, and show the number of these that pass our 1% verification threshold. No consistent pattern is seen to justify the more complex alternatives.","element":"figcaption","subtype":"caption"}],[{"id":"id-44","text":"Verification prompt #1. ","element":"span"},{"text":" ","element":"span"},{"text":"is replaced with the token we are testing.","element":"span"}],[{"style":{"width":"100%"},"width":877,"height":388,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/13-1.png","element":"img"}],[{"text":"Verification prompt #2 ","element":"span"},{"text":" ","element":"span"},{"text":"is replaced with the token we are testing.","element":"span"}],[{"style":{"width":"100%"},"width":877,"height":382,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/13-2.png","element":"img"}],[{"text":"Verification prompt #3 ","element":"span"},{"text":" ","element":"span"},{"text":"is replaced with the token we are testing.","element":"span"}],[{"style":{"width":"100%"},"width":877,"height":171,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/13-3.png","element":"img"}],[{"text":"Prompt used for API-based testing","element":"span"}],[{"text":"Table 3: Prompts","element":"figcaption","subtype":"caption"}],[{"text":"• 2-byte sequences: 110xxxxx, 10xxxxxx representing up to 11 bits.","element":"span"}],[{"text":"• 3-byte sequences: ","element":"span"},{"text":"1110xxxx, ","element":"span"},{"text":"10xxxxxx, 10xxxxxx representing up to 16 bits.","element":"span"}],[{"text":"• 4-byte sequences: ","element":"span"},{"text":"11110xxx, ","element":"span"},{"text":"10xxxxxx, 10xxxxxx, 10xxxxxx representing up to 21 bits.","element":"span"}],[{"text":"Where the bits indicated by ‘x’ are concatenated to form the Unicode code point.","element":"span"}],[{"text":"This encoding naturally gives rise to some byte values that are not used:","element":"span"}],[{"text":"• 111110xx, 1111110x, 11111110, 11111111 would represent the first byte of sequences of 5-8 bytes, which are not in use. This corresponds to decimal 245-255 or hexadecimal ","element":"span"},{"text":"0xF5","element":"span"},{"text":"–","element":"span"},{"text":"0xFF","element":"span"},{"text":".","element":"span"}],[{"text":"• 11000000, 11000001 are not in use, as the possible two-byte encodings that start with this fit in 7 bits due to the five leading zeros. These are 192/193 in decimal and ","element":"span"},{"text":"0xC0","element":"span"},{"text":"/","element":"span"},{"text":"0xC1 ","element":"span"},{"text":"in hexadecimal.","element":"span"}],[{"text":"• Additionally, other starting bytes can be covered entirely by other tokens, and also turn out to be unused. ","element":"span"},{"text":"A common example of this is ","element":"span"},{"text":"0xC2","element":"span"},{"text":"/","element":"span"},{"text":"0xC3 ","element":"span"},{"text":"which are only used for Unicode points 128-255. In addition, since code points ","element":"span"},{"text":"U+323B0 ","element":"span"},{"text":"to ","element":"span"},{"text":"U+0xDFFFF ","element":"span"},{"text":"are unassigned, the ","element":"span"},{"text":"0xF1 ","element":"span"},{"text":"and ","element":"span"},{"text":"0xF2 ","element":"span"},{"text":"bytes are not used in UTF-8 representations of currently defined Unicode characters. Similarly, ","element":"span"},{"text":"0xF4 ","element":"span"},{"text":"is only used through the “Supplementary Private Use Area”. However, even if not defined in the current Unicode standard, such characters can be easily inserted in text and are found on web pages.","element":"span"}]]},{"heading":"D API-based veriﬁcation in closed-source models","paragraphs":[[{"text":"We use a specific prompt for API-based testing of under-trained tokens, show in Table ","element":"span"},{"href":"#id-44","text":"3","element":"a"},{"text":". ","element":"span"},{"text":"The ‘password’ strings consist of the problematic token, occasionally prefixed to help identify their source, and to avoid starting the string with a leading space, as we noticed that models often drop the leading space after a quotation mark, even for normal tokens. Although many other prompt formats are effective, we have found this code-based approach to more clearly avoid false positives. Figure ","element":"span"},{"href":"#id-46","text":"5 ","element":"a"},{"text":"shows the result for Mistral, Anthropic and OpenAI models.","element":"span"}],[{"id":"id-46","style":{"width":"80%"},"width":1456,"height":2405,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.05417/images/15-0.png","element":"img"}],[{"text":"Figure 5: API prompting results.","element":"figcaption","subtype":"caption"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]