2e:[[["$","$L33","0",{"heading":"Abstract","index":0,"length":13,"content":[["$","$L1a","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Test-time scaling has emerged as an effective approach for improving language model performance by utilizing additional compute at inference time. Recent studies have shown that overriding end-of-thinking tokens (e.g., replacing "}],["$","$1","1",{"children":" "}],["$","$1","2",{"children":"with “Wait”) can extend reasoning steps and improve accuracy. "}],["$","$1","3",{"children":"In this work, we explore whether a dedicated "}],["$","$1","4",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:0:paragraphs:0:4:style","children":"continue-thinking "}]}],["$","$1","5",{"children":"token can be "}],["$","$1","6",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:0:paragraphs:0:6:style","children":"learned "}]}],["$","$1","7",{"children":"to trigger extended reasoning. We augment a distilled version of "}],["$","$1","8",{"children":"DeepSeek-R1 "}],["$","$1","9",{"children":"with a single learned "}],["$","$1","10",{"children":"<|continue-thinking|> "}],["$","$1","11",{"children":"token, training only its embedding via reinforcement learning while keeping the model weights frozen. Our experiments show that this learned token achieves improved accuracy on standard math benchmarks compared to both the baseline model and a test-time scaling approach that uses a fixed token (e.g., “Wait”) for budget forcing. In particular, we observe that in cases where the fixed-token approach enhances the base model’s accuracy, our method achieves a markedly greater improvement. For example, on the GSM8K benchmark, the fixed-token approach yields a 1.3% absolute improvement in accuracy, whereas our learned-token method achieves a 4.2% improvement over the base model that does not use budget forcing."}]]}]]}],["$","$L33","1",{"heading":"1 Introduction","index":1,"length":13,"content":[["$","$L1a","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Language models have demonstrated impressive reasoning abilities through test-time compute scaling ("}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:3"}]}],["$","$1","4",{"children":"; "}],["$","$1","5",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:5"}]}],["$","$1","6",{"children":", "}],["$","$1","7",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:7"}]}],["$","$1","8",{"children":"; "}],["$","$1","9",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:9"}]}],["$","$1","10",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:10"}]}],["$","$1","11",{"children":", "}],["$","$1","12",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:12"}]}],["$","$1","13",{"children":"; "}],["$","$1","14",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:14"}]}],["$","$1","15",{"children":", "}],["$","$1","16",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:16"}]}],["$","$1","17",{"children":"). "}],["$","$1","18",{"children":"Two dominant paradigms have emerged for this process ("}],["$","$1","19",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:19"}]}],["$","$1","20",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:20"}]}],["$","$1","21",{"children":", "}],["$","$1","22",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:22"}]}],["$","$1","23",{"children":"): "}],["$","$1","24",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:24:style","children":"parallel "}]}],["$","$1","25",{"children":"and "}],["$","$1","26",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:26:style","children":"sequential"}]}],["$","$1","27",{"children":". The "}],["$","$1","28",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:28:style","children":"parallel "}]}],["$","$1","29",{"children":"approach involves generating multiple samples and selecting the best response based on a majority vote ("}],["$","$1","30",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:30"}]}],["$","$1","31",{"children":", "}],["$","$1","32",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:32"}]}],["$","$1","33",{"children":") or a model-provided score ("}],["$","$1","34",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:34"}]}],["$","$1","35",{"children":", "}],["$","$1","36",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:36"}]}],["$","$1","37",{"children":"). In contrast, the "}],["$","$1","38",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:38:style","children":"sequential "}]}],["$","$1","39",{"children":"approach—central to this work and popularized by OpenAI’s o1 model ("}],["$","$1","40",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:40"}]}],["$","$1","41",{"children":", "}],["$","$1","42",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:42"}]}],["$","$1","43",{"children":")—generates a single sample and encourages the model to revisit, backtrack, validate, and refine its reasoning before producing a final answer, typically resulting in a long chain-of-thought output ("}],["$","$1","44",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:44"}]}],["$","$1","45",{"children":", "}],["$","$1","46",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:46"}]}],["$","$1","47",{"children":"). Models that follow this paradigm are commonly referred to as "}],["$","$1","48",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:48:style","children":"reasoning models"}]}],["$","$1","49",{"children":", and they are often trained using reinforcement learning with verifiable rewards ("}],["$","$1","50",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:50"}]}],["$","$1","51",{"children":", "}],["$","$1","52",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:52"}]}],["$","$1","53",{"children":"; "}],["$","$1","54",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:54"}]}],["$","$1","55",{"children":", "}],["$","$1","56",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:56"}]}],["$","$1","57",{"children":"; "}],["$","$1","58",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:58"}]}],["$","$1","59",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:59"}]}],["$","$1","60",{"children":", "}],["$","$1","61",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:61"}]}],["$","$1","62",{"children":"; "}],["$","$1","63",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:63"}]}],["$","$1","64",{"children":", "}],["$","$1","65",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:65"}]}],["$","$1","66",{"children":")."}]]}],["$","$L1a","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"A key property of reasoning models is their ability to decide when to stop thinking, typically by generating an explicit end-of-thinking token (e.g., "}],["$","$1","1",{"children":""}],["$","$1","2",{"children":"). However, since this decision is made by the model, users have no direct control over the amount of reasoning performed. Recently, budget forcing was introduced in "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:1:3:style","children":"s1: simple test-time scaling "}]}],["$","$1","4",{"children":"("}],["$","$1","5",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:1:5"}]}],["$","$1","6",{"children":", "}],["$","$1","7",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:1:7"}]}],["$","$1","8",{"children":"), a sequential test-time scaling approach that provides direct control over the model’s computation time. By replacing end-of-thinking tokens with “Wait” tokens during generation, the authors showed that longer chain-of-thought reasoning could be achieved, resulting in improved accuracy. Conversely, early termination can be enforced by appending a "}],["$","$1","9",{"children":" "}],["$","$1","10",{"children":"token once the model reaches its compute budget, prompting it to generate the final answer. Budget forcing was quickly shown to perform well in various settings ("}],["$","$1","11",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:1:11"}]}],["$","$1","12",{"children":", "}],["$","$1","13",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:1:13"}]}],["$","$1","14",{"children":"; "}],["$","$1","15",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:1:15"}]}],["$","$1","16",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:1:16"}]}],["$","$1","17",{"children":", "}],["$","$1","18",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:1:18"}]}],["$","$1","19",{"children":"); see "}],["$","$1","20",{"children":"section 3 "}],["$","$1","21",{"children":"for a detailed discussion."}]]}],["$","$L1a","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"In this paper, we introduce a systematic approach for learning a special "}],["$","$1","1",{"children":"<|continue-thinking|> "}],["$","$1","2",{"children":"token as an alternative to the “Wait” or related fixed tokens such as “Alternatively” or “Hmm” suggested in ("}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:2:3"}]}],["$","$1","4",{"children":", "}],["$","$1","5",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:2:5"}]}],["$","$1","6",{"children":"). Our primary goal is to rigorously investigate whether the simple practice of using a fixed “Wait” token for budget forcing can be improved by learning a dedicated token embedding. Importantly, we do not aim to design the best possible test-time scaling method"}]]}],["$","$L1a","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/1-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:3:0"}]]}]}]]}],["$","$L1a","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 1: Text generation with budget forcing: Whenever the model outputs a "}],["$","$1","1",{"children":" "}],["$","$1","2",{"children":"token, we replace it with a "}],["$","$1","3",{"children":"<|continue-thinking|> "}],["$","$1","4",{"children":"token and feed that to the model."}]]}],["$","$L1a","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"for reasoning models. Rather, our focus is on understanding and analyzing the specific phenomenon reported in s1 ("}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:5:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:5:3"}]}],["$","$1","4",{"children":"). To this end, we deliberately constrain ourselves to a singletoken mechanism for budget forcing, enabling a controlled and direct comparison with prior approaches."}]]}],["$","$L1a","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"As illustrated in "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:6:1"}]}],["$","$1","2",{"children":", we introduce a special, learned "}],["$","$1","3",{"children":"<|continue-thinking|> "}],["$","$1","4",{"children":"token into the model’s vocabulary. During generation, we modify the model’s output generation process to replace any occurrence of the "}],["$","$1","5",{"children":" "}],["$","$1","6",{"children":"token with our "}],["$","$1","7",{"children":"<|continue-thinking|> "}],["$","$1","8",{"children":"token, as long as the token budget is not exhausted and the maximum number of forced continuations has not been reached. During training, we optimize only the embedding vector of the "}],["$","$1","9",{"children":"<|continue-thinking|> "}],["$","$1","10",{"children":"token while keeping all other model parameters frozen."}]]}],["$","$L1a","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We train this new token using reinforcement learning (RL), specifically the group relative policy optimization (GRPO) procedure ("}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:7:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:7:3"}]}],["$","$1","4",{"children":"). While supervised fine-tuning is a possible alternative, designing effective supervised reasoning demonstrations is difficult. In contrast, RL allows the training process to explore novel continuations aimed at improving task accuracy, making it a more natural fit for our setting. Notably, our method leverages a frozen model backbone, optimizing only a single parameter vector of the size of the model’s hidden dimension. This makes our "}],["$","$1","5",{"children":"approach highly generic and memory-efficient, enabling the use of relatively large context windows during RL training."}]]}],["$","$L1a","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We apply our proposed method to investigate the influence of the learned "}],["$","$1","1",{"children":"<|continue-thinking|> "}],["$","$1","2",{"children":"token on the reasoning process. Our results show that learning this token can significantly enhance model performance, yielding greater gains than the “Wait” or related tokens used in s1’s budget-forcing method. Notably, we find that whenever budget-forcing with “Wait” provides an improvement over the baseline, our learned token achieves even greater gains—up to a 320% increase in relative accuracy improvements, and as much as a 4% absolute improvement in overall accuracy. Conversely, in cases where budget-forcing with a fixed token (such as “Wait”) does not improve performance compared to the baseline, the learned token similarly does not offer a statistically significant benefit. Full results are reported in "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:8:3"}]}],["$","$1","4",{"children":"."}]]}],["$","$L1a","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"To summarize, our contributions have the following key features:"}]]}],["$","$L1a","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• We introduce the concept of learning a specialized "}],["$","$1","1",{"children":"<|continue-thinking|> "}],["$","$1","2",{"children":"token as an effective mechanism for budget-controlled compute scaling."}]]}],["$","$L1a","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• Our experiments demonstrate that, in cases where budget-forcing with a fixed token such as “Wait” improves accuracy, learning the token yields even greater gains, while requiring"}]]}],["$","$L1a","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/2-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:12:0"}]]}]}]]}],["$","$L1a","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 1: Accuracy (pass@1) results for different token budget limits "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:13:1"}]]}]}],["$","$1","2",{"children":"and different numbers of forced thinking continuations "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:13:3"}]]}]}],["$","$1","4",{"children":". Results are obtained via regex-based evaluation and an LLM evaluator if the model fails to generate an answer in the correct format. See "}],["$","$1","5",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:13:5"}]}],["$","$1","6",{"children":"for the full results."}]]}],["$","$L1a","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/2-3.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:14:0"}]]}]}]]}],["$","$L1a","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• We use an external LLM to compare outputs with ground truth, mitigating the limitations of standard math benchmark evaluations that rely on rigid answer formats, assuming the final answer is in "}],["$","$1","1",{"children":"\\boxed{}"}],["$","$1","2",{"children":". Our experiments show that this rigid-format evaluation approach can misrepresent reasoning ability."}]]}],["$","$L1a","16",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The code used for training and evaluation is available at "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:16:1"}]}],["$","$1","2",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:16:2"}]}],["$","$1","3",{"children":"."}]]}]]}],["$","$L33","2",{"heading":"2 Method","index":2,"length":13,"content":[["$","$L1a","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:0:0:style","children":"2.1 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:0:1:style","children":"Single Token Optimization"}]}]]}],["$","$L1a","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Let "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:1"}]]}]}],["$","$1","2",{"children":"be a pretrained language model that takes a prompt "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:3:style","children":"x "}]}],["$","$1","4",{"children":"as input and generates an output. Key to our method is the introduction of a special token, "}],["$","$1","5",{"children":"<|continue-thinking|>"}],["$","$1","6",{"children":", which we add to the model’s vocabulary to promote longer reasoning traces at test time. We denote the embedding vector of this token by "}],["$","$1","7",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:7"}]]}]}],["$","$1","8",{"children":", where "}],["$","$1","9",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:9:style","children":"d "}]}],["$","$1","10",{"children":"is the embedding dimension of the model."}]]}],["$","$L1a","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We refer to the adapted model with the new token as "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:2:1"}]]}]}],["$","$1","2",{"children":". "}],["$","$1","3",{"children":"All model parameters are frozen except for "}],["$","$1","4",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:2:4"}]]}]}],["$","$1","5",{"children":", so the embedding of the "}],["$","$1","6",{"children":"<|continue-thinking|> "}],["$","$1","7",{"children":"token is the only parameter updated during training."}]]}],["$","$L1a","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Our objective is to maximize the reasoning performance of "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:3:1"}]]}]}],["$","$1","2",{"children":"by optimizing the embedding vector "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:3:3"}]]}]}],["$","$1","4",{"children":". Formally, we aim to optimize the following:"}]]}],["$","$L1a","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/2-10.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:4:0"}]]}]}]]}],["$","$L1a","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"where "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:1:style","children":"x "}]}],["$","$1","2",{"children":"is a question sampled from a distribution "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:3:style","children":"Q"}]}],["$","$1","4",{"children":", and "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:5:style","children":"o "}]}],["$","$1","6",{"children":"is the response generated by running the budget forcing algorithm on "}],["$","$1","7",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:7:style","children":"x "}]}],["$","$1","8",{"children":"using the model "}],["$","$1","9",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:9"}]]}]}],["$","$1","10",{"children":". Note that "}],["$","$1","11",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:11:style","children":"o "}]}],["$","$1","12",{"children":"is not a standard sample from "}],["$","$1","13",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:13"}]]}]}],["$","$1","14",{"children":"; rather, its distribution is induced by the budget forcing algorithm BF"}],["$","$1","15",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:15"}]]}]}],["$","$1","16",{"children":", which we present below. The function "}],["$","$1","17",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:17:style","children":"R"}]}],["$","$1","18",{"children":"("}],["$","$1","19",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:19:style","children":"x, o"}]}],["$","$1","20",{"children":") "}],["$","$1","21",{"children":"represents the reward associated with the generated output. For example, for mathematical questions, "}],["$","$1","22",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:22:style","children":"R"}]}],["$","$1","23",{"children":"("}],["$","$1","24",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:24:style","children":"x, o"}]}],["$","$1","25",{"children":") "}],["$","$1","26",{"children":"could be an indicator function that equals "}],["$","$1","27",{"children":"1 "}],["$","$1","28",{"children":"if "}],["$","$1","29",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:29:style","children":"o "}]}],["$","$1","30",{"children":"contains the correct answer to question "}],["$","$1","31",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:31:style","children":"x"}]}],["$","$1","32",{"children":", and "}],["$","$1","33",{"children":"0 "}],["$","$1","34",{"children":"otherwise."}]]}],["$","$L1a","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"In more detail, the budget forcing algorithm BF"}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:6:1"}]]}]}],["$","$1","2",{"children":"modifies the generation process by enforcing additional reasoning steps before producing a final answer. "}],["$","$1","3",{"children":"During generation, whenever the model outputs an end-of-thinking "}],["$","$1","4",{"children":" "}],["$","$1","5",{"children":"token, it is replaced with the learned "}],["$","$1","6",{"children":"<|continue-thinking|> "}],["$","$1","7",{"children":"token, as a way to force longer reasoning traces. This process repeats until one of the following conditions is met:"}]]}],["$","$L1a","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"1. The number of forced thinking continuations reaches the preset maximum "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:7:1"}]]}]}]]}],["$","$L1a","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"2. The total number of generated tokens reaches the budget limit "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:8:1"}]]}]}],["$","$1","2",{"children":", in which case the reasoning process is immediately terminated and"}]]}],["$","$L1a","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"a "}],["$","$1","1",{"children":" "}],["$","$1","2",{"children":"token is inserted. This ensures that the model does not generate beyond the allowed compute budget."}]]}],["$","$L1a","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"During training, we set "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:10:1"}]]}]}],["$","$1","2",{"children":", meaning that only a single forced continuation is allowed per input. At test time, however, we also evaluate the model with "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:10:3"}]]}]}],["$","$1","4",{"children":"to assess its ability to generalize to multiple forced continuations."}]]}],["$","$L1a","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:11:0:style","children":"2.2 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:11:1:style","children":"Parallel Generation and Backpropagation"}]}]]}],["$","$L1a","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Fine-tuning LLMs with online RL often allocates separate GPUs for generation and backpropagation. This is because generation is significantly faster when using specialized inference engines such as vLLM ("}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:12:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:12:3"}]}],["$","$1","4",{"children":") and SGLang ("}],["$","$1","5",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:12:5"}]}],["$","$1","6",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:12:6"}]}],["$","$1","7",{"children":", "}],["$","$1","8",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:12:8"}]}],["$","$1","9",{"children":"), which are optimized for inference speed but cannot easily share memory with backpropagation workloads. While this setup enables efficient training, there remains an opportunity for better resource utilization, as backpropagation starts only after all generations are finished, and generation resumes only after weight updates."}]]}],["$","$L1a","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"To address this, we divide each batch into micro-batches, each containing the "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:13:1:style","children":"|"}]}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:13:2:style","children":"G"}]}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:13:3:style","children":"| "}]}],["$","$1","4",{"children":"generations of a single example. Once the generations for an example are complete, we immediately stream the micro-batch to the training GPU and initiate backpropagation—concurrently with ongoing generation for the rest of the batch. We accumulate gradients across micro-batches, updating the model weights only after the full batch is processed. This improves GPU utilization and accelerates training."}]]}],["$","$L1a","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:14:0:style","children":"2.3 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:14:1:style","children":"Training"}]}]]}],["$","$L1a","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Throughout this paper, we apply our method to the "}],["$","$1","1",{"children":"DeepSeek-R1-Distill-Qwen-1.5B "}],["$","$1","2",{"children":"model ("}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:3"}]}],["$","$1","4",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:4"}]}],["$","$1","5",{"children":", "}],["$","$1","6",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:6"}]}],["$","$1","7",{"children":"; "}],["$","$1","8",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:8"}]}],["$","$1","9",{"children":", "}],["$","$1","10",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:10"}]}],["$","$1","11",{"children":"), training only the embedding of a newly introduced token, while keeping all other parameters frozen. We set the number of forced continuations to "}],["$","$1","12",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:12"}]]}]}],["$","$1","13",{"children":"and the budget limit to "}],["$","$1","14",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:14"}]]}]}],["$","$1","15",{"children":". We initialize the new token’s embedding using that of the word “Wait.” The reward function used during training is the sum of two binary components: (i) a format reward that verifies whether the answer is in the expected format (specifically, wrapped in "}],["$","$1","16",{"children":"\\boxed{}"}],["$","$1","17",{"children":"), and (ii) a correctness reward, which checks whether the generated answer matches the ground truth. We implemented our training code on top of Open-R1 ("}],["$","$1","18",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:18"}]}],["$","$1","19",{"children":", "}],["$","$1","20",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:20"}]}],["$","$1","21",{"children":") and TRL ("}],["$","$1","22",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:22"}]}],["$","$1","23",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:23"}]}],["$","$1","24",{"children":", "}],["$","$1","25",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:25"}]}],["$","$1","26",{"children":"). For GRPO, we set "}],["$","$1","27",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:27:style","children":"G "}]}],["$","$1","28",{"children":"= 16 "}],["$","$1","29",{"children":"generations per example and used a batch size of 16. We performed 64 gradient accumulation steps, resulting in an effective batch size of 1,024 generations (i.e., 64 examples with 16 generations each) per optimizer update. The training dataset is "}],["$","$1","30",{"children":"DeepScaleR-Preview-Dataset "}],["$","$1","31",{"children":"("}],["$","$1","32",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:32"}]}],["$","$1","33",{"children":", "}],["$","$1","34",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:34"}]}],["$","$1","35",{"children":"), which is a collection of 40,000 math questions compiled from various datasets. The model was trained for a total of 936 steps using 8 NVIDIA A100 GPUs with 80GB memory: 4 GPUs for backpropagation and 4 GPUs for completion generations, where the total training time was about 1 week. See "}],["$","$1","36",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:15:36"}]}],["$","$1","37",{"children":"for a full list of parameters and system prompts used during training."}]]}]]}],["$","$L33","3",{"heading":"3 Related Work","index":3,"length":13,"content":[["$","$L1a","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Budget forcing, introduced by "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:1"}]}],["$","$1","2",{"children":"("}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:3"}]}],["$","$1","4",{"children":"), is a simple and effective method for scaling compute at test time. L1 ("}],["$","$1","5",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:5"}]}],["$","$1","6",{"children":", "}],["$","$1","7",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:7"}]}],["$","$1","8",{"children":") extends this idea using reinforcement learning to train models that satisfy user-specified reasoning lengths, enabling flexible cost-performance trade-offs. The m1 method by "}],["$","$1","9",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:9"}]}],["$","$1","10",{"children":"("}],["$","$1","11",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:11"}]}],["$","$1","12",{"children":") further explores the application of budget-forcing in medical QA. "}],["$","$1","13",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:13"}]}],["$","$1","14",{"children":"("}],["$","$1","15",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:15"}]}],["$","$1","16",{"children":") introduce a variant of budget forcing for greedy decoding, comparing multi-word phrases instead of single tokens like “Wait.” Collectively, these works highlight budget forcing as a promising direction for compute budget control."}]]}],["$","$L1a","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The concept of incorporating additional tokens has been explored in several prior studies. The works reported in ("}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:3"}]}],["$","$1","4",{"children":") and ("}],["$","$1","5",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:5"}]}],["$","$1","6",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:6"}]}],["$","$1","7",{"children":", "}],["$","$1","8",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:8"}]}],["$","$1","9",{"children":") introduce learnable tokens into reasoning traces to improve the model’s accuracy. In ("}],["$","$1","10",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:10"}]}],["$","$1","11",{"children":", "}],["$","$1","12",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:12"}]}],["$","$1","13",{"children":"; "}],["$","$1","14",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:14"}]}],["$","$1","15",{"children":", "}],["$","$1","16",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:16"}]}],["$","$1","17",{"children":"), additional tokens are inserted at random positions during training and appended to the prompt during inference, allowing the model to artificially increase the number of activations at test time. "}],["$","$1","18",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:18"}]}],["$","$1","19",{"children":"("}],["$","$1","20",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:20"}]}],["$","$1","21",{"children":") included a ‘planning token’ at the start of each reasoning step. Unlike our approach, these methods rely on supervised fine-tuning to learn token representations. In contrast, our method utilizes RL to optimize the new token embedding."}]]}],["$","$L1a","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Finally, various works on scaling test time compute motivate our choice of token learning using RL: (1) Recent empirical findings show that RLbased fine-tuning leads to better generalization ("}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:2:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:2:3"}]}],["$","$1","4",{"children":"); (2) theoretical analysis shows that RL enjoys higher expected cumulative reward"}]]}],["$","$L1a","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/4-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:0"}]]}]}]]}],["$","$L1a","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 2: Accuracy (pass@1) results for different token budget limits "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:1"}]]}]}],["$","$1","2",{"children":"and different numbers of forced thinking continuations "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:3"}]]}]}],["$","$1","4",{"children":". Results are obtained via a regex-based evaluation only. The percentage of final answers enclosed in "}],["$","$1","5",{"children":"\\boxed{} "}],["$","$1","6",{"children":"is shown in parentheses. See "}],["$","$1","7",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:7"}]}],["$","$1","8",{"children":"for the full results."}]]}],["$","$L1a","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"("}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:5:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:5:3"}]}],["$","$1","4",{"children":"), and (3) learning to use additional tokens, which is related to our approach of designing a "}],["$","$1","5",{"children":"<|continue-thinking|> "}],["$","$1","6",{"children":"token, has been observed to be a hard learning problem when using supervised learning ("}],["$","$1","7",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:5:7"}]}],["$","$1","8",{"children":", "}],["$","$1","9",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:5:9"}]}],["$","$1","10",{"children":")."}]]}]]}],["$","$L33","4",{"heading":"4 Experiments","index":4,"length":13,"content":[["$","$L1a","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:0:style","children":"4.1 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:1:style","children":"Evaluation protocol"}]}]]}],["$","$L1a","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We evaluate our model on three widely adopted mathematical "}],["$","$1","1",{"children":"reasoning "}],["$","$1","2",{"children":"datasets: "}],["$","$1","3",{"children":"GSM8K-Platinum ("}],["$","$1","4",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:4"}]}],["$","$1","5",{"children":", "}],["$","$1","6",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:6"}]}],["$","$1","7",{"children":"; "}],["$","$1","8",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:8"}]}],["$","$1","9",{"children":", "}],["$","$1","10",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:10"}]}],["$","$1","11",{"children":"), a revised version of the original GSM8K dataset containing 1209 grade-school level math problems; MATH500 ("}],["$","$1","12",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:12"}]}],["$","$1","13",{"children":", "}],["$","$1","14",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:14"}]}],["$","$1","15",{"children":"; "}],["$","$1","16",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:16"}]}],["$","$1","17",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:17"}]}],["$","$1","18",{"children":", "}],["$","$1","19",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:19"}]}],["$","$1","20",{"children":"), a 500-question subset of the MATH ("}],["$","$1","21",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:21"}]}],["$","$1","22",{"children":", "}],["$","$1","23",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:23"}]}],["$","$1","24",{"children":") dataset; and AIME24 ("}],["$","$1","25",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:25"}]}],["$","$1","26",{"children":", "}],["$","$1","27",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:27"}]}],["$","$1","28",{"children":") and AIME25 datasets, which contain 30 math problems from the 2024 and 2025 American Invitational Mathematics Examination, a national-level mathematics competition in the United States. We implemented the evaluation pipeline using a modified version of the LM-Evaluation-Harness library ("}],["$","$1","29",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:29"}]}],["$","$1","30",{"children":", "}],["$","$1","31",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:31"}]}],["$","$1","32",{"children":")."}]]}],["$","$L1a","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Our method is compared against three fixed tokens that have previously demonstrated strong performance ("}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:2:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:2:3"}]}],["$","$1","4",{"children":"), as well as a "}],["$","$1","5",{"children":"baseline configuration that does not employ budget-forcing. All competing methods were implemented using "}],["$","$1","6",{"children":"DeepSeek-R1-Distill-Qwen-1.5B"}],["$","$1","7",{"children":"."}]]}],["$","$L1a","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"$37"}]]}],["$","$L1a","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/5-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:4:0"}]]}]}]]}],["$","$L1a","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 2: Accuracy of different methods as a function of the average number of tokens generated by each method. Results for all datasets are obtained using "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:5:1"}]]}]}]]}],["$","$L1a","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"see "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:6:1"}]}],["$","$1","2",{"children":"for a detailed discussion. We utilized "}],["$","$1","3",{"children":"Qwen/Qwen2.5-7B-Instruct "}],["$","$1","4",{"children":"("}],["$","$1","5",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:6:5"}]}],["$","$1","6",{"children":", "}],["$","$1","7",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:6:7"}]}],["$","$1","8",{"children":") as our evaluator LLM. See "}],["$","$1","9",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:6:9"}]}],["$","$1","10",{"children":"for the instruction prompt we used for the LLM evaluation."}]]}],["$","$L1a","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"During inference, we verify the generalization of our method by also using inference configurations not used during training. Concretely, we set the reasoning budgets "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:7:1"}]]}]}],["$","$1","2",{"children":"and the maximal number of forced continuations "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:7:3"}]]}]}],["$","$1","4",{"children":"We also provide an additional 1,024 tokens reserved for generating the final answer. For inference, we used the same system prompt as the one used for training; see "}],["$","$1","5",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:7:5"}]}],["$","$1","6",{"children":"in the Appendix. Due to the prolonged length of the generated answers, evaluation took approximately 370 GPU hours, using 8 NVIDIA A40 GPUs."}]]}],["$","$L1a","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-36","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:8:0:style","children":"4.2 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:8:1:style","children":"LLM-Based Verification"}]}]]}],["$","$L1a","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"It is common practice to use regex-based functions to check the correctness of model outputs when evaluating language models on mathematical benchmarks. However, this approach can distort the evaluation of model performance, as it conflates output formatting with actual reasoning ability. In our experiments, evaluating with only regex-based functions suggested that the learned token led to substantial performance gains. However, a more detailed analysis using an evaluator LLM revealed that much of this improvement was due to better adherence to formatting rather than genuine reasoning improvements. As shown in "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:9:1"}]}],["$","$1","2",{"children":", the results obtained using only regex-based evaluation are significantly higher than those reported in "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:9:3"}]}],["$","$1","4",{"children":", which use LLM-based evaluation. For example, on GSM8K with "}],["$","$1","5",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:9:5"}]]}]}],["$","$1","6",{"children":"and "}],["$","$1","7",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:9:7"}]]}]}],["$","$1","8",{"children":", the improvement of our method over the baseline is more than three times higher when using regex-based evaluation compared to LLM-based evaluation. Similarly, on the MATH500 dataset with "}],["$","$1","9",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:9:9"}]]}]}],["$","$1","10",{"children":"and "}],["$","$1","11",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:9:11"}]]}]}],["$","$1","12",{"children":", regex-based evaluation indicates a statistically significant improvement, whereas LLM-based evaluation shows that this improvement is not actually present. Note that LLM-based verification is not flawless. We adopted it as a better means to mitigate the limitations of purely regex-based evaluation and enhance the trustworthiness of our findings."}]]}],["$","$L1a","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:0:style","children":"4.3 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:1:style","children":"Results"}]}]]}],["$","$L1a","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The evaluation results are depicted in "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:11:1"}]}],["$","$1","2",{"children":". Our findings indicate that when budget forcing does not improve upon the baseline, the learned token similarly offers no statistically significant advantage. However, on datasets where budget forcing yields improvements, our learned token demonstrates a substantial performance increase, achieving up to a 320% relative gain over the best fixed token and a 4% absolute improvement in accuracy. Notably, although our model was trained exclusively with "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:11:3"}]]}]}],["$","$1","4",{"children":", we observe that increasing "}],["$","$1","5",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:11:5"}]]}]}],["$","$1","6",{"children":"3 during inference often leads to further improvements in accuracy. This suggests that the learned "}],["$","$1","7",{"children":"<|continue-thinking|> "}],["$","$1","8",{"children":"token can generalize to settings with multiple forced continuations, even though the model was not explicitly trained for them."}]]}],["$","$L1a","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"To gain deeper insights into the behavior of the different methods, we analyzed the generated token distributions. "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:12:1"}]}],["$","$1","2",{"children":"depicts the accuracy of each method as a function of the average number of tokens generated per dataset. We observed that using the learned token consistently resulted in longer reasoning traces, suggesting that the performance improvement can be attributed to this increased reasoning length. Furthermore, "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:12:3"}]}],["$","$1","4",{"children":"shows that"}]]}],["$","$L1a","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-40","style":"$undefined","children":"Distribution of Reasoning Trace Lengths and Their Corresponding Accuracies "}]}],["$","$1","1",{"children":"(Darker Color = Higher Ratio of Correct Answers) Learned Alternatively Baseline"}]]}],["$","$L1a","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/6-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:14:0"}]]}]}]]}],["$","$L1a","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 3: Comparison of generated sequence length distributions across methods and datasets and their corresponding accuracies. Stacked bars represent the logarithmic count of answers within each length bin, with darker segments indicating a higher proportion of correct answers (fraction shown within each bin). Top row: GSM8K, Bottom row: AIME24. Left: Learned "}],["$","$1","1",{"children":"<|continue-thinking|> "}],["$","$1","2",{"children":"token vs. “Alternatively.” Right: Learned "}],["$","$1","3",{"children":"<|continue-thinking|> "}],["$","$1","4",{"children":"token vs. baseline model without budget forcing. Data was obtained using "}],["$","$1","5",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:15:5"}]]}]}]]}],["$","$L1a","16",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"for the AIME datasets, all methods generated a high average number of tokens. This can likely be attributed to the significant difficulty of the AIME problems for our model, which might also explain why our method did not yield substantial improvements on these datasets."}]]}],["$","$L1a","17",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"A comparison of the generated answer length distributions for the learned token, the baseline, and the “Alternatively” token is shown in "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:17:1"}]}],["$","$1","2",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:17:2"}]}],["$","$1","3",{"children":". The accuracy improvements observed with the learned token are consistent across different generated lengths, not just on average. This suggests that the enhanced performance is attributable to a genuine improvement in the model’s reasoning capabilities, beyond merely generating longer responses. In "}],["$","$1","4",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:17:4"}]}],["$","$1","5",{"children":", we show the probabilities that a correct answer under the baseline will be incorrect with the learned token. We can see that, for both GSM8K and MATH500, it is much more likely that the learned token leads to improvement. For the AIME dataset, we see that the learned token and the baseline are comparable, which is expected, since both methods have similar accuracy."}]]}],["$","$L1a","18",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"To better understand how the learned token influences the model’s reasoning process, we visualize in "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:18:1"}]}],["$","$1","2",{"children":"a word cloud showing the first word generated immediately after the"}]]}],["$","$L1a","19",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/6-2.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:19:0"}]]}]}]]}],["$","$L1a","20",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 4: "}],["$","$1","1",{"children":"Word cloud of the first token generated "}],["$","$1","2",{"children":"immediately "}],["$","$1","3",{"children":"after "}],["$","$1","4",{"children":"injecting "}],["$","$1","5",{"children":"the "}],["$","$1","6",{"children":"learned "}],["$","$1","7",{"children":"<|continue-thinking|> "}],["$","$1","8",{"children":"token, across all datasets."}]]}],["$","$L1a","21",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"<|continue-thinking|> "}],["$","$1","1",{"children":"token. The most common continuations often prompt the model to selfverify or reconsider its previous steps, indicating that the token effectively encourages reflective reasoning and backtracking."}]]}],["$","$L1a","22",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The reasoning trace depicted in "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:22:1"}]}],["$","$1","2",{"children":", taken from the GSM8K dataset, showcases how the "}],["$","$1","3",{"children":"<|continue_thinking|> "}],["$","$1","4",{"children":"token influences the final answer. "}],["$","$1","5",{"children":"The blue portion highlights the model’s initial reasoning, which leads to an incorrect result. "}],["$","$1","6",{"children":"However, the introduction of the "}],["$","$1","7",{"children":"<|continue_thinking|> "}],["$","$1","8",{"children":"token prompts a reexamination of the solution, ultimately guiding the"}]]}],["$","$L1a","23",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/7-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:23:0"}]]}]}]]}],["$","$L1a","24",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 3: Probability of observing a correct answer from the learned token and an incorrect answer from the baseline and vice versa. See "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:24:1"}]}],["$","$1","2",{"children":"in the Appendix for the full results."}]]}],["$","$L1a","25",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/7-1.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:25:0"}]]}]}]]}],["$","$L1a","26",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:26:0:style","children":"Question: "}]}],["$","$1","1",{"children":"Lloyd has an egg farm. His chickens produce 252 eggs per day and he sells them for $2 per dozen. How much does Lloyd make on eggs per week?"}]]}],["$","$L1a","27",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":" "}],["$","$1","1",{"children":"First, I need to calculate the daily revenue from egg sales. [...] Finally, to determine the weekly revenue, I’ll multiply the daily revenue by the number of days in a week, which is 7. So, $1,134 multiplied by 7 equals $7,938."}]]}],["$","$L1a","28",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"<|continue_thinking|> "}],["$","$1","1",{"children":"no, let me try that again. [...] Therefore, the correct answer is $294 per week."}]]}],["$","$L1a","29",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":" "}],["$","$1","1",{"children":"Lloyd sells his eggs at $2 per dozen. To determine his weekly revenue:"}]]}],["$","$L1a","30",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/7-2.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:30:0"}]]}]}]]}],["$","$L1a","31",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"1. "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:31:1:style","children":"Calculate weekly egg production"}]}],["$","$1","2",{"children":": - 252 eggs/day * 7 days = 1,764 eggs/week"}]]}],["$","$L1a","32",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/7-3.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:32:0"}]]}]}]]}],["$","$L1a","33",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"[...] "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:33:1:style","children":"Answer: "}]}],["$","$1","2",{"children":"Lloyd makes \\boxed{294} dollars per week from his egg sales."}]]}],["$","$L1a","34",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/7-4.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:34:0"}]]}]}]]}],["$","$L1a","35",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 5: GSM8K reasoning trace demonstrating the positive impact of "}],["$","$1","1",{"children":"<|continue_thinking|> "}],["$","$1","2",{"children":"token. Blue indicates the original reasoning, yielding an incorrect answer of 7,938. Green shows the continuation after the special token was added, leading to the correct answer of 294."}]]}],["$","$L1a","36",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"model to the correct conclusion. See "}],["$","$1","1",{"children":"Appendix D "}],["$","$1","2",{"children":"for the full reasoning traces and additional examples."}]]}]]}],["$","$L33","5",{"heading":"5 Conclusions","index":5,"length":13,"content":[["$","$L1a","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"In this work, we have demonstrated that learning a dedicated "}],["$","$1","1",{"children":"<|continue-thinking|> "}],["$","$1","2",{"children":"token yields significant effectiveness, specifically in scenarios where the baseline budget forcing method already provides performance improvements. We introduced a training methodology for this token that exhibits promising generalization capabilities across different inference settings. Our analysis indicates that the observed performance gains are not primarily due to better adherence to output formatting, but rather stem from the elicited longer reasoning traces and a genuine enhancement in the model’s underlying reasoning capabilities. Furthermore, this improvement remains relevant on average and conditionally across varying reasoning lengths, suggesting its utility even when generating shorter completions. While our method of learning a specialized "}],["$","$1","3",{"children":"<|continue-thinking|> "}],["$","$1","4",{"children":"token is relatively simple and efficient, practitioners can readily ascertain its potential benefit for their specific scenario by first employing the vanilla budget-forcing technique with a fixed token, such as “Wait”; an observed performance increase with this baseline strongly suggests that training a dedicated "}],["$","$1","5",{"children":"<|continue-thinking|> "}],["$","$1","6",{"children":"token would be worthwhile. Finally, we emphasize the critical importance of rigorous evaluation for drawing meaningful conclusions and propose a refined evaluation scheme designed to mitigate some of the inherent limitations associated with relying solely on regex-based assessments."}]]}],["$","$L1a","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:0:style","children":"Future Directions "}]}],["$","$1","1",{"children":"One promising future direction involves exploring the efficacy of learning distinct "}],["$","$1","2",{"children":"<|continue-thinking|> "}],["$","$1","3",{"children":"$38"}]]}]]}],["$","$L33","6",{"heading":"6 Limitations","index":6,"length":13,"content":[["$","$L1a","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"While our proposed method demonstrates promising results, it is subject to several limitations. First, as our analysis indicates, the effectiveness of the learned token appears to be contingent on the baseline performance of the budget forcing technique itself. If standard budget forcing does not yield improvements, our learned "}],["$","$1","1",{"children":"<|continue-thinking|> "}],["$","$1","2",{"children":"token is unlikely to provide a significant advantage. Second, the process of learning the token embedding necessitates a training phase, which is inherently more computationally demanding and requires access to the model’s weights compared to simply employing fixed tokens. Third, our method requires the addition of a new token to the model’s vocabulary. This modification might not be feasible or permitted when utilizing LLMs through certain API interfaces, which often provide restricted access to the model’s architecture, thus preventing vocabulary modifications. Finally, our experiments were only performed on the domain of mathematical questions—we did not explore the generalization capabilities of this method to additional domains."}]]}]]}],["$","$L33","7",{"heading":"7 Acknowledgments","index":7,"length":13,"content":[["$","$L1a","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"This research was supported by the European Union (ERC, SafetyBounds, 101163414). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. This research was also partially supported by the Israel Science Foundation (ISF "}],["$","$1","1",{"children":"grant 729/21). Y. R. acknowledges additional support from the Career Advancement Fellowship at the Technion."}]]}]]}],["$","$L33","8",{"heading":"References","index":8,"length":13,"content":[["$","$L1a","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-11","style":"$undefined","children":"Pranjal Aggarwal and Sean Welleck. 2025. "}]}],["$","$1","1",{"children":"L1: Controlling how long a reasoning model thinks with reinforcement learning. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:0:2:style","children":"arXiv preprint arXiv:2503.04697"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-3","style":"$undefined","children":"Anthropic. 2025. Claude’s extended thinking. "}]}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:1:1"}]}],["$","$1","2",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:1:2"}]}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:1:3"}]}],["$","$1","4",{"children":". Accessed: 2025-05-18."}]]}],["$","$L1a","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-28","style":"$undefined","children":"Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang "}]}],["$","$1","1",{"children":"Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. 2025. "}],["$","$1","2",{"children":"SFT memorizes, RL generalizes: A comparative study of foundation model post-training. "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:2:3:style","children":"arXiv preprint arXiv:2501.17161"}]}],["$","$1","4",{"children":"."}]]}],["$","$L1a","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-31","style":"$undefined","children":"Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, "}]}],["$","$1","1",{"children":"Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. "}],["$","$1","2",{"children":"Training verifiers to solve math word problems. "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:3:3:style","children":"arXiv preprint arXiv:2110.14168"}]}],["$","$1","4",{"children":"."}]]}],["$","$L1a","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-35","style":"$undefined","children":"Leo Gao, Jonathan Tow, Baber Abbasi, Stella Bider- "}]}],["$","$1","1",{"children":"man, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. The language model evaluation harness."}]]}],["$","$L1a","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-25","style":"$undefined","children":"Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Kr- "}]}],["$","$1","1",{"children":"ishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. 2024. Think before you speak: Training language models with pause tokens. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:5:2:style","children":"International Conference on Learning Representations"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-7","style":"$undefined","children":"Daya Guo, Dejian Yang, Haowei Zhang, Junxiao "}]}],["$","$1","1",{"children":"Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:6:2:style","children":"arXiv preprint arXiv:2501.12948"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-34","style":"$undefined","children":"Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul "}]}],["$","$1","1",{"children":"Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the MATH dataset. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:7:2:style","children":"Neural Information Processing Systems, Datasets and Benchmarks Track"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-12","style":"$undefined","children":"Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, "}]}],["$","$1","1",{"children":"and Yuyin Zhou. 2025. m1: Unleash the potential of test-time scaling for medical reasoning with large language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:8:2:style","children":"arXiv preprint arXiv:2504.00869"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-20","style":"$undefined","children":"HuggingFace. 2025. "}]}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:9:1"}]}],["$","$1","2",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:9:2"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-24","style":"$undefined","children":"Hyunbin Jin, Je Won Yeom, Seunghyun Bae, and Taesup "}]}],["$","$1","1",{"children":"Kim. 2025. “well, keep thinking”: Enhancing llm reasoning with adaptive injection decoding. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:10:2:style","children":"arXiv preprint arXiv:2503.10167"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-17","style":"$undefined","children":"Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying "}]}],["$","$1","1",{"children":"Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with PagedAttention. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:11:2:style","children":"Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-6","style":"$undefined","children":"Nathan Lambert, Jacob Morrison, Valentina Pyatkin, "}]}],["$","$1","1",{"children":"Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Nora Kassner, Timo Schick, Marzieh Saeidi, Noah A. Smith, and Matt Gardner. 2024. Tülu 3: Pushing frontiers in open language model post-training. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:12:2:style","children":"arXiv preprint arXiv:2411.15124"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-4","style":"$undefined","children":"Aitor "}]}],["$","$1","1",{"children":"Lewkowycz, "}],["$","$1","2",{"children":"Anders "}],["$","$1","3",{"children":"Johan "}],["$","$1","4",{"children":"Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. 2022. Solving quantitative reasoning problems with language models. In "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:13:5:style","children":"Advances in Neural Information Processing Systems"}]}],["$","$1","6",{"children":"."}]]}],["$","$L1a","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-33","style":"$undefined","children":"Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri- "}]}],["$","$1","1",{"children":"son Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:14:2:style","children":"International Conference on Learning Representations"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-22","style":"$undefined","children":"Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, "}]}],["$","$1","1",{"children":"William Y Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, and Li Erran Li. 2025. Deepscaler: Surpassing o1-preview with a 1.5B model by scaling RL. "}],["$","$1","2",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:15:2"}]}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:15:3"}]}],["$","$1","4",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:15:4"}]}],["$","$1","5",{"children":". Notion Blog."}]]}],["$","$L1a","16",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-10","style":"$undefined","children":"Niklas Muennighoff, Zitong Yang, Weijia Shi, Xi- "}]}],["$","$1","1",{"children":"ang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:16:2:style","children":"arXiv preprint arXiv:2501.19393"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","17",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-2","style":"$undefined","children":"OpenAI. 2024. Learning to reason with llms. "}]}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:17:1"}]}],["$","$1","2",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:17:2"}]}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:17:3"}]}],["$","$1","4",{"children":". Accessed: 2025-05-18."}]]}],["$","$L1a","18",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-27","style":"$undefined","children":"Jacob Pfau, William Merrill, and Samuel R Bowman. "}]}],["$","$1","1",{"children":"2024. "}],["$","$1","2",{"children":"Let’s think dot by dot: Hidden computation in transformer language models. "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:18:3:style","children":"arXiv preprint arXiv:2404.15758"}]}],["$","$1","4",{"children":"."}]]}],["$","$L1a","19",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-30","style":"$undefined","children":"Amrith Setlur, Nived Rajaraman, Sergey Levine, and "}]}],["$","$1","1",{"children":"Aviral Kumar. 2025. Scaling test-time compute without verification or RL is suboptimal. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:19:2:style","children":"arXiv preprint arXiv:2502.12118"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","20",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-14","style":"$undefined","children":"Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, "}]}],["$","$1","1",{"children":"Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan"}]]}],["$","$L1a","21",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Zhang, YK Li, Y Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:21:1:style","children":"arXiv preprint arXiv:2402.03300"}]}],["$","$1","2",{"children":"."}]]}],["$","$L1a","22",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-0","style":"$undefined","children":"Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- "}]}],["$","$1","1",{"children":"mar. 2024. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:22:2:style","children":"arXiv preprint arXiv:2408.03314"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","23",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-8","style":"$undefined","children":"Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, "}]}],["$","$1","1",{"children":"Zhaopeng Tu, Min Zhang, and Dong Yu. 2025. Expanding RL with verifiable rewards across diverse domains. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:23:2:style","children":"arXiv preprint arXiv:2503.23829"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","24",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-32","style":"$undefined","children":"Joshua Vendrow, Edward Vendrow, Sara Beery, and "}]}],["$","$1","1",{"children":"Aleksander Madry. 2025. "}],["$","$1","2",{"children":"Do large language model benchmarks test reliability? "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:24:3:style","children":"arXiv preprint arXiv:2502.03461"}]}],["$","$1","4",{"children":"."}]]}],["$","$L1a","25",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-21","style":"$undefined","children":"Leandro von Werra, Younes Belkada, Lewis Tunstall, "}]}],["$","$1","1",{"children":"Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gal-louédec. 2020. Trl: Transformer reinforcement learning. "}],["$","$1","2",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:25:2"}]}],["$","$1","3",{"children":". GitHub repository."}]]}],["$","$L1a","26",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-26","style":"$undefined","children":"Xinyi Wang, Lucas Caccia, Oleksiy Ostapenko, Xingdi "}]}],["$","$1","1",{"children":"Yuan, William Yang Wang, and Alessandro Sordoni. 2023. Guiding language model reasoning with planning tokens. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:26:2:style","children":"arXiv preprint arXiv:2310.05707"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","27",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-9","style":"$undefined","children":"Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, "}]}],["$","$1","1",{"children":"Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, and 1 others. 2025. Reinforcement learning for reasoning in large language models with one training example. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:27:2:style","children":"arXiv preprint arXiv:2504.20571"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","28",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-5","style":"$undefined","children":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten "}]}],["$","$1","1",{"children":"Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:28:2:style","children":"Advances in Neural Information Processing Systems"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","29",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-1","style":"$undefined","children":"Sean Welleck, Amanda Bertsch, Matthew Finlayson, "}]}],["$","$1","1",{"children":"Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui. 2024. "}],["$","$1","2",{"children":"From decoding to meta-generation: "}],["$","$1","3",{"children":"Inference-time algorithms for large language models. "}],["$","$1","4",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:29:4:style","children":"arXiv preprint arXiv:2406.16838"}]}],["$","$1","5",{"children":"."}]]}],["$","$L1a","30",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-37","style":"$undefined","children":"An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, "}]}],["$","$1","1",{"children":"Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 23 others. 2024a. Qwen2.5 technical report. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:30:2:style","children":"arXiv preprint arXiv:2412.15115"}]}],["$","$1","3",{"children":"."}]]}],["$","$L1a","31",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-19","style":"$undefined","children":"An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, "}]}],["$","$1","1",{"children":"Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Xingzhang Ren, and Zhenru Zhang. 2024b. "}],["$","$1","2",{"children":"Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:31:3:style","children":"arXiv preprint arXiv:2409.12122"}]}],["$","$1","4",{"children":"."}]]}],["$","$L1a","32",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-18","style":"$undefined","children":"Lianmin Zheng, "}]}],["$","$1","1",{"children":"Liangsheng Yin, "}],["$","$1","2",{"children":"Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, and 1 others. 2024. SGLang: Efficient execution of structured language model programs. "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:32:3:style","children":"Advances in Neural Information Processing Systems"}]}],["$","$1","4",{"children":"."}]]}]]}],["$","$L33","9",{"heading":"A Training and Evaluation Parameters","index":9,"length":13,"content":[["$","$L1a","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:0:0"}]}],["$","$1","1",{"children":"summarizes the full list of parameters we used for training and evaluation."}]]}],["$","$L1a","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/11-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:1:0"}]]}]}]]}],["$","$L1a","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 4: Full list of training and evaluation parameters"}]]}]]}],["$","$L33","10",{"heading":"B Artifacts Used","index":10,"length":13,"content":[["$","$L1a","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The following datasets, software libraries and models were used during this research, all artifacts were used in accordance with their respective licenses."}]]}],["$","$L1a","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:1:1:style","children":"Datasets"}]}],["$","$1","2",{"children":": "}],["$","$1","3",{"children":"DeepScaleR-Preview "}],["$","$1","4",{"children":"Dataset,"}]]}],["$","$L1a","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/11-1.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:2:0"}]]}]}]]}],["$","$L1a","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"GSM8K-Platinum"}],["$","$1","1",{"children":", "}],["$","$1","2",{"children":"licensed "}],["$","$1","3",{"children":"under "}],["$","$1","4",{"children":"CC BY-SA 4.0."}],["$","$1","5",{"children":"2 "}],["$","$1","6",{"children":"AIME24"}],["$","$1","7",{"children":", "}],["$","$1","8",{"children":"licensed under Apache-2.0,"}],["$","$1","9",{"children":"3 "}],["$","$1","10",{"children":"and AIME25."}],["$","$1","11",{"children":"4"}]]}],["$","$L1a","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:4:1:style","children":"Models"}]}],["$","$1","2",{"children":": "}],["$","$1","3",{"children":"DeepSeek-R1-Distill-Qwen-1.5B"}],["$","$1","4",{"children":", licensed "}],["$","$1","5",{"children":"under "}],["$","$1","6",{"children":"the "}],["$","$1","7",{"children":"MIT "}],["$","$1","8",{"children":"license."}],["$","$1","9",{"children":"5"}]]}],["$","$L1a","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Qwen2.5-7B-Instruct"}],["$","$1","1",{"children":", "}],["$","$1","2",{"children":"licensed "}],["$","$1","3",{"children":"under the Apache-2.0 license."}],["$","$1","4",{"children":"6"}]]}],["$","$L1a","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$1b:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:6:1:style","children":"Software Packages"}]}],["$","$1","2",{"children":": LM-evaluation-harness licensed under the MIT license."}],["$","$1","3",{"children":"7 "}],["$","$1","4",{"children":"vLLM licensed under the Apache-2.0."}],["$","$1","5",{"children":"8 "}],["$","$1","6",{"children":"Open r1, licensed under the Apache 2.0 license,"}],["$","$1","7",{"children":"9 "}],["$","$1","8",{"children":"and trl, licensed under Apache 2.0."}],["$","$1","9",{"children":"10"}]]}]]}],["$","$L33","11",{"heading":"C Full Experimental Results","index":11,"length":13,"content":[["$","$L1a","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We provide the complete set of results for all our experiments. "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:0:1"}]}],["$","$1","2",{"children":"shows the complete set of results when using an LLM evaluator in all configurations ("}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:0:3"}]]}]}],["$","$1","4",{"children":"). "}],["$","$1","5",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:0:5"}]}],["$","$1","6",{"children":"shows the complete set of results for regexonly evaluation. "}],["$","$1","7",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:0:7"}]}],["$","$1","8",{"children":"reports, for each configuration, the probability that the learned token yields a correct answer when the baseline does not, and the probability that the baseline yields a correct answer when the learned token does not."}]]}]]}],["$","$L33","12",{"heading":"D Generated Answers Examples","index":12,"length":13,"content":[["$","$L1a","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We include examples of reasoning traces of cases when using the learned token resulted in a correct answer while the baseline did not, and vice versa. In all figures, blue text indicates the original reasoning trace that is common for both the learned model and the baseline model and green text indicates the reasoning trace that was generated after a forced continuation. "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:0:1"}]}],["$","$1","2",{"children":"and "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:0:3"}]}],["$","$1","4",{"children":"show full reasoning traces for the learned model and the baseline model respectively on a question in which the baseline model was incorrect and adding the learned token allowed the model to continue reasoning through the problem and reach the correct"}]]}],["$","$L1a","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/12-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:1:0"}]]}]}]]}],["$","$L1a","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 5: Accuracy (pass@1) results for different token budget limits "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:2:1"}]]}]}],["$","$1","2",{"children":"and different numbers of forced thinking continuations "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:2:3"}]]}]}],["$","$1","4",{"children":". Results are obtained via regex-based evaluation and an LLM evaluator if the model fails to generate an answer in the correct format."}]]}],["$","$L1a","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"answer. "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:3:1"}]}],["$","$1","2",{"children":"and "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:3:3"}]}],["$","$1","4",{"children":"show full reasoning traces for the learned model and the baseline model, respectively, on a question in which both models were correct. Note that in this case, the learned token only adds a negligible number of tokens to the answer."}]]}],["$","$L1a","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/13-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:4:0"}]]}]}]]}],["$","$L1a","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 6: Accuracy (pass@1) results for different token budget limits "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:5:1"}]]}]}],["$","$1","2",{"children":"and different numbers of forced thinking continuations "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:5:3"}]]}]}],["$","$1","4",{"children":". Results are obtained via a regex-based evaluation only. The percentage of final answers enclosed "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"id-44","style":"$undefined","children":"in "}]}],["$","$1","6",{"children":"\\boxed{} "}],["$","$1","7",{"children":"is shown in parentheses."}]]}],["$","$L1a","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/13-3.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:6:0"}]]}]}]]}],["$","$L1a","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 7: Probability of observing a correct answer from the learned token and an incorrect answer from the baseline and vice versa."}]]}],["$","$L1a","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/14-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:8:0"}]]}]}]]}],["$","$L1a","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 6: GSM8K reasoning trace demonstrating the positive impact of the "}],["$","$1","1",{"children":"<|continue_thinking|> "}],["$","$1","2",{"children":"token. Blue indicates the original reasoning, yielding an incorrect answer of 7,938. Green shows the continuation after the special token was added, leading to the correct answer of 294."}]]}],["$","$L1a","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/15-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:10:0"}]]}]}]]}],["$","$L1a","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 7: GSM8K reasoning trace of the baseline model for the same question as in "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:11:1"}]}],["$","$1","2",{"children":". The final answer provided by the baseline model is incorrect, as opposed to the correct answer given in "}],["$","$1","3",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:11:3"}]}],["$","$1","4",{"children":"."}]]}],["$","$L1a","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/16-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:12:0"}]]}]}]]}],["$","$L1a","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 8: GSM8K reasoning trace demonstrating that the "}],["$","$1","1",{"children":"<|continue_thinking|> "}],["$","$1","2",{"children":"token does not generate many tokens when the model is confident. Blue indicates the original reasoning. Green shows the short continuation after the special token was added."}]]}],["$","$L1a","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/17-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:14:0"}]]}]}]]}],["$","$L1a","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 9: GSM8K reasoning trace of the baseline model for the same question as in "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:15:1"}]}],["$","$1","2",{"children":". In this case, both the baseline and learned token output a correct answer and the answer provided by the baseline model is almost identical to the one provided by the learned token."}]]}],["$","$L1a","16",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/18-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:16:0"}]]}]}]]}],["$","$L1a","17",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 10: GSM8K reasoning trace of a wrong answer given by both models."}]]}],["$","$L1a","18",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L35",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2506.11274/images/19-0.png","alt":"$undefined"}],["$","$L36",null,{"inAbstract":false,"imgScale":4,"avgLineHeight":13.52,"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:18:0"}]]}]}]]}],["$","$L1a","19",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 11: GSM8K reasoning trace of the baseline model for the same question as in "}],["$","$1","1",{"children":["$","$L34",null,{"fragment":"$1b:props:children:props:children:3:1:props:paperJSON:sections:12:paragraphs:19:1"}]}],["$","$1","2",{"children":"."}]]}]]}]],["$","$L39",null,{"paper":"$1b:props:children:props:children:0:props:product"}]]