36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"118864","publisher":"neurips","paperJSON":{"title":"Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL","paperID":"118864","avgLineHeight":10.92,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this overthinking problem, we explore how to equip LRMs with adaptive thinking capabilities, enabling them to dynamically decide whether to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis (\"...\") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"AutoThink","element":"span"},{"text":", a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"achieves favorable accuracy–efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"improves relative accuracy by 6.4% while reducing token usage by 52% on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Project Page: ","element":"span"},{"href":"https://github.com/ScienceOne-AI/AutoThink","style":{"fontWeight":"bold"},"text":"https://github.com/ScienceOne-AI/AutoThink","element":"a"},{"style":{"fontWeight":"bold"},"text":".","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Recently, reasoning-focused Large Language Models (LLMs), also referred to as Large Reasoning Models (LRMs) [","element":"span"},{"href":"#id-0","referenceIndex":41,"text":"41","element":"a"},{"text":"], have demonstrated remarkable progress in solving complex reasoning tasks. Particularly, DeepSeek-R1 [","element":"span"},{"href":"#id-1","referenceIndex":9,"text":"9","element":"a"},{"text":"] uses only outcome-based feedback and incentivizes explicit reasoning capabilities through reinforcement learning (RL) with verifiable rewards. DeepSeek-R1 and its distilled models typically follow the ","element":"span"},{"style":{"fontFamily":"monospace"},"text":" ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontFamily":"monospace"},"text":" ","element":"span"},{"text":"format, where the ","element":"span"},{"style":{"fontFamily":"monospace"},"text":" ","element":"span"},{"text":"process generates explicit, step-by-step reasoning sequences to support obtaining a final answer during the ","element":"span"},{"style":{"fontFamily":"monospace"},"text":" ","element":"span"},{"text":"phase. We refer to models that follow this Chain of Thought (CoT) [","element":"span"},{"href":"#id-2","referenceIndex":37,"text":"37","element":"a"},{"text":"] prompting scheme as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R1-style models","element":"span"},{"text":". The explicit thinking process, which enables self-reflection, backtracking, and validation, is widely regarded as essential for enhancing reasoning accuracy. Arising from this understanding, a popular paradigm has emerged that improves solution quality by increasing thinking token allocation during inference-time reasoning [","element":"span"},{"href":"#id-3","referenceIndex":17,"text":"17","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":40,"text":"40","element":"a"},{"text":"]. However, this paradigm introduces a major bottleneck: excessive thinking token generation leads to high computational cost and latency, raising the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"overthinking ","element":"span"},{"text":"phenomenon, where many reasoning steps are redundant or inefficient [","element":"span"},{"href":"#id-5","referenceIndex":29,"text":"29","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":14,"text":"14","element":"a"},{"text":"].","element":"span"}],[{"id":"id-14","style":{"width":"99%"},"width":1584,"height":523,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/1-0.png","element":"img"}],[{"text":"Figure 1: Overview of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"figcaption","subtype":"caption"},{"text":"Compared to Prior Reasoning Paradigms.","element":"figcaption","subtype":"caption"}],[{"text":"To mitigate overthinking, recent efforts have explored ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hybrid reasoning ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"concise reasoning ","element":"span"},{"text":"strategies. In the industry, Claude 3.7 Sonnet [","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"2","element":"a"},{"text":"] introduces a controllable reasoning framework that allows the model to switch between standard and extended reasoning modes. Similarly, Qwen3 [","element":"span"},{"href":"#id-8","referenceIndex":30,"text":"30","element":"a"},{"text":"] proposes a thinking control scheme with a \"thinking\" mode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(slow thinking) ","element":"span"},{"text":"and a \"non-thinking\" mode ","element":"span"},{"style":{"fontStyle":"italic"},"text":"(fast thinking)","element":"span"},{"text":", and provides users with the flexibility to choose whether the model should engage in reasoning or not. In the academic community, parallel research has focused on designing promptguided efficient reasoning [","element":"span"},{"href":"#id-9","referenceIndex":42,"text":"42","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":24,"text":"24","element":"a"},{"text":"] or training pruning-based models to achieve concise reasoning [","element":"span"},{"href":"#id-11","referenceIndex":12,"text":"12","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":6,"text":"6","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":44,"text":"44","element":"a"},{"text":"]. While promising, these approaches either rely on manually predefined modes or uniformly prune reasoning steps, which may degrade performance on harder instances. A fundamental question then arises to address the overthinking issue:","element":"span"}],[{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Can LLMs learn to adaptively determine thinking fast or slow based on given problems?","element":"span"}],[{"text":"To answer this question, we propose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink","element":"span"},{"text":", a multi-stage RL framework that enables R1-style LLMs to learn adaptive reasoning behaviors. Unlike prior approaches reliant on hard-coded prompting or external control signals, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"formulates reasoning as a learned dual-mode policy that determines both whether to engage the model’s \"thinking\" process and how to generate concise reasoning. As illustrated in Figure ","element":"span"},{"href":"#id-14","text":"1","element":"a"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"fundamentally differs from manual hybrid prompting and uniform pruning strategies by employing an ellipsis prompt and structured three-stage RL training process that enables adaptive reasoning to emerge. In detail, an ellipsis prompt acts as a controllable entry point for optional reasoning, triggering stochastic switching between thinking and no-thinking modes in R1-style LLMs. Then, the proposed multi-stage RL framework shapes this behavior progressively: Stage 1 stabilizes dual-mode coexistence, Stage 2 reinforces accurate reasoning to enhance solution quality, and Stage 3 prunes redundancy via length-aware rewards. This progression enables the model to allocate reasoning effort adaptively, achieving both accuracy and efficiency. The main contributions are as follows:","element":"span"}],[{"text":"• We identify the ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"ellipsis prompt","element":"span"},{"text":", a lightweight prompting scheme that activates a stochastic switching behavior in R1-style LLMs between thinking and no-thinking modes.","element":"span"}],[{"text":"• We propose a ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"multi-stage RL ","element":"span"},{"text":"framework that trains R1-style LLMs to dynamically modulate their reasoning behaviors according to problem complexity.","element":"span"}],[{"text":"• Experiments on mathematical benchmarks show that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"achieves ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"accuracy–efficiency trade-offs ","element":"span"},{"text":"better than existing pruning and compression methods, without sacrificing performance.","element":"span"}]]},{"heading":"2 An Ellipsis Unlocks Random Thinking in R1-Style Models","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"A Surprising Effect of Minimal Prompt Modification","element":"span"}],[{"text":"Recent efforts on concise reasoning aim to eliminate unnecessary thought, either via prompting that explicitly bypasses thinking [","element":"span"},{"href":"#id-15","referenceIndex":19,"text":"19","element":"a"},{"text":"], or RL-based training that penalizes long outputs [","element":"span"},{"href":"#id-12","referenceIndex":6,"text":"6","element":"a"},{"text":"]. While effective at shortening responses, these methods enforce uniform brevity regardless of problem complexity. Rather than compressing by default, we pose a subtler question:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Can a small change, perhaps a few tokens, lead R1-style models to ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"decide whether to think","element":"span"},{"style":{"fontStyle":"italic"},"text":"?","element":"span"}],[{"id":"id-16","style":{"width":"99%"},"width":1582,"height":782,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/2-0.png","element":"img"}],[{"text":"Figure 2: Prompting strategies shape reasoning behavior and computational cost.","element":"figcaption","subtype":"caption"}],[{"text":"To investigate this question, we explore how a minimal modification to the prompt structure can influence reasoning behaviors in R1-style models. The baseline prompt used typically includes a ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"\\n ","element":"span"},{"text":"tag followed by a fixed, detailed reasoning trace. In contrast, our modified prompt contains only a single ellipsis following the baseline tag. Specifically, the final prompt we provide is: ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"\\n...\\n","element":"span"},{"text":". This minimal form acts as an open-ended signal, leaving it entirely up to the model to decide whether to engage in thinking, how much to elaborate, and when to stop.","element":"span"}],[{"text":"Surprisingly, this tiny change leads to a distinct shift in behavior. Without any additional training, the model often generates a closing ","element":"span"},{"style":{"fontFamily":"monospace"},"text":" ","element":"span"},{"text":"tag, sometimes immediately, skipping deep thinking entirely, and other times after producing a full derivation. As shown in Figure ","element":"span"},{"href":"#id-16","text":"2a","element":"a"},{"text":", evaluation on Distill-R1-1.5B [","element":"span"},{"href":"#id-1","referenceIndex":9,"text":"9","element":"a"},{"text":"] and DeepScaleR [","element":"span"},{"href":"#id-3","referenceIndex":17,"text":"17","element":"a"},{"text":"] across five mathematical benchmarks shows that ellipsis prompting leads to a modest drop in accuracy, accompanied by a substantial reduction in token usage.","element":"span"}],[{"text":"Compared to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"no-thinking prompt ","element":"span"},{"text":"baseline [","element":"span"},{"href":"#id-15","referenceIndex":19,"text":"19","element":"a"},{"text":"], which suppresses reasoning at the cost of accuracy, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"the ellipsis prompt triggers a stochastic switch in reasoning mode and provides a more balanced trade-off by preserving reasoning when needed and reducing unnecessary computation","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Prompting Alone Does Not Enable Difficulty-Aware Thinking","element":"span"}],[{"text":"The proposed ellipsis prompt seems to trigger selective reasoning: the model thinks on some inputs but not others. While this behavior appears desirable, it raises a deeper question:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Does the prompt-forcing model choose to engage in deep thinking ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"based on task difficulty","element":"span"},{"style":{"fontStyle":"italic"},"text":"?","element":"span"}],[{"text":"Ideally, a well-calibrated model should reason more on complex problems and skip unnecessary thinking on simpler ones. To assess this, we divide MATH500 problems into 8 difficulty levels based on the average accuracy of Distill-R1 (standard prompt) over 16 rollouts, with higher accuracy indicating lower difficulty. Figure ","element":"span"},{"href":"#id-16","text":"2b ","element":"a"},{"text":"(top) shows the no-thinking rate across these levels. Contrary to expectations, under the ellipsis prompt without additional training, no clear trend emerges—","element":"span"},{"style":{"fontWeight":"bold"},"text":"the flat distribution suggests that thinking is unguided and unaffected by problem complexity.","element":"span"}],[{"text":"A decreasing no-thinking rate along the difficulty axis reflects a desirable reasoning pattern, in which the model allocates effort based on task difficulty. However, this behavior does not emerge from prompting alone. Even with diverse prompt designs (Appendix ","element":"span"},{"text":"A.2","element":"span"},{"text":"), the model failed to exhibit difficulty-aware reasoning. Yet prompt-only control suffers from a core limitation: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"without feedback, the model lacks a mechanism to learn when the thinking process is needed.","element":"span"}],[{"text":"To address this gap, we introduce a multi-stage RL framework that rewards appropriate reasoning behavior and encourages alignment between effort and difficulty. As shown in Figure ","element":"span"},{"href":"#id-16","text":"2b ","element":"a"},{"text":"(bottom), the resulting distribution from our final trained model exhibits clear difficulty-aware reasoning.","element":"span"}]]},{"heading":"3 Guiding When to Think via Multi-Stage Reinforcement Learning","paragraphs":[[{"text":"We propose ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"AutoThink","element":"span"},{"text":", a multi-stage RL framework with three training phases that induce difficulty-aware reasoning through progressively refined reward designs. At all stages, we employ the GRPO algorithm with a token-level policy gradient loss [","element":"span"},{"href":"#id-17","referenceIndex":25,"text":"25","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":46,"text":"46","element":"a"},{"text":"]. The training objective is:","element":"span"}],[{"style":{"width":"92%"},"width":1466,"height":200,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-0.png","element":"img"}],[{"text":"Here, ","element":"span"},{"style":{"height":9.6},"width":28.52,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-1.png","element":"img","alt":" oi","inline":true,"padRight":true},{"text":"denotes the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th sampled output for a given query ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"is the number of sampled outputs per query; ","element":"span"},{"style":{"height":16.4},"width":99.48,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-2.png","element":"img","alt":" ri,t(θ)","inline":true,"padRight":true},{"text":"is the token-level importance weight, defined as the ratio between the new and old token probabilities; and","element":"span"},{"style":{"height":19.6},"width":60.52,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-3.png","element":"img","alt":"ˆAi,t","inline":true,"padRight":true},{"text":"represents the estimated token-level advantage. The overall loss is normalized by the total number of tokens across all sampled trajectories. A visual overview of the reward mechanisms across the three training stages is illustrated in Figure ","element":"span"},{"href":"#id-14","text":"1","element":"a"},{"text":". In the following subsections, we detail the reward design for each stage.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Stage 1: Preventing Mode Collapse by Batch Reward Balance","element":"span"}],[{"text":"To promote efficient reasoning, higher rewards are assigned to correct answers without thinking, and stronger penalties to incorrect ones. Define think","element":"span"},{"style":{"height":16},"width":155.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-4.png","element":"img","alt":"i ∈ {0, 1}","inline":true,"padRight":true},{"text":"as an indicator of whether the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th output involves thinking, and correct","element":"span"},{"style":{"height":15.79},"width":155.52,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-5.png","element":"img","alt":"i ∈ {0, 1}","inline":true,"padRight":true},{"text":"as an indicator of whether it yields the correct answer. Based on these variables, the naive reward assignment is:","element":"span"}],[{"style":{"width":"53%"},"width":850,"height":198,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-6.png","element":"img"}],[{"text":"While this reward structure encourages difficulty-aware behavior, it causes instability during early training. The model may collapse into a degenerate policy, either always thinking or always skipping, depending on which yields a higher expected reward in the short term. This limits exploration and hinders later optimization. To mitigate this, we introduce ","element":"span"},{"style":{"fontWeight":"bold"},"text":"batch-level reward balancing","element":"span"},{"text":":","element":"span"}],[{"style":{"width":"34%"},"width":540,"height":460,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-7.png","element":"img"}],[{"text":"Figure 3: Effect of ","element":"figcaption","subtype":"caption"},{"id":"id-19","style":{"height":20},"width":141.48,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-8.png","element":"img","alt":" z on radji .","inline":true}],[{"text":"Let ","element":"span"},{"style":{"height":16},"width":155,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-9.png","element":"img","alt":" z ∈ [0, 1]","inline":true,"padRight":true},{"text":"denote the proportion of thinking trajectories ","element":"span"},{"style":{"fontWeight":"bold"},"text":"in a training batch","element":"span"},{"text":", and ","element":"span"},{"style":{"height":10.99},"width":89.52,"height":27.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-10.png","element":"img","alt":" 1 − z","inline":true,"padRight":true},{"text":"the no-thinking proportion. A target balance ratio ","element":"span"},{"style":{"height":16},"width":156,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-11.png","element":"img","alt":" γ ∈ (0, 1)","inline":true,"padRight":true},{"text":"and penalty slope ","element":"span"},{"style":{"height":13.41},"width":93.52,"height":33.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-12.png","element":"img","alt":" λ ≥ 0","inline":true,"padRight":true},{"text":"control the strength of adjustment. For thinking and no-thinking samples, we compute soft penalty factors:","element":"span"}],[{"style":{"width":"70%"},"width":1124,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-13.png","element":"img"}],[{"style":{"width":"72%"},"width":1152,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-14.png","element":"img"}],[{"text":"Each sample ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"is first assigned an original reward ","element":"span"},{"style":{"height":17.79},"width":391,"height":44.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-15.png","element":"img","alt":" rnaivei ∈ {+2, +1, 0, −1}","inline":true,"padRight":true},{"text":"based on its thinking flag and correctness. The final adjusted reward is then:","element":"span"}],[{"style":{"width":"89%"},"width":1416,"height":210,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-16.png","element":"img"}],[{"text":"The adjusted reward ","element":"span"},{"style":{"height":20.19},"width":50,"height":50.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-17.png","element":"img","alt":" radji","inline":true},{"text":"introduces a soft, piecewise-linear modulation over the naive reward, resembling a hinge-like transformation. Figure ","element":"span"},{"href":"#id-19","text":"3 ","element":"a"},{"text":"illustrates this behavior under a typical setting with ","element":"span"},{"style":{"height":14.4},"width":125.04,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-18.png","element":"img","alt":"γ = 0.5","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":11.6},"width":124.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-19.png","element":"img","alt":" λ = 2.0","inline":true},{"text":". When thinking dominates (","element":"span"},{"style":{"height":12.21},"width":94,"height":30.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-20.png","element":"img","alt":"z > γ","inline":true},{"text":"), the reward for thinking samples is softly reduced, especially for incorrect answers. Conversely, when no-thinking is overrepresented (","element":"span"},{"style":{"height":14.4},"width":124.52,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/3-21.png","element":"img","alt":"z ≪ γ),","inline":true,"padRight":true},{"text":"no-thinking rewards are suppressed. In both cases, the model is gently pushed to restore balance by favoring the less frequent behavior.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Stage 2: Reinforcing Reliable Behavior within Dual Modes","element":"span"}],[{"text":"After establishing behavioral stability across thinking and no-thinking modes, the second stage focuses on improving task performance within each mode. Specifically, the objective is to enhance reasoning quality when invoked, and to promote accurate responses in the absence of thinking.","element":"span"}],[{"text":"To allow the model to refine its behavior without external constraints, we remove the batch-level balancing used in the previous stage and allow free evolution of the reasoning policy. The reward is set directly to the naive definition:","element":"span"}],[{"style":{"width":"56%"},"width":900,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/4-0.png","element":"img"}],[{"text":"In this stage, we allocate a larger context budget during training, enabling longer responses when needed. Owing to the regularization established in Stage 1, the proportion of thinking in Stage 2 remains balanced, fluctuating naturally rather than collapsing.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Stage 3: Pruning Unnecessary Reasoning Paths via Length-Aware Reward","element":"span"}],[{"text":"While the relaxed setup in Stage 2 improves accuracy, it also leads to overly long responses. Building on the stability established in prior stages, we now aim to improve reasoning efficiency.","element":"span"}],[{"text":"Inspired by GRPO-LEAD[","element":"span"},{"href":"#id-20","referenceIndex":50,"text":"50","element":"a"},{"text":"], we introduce a length-aware reward modulation, encouraging brevity in no-thinking mode and rewarding elaboration only when warranted. Specifically, the adjusted reward in this stage is defined as:","element":"span"}],[{"style":{"width":"57%"},"width":904,"height":106,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/4-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":24.4},"width":196,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/4-2.png","element":"img","alt":" yi = Li−µqσq","inline":true},{"text":"is the standardized length of response ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"within its query group ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":". Here, ","element":"span"},{"style":{"height":13.6},"width":36,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/4-3.png","element":"img","alt":" Li","inline":true,"padRight":true},{"text":"denotes the response length, while ","element":"span"},{"style":{"height":15.41},"width":152,"height":38.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/4-4.png","element":"img","alt":" µq and σq","inline":true,"padRight":true},{"text":"are the group-specific mean and standard deviation of lengths, computed separately for correct and incorrect sample groups. And ","element":"span"},{"style":{"height":14.61},"width":124,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/4-5.png","element":"img","alt":" α and β","inline":true,"padRight":true},{"text":"are hyperparameters that control the sensitivity of the shaping term.","element":"span"}],[{"style":{"width":"34%"},"width":540,"height":462,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/4-6.png","element":"img"}],[{"text":"Figure 4: Effect of ","element":"figcaption","subtype":"caption"},{"id":"id-21","style":{"height":20.19},"width":189.52,"height":50.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/4-7.png","element":"img","alt":" α, β on radji .","inline":true}],[{"text":"The reward decays with length for correct responses and grows for incorrect ones, encouraging concise success and thorough failure analysis, as an example illustrated in Figure ","element":"span"},{"href":"#id-21","text":"4","element":"a"},{"text":". This final stage allows the model to adaptively regulate its reasoning depth, producing succinct responses without significantly compromising reliability.","element":"span"}]]},{"heading":"4 Experiments","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Setup","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Datasets and Models ","element":"span"},{"text":"We use the same training data as in DeepScaleR [","element":"span"},{"href":"#id-3","referenceIndex":17,"text":"17","element":"a"},{"text":"], comprising 40K mathematically problems with varying difficulties. Following prior works [","element":"span"},{"href":"#id-22","referenceIndex":49,"text":"49","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","referenceIndex":33,"text":"33","element":"a"},{"text":"], the evaluation is conducted on five standard math benchmarks: MATH, Minerva, Olympiad, AIME24, and AMC23. We evaluate the applicability of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"on three R1-style models ","element":"span"},{"style":{"fontWeight":"bold"},"text":"with varying sizes and RL post-training status","element":"span"},{"text":": DeepSeek-R1-Distill-Qwen-1.5B/7B (abbreviated as Distill-R1-1.5B/7B), and DeepScaleR-Preview-1.5B [","element":"span"},{"href":"#id-3","referenceIndex":17,"text":"17","element":"a"},{"text":"] (abbreviated as DeepScaleR), the state-of-the-art 1.5B reasoning model obtained from Distill-R1-1.5B via context-extended RL at a training budget of up to $5,000.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Baselines ","element":"span"},{"text":"We benchmark our approach against two classes of baselines designed to promote efficient reasoning. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(1) Prompt-only baselines","element":"span"},{"text":": we apply ","element":"span"},{"style":{"fontStyle":"italic"},"text":"standard ","element":"span"},{"text":"[","element":"span"},{"href":"#id-1","referenceIndex":9,"text":"9","element":"a"},{"text":"], ","element":"span"},{"style":{"fontStyle":"italic"},"text":"no-thinking ","element":"span"},{"text":"[","element":"span"},{"href":"#id-15","referenceIndex":19,"text":"19","element":"a"},{"text":"], and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ellipsis (ours) ","element":"span"},{"text":"prompting strategies on the base models, following the description illustrated in Figure ","element":"span"},{"href":"#id-16","text":"2a","element":"a"},{"text":". ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(2) RL-trained baselines","element":"span"},{"text":": including ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Concise-RL ","element":"span"},{"text":"[","element":"span"},{"href":"#id-12","referenceIndex":6,"text":"6","element":"a"},{"text":"], ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ShorterBetter ","element":"span"},{"text":"[","element":"span"},{"href":"#id-13","referenceIndex":44,"text":"44","element":"a"},{"text":"], and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ThinkPrune ","element":"span"},{"text":"[","element":"span"},{"href":"#id-11","referenceIndex":12,"text":"12","element":"a"},{"text":"], all of which aim to shorten reasoning traces by RL, but do not explicitly account for adaptive reasoning behavior. Among these methods, only ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ThinkPrune ","element":"span"},{"text":"provides publicly available model checkpoints; we evaluate its two representative variants, ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"iter-2K ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"4K","element":"span"},{"text":". For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Concise-RL ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ShorterBetter","element":"span"},{"text":", results are reported as published in their respective papers. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(3) ","element":"span"},{"text":"Additionally, we include a set of open-source","element":"span"}],[{"id":"id-31","text":"Table 1: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(Main Results) ","element":"figcaption","subtype":"caption"},{"text":"Accuracy, Token Usage, and Efficiency Comparison Across Methods.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1578,"height":1262,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/5-0.png","element":"img"}],[{"text":"RL-finetuned models based on Distill-R1-1.5B/7B as reference, including ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Open-RS3-1.5B ","element":"span"},{"text":"[","element":"span"},{"href":"#id-24","referenceIndex":5,"text":"5","element":"a"},{"text":"], ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Still-3-1.5B ","element":"span"},{"text":"[","element":"span"},{"href":"#id-25","referenceIndex":22,"text":"22","element":"a"},{"text":"], ","element":"span"},{"style":{"fontStyle":"italic"},"text":"FastCuRL-1.5B ","element":"span"},{"text":"[","element":"span"},{"href":"#id-26","referenceIndex":28,"text":"28","element":"a"},{"text":"], ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Light-R1-DS-7B ","element":"span"},{"text":"[","element":"span"},{"href":"#id-27","referenceIndex":38,"text":"38","element":"a"},{"text":"], ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AReaL-boba-RL-7B ","element":"span"},{"text":"[","element":"span"},{"href":"#id-28","referenceIndex":21,"text":"21","element":"a"},{"text":"], and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"QwQ-32B ","element":"span"},{"text":"[","element":"span"},{"href":"#id-29","referenceIndex":31,"text":"31","element":"a"},{"text":"]. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"These models are not explicitly optimized for concise reasoning and differ significantly in both training objectives and computational budgets. We report their results for contextual reference only, aiming to highlight differences in design philosophy rather than to draw direct performance comparisons.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Training and Evaluation ","element":"span"},{"text":"All experiments are implemented using the ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"verl ","element":"span"},{"text":"framework [","element":"span"},{"href":"#id-30","referenceIndex":27,"text":"27","element":"a"},{"text":"], with most training hyperparameters retained at the default values. For all models, we set the batch size and training context length to ","element":"span"},{"text":"(128","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"8","element":"span"},{"text":"K","element":"span"},{"text":") ","element":"span"},{"text":"in Stage 1, ","element":"span"},{"text":"(64","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"16","element":"span"},{"text":"K","element":"span"},{"text":") ","element":"span"},{"text":"in Stage 2, and ","element":"span"},{"text":"(64","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"24","element":"span"},{"text":"K","element":"span"},{"text":") ","element":"span"},{"text":"in Stage 3. We save model checkpoints at empirically selected steps based on observed convergence throughout the procedure: 220/440/130 for Distill-R1-1.5B, 110/240/60 for DeepScaleR, and 220/450/20 for Distill-R1-7B across Stages. During evaluation, all models use a 32K context window. We sample 16 rollouts per instance with temperature 0.6 and report the average pass@1 accuracy. Reward shaping hyperparameters are set to ","element":"span"},{"style":{"height":14.8},"width":272.52,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/5-1.png","element":"img","alt":" γ = 0.5, λ = 2.0","inline":true,"padRight":true},{"text":"for Stage 1, and ","element":"span"},{"style":{"height":14.4},"width":224,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/5-2.png","element":"img","alt":" α = β = 0.05","inline":true,"padRight":true},{"text":"for Stage 3.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Main Results","element":"span"}],[{"text":"Table ","element":"span"},{"href":"#id-31","text":"1 ","element":"a"},{"text":"reports average accuracy and token usage across five mathematical benchmarks. To jointly evaluate reasoning accuracy and efficiency, we introduce the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Efficiency-F1 score ","element":"span"},{"text":"(E-F1), defined as:","element":"span"}],[{"text":"E-F1 ","element":"span"},{"text":"= ","element":"span"},{"style":{"height":38.4},"width":291.52,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/5-3.png","element":"img","alt":"�2 · ∆acc · ∆len∆acc + ∆len","inline":true,"padRight":true},{"text":"if acc ","element":"span"},{"style":{"height":14.21},"width":517.52,"height":35.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/5-4.png","element":"img","alt":" > accstd and len < lenstd; else 0","inline":true}],[{"text":"where the normalized accuracy gain and token reduction are given by:","element":"span"}],[{"style":{"width":"47%"},"width":756,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/5-5.png","element":"img"}],[{"id":"id-32","style":{"width":"99%"},"width":1571,"height":402,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/6-0.png","element":"img"}],[{"text":"Figure 5: Prompting strategies shape reasoning behavior and computational cost.","element":"figcaption","subtype":"caption"}],[{"text":"The subscripts ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"std ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"no ","element":"span"},{"text":"refer to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"standard ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"no-thinking ","element":"span"},{"text":"baselines. A non-zero E-F1 indicates that the model improves upon the standard baseline in both accuracy and token usage, capturing the extent to which pruning enhances conciseness without degrading performance.","element":"span"}],[{"text":"Despite the strong performance of existing open-source models, their outputs are substantially longer, ","element":"span"},{"style":{"fontWeight":"bold"},"text":"even reaching twice the length of ours at the same model size, suggesting that their gains stem from verbose reasoning but non-adaptive reasoning","element":"span"},{"text":". Prompt-based baselines (","element":"span"},{"style":{"fontStyle":"italic"},"text":"no-thinking ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ellipsis","element":"span"},{"text":") reduce length at the cost of accuracy. RL-based baselines also shorten outputs, but offer limited improvements on Distill-R1 and in some cases even reduce accuracy on DeepScaleR.","element":"span"}],[{"text":"In contrast, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"exhibits a staged progression in both accuracy and efficiency. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"All three stages are consistently trained with the ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"ellipsis prompt ","element":"span"},{"style":{"fontWeight":"bold"},"text":"as the base prompting strategy. ","element":"span"},{"text":"Stage 1 primarily aims to stabilize the activation of reasoning behavior and has minimal impact on performance. Stage 2 leads to accuracy improvements over the standard prompt across all model backbones, demonstrating effective control over when to reason. Stage 3 introduces length-aware pruning, further reducing token usage while minimizing potential performance degradation. On Distill-R1-1.5B, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink-Stage3 ","element":"span"},{"text":"achieves 51.7% accuracy with half the token usage of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"standard prompt ","element":"span"},{"text":"baseline. Remarkably, even on the heavily optimized DeepScaleR, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink-Stage2 ","element":"span"},{"text":"further improves performance by 0.6 over the standard prompt while reducing token usage by an additional 10%. However, Stage 3 leads to a slight accuracy drop, likely because DeepScaleR has already undergone extensive optimization. This suggests that additional pruning may be unnecessary on fully optimized models.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Ablation Study","element":"span"}],[{"text":"We conduct ablations on the reward design of our multi-stage RL framework on Distill-R1-1.5B to assess the necessity of each stage. The performance gains achieved by Stage 2 and the pruning effect of Stage 3 are already reflected in Table ","element":"span"},{"href":"#id-31","text":"1","element":"a"},{"text":", in terms of accuracy and token usage. Here, we focus on two key aspects: (1) the role of batch reward balance in Stage 1, and (2) whether skipping Stage 2 and proceeding directly from Stage 1 to Stage 3 yields comparable performance.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Batch Reward Balance Prevents Mode Collapse ","element":"span"},{"text":"To assess the role of batch-level balancing in Stage 1, we examine its impact on stabilizing dual-mode reasoning behavior. Specifically, we plot the average thinking rate across training steps, as shown in Figure ","element":"span"},{"href":"#id-32","text":"5a","element":"a"},{"text":". Under a naive reward, the model rapidly collapses into a thinking mode. Conversely, applying the length-aware reward (with ","element":"span"},{"style":{"height":13.2},"width":156,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/6-1.png","element":"img","alt":" α = 0.05,","inline":true},{"style":{"height":14.4},"width":95.48,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/6-2.png","element":"img","alt":"β = 0","inline":true},{"text":") in naive reward to encourage brevity leads the model to collapse into a degenerate no-thinking mode. In contrast, the batch reward balance mechanism, by enforcing a target thinking ratio via penalty slope ","element":"span"},{"style":{"height":11.41},"width":20.48,"height":28.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/6-3.png","element":"img","alt":" λ","inline":true},{"text":", helps stabilize training and supports the coexistence of thinking and no-thinking behaviors. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"We observe that response length rises and then falls during training, indicating an increasing share of shorter, no-thinking responses. ","element":"span"},{"text":"These observations imply that the model implicitly performs reasoning pruning, akin to concise reasoning.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Pruning Without Reinforcement Limits Performance ","element":"span"},{"text":"We investigate the necessity of Stage 2 by applying Stage 3 directly after Stage 1, skipping the reinforcement phase. As shown in Figure ","element":"span"},{"href":"#id-32","text":"5b","element":"a"},{"text":", the complete training pipeline that includes Stage 2 prior to Stage 3 yields a notable boost in both accuracy and response length, followed by effective pruning with minimal performance degradation. In contrast, bypassing Stage 2 results in stagnant accuracy and an eventual increase in response length after an initial decline. In contrast, skipping Stage 2 leads to stagnant accuracy and a rebound in response length. With comparable response lengths, the variant achieves only 47.6% accuracy across","element":"span"}],[{"id":"id-33","style":{"width":"53%"},"width":848,"height":520,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/7-0.png","element":"img"}],[{"text":"Figure 6: Keyword usage of reasoning behaviors across thinking and no-thinking modes.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"42%"},"width":672,"height":500,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/7-1.png","element":"img"}],[{"text":"Figure 7: Accuracy vs. Thinking Rate. The numbers indicate response lengths (tokens).","element":"figcaption","subtype":"caption"}],[{"text":"five benchmarks, notably lower than the 51.7% from full training. These observations underscore ","element":"span"},{"style":{"fontWeight":"bold"},"text":"the importance of Stage 2 in establishing stable and discriminative reasoning behaviors ","element":"span"},{"text":"that enable reliable pruning in the subsequent stage.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"In-Depth Behavioral and Efficiency Analysis","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lexical Patterns in Two Reasoning Modes ","element":"span"},{"text":"We analyze linguistic differences between thinking and no-thinking responses by quantifying the frequency of reasoning-related verbs (e.g., “Wait”, “Alternatively”, “Check”) per 1,000 tokens, capturing how explicit reasoning is manifested in each mode. Following [","element":"span"},{"href":"#id-11","referenceIndex":12,"text":"12","element":"a"},{"text":"], we categorize these keywords into three functional groups on the MATH500 benchmark: (1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Soliloquize & Thinking","element":"span"},{"text":", reflecting internal deliberation and self-correction, characteristic of R1-style reasoning; (2) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Check & Confirm","element":"span"},{"text":", indicating procedural verification; and (3) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Summary & Calculation","element":"span"},{"text":", marking final deduction and computational closure. As illustrated in Figure ","element":"span"},{"href":"#id-33","text":"6","element":"a"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"training substantially reduces soliloquy-like expressions, particularly under the no-thinking mode, indicating a decline in explicit internal deliberation. In contrast, verification and computation-related terms appear slightly more frequently in the no-thinking setting, suggesting a shift toward focused conclusion and validation rather than step-by-step verbalization.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Correlation Between Task Difficulty and Reasoning Tendency ","element":"span"},{"text":"We investigate the relationship between the reasoning behavior and the inherent difficulty of the tasks. As shown in Figure ","element":"span"},{"href":"#id-33","text":"7","element":"a"},{"text":", there exists a positive correlation between the thinking rate and task difficulty. To further quantify this relationship, we compute the average accuracy, thinking rate, and response length across all datasets. Here, accuracy serves as a proxy for dataset difficulty. The results indicate that, on more challenging datasets, models tend to invoke explicit reasoning more frequently and produce longer responses. This demonstrates that stronger models do not rely on explicit reasoning as frequently, yet outperform weaker models, highlighting an emergent ability to reason more selectively and efficiently.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Readability and Accuracy of Dual Reasoning Modes ","element":"span"},{"text":"A common concern in reinforcement fine-tuning is that reward-driven optimization may degrade the fluency or coherence of generated reasoning traces. To assess whether ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"introduces such effects, we follow the evaluation setup in [","element":"span"},{"href":"#id-11","referenceIndex":12,"text":"12","element":"a"},{"text":"] and compute the perplexity (PPL) over the ","element":"span"},{"style":{"fontFamily":"monospace"},"text":" ","element":"span"},{"text":"span traces using Distill-R1-1.5B. For no-thinking variants, PPL is calculated over the segment following ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"","element":"span"},{"text":". As shown in Table ","element":"span"},{"href":"#id-34","text":"2","element":"a"},{"text":", the think mode of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"maintains PPL comparable to standard prompting, while the no-think mode achieves the lowest PPL, reflecting more concise and fluent responses. Overall, all variants remain within acceptable readability ranges. Meanwhile, we analyze accuracy and token usage across reasoning modes. The results are also recorded in Table ","element":"span"},{"href":"#id-34","text":"2","element":"a"},{"text":", ","element":"span"},{"style":{"fontWeight":"bold"},"text":"no-thinking responses are shorter and more accurate, suggesting effective handling of simpler problems. Thinking-mode responses are longer with slightly lower accuracy, reflecting allocation to harder cases. ","element":"span"},{"text":"These results indicate that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"adaptively adjusts reasoning depth based on task difficulty.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Evaluating ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"AutoThink ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Under Standard and No-Thinking Prompts ","element":"span"},{"text":"We analyze how the trained model responds to the standard and forced-no-think prompts. The forced-no-think prompt is defined as ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"\\n...\\n\\n\\n","element":"span"},{"text":", which builds upon the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ellipsis ","element":"span"},{"text":"prompt but enforces an immediate termination of the thinking phase. The results of Distill-1.5B-","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"are presented in Table ","element":"span"},{"href":"#id-35","text":"3","element":"a"},{"text":". As","element":"span"}],[{"id":"id-36","style":{"width":"94%"},"width":1492,"height":538,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/8-0.png","element":"img"}],[{"text":"Figure 8: Distribution of Reasoning Behaviors Across Models and Reasoning Modes.","element":"figcaption","subtype":"caption"}],[{"text":"expected, the standard prompt induces longer reasoning traces and achieves higher accuracy, while the forced no-think prompt reduces token usage at the cost of slight performance degradation. These findings suggest that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"has learned to internally compress its reasoning when appropriate, while retaining the ability to conditionally invoke reasoning via prompting.","element":"span"}],[{"id":"id-34","text":"Table 2: PPL, Acc & Token Length.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"35%"},"width":564,"height":380,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/8-1.png","element":"img"}],[{"text":"Table 3: ","element":"figcaption","subtype":"caption"},{"id":"id-35","style":{"fontStyle":"italic"},"text":"AutoThink ","element":"figcaption","subtype":"caption"},{"text":"Performance on Three Prompts.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"61%"},"width":976,"height":378,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/8-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Reasoning Behavior Profiling ","element":"span"},{"text":"To gain a deeper understanding of how reasoning behaviors evolve, we annotate the generated solutions from each model with high-level problem-solving phases using GPT-4o. As illustrated in Figure ","element":"span"},{"href":"#id-36","text":"8","element":"a"},{"text":", Distill-R1-1.5B distributes its reasoning effort across many surface-level activities, such as “reformulating the problem” and “understanding the problem.” In contrast, ThinkPrune slightly shifts focus toward answer-finalization routines, while still exhibiting dispersed reasoning patterns. Notably, AutoThink in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Think Mode ","element":"span"},{"text":"allocates a larger proportion of steps to core reasoning phases, including “computing or simplifying expressions” and “applying known theorems,” suggesting a more targeted and efficient reasoning trajectory. Meanwhile, in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"No-Think Mode","element":"span"},{"text":", AutoThink maintains strong task comprehension and delivers concise outputs, dedicating most steps to problem understanding and direct computation. These findings indicate that AutoThink not only reduces redundant steps, but also adapts its reasoning structure based on the selected mode.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Generality Beyond Mathematical Reasoning ","element":"span"},{"text":"To investigate whether AutoThink generalizes beyond mathematical reasoning, we additionally evaluate our models on three non-mathematical benchmarks: (i) ","element":"span"},{"style":{"fontWeight":"bold"},"text":"GPQA ","element":"span"},{"text":"for scientific multi-hop reasoning, (ii) ","element":"span"},{"style":{"fontWeight":"bold"},"text":"MMLU ","element":"span"},{"text":"for general multi-task language understanding, and (iii) ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Live-Code-Bench ","element":"span"},{"text":"for code generation (20250727 release).","element":"span"}],[{"text":"As shown in Table ","element":"span"},{"href":"#id-37","text":"4","element":"a"},{"text":", AutoThink retains competitive accuracy while reducing token usage, indicating that adaptive reasoning behaviors extend beyond math tasks. Stage 2 even surpasses the baseline in accuracy while halving response length, highlighting the transferability of our approach to diverse domains.","element":"span"}],[{"id":"id-37","text":"Table 4: Performance of AutoThink on non-math bench- ","element":"figcaption","subtype":"caption"},{"text":"marks. Each cell shows Accuracy (%) / Avg. Length.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"56%"},"width":896,"height":200,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/8-3.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Additional Analysis. ","element":"span"},{"text":"We further conduct additional analyses on more base models, hyperparameters, training cost, and case studies. Details are presented in Appendix ","element":"span"},{"text":"B ","element":"span"},{"text":"due to space limitations.","element":"span"}]]},{"heading":"5 Related Works","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"RL-based Post-Training for LLMs. ","element":"span"},{"text":"Reinforcement fine-tuning (RFT) has been widely adopted to improve the reasoning ability of LLMs [","element":"span"},{"href":"#id-38","referenceIndex":32,"text":"32","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":13,"text":"13","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":9,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":11,"text":"11","element":"a"},{"text":", ","element":"span"},{"href":"#id-41","referenceIndex":7,"text":"7","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":34,"text":"34","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":35,"text":"35","element":"a"},{"text":"]. Recent work on RL for LLMs has focused on improving the efficiency and effectiveness of large-scale RL training. Key techniques decoupling the clipping mechanism and introducing dynamic group sampling [","element":"span"},{"href":"#id-18","referenceIndex":46,"text":"46","element":"a"},{"text":"], mitigating value bias over long sequences [","element":"span"},{"href":"#id-44","referenceIndex":48,"text":"48","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":47,"text":"47","element":"a"},{"text":"], difficulty-aware advantage reweighting [","element":"span"},{"href":"#id-20","referenceIndex":50,"text":"50","element":"a"},{"text":"], model ensembling [","element":"span"},{"href":"#id-46","referenceIndex":8,"text":"8","element":"a"},{"text":"] and designing minimal-form credit assignment strategies for rewards [","element":"span"},{"href":"#id-47","referenceIndex":4,"text":"4","element":"a"},{"text":"]. In addition, RFT has been shown to explicitly promote self-verification and self-correction behaviors [","element":"span"},{"href":"#id-48","referenceIndex":26,"text":"26","element":"a"},{"text":", ","element":"span"},{"href":"#id-49","referenceIndex":18,"text":"18","element":"a"},{"text":"], while also supporting optimization of test-time compute [","element":"span"},{"href":"#id-50","referenceIndex":23,"text":"23","element":"a"},{"text":"]. Multi-stage, context-length-extended RL further amplifies the long-chain reasoning ability of R1-style models [","element":"span"},{"href":"#id-3","referenceIndex":17,"text":"17","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":28,"text":"28","element":"a"},{"text":"]. In our work, RL is applied to train R1-style models to adaptively control their reasoning behavior, enabling selective thinking guided by multi-stage reward shaping.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Mitigating Overthinking for LLMs. ","element":"span"},{"text":"While RFT improves performance, it may induce overthinking, causing models to generate overly verbose reasoning with limited benefit [","element":"span"},{"href":"#id-5","referenceIndex":29,"text":"29","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":14,"text":"14","element":"a"},{"text":"]. [","element":"span"},{"href":"#id-51","referenceIndex":3,"text":"3","element":"a"},{"text":"] address overthinking in R1-style models by using self-generated short CoT as positive signals in DPO, encouraging concise reasoning. [","element":"span"},{"href":"#id-52","referenceIndex":51,"text":"51","element":"a"},{"text":"] mitigate overthinking by training models to terminate with “I don’t know” on unsolvable problems. Recent studies have shown that inserting pseudo-thinking cues into R1-style prompts [","element":"span"},{"href":"#id-15","referenceIndex":19,"text":"19","element":"a"},{"text":"], or manually controlling reasoning based on problem difficulty [","element":"span"},{"href":"#id-53","referenceIndex":10,"text":"10","element":"a"},{"text":", ","element":"span"},{"href":"#id-54","referenceIndex":39,"text":"39","element":"a"},{"text":", ","element":"span"},{"href":"#id-55","referenceIndex":15,"text":"15","element":"a"},{"text":"], can suppress the model’s thinking behavior, but resulting in reduced performance. Other studies approach the problem from different perspectives: supervised fine-tuning (SFT) with short CoT responses [","element":"span"},{"href":"#id-56","referenceIndex":45,"text":"45","element":"a"},{"text":", ","element":"span"},{"href":"#id-57","referenceIndex":20,"text":"20","element":"a"},{"text":"], incorporate response length–aware rewards in RFT [","element":"span"},{"href":"#id-58","referenceIndex":1,"text":"1","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":44,"text":"44","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":12,"text":"12","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":6,"text":"6","element":"a"},{"text":", ","element":"span"},{"href":"#id-59","referenceIndex":16,"text":"16","element":"a"},{"text":"], or leverage smaller models guide larger ones toward faster reasoning [","element":"span"},{"href":"#id-55","referenceIndex":15,"text":"15","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","referenceIndex":36,"text":"36","element":"a"},{"text":"]. Furthermore, [","element":"span"},{"href":"#id-61","referenceIndex":43,"text":"43","element":"a"},{"text":"] designs an early-exit mechanism during the reasoning phase to prevent excessive thinking along an incorrect chain of thought. Inspired by these findings, we first design a minimal prompt that elicits random thinking behavior, then apply multi-stage RL to guide the model to think adaptively based on task difficulty, without using external signals or teacher models.","element":"span"}]]},{"heading":"6 Conclusion & Limitations","paragraphs":[[{"text":"This work explores how R1-style LLMs can learn to reason adaptively. We propose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink","element":"span"},{"text":", a minimal prompting strategy paired with a multi-stage RL framework that enables task-aware thinking. Through stage-wise reward shaping, the model stabilizes reasoning patterns, reinforces effective behaviors, and prunes unnecessary steps. Experiments show that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"achieves favorable accuracy–efficiency trade-offs, outperforming prompting and RL baselines without compromising performance, offering a scalable and controllable approach to efficient reasoning in LLMs.","element":"span"}],[{"text":"While ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"demonstrates promising adaptive reasoning capabilities, several limitations remain:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(1) Reward Hacking","element":"span"},{"text":": The model may bypass the separation between thinking and answering by embedding reasoning after the ","element":"span"},{"style":{"fontFamily":"monospace"},"text":" ","element":"span"},{"text":"tag. As shown in Figure ","element":"span"},{"href":"#id-33","text":"6","element":"a"},{"text":", reasoning-related tokens still appear in no-thinking mode, suggesting incomplete behavioral separation.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(2) Uncontrolled Reasoning Budget","element":"span"},{"text":": ","element":"span"},{"style":{"fontStyle":"italic"},"text":"AutoThink ","element":"span"},{"text":"adaptively decides when to think, but cannot control overall response length. Future work could explore budget-aware CoT generation, as seen in recent systems like Qwen3 [","element":"span"},{"href":"#id-8","referenceIndex":30,"text":"30","element":"a"},{"text":"].","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(3) Unfiltered Training Data","element":"span"},{"text":": We directly use the DeepScaleR dataset without filtering by task difficulty. Though simple data selection has shown utility, our focus lies in training design. Integrating curriculum-based filtering may further improve performance.","element":"span"}]]},{"heading":"Acknowledgments and Disclosure of Funding","paragraphs":[[{"text":"This work is supported by the Strategic Priority Research Program of Chinese Academy of Sciences under Grant XDA0480303, Young Scientists Fund of The State Key Laboratory of Multimodal Artificial Intelligence Systems ES2P100112, National Natural Science Foundation of China 62402252, 62536003, and Guangdong High-Level Talent Programme 2024TQ08X283.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-58","text":"[1] ","element":"span"},{"text":"Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2nd Conference on Language Modeling (COLM)","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-7","text":"[2] ","element":"span"},{"text":"Anthropic. Claude 3.7 sonnet, 2025. URL ","element":"span"},{"href":"https://www.anthropic.com/claude/sonnet","style":{"fontFamily":"monospace"},"text":"https://www.anthropic.com/claude/sonnet","element":"a"},{"text":".","element":"span"}],[{"id":"id-51","text":"[3] ","element":"span"},{"text":"Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Forty-second International Conference on Machine Learning (ICML)","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-47","text":"[4] ","element":"span"},{"text":"Jie Cheng, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Gang Xiong, Yisheng Lv, and Fei-Yue Wang. Stop summation: Min-form credit assignment is all process reward model needs for reasoning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2504.15275","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-24","text":"[5] ","element":"span"},{"text":"Quy-Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What works and what doesn’t. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.16219","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-12","text":"[6] ","element":"span"},{"text":"Mehdi Fatemi, Banafsheh Rafiee, Mingjie Tang, and Kartik Talamadupula. Concise reasoning via reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2504.05185","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-41","text":"[7] ","element":"span"},{"text":"Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. Srft: A single-stage method with supervised and reinforcement fine-tuning for reasoning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2506.19767","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-46","text":"[8] ","element":"span"},{"text":"Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Guojun Yin, Wei Lin, Qichao Zhang, and Dongbin Zhao. Rlae: Reinforcement learning-assisted ensemble for llms. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-1","text":"[9] ","element":"span"},{"text":"Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2501.12948","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-53","text":"[10] ","element":"span"},{"text":"Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2412.18547","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-40","text":"[11] ","element":"span"},{"text":"Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, ","element":"span"},{"text":"Fuxiang Zhang, ","element":"span"},{"text":"Jiacheng Xu, ","element":"span"},{"text":"Wei Shen, ","element":"span"},{"text":"Siyuan Li, ","element":"span"},{"text":"Liang Zeng, Tianwen Wei, ","element":"span"},{"text":"Cheng Cheng, ","element":"span"},{"text":"Bo An, ","element":"span"},{"text":"Yang Liu, ","element":"span"},{"text":"and Yahui Zhou. ","element":"span"},{"text":"Skywork open ","element":"span"},{"text":"reasoner ","element":"span"},{"text":"series. ","element":"span"},{"href":"https://capricious-hydrogen-41c.notion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680","style":{"fontFamily":"monospace"},"text":"https://capricious-hydrogen-41c.notion.site/ ","element":"a"},{"href":"https://capricious-hydrogen-41c.notion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680","style":{"fontFamily":"monospace"},"text":"Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680","element":"a"},{"text":", ","element":"span"},{"text":"2025. Notion Blog.","element":"span"}],[{"id":"id-11","text":"[12] ","element":"span"},{"text":"Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2504.01296","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-39","text":"[13] ","element":"span"},{"text":"Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2412.16720","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-6","text":"[14] ","element":"span"},{"text":"Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. Overthink: Slowdown attacks on reasoning llms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2502.02542","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-55","text":"[15] ","element":"span"},{"text":"Yule Liu, Jingyi Zheng, Zhen Sun, Zifan Peng, Wenhan Dong, Zeyang Sha, Shiwen Cui, Weiqiang Wang, and Xinlei He. Thought manipulation: External thought can be efficient for large reasoning models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2504.13626","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-59","text":"[16] ","element":"span"},{"text":"Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2501.12570","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-3","text":"[17] ","element":"span"},{"text":"Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog.","element":"span"}],[{"id":"id-49","text":"[18] ","element":"span"},{"text":"Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, and Jia Li. S","element":"span"},{"style":{"height":7.39},"width":13,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/118864/images/11-0.png","element":"img","alt":"2","inline":true},{"text":"r: Teaching llms to self-verify and self-correct via reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2502.12853","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-15","text":"[19] ","element":"span"},{"text":"Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2504.09858","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-57","text":"[20] ","element":"span"},{"text":"Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. Cot-valve: Length-compressible chain-of-thought tuning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2502.09601","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-28","text":"[21] ","element":"span"},{"text":"Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Eighth Conference on Machine Learning and Systems","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-25","text":"[22] ","element":"span"},{"text":"Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2412.09413","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-50","text":"[23] ","element":"span"},{"text":"Yuxiao Qu, Matthew YR Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement fine-tuning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Forty-second International Conference on Machine Learning (ICML)","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-10","text":"[24] ","element":"span"},{"text":"Matthew Renze and Erhan Guven. The benefits of a concise chain of thought on problem-solving in large language models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2024 2nd International Conference on Foundation and Large Language Models (FLLM)","element":"span"},{"text":", pages 476–483. IEEE, 2024.","element":"span"}],[{"id":"id-17","text":"[25] ","element":"span"},{"text":"Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2402.03300","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-48","text":"[26] ","element":"span"},{"text":"Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory W Wornell, Subhro Das, David Daniel Cox, and Chuang Gan. Satori: Reinforcement learning with chain-of-action-thought enhances llm reasoning via autoregressive search. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Forty-second International Conference on Machine Learning (ICML)","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-30","text":"[27] ","element":"span"},{"text":"Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Twentieth European Conference on Computer Systems","element":"span"},{"text":", pages 1279–1297, 2025.","element":"span"}],[{"id":"id-26","text":"[28] ","element":"span"},{"text":"Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, and Feng Zhang. Fastcurl: Curriculum reinforcement learning with progressive context extension for efficient training r1-like reasoning models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.17287","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-5","text":"[29] ","element":"span"},{"text":"Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, et al. Stop overthinking: A survey on efficient reasoning for large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.16419","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-8","text":"[30] ","element":"span"},{"text":"Qwen Team. Qwen3 technical report, 2025. URL ","element":"span"},{"href":"https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf","style":{"fontFamily":"monospace"},"text":"https://github.com/QwenLM/Qwen3/ ","element":"a"},{"href":"https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf","style":{"fontFamily":"monospace"},"text":"blob/main/Qwen3_Technical_Report.pdf","element":"a"},{"text":".","element":"span"}],[{"id":"id-29","text":"[31] ","element":"span"},{"text":"Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL ","element":"span"},{"href":"https://qwenlm.github.io/blog/qwq-32b/","style":{"fontFamily":"monospace"},"text":"https://qwenlm.github.io/blog/qwq-32b/","element":"a"},{"text":".","element":"span"}],[{"id":"id-38","text":"[32] ","element":"span"},{"text":"Luong Trung, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","element":"span"},{"text":", pages 7601–7614, 2024.","element":"span"}],[{"id":"id-23","text":"[33] ","element":"span"},{"text":"Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, et al. Enhancing llm reasoning with iterative dpo: A comprehensive empirical investigation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2nd Conference on Language Modeling (COLM)","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-42","text":"[34] ","element":"span"},{"text":"Songjun Tu, Jingbo Sun, Qichao Zhang, Xiangyuan Lan, and Dongbin Zhao. Online preferencebased reinforcement learning with self-augmented feedback from large language model. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems","element":"span"},{"text":", pages 2069–2077, 2025.","element":"span"}],[{"id":"id-43","text":"[35] ","element":"span"},{"text":"Songjun Tu, Qichao Zhang, Jingbo Sun, Yuqian Fu, Linjing Li, Xiangyuan Lan, Dongmei Jiang, Yaowei Wang, and Dongbin Zhao. Perception-consistency multimodal large language models reasoning via caption-regularized policy optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2509.21854","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-60","text":"[36] ","element":"span"},{"text":"Jikai Wang, Juntao Li, Lijun Wu, and Min Zhang. ","element":"span"},{"text":"Efficient reasoning for llms through speculative chain-of-thought. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2504.19095","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-2","text":"[37] ","element":"span"},{"text":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems (NeurIPS)","element":"span"},{"text":", 35:24824–24837, 2022.","element":"span"}],[{"id":"id-27","text":"[38] ","element":"span"},{"text":"Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, et al. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.10460","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-54","text":"[39] ","element":"span"},{"text":"Tong Wu, Chong Xiang, Jiachen T Wang, and Prateek Mittal. Effectively controlling reasoning models through thinking intervention. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.24370","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-4","text":"[40] ","element":"span"},{"text":"Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for llm problem-solving. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Thirteenth International Conference on Learning Representations","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-0","text":"[41] ","element":"span"},{"text":"Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, et al. Towards large reasoning models: A survey of reinforced reasoning with large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2501.09686","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-9","text":"[42] ","element":"span"},{"text":"Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2502.18600","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-61","text":"[43] ","element":"span"},{"text":"Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Zheng Lin, Li Cao, and Weiping Wang. Dynamic early exit in reasoning models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2504.15895","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-13","text":"[44] ","element":"span"},{"text":"Jingyang Yi and Jiazheng Wang. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2504.21370","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-56","text":"[45] ","element":"span"},{"text":"Bin Yu, Hang Yuan, Yuliang Wei, Bailing Wang, Weizhen Qi, and Kai Chen. Long-short chain-of-thought mixture supervised fine-tuning eliciting efficient reasoning in large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2505.03469","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-18","text":"[46] ","element":"span"},{"text":"Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.14476","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-45","text":"[47] ","element":"span"},{"text":"Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2504.05118","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-44","text":"[48] ","element":"span"},{"text":"Yufeng Yuan, Yu Yue, Ruofei Zhu, Tiantian Fan, and Lin Yan. What’s behind ppo’s collapse in long-cot? value optimization holds the secret. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.01491","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-22","text":"[49] ","element":"span"},{"text":"Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.18892","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-20","text":"[50] ","element":"span"},{"text":"Jixiao Zhang and Chunsheng Zuo. ","element":"span"},{"text":"Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2504.09696","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-52","text":"[51] ","element":"span"},{"text":"Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, and Doyen Sahoo. Automatic curriculum expert iteration for reliable llm reasoning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Thirteenth International Conference on Learning Representations (ICLR)","element":"span"},{"text":", 2024.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]