36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"117932","publisher":"neurips","paperJSON":{"title":"CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models","paperID":"117932","avgLineHeight":10.91,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need to sample multiple completions for each question. Our experiment and theoretical analysis reveal that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training—their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experiments show that CPPO achieves up to ","element":"span"},{"style":{"height":11.2},"width":94.52,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/0-0.png","element":"img","alt":" 7.98×","inline":true,"padRight":true},{"text":"speedup on GSM8K and ","element":"span"},{"style":{"height":11.2},"width":94.52,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/0-1.png","element":"img","alt":" 3.48×","inline":true,"padRight":true},{"text":"on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at ","element":"span"},{"href":"https://github.com/lzhxmu/CPPO","text":"https://github.com/lzhxmu/CPPO.","element":"a"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Recently, there has been a surge in the development and application of advanced reasoning models, with models such as OpenAI-o1 [","element":"span"},{"href":"#id-0","referenceIndex":11,"text":"11","element":"a"},{"text":"], Deepseek-R1 [","element":"span"},{"href":"#id-1","referenceIndex":7,"text":"7","element":"a"},{"text":"], and Kimi-k1.5 [","element":"span"},{"href":"#id-2","referenceIndex":25,"text":"25","element":"a"},{"text":"] being prime examples. These models exhibit remarkable capability in complex reasoning tasks, such as mathematics, coding, and scientific reasoning through step-by-step inference and reflection.","element":"span"}],[{"text":"Reinforcement learning has been proven to be an effective method for training reasoning models. Deepseek-R1 [","element":"span"},{"href":"#id-1","referenceIndex":7,"text":"7","element":"a"},{"text":"] demonstrates that reasoning patterns can be effectively elicited through rule-based reinforcement learning. It employs Group Relative Policy Optimization (GRPO) [","element":"span"},{"href":"#id-3","referenceIndex":21,"text":"21","element":"a"},{"text":"], which differs from Proximal Policy Optimization (PPO) [","element":"span"},{"href":"#id-4","referenceIndex":20,"text":"20","element":"a"},{"text":"] by estimating the baseline directly from group scores, eliminating the need for a critic model. However, this necessitates sampling a group of completions for each question, rendering the training process computationally expensive. Subsequently, GRPO computes the reward for each completion using a rule-based reward function and calculates the relative advantage of each completion. To ensure training stability, GRPO also calculates the ratio of the predicted probabilities of the policy model, reference model, and old policy model for a group of completions as part of the policy objective function, further increasing the training overhead of reinforcement learning. The substantial training overhead of GRPO limits its training efficiency and scalability. Improving the training efficiency is an important and practical problem.","element":"span"}],[{"text":"The computational expense of GRPO training primarily stems from its core design: generating a large group of completions per prompt for intra-group comparison, which makes the training process computationally expensive. Moreover, the forward computation of GRPO scales by a factor of (3","element":"span"},{"style":{"height":13.6},"width":37,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/1-0.png","element":"img","alt":"×)","inline":true,"padRight":true},{"text":"completion number. It is natural to question whether the contribution of each completion to the reinforcement learning process is equal. In Sec. ","element":"span"},{"href":"#id-5","text":"3.2, ","element":"a"},{"text":"we find that the contribution of each completion is related to its relative advantage. In other words, the contribution of each completion to the policy model training is not equal. This insight inspires us to accelerate GRPO by pruning completions.","element":"span"}],[{"text":"$3c","element":"span"}],[{"text":"We have conducted experiments on multiple challenging benchmarks and models of different scales to evaluate CPPO’s effectiveness. Specifically, we train the Qwen-2.5 series models [","element":"span"},{"href":"#id-6","referenceIndex":29,"text":"29","element":"a"},{"text":"], such as Qwen-2.5-1.5B-Instruct and Qwen-2.5-7B-Instruct, on math datasets including Math [","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"8","element":"a"},{"text":"] and GSM8K [","element":"span"},{"href":"#id-8","referenceIndex":4,"text":"4","element":"a"},{"text":"]. The results demonstrate that CPPO achieves up to ","element":"span"},{"style":{"height":11.2},"width":94.48,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/1-1.png","element":"img","alt":" 7.98×","inline":true,"padRight":true},{"text":"speedup on GSM8K and ","element":"span"},{"style":{"height":11.2},"width":95,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/1-2.png","element":"img","alt":"3.48×","inline":true,"padRight":true},{"text":"on Math while preserving or even enhancing the accuracy compared to the original GRPO.","element":"span"}]]},{"heading":"2 Related Work","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Large Scale Reasoning Models","element":"span"},{"text":". Large Language Models (LLM) [","element":"span"},{"href":"#id-9","referenceIndex":1,"text":"1","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":26,"text":"26","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":24,"text":"24","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":2,"text":"2","element":"a"},{"text":"] have made impressive progress in various natural language processing tasks. Recently, researchers have continued to boost the performance of large language models in reasoning tasks, such as mathematics [","element":"span"},{"href":"#id-8","referenceIndex":4,"text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"8","element":"a"},{"text":"], coding [","element":"span"},{"href":"#id-13","referenceIndex":12,"text":"12","element":"a"},{"text":"], and scientific reasoning [","element":"span"},{"href":"#id-14","referenceIndex":19,"text":"19","element":"a"},{"text":"]. Snell ","element":"span"},{"style":{"fontStyle":"italic"},"text":"et al. ","element":"span"},{"text":"[","element":"span"},{"href":"#id-15","referenceIndex":23,"text":"23","element":"a"},{"text":"] use dense, process-based verifier reward models and adaptively update the model’s response distribution based on the test-time prompt to enhance reasoning ability. rStar-Math [","element":"span"},{"href":"#id-16","referenceIndex":6,"text":"6","element":"a"},{"text":"] proposes a self-evolved deep thinking approach that significantly boosts the math reasoning capabilities of small LLMs. OpenAI-o1 [","element":"span"},{"href":"#id-0","referenceIndex":11,"text":"11","element":"a"},{"text":"] uses large scale reinforcement learning to train a reasoning model that can solve complex reasoning tasks, achieving state-of-the-art performance on multiple benchmarks. However, the training details of OpenAI-o1 have not been released, making it difficult to replicate and expand the reasoning model.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Reinforcement Learning","element":"span"},{"text":". Recently, DeepSeek-R1 [","element":"span"},{"href":"#id-1","referenceIndex":7,"text":"7","element":"a"},{"text":"] has incentivized the reasoning capability of large language models through Group Relative Policy Optimization. Inspired by DeepSeek-R1’s success, Logic-RL [","element":"span"},{"href":"#id-17","referenceIndex":28,"text":"28","element":"a"},{"text":"] adopts the REIFORCE++ algorithm to enhance the training efficiency and stability of rule-based reinforcement learning. Hu ","element":"span"},{"style":{"fontStyle":"italic"},"text":"et al. ","element":"span"},{"text":"[","element":"span"},{"href":"#id-18","referenceIndex":10,"text":"10","element":"a"},{"text":"] demonstrate that the vanilla PPO algorithm, without KL divergence constraint, is sufficient to scale up both response length and benchmark performance on reasoning tasks. Nevertheless, these reinforcement learning algorithms universally require multiple completions for each question, resulting in substantial computational costs. There is an urgent need to accelerate the training of reinforcement learning algorithms.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Inference Acceleration for Reasoning Models","element":"span"},{"text":". The enhancement of model inference capabilities is often accompanied by increased computational overhead and longer response times. Recent works have attempted to accelerate the inference process of reasoning models through efficient Chain of Thought (CoT) methods. TokenSkip [","element":"span"},{"href":"#id-19","referenceIndex":27,"text":"27","element":"a"},{"text":"] proposes a controllable CoT compression method that improves reasoning efficiency by selectively skipping less important tokens while preserving critical ones, thus achieving a balance between efficiency and accuracy. Kang ","element":"span"},{"style":{"fontStyle":"italic"},"text":"et al. ","element":"span"},{"text":"[","element":"span"},{"href":"#id-20","referenceIndex":13,"text":"13","element":"a"},{"text":"] utilize a compressor to condense an original longer CoT into a shorter one while maintaining key information and interpretability. Although numerous works focus on inference acceleration, the acceleration of reasoning model training remains an underexplored area.","element":"span"}],[{"id":"id-22","style":{"width":"99%"},"width":1578,"height":382,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-0.png","element":"img"}],[{"text":"Figure 1: Completion number ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"vs. ","element":"figcaption","subtype":"caption"},{"text":"(left) accuracy and (right) training time. Experiments are conducted on GSM8K ","element":"figcaption","subtype":"caption"},{"href":"#id-8","referenceIndex":4,"text":"[4] ","element":"a","subtype":"caption"},{"text":"using Qwen2.5-1.5B-Instruct ","element":"figcaption","subtype":"caption"},{"href":"#id-6","referenceIndex":29,"text":"[29]","element":"a","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}]]},{"heading":"3 Method","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Preliminary","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Group Relative Policy Optimization","element":"span"},{"text":". GRPO [","element":"span"},{"href":"#id-1","referenceIndex":7,"text":"7","element":"a"},{"text":"] foregoes the critic model that is typically the same size as the policy model and estimates the baseline from group scores instead. Specifically, for each question ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"sampled from the dataset distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":")","element":"span"},{"text":", GRPO generates ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"completions ","element":"span"},{"style":{"height":15.79},"width":261.52,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-1.png","element":"img","alt":"{o1, o2, · · · , oG}","inline":true,"padRight":true},{"text":"using the old policy model ","element":"span"},{"style":{"height":11.2},"width":74.52,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-2.png","element":"img","alt":" πθold","inline":true},{"text":". And then GRPO optimizes the policy model ","element":"span"},{"style":{"height":9.6},"width":36.52,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-3.png","element":"img","alt":" πθ","inline":true,"padRight":true},{"text":"by maximizing the following objective:","element":"span"}],[{"id":"id-21","style":{"width":"95%"},"width":1522,"height":263,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-4.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"84%"},"width":1340,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-5.png","element":"img"}],[{"text":"Here, ","element":"span"},{"style":{"height":7.2},"width":13.48,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-6.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.61},"width":22.48,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-7.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"are hyperparameters.","element":"span"},{"style":{"height":11.79},"width":70,"height":29.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-8.png","element":"img","alt":"πref","inline":true,"padRight":true},{"text":"is the reference model which is usually the initial model before reinforcing learning. And ","element":"span"},{"style":{"height":14.21},"width":39,"height":35.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-9.png","element":"img","alt":" Ai","inline":true,"padRight":true},{"text":"is the advantage computed using a group of rewards ","element":"span"},{"style":{"height":15.79},"width":257.48,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-10.png","element":"img","alt":"{r1, r2, . . . , rG}","inline":true,"padRight":true},{"text":"corresponding to the completions within each group:","element":"span"}],[{"style":{"width":"68%"},"width":1084,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-11.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Rule-based Reward Function","element":"span"},{"text":". Instead of training an additional reward model for reward computation, GRPO employs a rule-based reward system that consists of two components:","element":"span"}],[{"style":{"width":"67%"},"width":1078,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-12.png","element":"img"}],[{"text":"Here, the format reward ","element":"span"},{"style":{"height":16},"width":165.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-13.png","element":"img","alt":" Rformat(oi)","inline":true,"padRight":true},{"text":"ensures that the output adheres to the expected structure, while the accuracy reward ","element":"span"},{"style":{"height":16.8},"width":190,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-14.png","element":"img","alt":" Raccuracy(oi)","inline":true,"padRight":true},{"text":"prioritizes correctness with higher reward for accurate responses. The specific reward functions are presented in Appendix ","element":"span"},{"text":"A.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Analyzing Completion Impact on Policy Training","element":"span"},{"text":". From Eq. ","element":"span"},{"href":"#id-21","text":"(1)","element":"a"},{"text":", GRPO’s training overhead scales linearly with the number of completions sampled per question. This arises from the necessity of calculating predicted probabilities for the policy, reference, and old policy models over all completions. For instance, in DeepSeek-Math [","element":"span"},{"href":"#id-3","referenceIndex":21,"text":"21","element":"a"},{"text":"], using 64 completions requires 192 forward passes per question (64","element":"span"},{"style":{"height":8},"width":19.48,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/2-15.png","element":"img","alt":"×","inline":true},{"text":"3), incurring significant computational costs. This raises two critical questions: (1) How does the number of completions affect policy model accuracy? Does increasing completions always enhance performance? (2) Do all completions in a group contribute equally to training?","element":"span"}],[{"text":"To address the first question, we conduct an ablation study on GSM8K [","element":"span"},{"href":"#id-8","referenceIndex":4,"text":"4","element":"a"},{"text":"] using Qwen2.5-1.5B-Instruct [","element":"span"},{"href":"#id-6","referenceIndex":29,"text":"29","element":"a"},{"text":"]. Results in Figure ","element":"span"},{"href":"#id-22","text":"1 ","element":"a"},{"text":"show that model accuracy improves with more completions, but training time grows multiplicatively. This indicates diminishing returns on performance gains as training costs increase. Crucially, reducing completions to cut costs risks degrading reasoning capabilities, making it impractical.","element":"span"}],[{"text":"For the second question, we investigate whether completions contribute uniformly to training effectiveness. Our comprehensive analysis in Sec. ","element":"span"},{"href":"#id-5","text":"3.2 ","element":"a"},{"text":"reveals that completion contributions are highly variable, with some samples providing significantly more training signals than others. These findings motivate the development of strategies to identify and prioritize high-value completions, potentially improving training efficiency without compromising model performance.","element":"span"}],[{"id":"id-5","style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Completion Contribution Analysis","element":"span"}],[{"text":"To measure the contribution of each completion to the policy model training, we first compute the derivative of the policy objective function in Eq. ","element":"span"},{"href":"#id-21","text":"(1) ","element":"a"},{"text":"with respect to the model parameters ","element":"span"},{"style":{"height":11.6},"width":69.52,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/3-0.png","element":"img","alt":" θ as:","inline":true}],[{"id":"id-23","style":{"width":"98%"},"width":1564,"height":1212,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/3-1.png","element":"img"}],[{"text":"We analyze the derivative components stressed in Eq. ","element":"span"},{"href":"#id-23","text":"(5)","element":"a"},{"text":". (1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advantage-weighted probability ratio term","element":"span"},{"style":{"height":26.4},"width":298,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/3-2.png","element":"img","alt":"πθ(oi,t|q,oi, 1 + ϵ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.21},"width":120.52,"height":35.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/5-6.png","element":"img","alt":" Ai > 0","inline":true},{"text":", the clip function is activated. This action ","element":"span"},{"text":"effectively nullifies the policy model gradient term in Eq. ","element":"span"},{"href":"#id-24","text":"(6)","element":"a"},{"text":", equivalent to pruning all completions.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Unifying Single-/Multiple-GPU(s) Settings","element":"span"},{"text":". In a multi-GPUs training scenario, we observe that the number of completions with significant advantages varies across devices. In such cases, the overall training efficiency is bottlenecked by the device processing the largest number of completions—a phenomenon referred to as the bucket effect. To mitigate this, for each GPU, we retain only the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"completions with the largest absolute advantage for each question, where","element":"span"}],[{"style":{"width":"60%"},"width":957,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/5-7.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":158.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/5-8.png","element":"img","alt":" P ∈ (0, 1]","inline":true,"padRight":true},{"text":"denoting the pruning rate. The modified CPPO under this strategy is:","element":"span"}],[{"style":{"width":"94%"},"width":1494,"height":266,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/5-9.png","element":"img"}],[{"text":"where the summation is taken only over the index set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"I ","element":"span"},{"text":"corresponding to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"completions with the highest absolute advantage values, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i.e.","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"77%"},"width":1228,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/5-10.png","element":"img"}],[{"text":"In Sec. ","element":"span"},{"href":"#id-26","text":"4.2.2, ","element":"a"},{"text":"we analyze that completions with high absolute advantage values, either having a correct format and correct answer or an incorrect format and incorrect answer, provide the clearest training signals. Partial correct completions with small absolute advantages contribute minimally or may mislead the policy model. Removing these completions from the training process can enhance training efficiency without compromising model performance.","element":"span"}],[{"text":"The key distinction between CPPO and GRPO is that CPPO does not use all completions for the forward computation of the policy model, reference model, and old policy model. Instead, by retaining only those with high absolute advantages for the gradient update, CPPO significantly reduces the computational overhead during forward passes, thereby accelerating the training process.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Parallel Processing through Dynamic Completion Allocation","element":"span"}],[{"text":"In this section, we introduce a novel dynamic completion allocation strategy to further optimize the training efficiency of CPPO. Conventional approaches, such as those employed in GRPO, face inherent limitations due to GPU memory constraints. Specifically, a single device can process a maximum of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"questions per batch, with each question generating ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"candidate completions. After pruning, the total number of retained completions per device reduces to ","element":"span"},{"style":{"height":11.39},"width":91,"height":28.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/5-11.png","element":"img","alt":" B ×k","inline":true},{"text":", resulting in suboptimal GPU utilization and underleveraged parallel computing capabilities.","element":"span"}],[{"text":"To address this inefficiency, we dynamically allocate pruned completions from additional questions into the device’s processing pipeline, as illustrated in Figure ","element":"span"},{"href":"#id-27","text":"3. ","element":"a"},{"text":"This strategy ensures that each device operates at full capacity by continuously populating its memory with high-quality completions derived from both the original and newly introduced questions. Critically, all newly incorporated completions undergo the same rigorous pruning process to maintain consistency.","element":"span"}],[{"id":"id-27","style":{"width":"98%"},"width":1561,"height":709,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/6-0.png","element":"img"}],[{"text":"Figure 3: Illustration of dynamic completion allocation for parallel processing. After pruning completions, completion allocation incorporates important completions from new questions. The ","element":"figcaption","subtype":"caption"},{"style":{"height":14.99},"width":45.52,"height":37.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/6-1.png","element":"img","alt":" onm","inline":true,"padRight":true},{"text":"represents the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"m","element":"figcaption","subtype":"caption"},{"text":"-th completion of the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"n","element":"figcaption","subtype":"caption"},{"text":"-th question.","element":"figcaption","subtype":"caption"}],[{"text":"The benefits of this approach are twofold. First, it maximizes GPU utilization by fully exploiting the device’s parallel computing potential. Second, it enables each device to process a larger number of questions per batch, thereby reducing the total number of training steps required to achieve convergence. This dual optimization boosts training efficiency while maintaining training quality. The CPPO algorithm can be found in Appendix ","element":"span"},{"text":"B.","element":"span"}]]},{"heading":"4 Experiments","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experimental Settings","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Training Details","element":"span"},{"text":". We implement CPPO on the Open R1 [","element":"span"},{"href":"#id-28","referenceIndex":5,"text":"5","element":"a"},{"text":"] and verl [","element":"span"},{"href":"#id-29","referenceIndex":22,"text":"22","element":"a"},{"text":"] frameworks, utilizing the vLLM inference library [","element":"span"},{"href":"#id-30","referenceIndex":14,"text":"14","element":"a"},{"text":"] for efficient completion generation. Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct are trained on two and four GPUs (each with 80GB memory), respectively. We set ","element":"span"},{"style":{"height":11.2},"width":121,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/6-2.png","element":"img","alt":" ϵ = 0.2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.59},"width":151,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/6-3.png","element":"img","alt":" β = 0.04","inline":true,"padRight":true},{"text":"in Eq. ","element":"span"},{"href":"#id-31","text":"(7)","element":"a"},{"text":", batch size to 16, number of epochs to 1, and learning rate to ","element":"span"},{"style":{"height":13.6},"width":145,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/6-4.png","element":"img","alt":"1 × 10−6","inline":true},{"text":". The policy model temperature is 1, group size is 16, and the maximum completion length is 1024. The prompt templates for CPPO can be found in Appendix ","element":"span"},{"text":"C.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Evaluation Details","element":"span"},{"text":". We evaluate the performance on multiple benchmarks with different difficulties, including Math [","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"8","element":"a"},{"text":"], AIME2024 [","element":"span"},{"href":"#id-32","referenceIndex":18,"text":"18","element":"a"},{"text":"], AMC2023 [","element":"span"},{"href":"#id-33","referenceIndex":17,"text":"17","element":"a"},{"text":"], and GSM8K [","element":"span"},{"href":"#id-8","referenceIndex":4,"text":"4","element":"a"},{"text":"]. We use vLLM [","element":"span"},{"href":"#id-30","referenceIndex":14,"text":"14","element":"a"},{"text":"] to accelerate the evaluation process. The evaluation batch size is set to 10. We use greedy decoding to generate completions for Math and GSM8K. For AIME2024 and AMC2023, we set the temperature as ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"6 ","element":"span"},{"text":"and use ","element":"span"},{"text":"4 ","element":"span"},{"text":"completions for each question. We use Pass@1 accuracy as the evaluation metric.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Main Results","element":"span"}],[{"text":"We evaluate CPPO by training models of different scales on GSM8K [","element":"span"},{"href":"#id-8","referenceIndex":4,"text":"4","element":"a"},{"text":"] and MATH [","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"8","element":"a"},{"text":"]. GSM8K contains 8.5K grade-school math problems, while MATH includes 7.5K competition-level problems. For the relatively simpler GSM8K dataset, we use Qwen2.5-1.5B-Instruct; for the more challenging MATH dataset, we use Qwen2.5-7B-Instruct. Each model is evaluated on the corresponding test subset. To further assess out-of-distribution reasoning ability, we test Qwen2.5-7B-Instruct on AMC2023 [","element":"span"},{"href":"#id-33","referenceIndex":17,"text":"17","element":"a"},{"text":"] and AIME2024 [","element":"span"},{"href":"#id-32","referenceIndex":18,"text":"18","element":"a"},{"text":"], as these benchmarks are too difficult for Qwen2.5-1.5B-Instruct. Additional results on larger models and different backbones are provided in Appendix ","element":"span"},{"text":"D, ","element":"span"},{"text":"E, ","element":"span"},{"text":"and ","element":"span"},{"text":"F. ","element":"span"},{"text":"Analyses of stability, convergence, and case studies are presented in Appendix ","element":"span"},{"text":"H ","element":"span"},{"text":"and ","element":"span"},{"text":"I.","element":"span"}],[{"id":"id-36","style":{"fontWeight":"bold"},"text":"4.2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Performance Comparison","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Training on GSM8K","element":"span"},{"text":". As shown in Table ","element":"span"},{"href":"#id-34","text":"1, ","element":"a"},{"text":"CPPO demonstrates clear advantages over GRPO in both accuracy and acceleration ratio. Notably, CPPO achieves comparable or even higher accuracy","element":"span"}],[{"id":"id-34","text":"Table 1: Comparison between GRPO and CPPO on GSM8K test subset. We train Qwen2.5-1.5B- ","element":"figcaption","subtype":"caption"},{"text":"Instruct on the GSM8K training subset three times independently to calculate the mean and standard deviation, and the number of retained completions after pruning is denoted by ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":327,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/7-0.png","element":"img","alt":" k = ⌊G × (1 − P)⌋.","inline":true}],[{"style":{"width":"99%"},"width":1574,"height":382,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/7-1.png","element":"img"}],[{"id":"id-35","text":"Table 2: Comparison of GRPO and CPPO on the MATH test subset, as well as on out-of-distribution ","element":"figcaption","subtype":"caption"},{"text":"benchmarks AMC 2023 and AIME 2024. We train Qwen2.5-7B-Instruct on the MATH training dataset three times independently to calculate the mean and standard deviation, and the number of retained completions after pruning is denoted by ","element":"figcaption","subtype":"caption"},{"style":{"height":15.79},"width":327,"height":39.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/7-2.png","element":"img","alt":" k = ⌊G × (1 − P)⌋.","inline":true}],[{"style":{"width":"99%"},"width":1570,"height":327,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/7-3.png","element":"img"}],[{"text":"than GRPO across various pruning rates. At a pruning rate of 87.50%, CPPO attains an accuracy of 80.01%, surpassing GRPO’s 77.38% by 2.63%.","element":"span"}],[{"text":"For efficiency, CPPO greatly accelerates training. At a pruning rate of 93.75%, it achieves an acceleration ratio of ","element":"span"},{"style":{"height":11.2},"width":94,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/7-4.png","element":"img","alt":" 7.98×","inline":true},{"text":". The speedup stems from the completions pruning and the completions allocation. Completions pruning reduces computational overhead by discarding less important completions, while the completions allocation strategy maximizes the use of freed memory and leverages the GPU’s parallel processing capabilities. As a result, CPPO processes more questions per batch and reduces the total number of training steps required. These results demonstrate that CPPO not only maintains or improves accuracy but also significantly enhances training efficiency, making it a practical and effective solution for large-scale reasoning model training.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Training on MATH","element":"span"},{"text":". In Table ","element":"span"},{"href":"#id-35","text":"2, ","element":"a"},{"text":"CPPO can well scale to larger models, achieving up to ","element":"span"},{"style":{"height":11.2},"width":95,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/7-5.png","element":"img","alt":" 3.48×","inline":true,"padRight":true},{"text":"acceleration on the MATH without sacrificing accuracy. For instance, at a pruning rate of 87.5%, CPPO attains 75.95% accuracy, outperforming GRPO (75.26%) while cutting training time by ","element":"span"},{"style":{"height":11.2},"width":107.48,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/7-6.png","element":"img","alt":" 3.48×.","inline":true}],[{"text":"Furthermore, evaluation on the AMC2023 and AIME2024 benchmarks confirms that CPPO, despite training only on high absolute advantage completions, preserves the model’s generalization ability on out-of-distribution tasks. Thus, CPPO not only matches or even surpasses GRPO in enhancing reasoning capabilities but also well reduces training time, making it a more efficient alternative.","element":"span"}],[{"id":"id-26","style":{"fontWeight":"bold"},"text":"4.2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"An In-depth Analysis of CPPO’s Higher Accuracy","element":"span"}],[{"text":"Results on Sec. ","element":"span"},{"href":"#id-36","text":"4.2.1 ","element":"a"},{"text":"indicate that CPPO sometimes achieves better performance at higher pruning rates on the GSM8K and MATH datasets. For example, CPPO with a ","element":"span"},{"text":"75% ","element":"span"},{"text":"pruning rate achieves ","element":"span"},{"text":"78","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"76% ","element":"span"},{"text":"accuracy on GSM8K and ","element":"span"},{"text":"76","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"55% ","element":"span"},{"text":"accuracy on MATH, compared to ","element":"span"},{"text":"78","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"15% ","element":"span"},{"text":"and ","element":"span"},{"text":"76","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"01% ","element":"span"},{"text":"accuracy with a ","element":"span"},{"text":"50% ","element":"span"},{"text":"pruning rate, respectively. To rule out the possibility that this improvement is merely due to the increased number of questions processed per training step, which is enabled by the completion allocation strategy, we compare CPPO with GRPO under the same number of questions per training step, as shown in Figure ","element":"span"},{"href":"#id-37","text":"4.","element":"a"}],[{"text":"The key difference is that CPPO first generates a group of completions and retains only the top ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"completions for gradient update, whereas GRPO directly generates ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"completions for update. Despite this, CPPO consistently outperforms GRPO, demonstrating that its accuracy gains stem","element":"span"}],[{"id":"id-37","style":{"width":"99%"},"width":1575,"height":375,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/8-0.png","element":"img"}],[{"text":"Figure 4: Evaluation accuracy comparison. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Left","element":"figcaption","subtype":"caption"},{"text":": Qwen2.5-1.5b-Instruct on GSM8K test subset. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Right","element":"figcaption","subtype":"caption"},{"text":": Qwen2.5-7b-Instruct on MATH test subset. Here, ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k ","element":"figcaption","subtype":"caption"},{"text":"denotes the retained completion quantity by CPPO (or generated by GRPO), and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"b ","element":"figcaption","subtype":"caption"},{"text":"represents questions per training step.","element":"figcaption","subtype":"caption"}],[{"text":"not from processing more questions per step but from the higher quality of retained completions. The quality of completions plays a crucial role in training. CPPO selectively retains high absolute advantage completions from a larger pool, whereas GRPO updates the model with directly generated completions, which may vary in quality. This aligns with our completion contribution analysis in Sec. ","element":"span"},{"href":"#id-5","text":"3.2, ","element":"a"},{"text":"which highlights that completions with high absolute advantages contribute more effectively to training. In GRPO with a group size of 16, both high and low absolute advantage completions are used for training. In contrast, CPPO with a pruning rate of ","element":"span"},{"text":"87","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"50% ","element":"span"},{"text":"trains exclusively on highadvantage completions—yet still achieves superior performance.","element":"span"}],[{"text":"To better understand this, we categorize completions into four types: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(1) Correct format and correct answer ","element":"span"},{"text":"— Guides the model to generate accurate completions. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(2) Incorrect format and incorrect answer ","element":"span"},{"text":"— Helps the model avoid incorrect completions. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(3) Correct format and incorrect answer ","element":"span"},{"text":"— May mislead the model into generating partially correct responses. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(4) Incorrect format but correct answer ","element":"span"},{"text":"— Similarly, may introduce noise in learning. Specific examples of the four types of completions can be found in Appendix ","element":"span"},{"text":"J.","element":"span"}],[{"text":"The first two types are high-quality completions that provide clear training signals. The latter two types, however, are low-quality completions, as their small positive advantage values can mislead the model and introduce noise. Unlike GRPO, which trains on all completions indiscriminately, CPPO filters out these low-quality completions through completions pruning, leading to more efficient learning and better overall performance. As more low-quality completions are removed (pruning rate: 0.00% ","element":"span"},{"style":{"height":8.4},"width":36,"height":21,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/8-1.png","element":"img","alt":" →","inline":true,"padRight":true},{"text":"87.50%), performance improves, as shown in Table ","element":"span"},{"href":"#id-34","text":"1 ","element":"a"},{"text":"and Table ","element":"span"},{"href":"#id-35","text":"2. ","element":"a"},{"text":"However, an excessively high pruning rate can also discard high-quality completions, reducing training effectiveness. This is evident in CPPO’s performance decline at a 93.75% pruning rate, as shown in Table ","element":"span"},{"href":"#id-34","text":"1 ","element":"a"},{"text":"and Table ","element":"span"},{"href":"#id-35","text":"2.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"4.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Generalizing CPPO to Other Reinforcement Learning Algorithms","element":"span"}],[{"text":"CPPO reduces training cost by pruning low-quality completions. Therefore, CPPO can be generalized to other group relative policy optimization based algorithms such as DAPO [","element":"span"},{"href":"#id-38","referenceIndex":30,"text":"30","element":"a"},{"text":"] and Dr.GRPO [","element":"span"},{"href":"#id-39","referenceIndex":16,"text":"16","element":"a"},{"text":"]. As shown in Table ","element":"span"},{"href":"#id-40","text":"3, ","element":"a"},{"text":"CPPO can be combined with DAPO and Dr.GRPO to further improve training speed and accuracy, demonstrating the strong generalizability of CPPO.","element":"span"}],[{"text":"Table 3: ","element":"figcaption","subtype":"caption"},{"text":"Comparison between different reinforcement learning algorithms on the GSM8K test ","element":"figcaption","subtype":"caption"},{"id":"id-40","text":"subset. We train Qwen2.5-1.5B-Instruct on the GSM8K training subset, and the number of retained ","element":"figcaption","subtype":"caption"},{"text":"completions after pruning is ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":284.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/8-2.png","element":"img","alt":" k = ⌊G×(1−P)⌋","inline":true},{"text":". All experiments are conducted on the verl framework, which supports various RL algorithms.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"98%"},"width":1558,"height":466,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/8-3.png","element":"img"}],[{"id":"id-41","text":"Table 4: Evaluation of CPPO with different pruning metrics on GSM8K. “Largest”/“Smallest” prune ","element":"figcaption","subtype":"caption"},{"text":"completions with the highest/lowest absolute advantages, while “Largest*”/“Smallest*” use raw advantage values. “Random” denotes random pruning.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"70%"},"width":1112,"height":378,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/9-0.png","element":"img"}],[{"id":"id-42","text":"Table 5: Ablation study on the key components of CPPO. Experiments are conducted on Math [","element":"figcaption","subtype":"caption"},{"href":"#id-7","referenceIndex":8,"text":"8","element":"a","subtype":"caption"},{"text":"] using Qwen2.5-7B-Instruct ","element":"figcaption","subtype":"caption"},{"href":"#id-6","referenceIndex":29,"text":"[29]","element":"a","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"width":"85%"},"width":1356,"height":270,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/9-1.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"4.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Ablation Study","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Ablation Study on Pruning Metrics","element":"span"},{"text":". The analysis in Sec. ","element":"span"},{"href":"#id-5","text":"3.2 ","element":"a"},{"text":"reveals that a completion’s impact on policy model training is tied to the absolute value of its advantage— a higher absolute value provides stronger training signals. Based on this insight, we adopt the absolute advantage value as the pruning metric, removing completions with the lowest absolute advantages. As shown in Table ","element":"span"},{"href":"#id-41","text":"4, ","element":"a"},{"text":"“Smallest” achieves the best performance, while “Largest” performs the worst, with “Random” falling in between. Additionally, “Smallest*” and “Largest*”, which prune completions based on raw advantage values rather than absolute values, perform worse than “Smallest”, confirming that absolute advantage values are a more effective pruning metric. The results align with our analysis in Sec. ","element":"span"},{"href":"#id-5","text":"3.2 ","element":"a"},{"text":"and further validate the effectiveness of pruning based on absolute advantage values.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Ablation Study on Key Modules","element":"span"},{"text":". As shown in Table ","element":"span"},{"href":"#id-42","text":"5, ","element":"a"},{"text":"by discarding unimportant completions, the completion pruning module improves the training efficiency by ","element":"span"},{"style":{"height":11.2},"width":93.52,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/9-2.png","element":"img","alt":" 1.23×","inline":true},{"text":". By fully leveraging the benefits brought by completion pruning and the GPU parallel computing capability, the completion allocation strategy further improves the training efficiency to ","element":"span"},{"style":{"height":11.2},"width":106,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/9-3.png","element":"img","alt":" 1.65×.","inline":true}]]},{"heading":"5 Limitations and Future Work","paragraphs":[[{"text":"CPPO does not reduce the time required for generating completions. When completion generation dominates the overall training time, the speedup of CPPO may be reduced. However, CPPO can benefit from inference acceleration methods [","element":"span"},{"href":"#id-43","referenceIndex":3,"text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":15,"text":"15","element":"a"},{"text":"], which are orthogonal to it, to improve training efficiency further. Due to the limited GPU resources of the academic community, we only evaluate the effectiveness of CPPO on relatively small-scale models (less than 14B) and math datasets, including Math and GSM8K. In the future, we plan to: (1) evaluate CPPO on larger models and more tasks. (2) optimize the completion generation time to boost training efficiency further.","element":"span"}]]},{"heading":"6 Conclusion","paragraphs":[[{"text":"In this paper, we proposed Completion Pruning Policy Optimization (CPPO) to enhance the training efficiency of GRPO-based reasoning models. By selectively pruning completions based on their relative advantages with a suitable pruning rate, CPPO reduces computational overhead without compromising model performance. Additionally, our dynamic completion allocation strategy fully leverages the benefits of completion pruning and GPU parallelism, further boosting training speed. The results demonstrate that CPPO achieves up to ","element":"span"},{"style":{"height":11.2},"width":94.52,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/9-4.png","element":"img","alt":" 7.98×","inline":true,"padRight":true},{"text":"speedup on GSM8K and ","element":"span"},{"style":{"height":11.2},"width":94.52,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/117932/images/9-5.png","element":"img","alt":" 3.48×","inline":true,"padRight":true},{"text":"on Math while sometimes preserving or even enhancing the accuracy compared to GRPO. These findings highlight CPPO as a practical solution for optimizing reasoning model training at a lower cost.","element":"span"}]]},{"heading":"Acknowledgments","paragraphs":[[{"text":"This work was supported by the National Science Fund for Distinguished Young Scholars (No.62025603), National Science Fund for Excellent Young Scholars (No. 62222602), the National Natural Science Foundation of China (No. U21B2037, No. U22B2051, No. U23A20383, No. 62176222, No. 62176223, No. 62176226, No. 62072386, No. 62072387, No. 62072389, No. 62002305 and No. 62272401), and the Natural Science Foundation of Fujian Province of China (No. 2021J06003, No.2022J06001).","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-9","text":"[1] ","element":"span"},{"text":"Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report, 2023.","element":"span"}],[{"id":"id-12","text":"[2] ","element":"span"},{"text":"Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2309.16609","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-43","text":"[3] Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and ","element":"span"},{"text":"John Jumper. Accelerating large language model decoding with speculative sampling. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2302.01318","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-8","text":"[4] ","element":"span"},{"text":"Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2110.14168","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-28","text":"[5] Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025.","element":"span"}],[{"id":"id-16","text":"[6] ","element":"span"},{"text":"Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2501.04519","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-1","text":"[7] ","element":"span"},{"text":"Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2501.12948","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-7","text":"[8] ","element":"span"},{"text":"Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NeurIPS","element":"span"},{"text":", 2021.","element":"span"}],[{"text":"[9] ","element":"span"},{"text":"Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2501.03262","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-18","text":"[10] ","element":"span"},{"text":"Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, and Heung-Yeung Shum Xiangyu Zhang. Open-reasoner-zero: An open source approach to scaling reinforcement learning on the base model, 2025.","element":"span"}],[{"id":"id-0","text":"[11] ","element":"span"},{"text":"Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2412.16720","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-13","text":"[12] ","element":"span"},{"text":"Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2403.07974","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-20","text":"[13] ","element":"span"},{"text":"Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of-thought without compromising effectiveness. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2412.11664","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-30","text":"[14] ","element":"span"},{"text":"Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SOSP","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-44","text":"[15] ","element":"span"},{"text":"Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2401.15077","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-39","text":"[16] ","element":"span"},{"text":"Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.20783","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-33","text":"[17] Mathematical Association of America. American mathematics competitions, 2023.","element":"span"}],[{"id":"id-32","text":"[18] ","element":"span"},{"text":"Mathematical Association of America. American invitational mathematics examination, 2024.","element":"span"}],[{"id":"id-14","text":"[19] ","element":"span"},{"text":"David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"COLM","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-4","text":"[20] ","element":"span"},{"text":"John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1707.06347","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-3","text":"[21] ","element":"span"},{"text":"Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2402.03300","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-29","text":"[22] ","element":"span"},{"text":"Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv: 2409.19256","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-15","text":"[23] ","element":"span"},{"text":"Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2408.03314","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-11","text":"[24] ","element":"span"},{"text":"Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: A family of highly capable multimodal models, 2023.","element":"span"}],[{"id":"id-2","text":"[25] ","element":"span"},{"text":"Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2501.12599","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-10","text":"[26] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- ","element":"span"},{"text":"thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models, 2023.","element":"span"}],[{"id":"id-19","text":"[27] ","element":"span"},{"text":"Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2502.12067","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-17","text":"[28] ","element":"span"},{"text":"Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2502.14768","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-6","text":"[29] ","element":"span"},{"text":"An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2412.15115","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-38","text":"[30] ","element":"span"},{"text":"Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.14476","element":"span"},{"text":", 2025.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]