36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"119509","publisher":"neurips","paperJSON":{"title":"Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models","paperID":"119509","avgLineHeight":10.91,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: token-level methods (e.g., PPO) aim to provide fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Segment Policy Optimization (SPO)","element":"span"},{"text":", a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(1) ","element":"span"},{"text":"flexible segment partition; ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(2) ","element":"span"},{"text":"accurate segment advantage estimation; and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(3) ","element":"span"},{"text":"policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(1) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SPO-chain ","element":"span"},{"text":"for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving ","element":"span"},{"text":"6","element":"span"},{"text":"-","element":"span"},{"text":"12 ","element":"span"},{"text":"percentage point improvements in accuracy over PPO and GRPO on GSM8K. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(2) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"SPO-tree ","element":"span"},{"text":"for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving ","element":"span"},{"text":"7","element":"span"},{"text":"-","element":"span"},{"text":"11 ","element":"span"},{"text":"percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at ","element":"span"},{"href":"https://github.com/AIFrameResearch/SPO","style":{"fontFamily":"monospace"},"text":"https://github.com/AIFrameResearch/SPO","element":"a"},{"text":".","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Reinforcement learning (RL) has become the cornerstone for training state-of-the-art reasoning large language models (LLMs), such as OpenAI o1 [","element":"span"},{"href":"#id-0","referenceIndex":11,"text":"11","element":"a"},{"text":"], DeepSeek R1 [","element":"span"},{"href":"#id-1","referenceIndex":7,"text":"7","element":"a"},{"text":"], Kimi K1.5 [","element":"span"},{"href":"#id-2","referenceIndex":31,"text":"31","element":"a"},{"text":"], and Qwen3 [","element":"span"},{"href":"#id-3","referenceIndex":34,"text":"34","element":"a"},{"text":"]. These models demonstrate RL’s unique ability to cultivate advanced reasoning capabilities, particularly in complex, STEM-related tasks. However, achieving both effectiveness and efficiency in RL training hinges on addressing a fundamental challenge: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"credit assignment, i.e., accurately attributing success or failure to individual actions within a sequence ","element":"span"},{"text":"[","element":"span"},{"href":"#id-4","referenceIndex":30,"text":"30","element":"a"},{"text":"]. In the context of LLMs, where actions correspond to generated tokens, this challenge is even greater due to sparse and delayed rewards that are typically only available at the end of the response. Advantage estimation is the common approach for credit assignment in RL, and existing methods differ in the granularity of advantage estimation, typically operating at two extremes, each with its own limitations.","element":"span"}],[{"text":"Fine-grained token-level methods like Proximal Policy Optimization (PPO) [","element":"span"},{"href":"#id-5","referenceIndex":26,"text":"26","element":"a"},{"text":"] use a critic model to estimate advantages for each token. However, accurately predicting state values poses a particular challenge in LLM training, due to the significant variability among states conditioned on different prompts and the limited per-prompt data to effectively train the critic model. Empirical findings by [","element":"span"},{"href":"#id-6","referenceIndex":12,"text":"12","element":"a"},{"text":"] provide extensive evidence that this difficulty causes the critic model to produce unreliable value predictions, leading to suboptimal credit assignment in practice. Additionally, PPO employs either a separate critic model or an additional critic head to predict the value function, leading to extra memory and computation overhead. At the other extreme, coarse-grained trajectory-level methods such as Group Relative Policy Optimization (GRPO) [","element":"span"},{"href":"#id-7","referenceIndex":28,"text":"28","element":"a"},{"text":"] bypass the critic model and compute a single advantage for the entire generated sequence based solely on the final outcome. While this approach is computationally efficient and unbiased, it leads to imprecise credit assignment over long sequences [","element":"span"},{"href":"#id-6","referenceIndex":12,"text":"12","element":"a"},{"text":"]. Applying a single advantage signal to a large number of tokens makes it challenging for the model to identify which specific tokens contribute positively or negatively, resulting in the model failing to reward partial progress or learning redundant solution paths [","element":"span"},{"href":"#id-8","referenceIndex":22,"text":"22","element":"a"},{"text":"]. Moreover, our experimental results, consistent with a concurrent work [","element":"span"},{"href":"#id-9","referenceIndex":35,"text":"35","element":"a"},{"text":"], find that GRPO can rapidly overfit on a fixed training set, with the number of unique responses decreasing and the performance on the validation set saturating early (see Figure ","element":"span"},{"text":"8","element":"span"},{"text":").","element":"span"}],[{"text":"To overcome the limitations of both token-level and trajectory-level methods, we propose ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"S","element":"span"},{"style":{"fontStyle":"italic"},"text":"egment ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"P","element":"span"},{"style":{"fontStyle":"italic"},"text":"olicy ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"O","element":"span"},{"style":{"fontStyle":"italic"},"text":"ptimization (SPO)","element":"span"},{"text":", a novel RL framework focusing on mid-grained, segment-level advantage estimation. Instead of assigning credit for each token or only at the end of a trajectory, SPO partitions the generated sequence into contiguous segments and estimates advantages at this intermediate granularity. This segment-level estimation offers several key benefits: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(1) Improved credit assignment: ","element":"span"},{"text":"Segment-level feedback provides more localized information than trajectory-level methods, allowing credit assignment to shorter segments. This finer granularity enables the model to reward partial progress even for ultimately unsuccessful responses and penalize redundancy or unnecessary portions within successful responses. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(2) More accurate advantage estimation: ","element":"span"},{"text":"Compared to token-level advantages, segment-level advantages involve fewer estimation points. This enables SPO to leverage effective Monte Carlo (MC) sampling, yielding accurate and unbiased advantage estimation directly from the policy, thus eliminating the need for an additional, unstable critic model. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(3) Flexibility and adaptability: ","element":"span"},{"text":"Our segment partition method can be arbitrarily defined without requiring semantic completeness, offering flexible adjustment of granularity from token-level to trajectory-level, also making it adaptable to a wide range of tasks.","element":"span"}],[{"text":"Our SPO framework contains three key components: (","element":"span"},{"style":{"fontWeight":"bold"},"text":"1) Flexible segment partition","element":"span"},{"text":", ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(2) Segment advantage estimation via MC","element":"span"},{"text":", and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"(3) Policy optimization using segment advantages","element":"span"},{"text":". This modular design allows for various strategies to be implemented within each component, making the framework highly adaptable to different tasks and scenarios. We further instantiate SPO with two specialized instances tailored for different reasoning scenarios. For short chain-of-thought (CoT), we introduce ","element":"span"},{"style":{"fontWeight":"bold"},"text":"SPO-chain","element":"span"},{"text":", which employs a cutpoint-based segment partition strategy and chain-based segment advantage estimation. For long CoT, we introduce ","element":"span"},{"style":{"fontWeight":"bold"},"text":"SPO-tree","element":"span"},{"text":", featuring a novel tree-based segment advantage estimation strategy specifically designed to significantly improve MC sampling efficiency. Additionally, we propose a novel probability-mask optimization strategy that selectively compute the loss for key tokens instead of all tokens within a segment, which can be applied to either SPO-chain or SPO-tree to further enhance credit assignment. Our experimental evaluations demonstrate the effectiveness of the SPO framework and its specialized instantiation. For short CoT, SPO-chain achieves 6–12 percentage point accuracy improvements over PPO and GRPO on GSM8K. For long CoT, SPO-tree achieves 7–11 percentage point accuracy improvements over GRPO on MATH500 under 2K and 4K context evaluation. Our major contributions are summarized as follows:","element":"span"}],[{"text":"1. We propose Segment Policy Optimization (SPO), a novel segment-level RL training framework for LLMs. SPO introduces mid-grained advantage estimation to overcome the limitations of token-level and trajectory-level methods, and features a modular architecture.","element":"span"}],[{"text":"2. We introduce several novel techniques integrated within the SPO framework, including cutpoint-based segment partition, tree-based segment advantage estimation, and a policy optimization strategy utilizing probability masks.","element":"span"}],[{"text":"3. Building on the proposed SPO framework, we introduce two specialized instantiations: SPO-chain and SPO-tree, for short and long CoT scenarios, respectively, and demonstrate their effectiveness through extensive experiments on mathematical reasoning benchmarks, GSM8K and MATH.","element":"span"}]]},{"heading":"2 Background","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Language Generation as an MDP. ","element":"span"},{"text":"Language generation tasks can be modeled as a Markov Decision Process (MDP) defined by a tuple ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"P","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":")","element":"span"},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"is the state space, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is the action space, ","element":"span"},{"text":"P ","element":"span"},{"text":"represents transition dynamics, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R ","element":"span"},{"text":"is the reward function. Specifically, at each time step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", the state ","element":"span"},{"style":{"height":13.79},"width":106,"height":34.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/119509/images/2-0.png","element":"img","alt":"st ∈ S","inline":true,"padRight":true},{"text":"consists of the prompt tokens ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"x ","element":"span"},{"text":"along with previously generated tokens ","element":"span"},{"style":{"height":17.09},"width":349.52,"height":42.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/119509/images/2-1.png","element":"img","alt":" y