38:[["$","audio",null,{"id":"tts"}],["$","$L3d",null,{"paperID":"2510.00911","publisher":"arxiv","paperJSON":{"title":"RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training","paperID":"2510.00911","avgLineHeight":10.96,"imgScale":4,"sections":[{"heading":"ABSTRACT","paragraphs":[[{"text":"$3e","element":"span"},{"href":"https://github.com/RTkenny/RiskPO","style":{"height":12.8},"width":719.14,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/0-0.png","element":"img","alt":" https://github.com/RTkenny/RiskPO.","inline":true}]]},{"heading":"1 INTRODUCTION","paragraphs":[[{"style":{"width":"46%"},"width":730,"height":542,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/0-1.png","element":"img"}],[{"text":"Figure 1: Pass@32 and Avg@32 learning curves of DeepSeek-R1-Distill-Qwen-1.5B","element":"figcaption","subtype":"caption"}],[{"text":"Since reinforcement learning (RL) provides a unified framework that flexibly accommodates diverse training targets and feedback, it has become a key technique for the post-training of large language models (LLMs). Based on such a foundation, RL with verifiable reward (RLVR) has recently been recognized as an effective paradigm for enhancing the reasoning ability of LLMs. Unlike traditional RL from human feedback, it leverages objective and binary reward signals, providing clear optimization feedback. Maximizing the expected average reward is anticipated to improve task performance of LLMs. Within this framework, a series of efficiency-oriented extensions have been developed from the classical policy-based RL method. Among them, Group Relative Policy Optimization ","element":"span"},{"href":"#id-0","referenceIndex":44,"text":"(Shao et al., ","element":"a"},{"href":"#id-0","referenceIndex":44,"text":"2024; ","element":"a"},{"href":"#id-1","referenceIndex":19,"text":"Guo et al., ","element":"a"},{"href":"#id-1","referenceIndex":19,"text":"2025, ","element":"a"},{"text":"GRPO) achieves substantial efficiency gains by discarding re-","element":"span"}],[{"text":"dundant structures originally designed for standard RL tasks, and has become the de facto baseline in this area. Since then, several GRPO variants have been proposed; see Section ","element":"span"},{"text":"2 ","element":"span"},{"text":"for details.","element":"span"}],[{"text":"However, RLVR methods that maximize average performance suffer from the fundamental issue of entropy collapse. Prior work shows that models trained via RLVR often experience rapid entropy collapse in the early stages of training, leading to premature convergence and a plateau in performance with little subsequent improvement ","element":"span"},{"href":"#id-2","referenceIndex":14,"text":"(Cui et al., ","element":"a"},{"href":"#id-2","referenceIndex":14,"text":"2025; ","element":"a"},{"href":"#id-3","referenceIndex":18,"text":"Gao et al., ","element":"a"},{"href":"#id-3","referenceIndex":18,"text":"2025)","element":"a"},{"text":". Entropy, as emphasized by several studies, serves as a key indicator of exploration capacity in RL ","element":"span"},{"href":"#id-4","referenceIndex":47,"text":"(Wang et al., ","element":"a"},{"href":"#id-4","referenceIndex":47,"text":"2025b; ","element":"a"},{"href":"#id-5","referenceIndex":8,"text":"Cheng et al., ","element":"a"},{"href":"#id-5","referenceIndex":8,"text":"2025; ","element":"a"},{"href":"#id-6","referenceIndex":22,"text":"Hou et al., ","element":"a"},{"href":"#id-6","referenceIndex":22,"text":"2025)","element":"a"},{"text":". Once entropy collapses, the model becomes overconfident, reduces exploration prematurely, and fails to acquire new knowledge effectively. This constrained exploration ultimately limits its reasoning capabilities and overall performance. As a consequence, LLMs do not truly expand their intrinsic reasoning capacity or boundary; the observed improvements often reflect a more efficient sampling of known answers rather than genuinely stronger reasoning skills ","element":"span"},{"href":"#id-7","referenceIndex":51,"text":"(Yue et al., ","element":"a"},{"href":"#id-7","referenceIndex":51,"text":"2025a; ","element":"a"},{"href":"#id-8","referenceIndex":48,"text":"Xiong et al., ","element":"a"},{"href":"#id-8","referenceIndex":48,"text":"2025; ","element":"a"},{"href":"#id-9","referenceIndex":6,"text":"Chen et al., ","element":"a"},{"href":"#id-9","referenceIndex":6,"text":"2025; ","element":"a"},{"href":"#id-3","referenceIndex":18,"text":"Gao et al., ","element":"a"},{"href":"#id-3","referenceIndex":18,"text":"2025)","element":"a"},{"text":". This boundary effect implies that GRPO may only enhance short-horizon performance metrics (e.g., Pass@1) without significantly lifting the capability of the base model.","element":"span"}],[{"id":"id-10","style":{"width":"49%"},"width":782,"height":970,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/1-0.png","element":"img"}],[{"text":"We argue that a key reason behind these challenges is that GRPO employs the mean as its objective, which is inherently misaligned with the goal of improving reasoning ability. ","element":"span"},{"text":"A mean-based objective disproportionately emphasizes common, high-probability generation paths while neglecting rare yet informative reasoning trajectories, leading to premature convergence and limited exploration. Even worse, if the model consistently generates either all wrong answers for a question, the estimated GRPO’s advantage collapses to zero, leaving the model without any learning signal on its weakest areas. This overexploitation of gradients on easier questions yields marginal performance gains, as optimization predominantly reinforces knowledge the model already possesses rather than guiding it toward solving more challenging problems. In contrast, risk-averse optimization objectives, such as Conditional Value-at-Risk (CVaR) or Range Value-at-Risk (RVaR), can encourage the model to explore difficult problems and enhance reasoning abili-","element":"span"}],[{"text":"ties. By amplifying gradient signals from low-reward answers, these objectives naturally encourage the policy to reduce overconfidence, diversify its search, and promote novel reasoning strategies. Consequently, Risk-based objectives provide an effective handle for better mitigating entropy collapse, preventing overfitting to easy problems, and driving genuine improvements of the reasoning boundary.","element":"span"}],[{"text":"We propose Risk-based Policy Optimization (RiskPO), which employs a novel risk-sensitive objective termed Mixed Value-at-Risk (MVaR). Compared with mean-based post-training methods, our risk-based approach demonstrates superior performance in encouraging exploration and fostering stronger reasoning capabilities. The overall framework of RiskPO is illustrated in Figure ","element":"span"},{"href":"#id-10","text":"2. ","element":"a"},{"text":"We summarize our contributions as follows:","element":"span"}],[{"text":"1. To the best of our knowledge, we are the first to incorporate risk measures into the training objective. Since the reward for a single question is binary, we propose grouping multiple questions into a bundle to enrich the feedback signal. It is shown to avoid the zero advantage issue and strengthen gradient signals for hard problems, thereby facilitating exploration.","element":"span"}],[{"text":"2. We provide theoretical results that explain the superiority of the proposed MVaR objective. By analyzing the entropy mechanism, we demonstrate that the risk-averse configuration can effectively mitigate entropy collapse.","element":"span"}],[{"text":"3. We conduct extensive numerical experiments to evaluate the performance of our algorithm. RiskPO consistently outperforms GRPO and other baselines on multiple mathematical reasoning tasks. On Pass@k metrics, RiskPO even achieves better performance, indicating its strong capacity for exploration and acquisition of new reasoning skills.","element":"span"}]]},{"heading":"2 RELATED WORKS","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"RL for LLM Post-training. ","element":"span"},{"text":"RL has played a critical role in the post-training phase of LLM ","element":"span"},{"href":"#id-0","referenceIndex":44,"text":"(Shao ","element":"a"},{"href":"#id-0","referenceIndex":44,"text":"et al., ","element":"a"},{"href":"#id-0","referenceIndex":44,"text":"2024; ","element":"a"},{"href":"#id-11","referenceIndex":11,"text":"Christiano et al., ","element":"a"},{"href":"#id-11","referenceIndex":11,"text":"2017; ","element":"a"},{"href":"#id-12","referenceIndex":28,"text":"Lambert et al., ","element":"a"},{"href":"#id-12","referenceIndex":28,"text":"2022)","element":"a"},{"text":". Through verifiable reward, the LLM learn complicated reasoning skills across solving math problems and coding ","element":"span"},{"href":"#id-13","referenceIndex":53,"text":"(Zhao et al., ","element":"a"},{"href":"#id-13","referenceIndex":53,"text":"2025a; ","element":"a"},{"href":"#id-14","referenceIndex":7,"text":"Chen et al., ","element":"a"},{"href":"#id-14","referenceIndex":7,"text":"2024; ","element":"a"},{"href":"#id-15","referenceIndex":23,"text":"Huang et al., ","element":"a"},{"href":"#id-15","referenceIndex":23,"text":"2025)","element":"a"},{"text":". Originating from the PPO, much literature has proposed new methods to cater to the requirements of RLVR. ReMax ","element":"span"},{"href":"#id-16","referenceIndex":30,"text":"(Li et al., ","element":"a"},{"href":"#id-16","referenceIndex":30,"text":"2024) ","element":"a"},{"text":"proposes to use the deterministic output of the LLM as the baseline to reduce variance. DAPO ","element":"span"},{"href":"#id-17","referenceIndex":50,"text":"(Yu et al., ","element":"a"},{"href":"#id-17","referenceIndex":50,"text":"2025) ","element":"a"},{"text":"incorporates four engineering tricks to improve GRPO. VAPO shows that the value-based RL method can also perform well in RLVR ","element":"span"},{"href":"#id-18","referenceIndex":52,"text":"(Yue et al., ","element":"a"},{"href":"#id-18","referenceIndex":52,"text":"2025b)","element":"a"},{"text":". GPG ","element":"span"},{"href":"#id-19","referenceIndex":12,"text":"(Chu et al., ","element":"a"},{"href":"#id-19","referenceIndex":12,"text":"2025) ","element":"a"},{"text":"investigates the normalizing factor in GRPO. Several literatures ","element":"span"},{"href":"#id-4","referenceIndex":47,"text":"(Wang et al., ","element":"a"},{"href":"#id-4","referenceIndex":47,"text":"2025b; ","element":"a"},{"href":"#id-2","referenceIndex":14,"text":"Cui et al., ","element":"a"},{"href":"#id-2","referenceIndex":14,"text":"2025; ","element":"a"},{"href":"#id-5","referenceIndex":8,"text":"Cheng et al., ","element":"a"},{"href":"#id-5","referenceIndex":8,"text":"2025; ","element":"a"},{"href":"#id-20","referenceIndex":46,"text":"Wang et al., ","element":"a"},{"href":"#id-20","referenceIndex":46,"text":"2025a) ","element":"a"},{"text":"investigate the entropy mechanism in RLVR, pointing out the significance of exploration in RLVR. GSPO ","element":"span"},{"href":"#id-21","referenceIndex":55,"text":"(Zheng et al., ","element":"a"},{"href":"#id-21","referenceIndex":55,"text":"2025) ","element":"a"},{"text":"and GMPO ","element":"span"},{"href":"#id-22","referenceIndex":54,"text":"(Zhao et al., ","element":"a"},{"href":"#id-22","referenceIndex":54,"text":"2025b) ","element":"a"},{"text":"focus on stabilizing the RL training. GSPO uses sequence-wise importance sampling, which has good performance on Mixture-of-Expert models. GMPO uses the geometric mean in the gradient estimation, which decreases the importance. ProRL ","element":"span"},{"href":"#id-23","referenceIndex":32,"text":"(Liu et al., ","element":"a"},{"href":"#id-23","referenceIndex":32,"text":"2025a) ","element":"a"},{"text":"shows that with stabilized training, the performance gain would have a log-scale relationship with the training time. The above methods mainly rely on engineering tricks rather than investigating fundamental dynamics.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Risk-sensitive RL. ","element":"span"},{"text":"Risk-sensitive RL (see, e.g., ","element":"span"},{"href":"#id-24","referenceIndex":40,"text":"Ren et al., ","element":"a"},{"href":"#id-24","referenceIndex":40,"text":"2024; ","element":"a"},{"href":"#id-25","referenceIndex":38,"text":"Petersen et al., ","element":"a"},{"href":"#id-25","referenceIndex":38,"text":"2019) ","element":"a"},{"text":"seeks to shape the entire reward distribution rather than merely optimizing its mean. ","element":"span"},{"href":"#id-26","referenceIndex":9,"text":"Chow et al. ","element":"a"},{"href":"#id-26","referenceIndex":9,"text":"(2015) ","element":"a"},{"text":"investigates the Markov Decision Process (MDP) under CVaR objective and proposes a dynamicprogramming based solution. ","element":"span"},{"href":"#id-27","referenceIndex":27,"text":"La & Ghavamzadeh ","element":"a"},{"href":"#id-27","referenceIndex":27,"text":"(2013) ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-28","referenceIndex":39,"text":"Prashanth et al. ","element":"a"},{"href":"#id-28","referenceIndex":39,"text":"(2016) ","element":"a"},{"text":"use finite difference to optimize risk measure under the MDP setting. ","element":"span"},{"href":"#id-29","referenceIndex":16,"text":"Dabney et al. ","element":"a"},{"href":"#id-29","referenceIndex":16,"text":"(2018b;","element":"a"},{"href":"#id-30","referenceIndex":15,"text":"a) ","element":"a"},{"text":"introduces stateaction value distribution approximation techniques to improve the effectiveness, which is referred to as distributional RL. CVaRPG ","element":"span"},{"href":"#id-31","referenceIndex":45,"text":"(Tamar et al., ","element":"a"},{"href":"#id-31","referenceIndex":45,"text":"2015) ","element":"a"},{"text":"and QPO ","element":"span"},{"href":"#id-32","referenceIndex":25,"text":"(Jiang et al., ","element":"a"},{"href":"#id-32","referenceIndex":25,"text":"2022) ","element":"a"},{"text":"design policy gradient style algorithm to optimize CVaR and quantile, respectively. ","element":"span"},{"href":"#id-33","referenceIndex":26,"text":"Jiang et al. ","element":"a"},{"href":"#id-33","referenceIndex":26,"text":"(2024) ","element":"a"},{"text":"considers a more general case, optimizing the distortion risk measure in a policy gradient manner. There is also literature considering risk level as a constraint. ","element":"span"},{"href":"#id-34","referenceIndex":4,"text":"Bertsekas ","element":"a"},{"href":"#id-34","referenceIndex":4,"text":"(1997) ","element":"a"},{"text":"use a Lagrangian approach to solve RL problems. ","element":"span"},{"href":"#id-35","referenceIndex":5,"text":"Borkar & Jain ","element":"a"},{"href":"#id-35","referenceIndex":5,"text":"(2014) ","element":"a"},{"text":"use CVaR as a constraint, and ","element":"span"},{"href":"#id-36","referenceIndex":10,"text":"Chow et al. ","element":"a"},{"href":"#id-36","referenceIndex":10,"text":"(2018) ","element":"a"},{"text":"develop actor-critic algorithms under quantile and CVaR constraint.","element":"span"}]]},{"heading":"3 RETHINKING RLVR FROM A DISTRIBUTIONAL PERSPECTIVE","paragraphs":[[{"text":"We formalize the post-training problem of RLVR as follows. Given an input problem ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"sampled from a dataset ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", an LLM parameterized as ","element":"span"},{"style":{"height":9.19},"width":37.74,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/2-0.png","element":"img","alt":" πθ","inline":true,"padRight":true},{"text":"generates a response ","element":"span"},{"style":{"height":16},"width":204.31,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/2-1.png","element":"img","alt":" y ∼ πθ(·|x)","inline":true},{"text":". A rule-based verifier ","element":"span"},{"style":{"height":16},"width":73.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/2-2.png","element":"img","alt":" R(·)","inline":true,"padRight":true},{"text":"then evaluates the correctness of the response, returning one if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"is correct and zero otherwise. Notably, no intermediate process-level feedback is provided. The standard objective in this setting is to maximize the expected reward: ","element":"span"},{"style":{"height":17.68},"width":514.26,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/2-3.png","element":"img","alt":" J (θ) = Ex∼D, y∼πθ(·|x)[R(y)]","inline":true},{"text":". With a score-function method, its gradient is given by ","element":"span"},{"style":{"height":16},"width":548.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/2-4.png","element":"img","alt":" ∇θJ (θ) = E[R(y)∇θ ln πθ(y|x)]","inline":true},{"text":", resulting in a standard RL framework, where a baseline or so-called value model is used for variance reduction.","element":"span"}],[{"text":"As a widely adopted baseline for RLVR, GRPO ","element":"span"},{"href":"#id-0","referenceIndex":44,"text":"(Shao et al., ","element":"a"},{"href":"#id-0","referenceIndex":44,"text":"2024) ","element":"a"},{"text":"replaces the value model with sequence-level standardized rewards computed within a group of responses. We denote by ","element":"span"},{"style":{"height":10.39},"width":56.44,"height":25.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/2-5.png","element":"img","alt":" y 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":19.23},"width":274.39,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-14.png","element":"img","alt":" fθ(F −1θ (β)) > 0","inline":true},{"style":{"fontStyle":"italic"},"text":"; and that the differentiation under the integral sign is justified. Then the gradient of RVaR is given by","element":"span"}],[{"style":{"width":"86%"},"width":1363,"height":161,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-15.png","element":"img"}],[{"text":"Note that when ","element":"span"},{"style":{"height":10.8},"width":112.53,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-16.png","element":"img","alt":" α = 0","inline":true},{"text":", RVaR coincides with the lower-tail CVaR at level ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-17.png","element":"img","alt":" β","inline":true},{"text":", and the gradient in Theorem ","element":"span"},{"href":"#id-37","text":"1 ","element":"a"},{"text":"reduces to ","element":"span"},{"style":{"height":20.1},"width":1011.16,"height":50.26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-18.png","element":"img","alt":" ∇θJRVaR0:β(θ) = β−1E�− (F −1θ (β) − R(y))+∇θ ln πθ(y|x)�","inline":true},{"text":". Since RVaR effectively places a window for control on the reward distribution, it is natural to further combine several such segments to better shape the overall distribution. Accordingly, we introduce a new objective into RLVR, namely Mixed Value-at-Risk (MVaR), which integrates metrics over multiple distributional segments as follows:","element":"span"}],[{"id":"id-39","style":{"width":"79%"},"width":1265,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-19.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.2},"width":137,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-20.png","element":"img","alt":" ω ≥ 0","inline":true,"padRight":true},{"text":"controls the emphasis placed on tail samples during optimization, and highperformance samples are excluded from the current training process. Note that ","element":"span"},{"style":{"height":19.85},"width":249.82,"height":49.62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-21.png","element":"img","alt":" JMVaRωα:β(θ) =","inline":true},{"style":{"height":17.68},"width":705.9,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-22.png","element":"img","alt":"(1 + ω)αJRVaR0:α(θ) + (β − α)JRVaRα:β(θ)","inline":true},{"text":". The gradient of ","element":"span"},{"href":"#id-39","text":"(2) ","element":"a"},{"text":"can be derived by Theorem ","element":"span"},{"href":"#id-37","text":"1.","element":"a"}],[{"text":"However, the distributional information from a single question ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"is very limited, since the feedback is binary and offers only coarse signals. To obtain a richer source of information, we propose to group several questions into a bundle, i.e., ","element":"span"},{"style":{"height":17.53},"width":394.78,"height":43.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-23.png","element":"img","alt":" X := {xi}Bi=1 ∼ D⊗B","inline":true},{"text":", and calculate the advantage based ","element":"span"},{"text":"on the sum of the individual question scores within the bundle. This aggregation transforms sparse binary feedback into a more informative distribution over bundle scores, enabling finer distinctions between different levels of performance and avoiding zero gradient on difficult questions. We then focus on optimizing the MVaR of the bundle score:","element":"span"}],[{"style":{"width":"84%"},"width":1340,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-24.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":20.4},"width":319.98,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-25.png","element":"img","alt":" RB = �Bi=1 R(yi)","inline":true,"padRight":true},{"text":"denotes the bundle score. For each ","element":"span"},{"style":{"height":16},"width":253.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-26.png","element":"img","alt":" i ∈ {1, . . . , B}","inline":true},{"text":", we sample ","element":"span"},{"style":{"height":13.19},"width":94.73,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-27.png","element":"img","alt":" Yi :=","inline":true},{"style":{"height":19.93},"width":131.46,"height":49.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-28.png","element":"img","alt":"{yij}Gj=1","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":19.53},"width":235.7,"height":48.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-29.png","element":"img","alt":" yij ∼ πθ(·|xi)","inline":true,"padRight":true},{"text":"i.i.d., and define ","element":"span"},{"style":{"height":17.53},"width":240,"height":43.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/3-30.png","element":"img","alt":" Y := {Yi}Bi=1","inline":true},{"text":". Then we can generate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"bundles","element":"span"}],[{"text":"without overlaps from the ","element":"span"},{"style":{"height":10.8},"width":111.02,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-0.png","element":"img","alt":" G × B","inline":true,"padRight":true},{"text":"responses of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"questions. The gradient can be calculated by","element":"span"}],[{"style":{"width":"76%"},"width":1216,"height":123,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":20},"width":1049.78,"height":50.01,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-2.png","element":"img","alt":" Aj = −(1 + ω)(F −1θ (α) − RBj)+ + g(RBj, F −1θ (α), F −1θ (β))","inline":true,"padRight":true},{"text":"is the bundle-wise advantage under MVaR objective, ","element":"span"},{"style":{"height":23.32},"width":365.3,"height":58.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-3.png","element":"img","alt":" RBj = �Bi=1 R(yiξi,j)","inline":true,"padRight":true},{"text":"is the bundle-wise score, ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-4.png","element":"img","alt":" ξ","inline":true,"padRight":true},{"text":"is a permutation of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , G","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"that independently draw ","element":"span"},{"style":{"height":16},"width":258.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-5.png","element":"img","alt":" ξi ∼ Unif(SG)","inline":true,"padRight":true},{"text":"for every ","element":"span"},{"style":{"height":13.6},"width":93.24,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-6.png","element":"img","alt":" i, SG","inline":true,"padRight":true},{"text":"is the symmetric group on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"element, and ","element":"span"},{"style":{"height":15.59},"width":51.16,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-7.png","element":"img","alt":" ξi,j","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th elements in the permutation. This construction yields ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"disjoint bundles: the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th bundle uses ","element":"span"},{"style":{"height":21.09},"width":162.26,"height":52.73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-8.png","element":"img","alt":" {yiξi,j}Bi=1","inline":true},{"text":", so that for each fixed ","element":"span"},{"style":{"height":20.7},"width":313.78,"height":51.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-9.png","element":"img","alt":" i, {yiξi,1, . . . , yiξi,G}","inline":true,"padRight":true},{"text":"is a permutation of ","element":"span"},{"style":{"height":17.37},"width":212.24,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-10.png","element":"img","alt":"{yi1, . . . , yiG}","inline":true},{"text":", i.e., every answer is used only once (without replacement).","element":"span"}],[{"text":"To ensure stable improvement ","element":"span"},{"href":"#id-40","referenceIndex":42,"text":"(Schulman et al., ","element":"a"},{"href":"#id-40","referenceIndex":42,"text":"2017; ","element":"a"},{"href":"#id-41","referenceIndex":41,"text":"2015) ","element":"a"},{"text":"with multiple updates per bundle-wise MVaR objective evaluation, we adopt a trust-region style update with clipping and sequence-level importance sampling ","element":"span"},{"href":"#id-21","referenceIndex":55,"text":"(Zheng et al., ","element":"a"},{"href":"#id-21","referenceIndex":55,"text":"2025)","element":"a"},{"text":". Since the reward in RLVR is only available at the sequence level, i.e., ","element":"span"},{"style":{"height":16.18},"width":31.95,"height":40.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-11.png","element":"img","alt":" yi","inline":true},{"text":", it is natural to define importance weights also at the sequence (response) level and then aggregate them into the bundle objective. Formally, given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"problems ","element":"span"},{"style":{"height":17.53},"width":221.27,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-12.png","element":"img","alt":" X = {xi}Bi=1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"responses per problem ","element":"span"},{"style":{"height":19.93},"width":221.02,"height":49.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-13.png","element":"img","alt":" Yi = {yij}Gj=1","inline":true},{"text":", we independently draw ","element":"span"},{"style":{"height":16},"width":253.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-14.png","element":"img","alt":" ξi ∼ Unif(SG)","inline":true,"padRight":true},{"text":"for each ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":", yielding ","element":"span"},{"style":{"fontStyle":"italic"},"text":"G ","element":"span"},{"text":"bundles: ","element":"span"},{"style":{"height":21.09},"width":508.98,"height":52.73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-15.png","element":"img","alt":" Pj = {yiξi,j}Bi=1, j = 1, . . . , G,","inline":true,"padRight":true},{"text":"where every responses is used without replication. We ","element":"span"},{"text":"then construct the clipped MVaR objective at the bundle level, which constitutes the final loss for backpropagation:","element":"span"}],[{"style":{"width":"96%"},"width":1522,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-16.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":36.81},"width":479.02,"height":92.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-17.png","element":"img","alt":" sij(θ) =� πθ(yiξi,j |xi)πθ′(yiξi,j |xi)�1/|yiξi,j |","inline":true,"padRight":true},{"text":"is the sequence-wise importance sampling ratio.","element":"span"}],[{"text":"Every token within the same bundle shares the same MVaR-based advantage ","element":"span"},{"style":{"height":14.18},"width":69.14,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-18.png","element":"img","alt":" A(j)","inline":true},{"text":", ensuring that optimization is aligned with the unit of reward (the bundle score) and directs training toward the left tail of the performance distribution. We track ","element":"span"},{"style":{"height":19.23},"width":131.1,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-19.png","element":"img","alt":" F −1θ (α)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.23},"width":130.11,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-20.png","element":"img","alt":" F −1θ (β)","inline":true,"padRight":true},{"text":"in an online manner. After substituting the tracked quantiles into the advantage and deriving the gradient, we update model parameters accordingly. Therefore, RiskPO can be implemented as a two-timescale stochastic approximation algorithm. The pseudocode of the proposed algorithm is provided in Algorithm ","element":"span"},{"href":"#id-42","text":"1.","element":"a"}],[{"id":"id-42","style":{"width":"100%"},"width":1584,"height":683,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/4-21.png","element":"img"}]]},{"heading":"5 ENTROPY MECHANISM FOR RISK-SENSITIVE OBJECTIVE","paragraphs":[[{"text":"GRPO suffers from entropy collapse, where the policy entropy rapidly decreases during training. This premature reduction in entropy limits exploration of alternative reasoning paths, thereby constraining performance improvement and reducing the likelihood of discovering correct solutions. In this section, we conduct a per-step analysis for the change of policy entropy in the optimization and give a theoretical guarantee that our RVaR policy gradient can mitigate the entropy collapse issue.","element":"span"}],[{"text":"Following the standard framework in policy-gradient literature (see, e.g., ","element":"span"},{"href":"#id-43","referenceIndex":2,"text":"Agarwal et al., ","element":"a"},{"href":"#id-43","referenceIndex":2,"text":"2021; ","element":"a"},{"href":"#id-44","referenceIndex":43,"text":"Shani ","element":"a"},{"href":"#id-44","referenceIndex":43,"text":"et al., ","element":"a"},{"href":"#id-44","referenceIndex":43,"text":"2020; ","element":"a"},{"href":"#id-45","referenceIndex":1,"text":"Abbasi-Yadkori et al., ","element":"a"},{"href":"#id-45","referenceIndex":1,"text":"2019)","element":"a"},{"text":", we conduct theoretical analysis under a tabular softmax formulation with deterministic sequence-level rewards. Specifically, we consider an input set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"and an output set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y","element":"span"},{"text":". The actor is parameterized by a matrix ","element":"span"},{"style":{"height":14.98},"width":223.03,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-0.png","element":"img","alt":" θ ∈ R|X|×|Y|","inline":true},{"text":", where each entry ","element":"span"},{"style":{"height":15.59},"width":62.26,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-1.png","element":"img","alt":" θx,y","inline":true,"padRight":true},{"text":"is the logit for choosing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". The policy is thus ","element":"span"},{"style":{"height":27.16},"width":432.33,"height":67.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-2.png","element":"img","alt":" πθ(y|x) = exp(zx,y)�u∈Yx exp(zx,u)","inline":true},{"text":", and its conditional","element":"span"}],[{"text":"entropy is ","element":"span"},{"style":{"height":19.18},"width":623.83,"height":47.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-3.png","element":"img","alt":" H(πθ|x) = − �y πθ(y|x) log πθ(y|x)","inline":true},{"text":". Let ","element":"span"},{"style":{"height":16},"width":140.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-4.png","element":"img","alt":" Aθ(x, y)","inline":true,"padRight":true},{"text":"be the advantage value associated ","element":"span"},{"text":"with the chosen algorithm. We begin with the following proposition, which links entropy dynamics to the covariance between the advantage and the log-probability of the output.","element":"span"}],[{"id":"id-58","style":{"fontWeight":"bold"},"text":"Proposition 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Fix a prompt ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and a finite set of complete sequences ","element":"span"},{"style":{"height":13.19},"width":44.61,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-5.png","element":"img","alt":" Yx","inline":true},{"style":{"fontStyle":"italic"},"text":". Consider a naturalgradient step, i.e., ","element":"span"},{"style":{"height":15.59},"width":290.24,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-6.png","element":"img","alt":" θk+1 = θk + η∆k","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":16.79},"width":317.58,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-7.png","element":"img","alt":" ∆k,x,y = Aθk(x, y)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"otherwise zero, then","element":"span"}],[{"style":{"width":"97%"},"width":1543,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-8.png","element":"img"}],[{"text":"$3f","element":"span"},{"style":{"height":16},"width":477.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-9.png","element":"img","alt":"ψ(r) = E[log πθ(y|x)|R = r]","inline":true,"padRight":true},{"text":"denote the conditional log-likelihood.","element":"span"}],[{"id":"id-47","style":{"fontWeight":"bold"},"text":"Assumption 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The conditional log-probability of output ","element":"span"},{"style":{"height":16},"width":77.93,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-10.png","element":"img","alt":" ψ(r)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is non-decreasing for ","element":"span"},{"style":{"height":19.23},"width":204.6,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-11.png","element":"img","alt":" r ≥ F −1θ (β)","inline":true},{"style":{"fontStyle":"italic"},"text":", non-increasing for ","element":"span"},{"style":{"height":19.23},"width":203.33,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-12.png","element":"img","alt":" r ≤ F −1θ (α)","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":19.23},"width":681.08,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-13.png","element":"img","alt":" ψ(F −1θ (α)), ψ(F −1θ (β)) ≥ E[log πθ(y|x)]","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"id":"id-46","style":{"width":"48%"},"width":767,"height":529,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-14.png","element":"img"}],[{"text":"Figure 3: ","element":"figcaption","subtype":"caption"},{"text":"Log-probabilities as a function of reward quantile levels for DeepSeek-R1-Distill-Qwen-1.5B on DAPOMATH-17K.","element":"figcaption","subtype":"caption"}],[{"text":"Intuitively, this assumption captures the behavior of a pre-trained base model. For relatively easier problems in the upper tail of the reward distribution, the model exhibits well-calibrated confidence, assigning higher probabilities to correct answers (upper-tail monotonicity). In contrast, for harder problems in the lower tail, the model often fails systematically: it allocates substantial probability mass to incorrect outputs, effectively making confident but consistently wrong predictions (lower-tail monotonicity). ","element":"span"},{"text":"We empirically validate this assumption using DeepSeek-R1-Distill-Qwen-1.5B on the training set. For each question, the model generates 16 responses and computes the mean reward across these responses. Recall that the lengthnormalized sequence-level log-probability is de-","element":"span"}],[{"text":"fined as ","element":"span"},{"style":{"height":21.2},"width":724.49,"height":52.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/5-15.png","element":"img","alt":" log πθ(y|x) = |y|−1 �|y|t=1 log ˜πθ(yt|yt} − 1{−X>t}�dt","inline":true},{"text":", which holds","element":"span"}],[{"text":"for any real-valued ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":". Multiplying by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"and taking expectations, we may apply the Tonelli–Fubini","element":"span"}],[{"text":"theorem under the integrability condition ","element":"span"},{"style":{"height":16},"width":232.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-3.png","element":"img","alt":" E[|XY |] < ∞","inline":true},{"text":", which yields","element":"span"}],[{"style":{"width":"54%"},"width":857,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-4.png","element":"img"}],[{"text":"Applying the same transformation to ","element":"span"},{"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":"]","element":"span"},{"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"text":"] ","element":"span"},{"text":"and subtracting, we obtain","element":"span"}],[{"style":{"width":"66%"},"width":1057,"height":94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-5.png","element":"img"}],[{"text":"Changing the integral variable in the second term and using the identity ","element":"span"},{"style":{"height":16},"width":349.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-6.png","element":"img","alt":" − Cov(1 − Z, Y ) =","inline":true}],[{"text":"Cov(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z, Y ","element":"span"},{"text":")","element":"span"},{"text":", we merge the two integrals and obtain the equality ","element":"span"},{"href":"#id-61","text":"(7)","element":"a"},{"text":". The distinction between strict","element":"span"}],[{"text":"and non-strict inequalities only affects a countable set of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"value","element":"span"},{"href":"#id-61","text":"s a","element":"a"},{"text":"nd does not change the integral","element":"span"}],[{"text":"under the Lebesgue measure, which completes the proof.","element":"span"}],[{"text":"Next, we present the proof of Theorem ","element":"span"},{"href":"#id-60","text":"2. ","element":"a"},{"text":"For notational clarity, we focus on a single repre-","element":"span"}],[{"text":"sentative output ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"from the model ","element":"span"},{"style":{"height":16},"width":117.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-7.png","element":"img","alt":" πθ(·|x)","inline":true,"padRight":true},{"text":"rather than a bundle. ","element":"span"},{"text":"The advantage values for the","element":"span"}],[{"text":"MVaR- and mean-based objectives are given by ","element":"span"},{"style":{"height":22.27},"width":775.24,"height":55.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-8.png","element":"img","alt":" AMVaRωα:β = −(1 + ω)(F −1θ (α) − R(y))+ +","inline":true}],[{"style":{"height":19.23},"width":430.08,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-9.png","element":"img","alt":"g(R(y), F −1θ (α), F −1θ (β))","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":423.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-10.png","element":"img","alt":" AMean = R(y) − E[R(y)]","inline":true},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-60","style":{"fontStyle":"italic"},"text":"2. ","element":"a"},{"text":"Recall that the positive part funct","element":"span"},{"href":"#id-38","referenceIndex":56,"text":"ion ","element":"a"},{"text":"can be expressed via the layer–cake rep-","element":"span"}],[{"text":"resentation: ","element":"span"},{"href":"#id-60","style":{"height":21.78},"width":455.74,"height":54.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-11.png","element":"img","alt":" (z − a)+ =� +∞a 1{z>t} dt.","inline":true,"padRight":true},{"text":"With Lemma ","element":"span"},{"href":"#id-38","referenceIndex":56,"text":"1, ","element":"a"},{"text":"we can derive the covariances as below","element":"span"}],[{"style":{"width":"91%"},"width":1451,"height":113,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-12.png","element":"img"}],[{"text":"where, ","element":"span"},{"text":"for notational convenience, ","element":"span"},{"text":"we denote ","element":"span"},{"style":{"height":16},"width":365.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-13.png","element":"img","alt":" SF := log πθ(y|x)","inline":true},{"text":". ","element":"span"},{"text":"Define ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":") ","element":"span"},{"text":":=","element":"span"}],[{"style":{"height":17.68},"width":327.32,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-14.png","element":"img","alt":"Cov(1{R(y)>t}, SF)","inline":true,"padRight":true},{"text":"and recall that the density of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"R","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":") ","element":"span"},{"text":"is ","element":"span"},{"style":{"height":14},"width":34.47,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-15.png","element":"img","alt":" fθ","inline":true},{"text":". Then, we compute the derivative","element":"span"}],[{"text":"of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"as follows:","element":"span"}],[{"style":{"width":"66%"},"width":1055,"height":537,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-16.png","element":"img"}],[{"text":"Under Assumption ","element":"span"},{"href":"#id-47","text":"1, ","element":"a"},{"style":{"height":16},"width":223.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-17.png","element":"img","alt":" ψ(t) ≥ E[SF]","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":19.23},"width":198.04,"height":48.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-18.png","element":"img","alt":" t ≥ F −1θ (β)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":19.23},"width":199.05,"height":48.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-19.png","element":"img","alt":" t ≤ F −1θ (α)","inline":true},{"text":". Consequently, ","element":"span"},{"style":{"height":16},"width":152.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-20.png","element":"img","alt":" k′(t) ≤ 0","inline":true,"padRight":true},{"text":"for","element":"span"}],[{"style":{"height":19.23},"width":197.64,"height":48.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-21.png","element":"img","alt":"t ≥ F −1θ (β)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"href":"#id-47","style":{"height":19.23},"width":198.6,"height":48.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-22.png","element":"img","alt":" t ≤ F −1θ (α)","inline":true},{"text":", which implies that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"is non-increasing in both the upper and lower","element":"span"}],[{"style":{"width":"85%"},"width":1358,"height":235,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/15-23.png","element":"img"}],[{"text":"Analogously, we can show that ","element":"span"},{"style":{"height":16},"width":612.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-0.png","element":"img","alt":" limt→−∞ k(t) = E[SF] − E[SF] = 0","inline":true},{"text":". By the monotonicity in the","element":"span"}],[{"text":"tails, we obtain ","element":"span"},{"style":{"height":16},"width":149.95,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-1.png","element":"img","alt":" k(t) ≥ 0","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":19.23},"width":207,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-2.png","element":"img","alt":" t ≥ F −1θ (β)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":149.95,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-3.png","element":"img","alt":" k(t) ≤ 0","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":19.23},"width":207.98,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-4.png","element":"img","alt":" t ≤ F −1θ (α)","inline":true},{"text":". Therefore, noting that","element":"span"}],[{"style":{"height":19.39},"width":459.2,"height":48.48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-5.png","element":"img","alt":"k(t) = Cov�1{R(y)>t}, SF�","inline":true},{"text":"preserves its sign on both tails, we can further obtain","element":"span"}],[{"style":{"width":"72%"},"width":1144,"height":91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-6.png","element":"img"}],[{"text":"which completes the proof.","element":"span"}],[{"id":"id-48","text":"A.4 ","element":"span"},{"text":"S","element":"span"},{"text":"UPPLEMENTARY ","element":"span"},{"text":"T","element":"span"},{"text":"HEOREM OF ","element":"span"},{"text":"S","element":"span"},{"text":"ECTION ","element":"span"},{"text":"5","element":"span"}],[{"text":"With the different treatment of the tail in the reward distribution, we obtain the following covariance","element":"span"}],[{"text":"result between the resulting advantage value and the output log-probability.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If Assumption ","element":"span"},{"href":"#id-47","style":{"fontStyle":"italic"},"text":"1 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"holds with ","element":"span"},{"style":{"height":16},"width":217.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-7.png","element":"img","alt":" E[|SF|] < ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":10},"width":89.54,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-8.png","element":"img","alt":" g1, g2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are nondecreasing and differen-","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"tiable with ","element":"span"},{"style":{"height":16},"width":218.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-9.png","element":"img","alt":" g′1(t) ≥ g′2(t)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"on ","element":"span"},{"href":"#id-47","style":{"height":19.23},"width":451.81,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-10.png","element":"img","alt":" [F −1θ (β), ∞), g1(t) = g2(t)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"on ","element":"span"},{"style":{"height":19.23},"width":244.67,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-11.png","element":"img","alt":" (−∞, F −1θ (β)]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", then we have","element":"span"}],[{"style":{"width":"45%"},"width":713,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-12.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Note that","element":"span"}],[{"style":{"width":"91%"},"width":1443,"height":206,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-13.png","element":"img"}],[{"text":"where the first equalit","element":"span"},{"href":"#id-60","text":"y ","element":"a"},{"text":"is due to Lemma ","element":"span"},{"href":"#id-38","referenceIndex":56,"text":"1 ","element":"a"},{"text":"and the se","element":"span"},{"href":"#id-47","text":"con","element":"a"},{"text":"d equality is integration by substitution.","element":"span"}],[{"text":"The proof of Theorem ","element":"span"},{"href":"#id-60","text":"2 ","element":"a"},{"text":"implies that under Assumption ","element":"span"},{"href":"#id-47","text":"1, ","element":"a"},{"text":"we have ","element":"span"},{"style":{"height":16},"width":140.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-14.png","element":"img","alt":" k(t) ≥ 0","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":19.23},"width":197.61,"height":48.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-15.png","element":"img","alt":" t ≥ F −1θ (β)","inline":true},{"text":", thus","element":"span"}],[{"style":{"width":"81%"},"width":1298,"height":104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/16-16.png","element":"img"}],[{"text":"which gives the desired result.","element":"span"}],[{"text":"A symmetric conclusion can be derived analogously for the treatment of the other tail.","element":"span"}],[{"id":"id-49","text":"B ","element":"span"},{"text":"E","element":"span"},{"text":"XPERIMENTAL ","element":"span"},{"text":"S","element":"span"},{"text":"ETUP AND ","element":"span"},{"text":"S","element":"span"},{"text":"UPPLEMENTARY ","element":"span"},{"text":"R","element":"span"},{"text":"ESULTS","element":"span"}],[{"text":"B.1 ","element":"span"},{"text":"E","element":"span"},{"text":"XPERIMENTAL ","element":"span"},{"text":"S","element":"span"},{"text":"ETUP","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Model. ","element":"span"},{"text":"We focus on mathematics reasoning, code generation, and multi-modal reasoning. We","element":"span"}],[{"text":"use DeepSeek-R1-Distill-Qwen-1.5B ","element":"span"},{"href":"#id-1","referenceIndex":19,"text":"(Guo et al., ","element":"a"},{"href":"#id-1","referenceIndex":19,"text":"2025) ","element":"a"},{"text":"as our base model to evaluate different","element":"span"}],[{"text":"algorithms on hard-level mathematics reasoning and code generation. On easy-level mathematics","element":"span"}],[{"text":"reasoning, we use Qwen2.5-Math-1.5B-Instruct ","element":"span"},{"href":"#id-62","referenceIndex":49,"text":"(Yang et al., ","element":"a"},{"href":"#id-62","referenceIndex":49,"text":"2024) ","element":"a"},{"text":"as the base model. On multi-","element":"span"}],[{"text":"modal reasoning, we use Qwen2.5-VL-Instruct-3B ","element":"span"},{"href":"#id-63","referenceIndex":3,"text":"(Bai et al., ","element":"a"},{"href":"#id-63","referenceIndex":3,"text":"2025) ","element":"a"},{"text":"as the base model.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Training. ","element":"span"},{"text":"For the hard-level mathematics reasoning tasks. We use the DAPO-math-17k as the","element":"span"}],[{"text":"training set. For the easy-level, we use MATH ","element":"span"},{"href":"#id-64","referenceIndex":21,"text":"(Hendrycks et al., ","element":"a"},{"href":"#id-64","referenceIndex":21,"text":"2021) ","element":"a"},{"text":"and GSM8K ","element":"span"},{"href":"#id-65","referenceIndex":13,"text":"(Cobbe et al.,","element":"a"}],[{"href":"#id-65","referenceIndex":13,"text":"2021)","element":"a"},{"text":". For multi-modal reasoning, we use Geometry3K (Geo3K, ","element":"span"},{"href":"#id-66","referenceIndex":34,"text":"Lu et al., ","element":"a"},{"href":"#id-66","referenceIndex":34,"text":"2021)","element":"a"},{"text":". For code genera-","element":"span"}],[{"text":"tion, we train the models on Archer-6K ","element":"span"},{"href":"#id-20","referenceIndex":46,"text":"(Wang et al., ","element":"a"},{"href":"#id-20","referenceIndex":46,"text":"2025a)","element":"a"},{"text":". We set the clipping threshold ","element":"span"},{"style":{"height":10.8},"width":120.3,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/17-0.png","element":"img","alt":" ϵ = 0.2","inline":true},{"text":".","element":"span"}],[{"text":"KL penalty and entropy regularization are omitted from the loss objective. We use vLLM as the","element":"span"}],[{"text":"inference backend and FSDP as the training backend. We set the temperature to ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"8 ","element":"span"},{"text":"and top p to","element":"span"}],[{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"0","element":"span"},{"text":", and maximum output length as ","element":"span"},{"text":"3072","element":"span"},{"text":". We generate ","element":"span"},{"text":"10 ","element":"span"},{"text":"responses for each problem. The batch","element":"span"}],[{"text":"size is ","element":"span"},{"text":"512","element":"span"},{"text":", the mini-batch size is set to ","element":"span"},{"text":"128","element":"span"},{"text":". For quantile levels, we set ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/17-1.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"to ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"2 ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/17-2.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"to ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"8 ","element":"span"},{"text":"corre-","element":"span"}],[{"text":"spondingly. The bundle size ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"is set to ","element":"span"},{"text":"5","element":"span"},{"text":". The mixing parameter of MVaR is ","element":"span"},{"style":{"height":11.2},"width":132.38,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/17-3.png","element":"img","alt":" ω = 0.5","inline":true},{"text":". All training","element":"span"}],[{"text":"procedures are carried out on a Linux server equipped with 8 NVIDIA H20 GPUs, each providing","element":"span"}],[{"text":"96 GB of memory.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Evaluation. ","element":"span"},{"text":"For hard-level mathematics reasoning, We evaluate on six math reasoning datasets:","element":"span"}],[{"text":"AIME24 ","element":"span"},{"href":"#id-67","referenceIndex":36,"text":"(MAA, ","element":"a"},{"href":"#id-67","referenceIndex":36,"text":"2024) ","element":"a"},{"text":"and AIME25 ","element":"span"},{"href":"#id-68","referenceIndex":37,"text":"(MAA, ","element":"a"},{"href":"#id-68","referenceIndex":37,"text":"2025) ","element":"a"},{"text":"with ","element":"span"},{"text":"30 ","element":"span"},{"text":"problems from the American Invita-","element":"span"}],[{"text":"tional Mathematics Examination, both targeting advanced pre-collegiate reasoning; AMC23 ","element":"span"},{"href":"#id-69","referenceIndex":35,"text":"(MAA,","element":"a"}],[{"href":"#id-69","referenceIndex":35,"text":"2023) ","element":"a"},{"text":"with ","element":"span"},{"text":"83 ","element":"span"},{"text":"problems from the American Mathematics Competitions, testing creative algebraic,","element":"span"}],[{"text":"geometric, and number-theoretic skills; MATH-500 ","element":"span"},{"href":"#id-70","referenceIndex":31,"text":"(Lightman et al., ","element":"a"},{"href":"#id-70","referenceIndex":31,"text":"2023) ","element":"a"},{"text":"with 500 graduate-level","element":"span"}],[{"text":"problems from the original MATH dataset covering algebra, geometry, and number theory; Minerva","element":"span"}],[{"text":"Math ","element":"span"},{"href":"#id-71","referenceIndex":29,"text":"(Lewkowycz et al., ","element":"a"},{"href":"#id-71","referenceIndex":29,"text":"2022) ","element":"a"},{"text":"with ","element":"span"},{"text":"272 ","element":"span"},{"text":"undergraduate-level quantitative reasoning problems; and","element":"span"}],[{"text":"OlympiadBench ","element":"span"},{"href":"#id-72","referenceIndex":20,"text":"(He et al., ","element":"a"},{"href":"#id-72","referenceIndex":20,"text":"2024) ","element":"a"},{"text":"with ","element":"span"},{"text":"675 ","element":"span"},{"text":"Olympiad-style problems. For easy-level math tasks and","element":"span"}],[{"text":"the multi-modal task, we follow the train-test split in the original datasets. For code generations, we","element":"span"}],[{"text":"evaluate trained models on LiveCodeBench v5 (LCB, ","element":"span"},{"href":"#id-73","referenceIndex":24,"text":"Jain et al., ","element":"a"},{"href":"#id-73","referenceIndex":24,"text":"2024)","element":"a"},{"text":".","element":"span"}],[{"text":"B.2 ","element":"span"},{"text":"A","element":"span"},{"text":"BLATION ON THE ","element":"span"},{"text":"M","element":"span"},{"text":"IXING ","element":"span"},{"text":"P","element":"span"},{"text":"ARAMETER","element":"span"}],[{"text":"We fix the quantile level, and perturb the ","element":"span"},{"style":{"height":6.8},"width":25,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/17-4.png","element":"img","alt":" ω","inline":true,"padRight":true},{"text":"to investigate how different attention on the lower tail","element":"span"}],[{"text":"influences the performance. The ablation of ","element":"span"},{"style":{"height":6.8},"width":25,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/17-5.png","element":"img","alt":" ω","inline":true,"padRight":true},{"text":"is shown in Table ","element":"span"},{"href":"#id-74","referenceIndex":150,"text":"5. ","element":"a"},{"text":"Setting ","element":"span"},{"style":{"height":10.8},"width":130.43,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/17-6.png","element":"img","alt":" ω = 0.0","inline":true,"padRight":true},{"text":"would reduce the","element":"span"}],[{"text":"MVaR objective, ","element":"span"},{"style":{"height":19.85},"width":198.13,"height":49.63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/17-7.png","element":"img","alt":" JMVaRωα:β(θ)","inline":true},{"text":", to ","element":"span"},{"style":{"height":17.68},"width":185.07,"height":44.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/17-8.png","element":"img","alt":" JRVaR0:β(θ)","inline":true},{"text":", which does not have extra attention on the lower tail","element":"span"}],[{"text":"even though it is still a risk-averse objective. The variant with ","element":"span"},{"style":{"height":10.8},"width":132.58,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/17-9.png","element":"img","alt":" ω = 0.0","inline":true,"padRight":true},{"text":"has the largest performance","element":"span"}],[{"text":"decrease, indicating the significance of extra attention on the lower tail. When setting ","element":"span"},{"style":{"height":10.8},"width":133.22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/17-10.png","element":"img","alt":" ω = 1.0","inline":true},{"text":", the","element":"span"}],[{"text":"variant also suffers from a mild performance drop. Overall, the phenomena suggest that the level of","element":"span"}],[{"text":"risk aversion needs to be properly tuned; both an indifferent level and excessive focus would lead to","element":"span"}],[{"id":"id-74","text":"undesirable performance.","element":"span"}],[{"text":"Table 5: Ablation of the mixing parameter.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"51%"},"width":818,"height":205,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/17-11.png","element":"img"}],[{"text":"B.3 ","element":"span"},{"text":"E","element":"span"},{"text":"XTENSIVE ","element":"span"},{"text":"A","element":"span"},{"text":"VG","element":"span"},{"text":"@","element":"span"},{"text":"K AND ","element":"span"},{"text":"P","element":"span"},{"text":"ASS","element":"span"},{"text":"@","element":"span"},{"text":"K ","element":"span"},{"text":"R","element":"span"},{"text":"ESULTS","element":"span"}],[{"text":"Since both AIME2024 and AIME2025 contain only 30 questions, the Pass@k metric exhibits high","element":"span"}],[{"text":"variance and fluctuates significantly. To obtain a more stable evaluation, we report the Avg@k re-","element":"span"}],[{"text":"sults in Figure ","element":"span"},{"href":"#id-75","referenceIndex":167,"text":"7. ","element":"a"},{"text":"Across both datasets and different ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"values, RiskPO consistently outperforms","element":"span"}],[{"text":"GRPO, achieving higher Avg@k scores and demonstrating more stable improvements during train-","element":"span"}],[{"text":"ing. The advantage of RiskPO is especially pronounced in the later training stages, where it con-","element":"span"}],[{"text":"tinues to increase while GRPO tends to plateau. These results further confirm the effectiveness of","element":"span"}],[{"text":"our risk-sensitive optimization in enhancing reasoning performance on small-scale but challenging","element":"span"}],[{"text":"benchmarks like AIME.","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-76","text":"8 ","element":"a"},{"text":"reports Pass@k for ","element":"span"},{"style":{"height":16},"width":233.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/18-0.png","element":"img","alt":" k ∈ {1, 8, 16}","inline":true},{"text":". Across both datasets and all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", R","element":"span"},{"text":"ISK","element":"span"},{"text":"PO consistently","element":"span"}],[{"text":"outper","element":"span"},{"href":"#id-76","text":"for","element":"a"},{"text":"ms GRPO throughout training. The margin is modest but stable at ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"(especially on","element":"span"}],[{"text":"M","element":"span"},{"text":"INERVA","element":"span"},{"text":", where variance is higher), and becomes clearly larger for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 8","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"16","element":"span"},{"text":". This widening","element":"span"}],[{"text":"gap at larger ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"indicates that R","element":"span"},{"text":"ISK","element":"span"},{"text":"PO not only improves the best single prediction, but also spreads","element":"span"}],[{"text":"probability mass over a broader set of valid solution paths, thereby increasing the likelihood that at","element":"span"}],[{"text":"least one sampled response is correct. In effect, the risk-sensitive objective enhances coverage and","element":"span"}],[{"text":"diversity of reasoning, pushing the success frontier on problems that initially have low correctness","element":"span"}],[{"id":"id-75","text":"probability and yielding larger gains at higher ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":".","element":"span"}],[{"style":{"width":"58%"},"width":928,"height":820,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/18-1.png","element":"img"}],[{"text":"Figure 7: Avg@k learning curves on AIME2024 and AIME2025 datasets.","element":"figcaption","subtype":"caption"}],[{"id":"id-76","style":{"width":"89%"},"width":1414,"height":821,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/18-2.png","element":"img"}],[{"text":"Figure 8: Pass@k learning curves on Minerva and Olympiad datasets.","element":"figcaption","subtype":"caption"}],[{"text":"B.4 ","element":"span"},{"text":"J","element":"span"},{"text":"USTIFICATION OF ","element":"span"},{"text":"A","element":"span"},{"text":"SSUMPTION ","element":"span"},{"href":"#id-47","text":"1","element":"a"}],[{"text":"Figure ","element":"span"},{"href":"#id-77","referenceIndex":178,"text":"9 ","element":"a"},{"text":"presents the output log probability stratified by reward ranges across evaluation datasets.","element":"span"}],[{"text":"On Minerva and Olympiad datasets, the patterns closely align with Assumption ","element":"span"},{"href":"#id-47","text":"1: ","element":"a"},{"text":"the output log","element":"span"}],[{"text":"probability is monotone with respect to reward in both the lower- and upper-tail regions, approxi-","element":"span"}],[{"text":"mately on ","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"3) ","element":"span"},{"text":"and ","element":"span"},{"text":"(0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"7","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1)","element":"span"},{"text":". Results on AMC show a similar monotone trend, although fluctua-","element":"span"}],[{"text":"tions appear in the mid-reward ranges, suggesting mixed difficulty and solution modes. For MATH,","element":"span"}],[{"text":"which is comparatively easier, the upper tail exhibits strong monotonicity, while the lower tail is less","element":"span"}],[{"text":"pronounced—likely due to a scarcity of truly difficult items that would populate that region. Im-","element":"span"}],[{"text":"portantly, these evaluation sets are not used for model training; the observed regularities therefore","element":"span"}],[{"text":"provide additional evidence that Assumption ","element":"span"},{"href":"#id-47","text":"1 ","element":"a"},{"text":"holds broadly for the pretrained base model across","element":"span"}],[{"id":"id-77","text":"diverse benchmarks.","element":"span"}],[{"style":{"width":"90%"},"width":1429,"height":1134,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2510.00911/images/19-0.png","element":"img"}],[{"text":"Figure 9: The output log probability on various evaluation datasets.","element":"figcaption","subtype":"caption"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]