2d:[[["$","$L30","0",{"heading":"Abstract","index":0,"length":10,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and propose a minimalist RL approach termed Group Policy Gradient (GPG). Unlike conventional methods, GPG directly optimize the original RL objective, thus obviating the need for surrogate loss functions. By eliminating the critic and reference models, avoiding KL divergence constraints, and addressing the advantage and gradient estimation bias, our approach significantly simplifies the training process compared to Group Relative Policy Optimization (GRPO). Our approach achieves superior performance without relying on auxiliary techniques or adjustments. As illustrated in Figure "}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:0:paragraphs:0:1"}]}],["$","$1","2",{"children":"extensive experiments demonstrate that our method not only reduces computational costs but also consistently outperforms GRPO across various unimodal and multimodal tasks. Our code is available at "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:0:paragraphs:0:3"}]}],["$","$1","4",{"children":"."}]]}]]}],["$","$L30","1",{"heading":"1 Introduction","index":1,"length":10,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Large Language Models (LLMs) have achieved substantial advancements, progressively narrowing the gap towards achieving Artificial General Intelligence (AGI) ["}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:1"}]}],["$","$1","2",{"children":"; "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:3"}]}],["$","$1","4",{"children":"; "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:5"}]}],["$","$1","6",{"children":"; "}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:7"}]}],["$","$1","8",{"children":"; "}],["$","$1","9",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:9"}]}],["$","$1","10",{"children":"; "}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:11"}]}],["$","$1","12",{"children":"]. Recently, LLMs exemplified by OpenAI o1 ["}],["$","$1","13",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:13"}]}],["$","$1","14",{"children":"] and DeepSeek R1 ["}],["$","$1","15",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:15"}]}],["$","$1","16",{"children":"], have adopted a strategy of generating intermediate reasoning steps before producing final answers. This approach has markedly improved their efficacy in domain-specific tasks ["}],["$","$1","17",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:17"}]}],["$","$1","18",{"children":"; "}],["$","$1","19",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:19"}]}],["$","$1","20",{"children":"; "}],["$","$1","21",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:21"}]}],["$","$1","22",{"children":"; "}],["$","$1","23",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:23"}]}],["$","$1","24",{"children":"; "}],["$","$1","25",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:25"}]}],["$","$1","26",{"children":"; "}],["$","$1","27",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:27"}]}],["$","$1","28",{"children":"], such as mathematical reasoning. The remarkable success of this technology is mainly attributed to the Reinforcement Fine-Tuning (RFT) method ["}],["$","$1","29",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:29"}]}],["$","$1","30",{"children":"; "}],["$","$1","31",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:31"}]}],["$","$1","32",{"children":"; "}],["$","$1","33",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:33"}]}],["$","$1","34",{"children":"; "}],["$","$1","35",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:35"}]}],["$","$1","36",{"children":"; "}],["$","$1","37",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:37"}]}],["$","$1","38",{"children":"]. Through the application of RFT, the models allocate additional time to “deliberate” prior to generating answers, thereby constructing intricate reasoning chains and subsequently enhancing overall model performance."}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"In contrast to Supervised Fine-Tuning (SFT), which involves training models on fixed input-output pairs to mimic correct responses, RFT introduces an iterative process that incentivizes models to generate coherent and logically structured reasoning paths. RFT leverages RL techniques, such as Proximal Policy Optimization (PPO) ["}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:1:1"}]}],["$","$1","2",{"children":"] and GRPO ["}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:1:3"}]}],["$","$1","4",{"children":"] to optimize decision-making during the generation of intermediate steps. Specifically, PPO ensures stability by constraining policy updates, preventing new strategies that deviate significantly from established behaviours. In contrast, GRPO enhances this process by evaluating performance across groups of actions, encouraging consistent improvements in reasoning quality. This dynamic and feedback-driven approach enables models to deeper and more adaptive thinking, resulting in nuanced answers that better handle complex reasoning tasks compared to the more rigid and label-dependent training of SFT."}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Despite the significant success of PPO in enhancing reasoning quality, it still suffers severely from the enormous resource consumption required during training. PPO necessitates the development and integration of both a critic model and a reference model, which not only complicates the training"}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/1-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:3:0"}]]}]}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 1: Performance comparison on unimodal reasoning tasks, with extended validation on multimodal reasoning. ("}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:1:style","children":"Top"}]}],["$","$1","2",{"children":") GPG achieves substantial performance gains over state-of-the-art (SOTA) baselines across diverse mathematical benchmarks, demonstrating its core effectiveness for linguistic reasoning. ("}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:3:style","children":"Bottom"}]}],["$","$1","4",{"children":") The method also generalizes robustly to multi-modal settings, outperforming other RL methods and further validating its broad applicability. Extended experiments are detailed in Section "}],["$","$1","5",{"children":"3."}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"process but also substantially increases computational demands. Consequently, there is a growing trend toward simplifying the PPO method. For instance, ReMax ["}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:5:1"}]}],["$","$1","2",{"children":"] removes the critic model by introducing a baseline value, which reduces the training GPU memory usage and accelerates the training process. Besides, GRPO eliminates the need for a critic model and utilizes normalized rewards within a sample group."}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"In addition to these methods to improve efficiency and stability, a very recent and concurrent work Dr.GRPO ["}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:6:1"}]}],["$","$1","2",{"children":"] studies the details of reward and loss normalization and states GRPO tends to generate more tokens. However, although it reveals the reward bias in the advantage function, we observe that its performance did not significantly outperform GRPO."}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"A thorough examination of the evolution within the Reinforcement Learning (RL) community, reveals that the widely adopted PPO algorithm essentially functions as a conservative surrogate for the original RL problem ["}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:7:1"}]}],["$","$1","2",{"children":"]. All existing enhancements to PPO continue to focus on optimizing the surrogate loss function. Consequently, two fundamental questions remain unresolved:"}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• Is it feasible to transcend this intermediate strategy and directly optimize the original problem? • If this is achievable, to what extent can the learning strategy be streamlined?"}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"This paper endeavors to provide a comprehensive exploration of these critical questions. In summary, our key contributions are as follows:"}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• We revisit the design of policy gradient algorithms ["}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:10:1"}]}],["$","$1","2",{"children":"] and propose a simple RL method that retains minimal RL components. Unlike conventional approaches, our method directly optimizes the objective function rather than relying on surrogate loss."}]]}],["$","$La","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• Our approach eschews the necessity for both a critic model and a reference model. Moreover, it imposes no distributional constraints. These characteristics confer substantial advantages for potential scalability."}]]}],["$","$La","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• We have analyzed and demonstrated the reward bias inherent in existing advantage functions and revealed the limitations of simplistic debiasing methods. Our exploration of the gradient estimate bias phenomenon has led us to propose a simple yet accurate gradient estimation (AGE) technique. To mitigate the potential issue of large variance in gradient estimation when the proportion of valid samples is excessively small, we introduce a simple thresholding mechanism to ensure a minimal partition of valid samples is maintained, followed by resampling."}]]}],["$","$La","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• Extensive experiments demonstrated that GPG achieves SOTA results across various unimodal and multimodal visual tasks."}]]}],["$","$La","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Our code and implementation details are open-sourced."}]]}]]}],["$","$L30","2",{"heading":"2 Method","index":2,"length":10,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:0:0:style","children":"2.1 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:0:1:style","children":"Preliminary and Task Formulation"}]}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"RL is a computational approach to learning through interaction, where an agent seeks to maximize cumulative rewards by selecting optimal actions within an environment. The RL problem is typically defined by a policy "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:1"}]]}]}],["$","$1","2",{"children":", which maps states to actions, and aims to optimize the expected return. The core idea behind policy gradient methods is to use gradient ascent to iteratively adjust the policy parameters. The learning objective is maximizing the return "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:3"}]]}]}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/2-2.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:2:0"}]]}]}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The policy gradient theorem ["}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:3:1"}]}],["$","$1","2",{"children":"] proves that the above problem can be converted into estimating the gradient,"}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/2-3.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:4:0"}]]}]}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"where "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:1"}]]}]}],["$","$1","2",{"children":"is the action-value function, representing the expected return when taking action "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:3"}]]}]}],["$","$1","4",{"children":"and following policy "}],["$","$1","5",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:5"}]]}]}],["$","$1","6",{"children":"thereafter."}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"To reduce the variance, the advantage function "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:6:1"}]]}]}],["$","$1","2",{"children":"is often used leading to the policy gradient update rule:"}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/2-8.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:7:0"}]]}]}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:8:0:style","children":"One-step advantage estimation "}]}],["$","$1","1",{"children":"can be mathematically formulated as "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:8:2"}]}],["$","$1","3",{"children":":"}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/2-9.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:9:0"}]]}]}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"where "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:10:1"}]]}]}],["$","$1","2",{"children":"is a function of "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:10:3"}]]}]}],["$","$1","4",{"children":". In principle, "}],["$","$1","5",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:10:5"}]]}]}],["$","$1","6",{"children":"can take any functional form. One commonly employed function is the value function, which represents the expected return when starting from state "}],["$","$1","7",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:10:7"}]]}]}],["$","$1","8",{"children":"and following policy "}],["$","$1","9",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:10:9"}]]}]}],["$","$1","10",{"children":". While GAE ["}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:10:11"}]}],["$","$1","12",{"children":"] offers a more sophisticated approach to balance bias and variance in advantage estimation, we find that in the context of model reasoning, one-step estimation is sufficiently effective for achieving good performance. This simplicity is particularly advantageous in scenarios where computational efficiency is paramount."}]]}],["$","$La","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Given a sequence of questions and instructions, the model is tasked with generating corresponding answers. Subsequently, rewards are returned based on predefined reward models or hand-crafted rules. Our objective is to leverage these reward signals to optimize our policy, thereby enhancing the model’s ability to generate accurate and contextually appropriate responses."}]]}],["$","$La","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"However, designing or obtaining accurate rewards for intermediate steps is nontrivial ["}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:12:1"}]}],["$","$1","2",{"children":"]. To address this challenge, we simplify our problem as follows. Given a question and prompt "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:12:3:style","children":"s"}]}],["$","$1","4",{"children":", we sample an action "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:12:5:style","children":"a "}]}],["$","$1","6",{"children":"from policy "}],["$","$1","7",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:12:7"}]]}]}],["$","$1","8",{"children":"and obtain a final reward signal "}],["$","$1","9",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:12:9:style","children":"r"}]}],["$","$1","10",{"children":". Note that the policy distribution "}],["$","$1","11",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:12:11"}]]}]}],["$","$1","12",{"children":"is modeled in an autoregressive manner. In this setting, we can leverage policy gradient methods to optimize the policy."}]]}],["$","$La","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:13:0:style","children":"2.2 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:13:1:style","children":"Group Policy Gradient"}]}]]}],["$","$La","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Our proposed method, Group Policy Gradient (GPG), is designed to address the issue of high variance in policy gradient estimation in the absence of a value model. By leveraging group-level rewards, GPG stabilizes learning and enhances the robustness of reinforcement learning training. Specifically, GPG utilizes the mean reward within each group to normalize the rewards, thereby effectively reducing variance. This approach eliminates the need for a traditional value model, thereby simplifying the training process and enhancing computational efficiency. The name \"Group Policy Gradient\" reflects our method’s core mechanism of utilizing group-level mean rewards to stabilize and optimize learning."}]]}],["$","$La","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The core objective of GPG is defined as:"}]]}],["$","$La","16",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/3-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:16:0"}]]}]}]]}],["$","$La","17",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"where "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:17:1"}]]}]}],["$","$1","2",{"children":"represents the individual responses in the group "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:17:3:style","children":"G"}]}],["$","$1","4",{"children":", and the advantage of the "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:17:5:style","children":"i"}]}],["$","$1","6",{"children":"-th response is calculated by normalizing the group-level rewards "}],["$","$1","7",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:17:7"}]]}]}]]}],["$","$La","18",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/3-3.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:18:0"}]]}]}]]}],["$","$La","19",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:19:0"}]]}]}],["$","$1","1",{"children":"is an "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:19:2:style","children":"optional "}]}],["$","$1","3",{"children":"normalization technique, which is commonly applied in conjunction with reward clipping to mitigate the impact of unexpected outlier values. One widely adopted practice is to employ standard variance normalization within a training batch ["}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:19:4"}]}],["$","$1","5",{"children":"; "}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:19:6"}]}],["$","$1","7",{"children":"]. This approach helps stabilize the training process by reducing the variance of the reward signal, which is particularly important when dealing with environments where the magnitude of rewards can vary significantly, such as in different Atari games. By normalizing the reward signal, the model becomes less sensitive to extreme values, thereby improving the robustness and convergence of the training algorithm. However, in the reasoning tasks involving large models, the reward is typically well-defined and does not suffer from the same variance issues observed in other environments. As for the Math reasoning problem, it is a common practice to award the right answer with 1.0 and the wrong answer with 0.0."}]]}],["$","$La","20",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We utilize a basic Math Reasoning setting "}],["$","$1","1",{"children":"1 "}],["$","$1","2",{"children":"of SimpleRL from open-r1 "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:20:3"}]}],["$","$1","4",{"children":", using only the MATHlighteval dataset to facilitate rapid experimental validation. Specifically, we remove the format reward and only enable the accuracy reward for simplicity."}]]}],["$","$La","21",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/3-5.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:21:0"}]]}]}]]}],["$","$La","22",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 1: Math reasoning results on Qwen2.5-Math-7B model. "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:22:1"}]]}]}],["$","$1","2",{"children":": reproduction use the released code."}]]}],["$","$La","23",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The critical component: "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:23:1"}]]}]}],["$","$1","2",{"children":", has been underexplored in prior research in reasoning. This gap in the literature highlights the need for further investigation of the role and impact of "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:23:3"}]]}]}],["$","$1","4",{"children":"within reasoning tasks. There are two unresolved problems."}]]}],["$","$La","24",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:24:0:style","children":"The "}]}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:24:1"}]]}]}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:24:2:style","children":"should not introduce reward bias"}]}],["$","$1","3",{"children":". Otherwise, bias deviates from the original problem formulation. GRPO ["}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:24:4"}]}],["$","$1","5",{"children":"] formulates it as "}],["$","$1","6",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:24:6"}]]}]}],["$","$1","7",{"children":", which is essentially a function of "}],["$","$1","8",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:24:8"}]]}]}],["$","$1","9",{"children":"in Equation "}],["$","$1","10",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:24:10"}]}],["$","$1","11",{"children":"and explicitly introduces the reward bias. Since we aim to solve the original problem, we don’t want to apply surrogate or bias. However, As shown in Table "}],["$","$1","12",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:24:12"}]}],["$","$1","13",{"children":"if we remove this bias item, i.e. "}],["$","$1","14",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:24:14"}]]}]}],["$","$1","15",{"children":", it (43.9%) "}],["$","$1","16",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:24:16:style","children":"cannot clearly outperform "}]}],["$","$1","17",{"children":"GRPO (43.7%), which is opposite to the observation of a concurrent work Dr. GRPO "}],["$","$1","18",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:24:18"}]}],["$","$1","19",{"children":"."}]]}],["$","$La","25",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:25:0:style","children":"Examples of all right or wrong responses within a group introduce bias for the estimation of the gradient"}]}],["$","$1","1",{"children":". Given a training batch of batch size "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:25:2:style","children":"B"}]}],["$","$1","3",{"children":", let the gradient of the "}],["$","$1","4",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:25:4:style","children":"i"}]}],["$","$1","5",{"children":"-th sample be denoted as "}],["$","$1","6",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:25:6"}]]}]}],["$","$1","7",{"children":"Without loss of generality, assume that the first "}],["$","$1","8",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:25:8:style","children":"M "}]}],["$","$1","9",{"children":"examples within the batch are all right or wrong responses within a group. The standard backpropagation (BP) algorithm estimates the gradient as:"}]]}],["$","$La","26",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/4-1.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:26:0"}]]}]}]]}],["$","$La","27",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"contribute zero gradient. Therefore, the more accurate gradient estimation (AGE) can be written as:"}]]}],["$","$La","28",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/4-2.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:28:0"}]]}]}]]}],["$","$La","29",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"It should be noted that the value "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:29:1"}]]}]}],["$","$1","2",{"children":"is not a constant and it varies across different sample batches. We also illustrate "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:29:3"}]]}]}],["$","$1","4",{"children":"with different steps in Figure "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:29:5"}]}],["$","$1","6",{"children":"which indicates the necessity of gradient correction. We As for multi-GPU training, to achieve more accurate gradient calculations, it is advisable to gather all non-zero gradient samples across all GPUs and compute the average gradient uniformly. This approach can be implemented through a custom gradient aggregation function, which leads to increased communication overhead. Instead, we derive another equivalent format, which doesn’t require extra cost and we provide the proof in section "}],["$","$1","7",{"children":"A. "}],["$","$1","8",{"children":"Therefore, given a batch sample, the objective can be written as"}]]}],["$","$La","30",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/4-5.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:30:0"}]]}]}]]}],["$","$La","31",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-25","style":"$undefined","children":"As shown in Table "}]}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:31:1"}]}],["$","$1","2",{"children":"our method achieves an average score of 47.8%, being equipped with AGE."}]]}],["$","$La","32",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/4-6.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:32:0"}]]}]}]]}],["$","$La","33",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 2: ("}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:33:1:style","children":"Left"}]}],["$","$1","2",{"children":") The proportion of easy problems with all rewards are 0, hard problems with all rewards are 1 within a rollout group. ("}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:33:3:style","children":"Right"}]}],["$","$1","4",{"children":") The standard deviation of reward across steps."}]]}],["$","$La","34",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"In a scenario where we reject the "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:34:1:style","children":"M "}]}],["$","$1","2",{"children":"examples and resample responses in a manner similar to the approach presented in a recent work ["}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:34:3"}]}],["$","$1","4",{"children":"] until "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:34:5:style","children":"M "}]}],["$","$1","6",{"children":"equals 0, "}],["$","$1","7",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:34:7"}]]}]}],["$","$1","8",{"children":"is set to 1. However, this particular setting is not training efficient. The reason is that the training time is constrained by the worker that takes the longest to collect the desired examples. In contrast, our proposed method demonstrates superior efficiency. Moreover, it has the capability to automatically adjust the loss based on the performance of the sample batch."}]]}],["$","$La","35",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We also evaluate a setting of reward normalization of GRPO, where "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:35:1"}]]}]}],["$","$1","2",{"children":"show the result in Table "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:35:3"}]}],["$","$1","4",{"children":"It outperforms "}],["$","$1","5",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:35:5"}]]}]}],["$","$1","6",{"children":"average score. This motivates us to dive into where the improvement comes from. We plot the std of the reward in Figure "}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:35:7"}]}],["$","$1","8",{"children":"Note the std is calculated by averaging the std of each group, whose value ranges from 0.10 to 0.35. And "}],["$","$1","9",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:35:9"}]]}]}],["$","$1","10",{"children":"varies from 1.5 to 4.0. The reward normalization of GRPO provides such a diving std (within a group) mechanism, which has some gradient correction effect."}]]}],["$","$La","36",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/4-11.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:36:0"}]]}]}]]}],["$","$La","37",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 2: Comparison of reinforcement learning algorithms (in reasoning) with various components."}]]}],["$","$La","38",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:38:0:style","children":"Thresholding minimal partition of valid samples and resampling to reduce variance. "}]}],["$","$1","1",{"children":"While our approach provides an unbiased estimation of the gradient, it may encounter issues with high variance when the proportion of valid samples is excessively low. To mitigate this, we introduce a threshold "}],["$","$1","2",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:38:2"}]]}]}],["$","$1","3",{"children":"for the proportion of valid samples. When this proportion falls below the given value, we "}],["$","$1","4",{"children":"accumulate the valid samples into the resampled subsequent batch until the proportion exceeds the threshold. This strategy effectively reduces the variance of the gradient estimation, thereby enhancing the stability and convergence rate of the model training process. It is important to highlight that this strategy can further enhance the performance, as demonstrated in Table "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:38:5"}]}]]}],["$","$La","39",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"RL algorithms vary significantly in their approaches to tackling variance and optimizing policies. Two key components in many RL algorithms are surrogate loss and policy constraints. We summarize the main comparisons among various frameworks in Table "}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:39:1"}]}],["$","$1","2",{"children":"Our method stands out by preserving the simplest form, which not only ensures ease of implementation but also maintains high efficiency and effectiveness."}]]}]]}],["$","$L30","3",{"heading":"3 Experiments","index":3,"length":10,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/5-1.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:0"}]]}]}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:0:style","children":"3.1 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:1:style","children":"Experimental Setup"}]}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:2:0:style","children":"Dataset and Benchmarks "}]}],["$","$1","1",{"children":"In the unimodal scenario, we utilize datasets from multiple sources such as open-s1, open-rs ["}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:2:2"}]}],["$","$1","3",{"children":"], and MATH-lighteval ["}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:2:4"}]}],["$","$1","5",{"children":"] for training. Specifically, we train the DeepSeek-R1-Distill-Qwen-1.5B base model with the open-s1 dataset, resulting in the GPG-RS1 model. Similarly, training with the open-rs dataset produces the GPG-RS3 model. Furthermore, by employing the MATH-lighteval dataset on the Qwen2.5-Math-7B base model, we do ablations on this dataset. To compare the overall performance on the 7B model, we utilize the dataset from ["}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:2:6"}]}],["$","$1","7",{"children":"] and the detailed setting is shown in section "}],["$","$1","8",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:2:8"}]}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"These datasets encompass a wide range of problem types and difficulty levels. To assess the reasoning capabilities of the models, we employ five distinct mathematics-focused benchmark datasets: AIME24, MATH-500 "}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:1"}]}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:2"}]}],["$","$1","3",{"children":", AMC23, Minerva "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:4"}]}],["$","$1","5",{"children":", and OlympiadBench "}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:6"}]}],["$","$1","7",{"children":"."}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"In the multimodal case, we handle a variety of tasks. Specifically, for the visual reasoning task, we utilize approximately "}],["$","$1","1",{"children":"12"}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:2:style","children":", "}]}],["$","$1","3",{"children":"000 "}],["$","$1","4",{"children":"samples from the SAT dataset ["}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:5"}]}],["$","$1","6",{"children":"] for training and perform evaluations on the CV-Bench dataset ["}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:7"}]}],["$","$1","8",{"children":"]. In addressing the geometry reasoning task, by following R1-V ["}],["$","$1","9",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:9"}]}],["$","$1","10",{"children":"], we train on around "}],["$","$1","11",{"children":"8"}],["$","$1","12",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:12:style","children":", "}]}],["$","$1","13",{"children":"000 "}],["$","$1","14",{"children":"samples from the GEOQA training set ["}],["$","$1","15",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:15"}]}],["$","$1","16",{"children":"] and subsequently evaluating performance on the GEOQA test set ["}],["$","$1","17",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:17"}]}],["$","$1","18",{"children":"]. For both the classification and reasoning grounding tasks, we follow Visual-RFT to conduct few-shot classification training on Flower102 ["}],["$","$1","19",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:19"}]}],["$","$1","20",{"children":"], Pets37 ["}],["$","$1","21",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:21"}]}],["$","$1","22",{"children":"], FGVCAircraft ["}],["$","$1","23",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:23"}]}],["$","$1","24",{"children":"], Car196 ["}],["$","$1","25",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:25"}]}],["$","$1","26",{"children":"], respectively. Additionally, training is conducted on "}],["$","$1","27",{"children":"239 "}],["$","$1","28",{"children":"samples from the LISA training set ["}],["$","$1","29",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:29"}]}],["$","$1","30",{"children":"]. All evaluations are carried out using the corresponding test sets associated with these training sets."}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:5:0:style","children":"Implementation Details "}]}],["$","$1","1",{"children":"Our approach is broadly applicable across a wide range of reinforcement learning tasks. To demonstrate its versatility and efficacy, we have conducted experiments encompassing both unimodal and multimodal scenarios. These experiments are performed on NVIDIA H20 GPUs and NPUs from China. For each experiment, we adhered strictly to the implementation of original code base, ensuring consistent training and evaluation procedures. The implemented GPG method can refer to Algorithm"}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:5:2"}]}],["$","$1","3",{"children":"and more detailed settings can refer to Appendix "}],["$","$1","4",{"children":"B."}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/5-2.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:6:0"}]]}]}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/6-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:7:0"}]]}]}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 3: The zero-shot pass@1 performance of the 1.5B models distilled by DeepSeek-R1 across five mathematical reasoning benchmarks. "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:8:1"}]]}]}],["$","$1","2",{"children":": reproduced results using the released code. "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:8:3"}]]}]}],["$","$1","4",{"children":": results from "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:8:5"}]}],["$","$1","6",{"children":"."}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/6-3.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:9:0"}]]}]}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 4: The zero-shot pass@1 performance of the 7B models across five mathematical reasoning benchmarks. "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:10:1"}]]}]}],["$","$1","2",{"children":": reproduced results using the released code. "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:10:3"}]]}]}],["$","$1","4",{"children":": results from "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:10:5"}]}],["$","$1","6",{"children":", "}],["$","$1","7",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:10:7"}]]}]}],["$","$1","8",{"children":": results from "}],["$","$1","9",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:10:9"}]}],["$","$1","10",{"children":"."}]]}],["$","$La","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-50","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:11:0:style","children":"3.2 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:11:1:style","children":"Unimodal Task Evaluation"}]}]]}],["$","$La","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"To evaluate our method, we select two models: a strong 1.5B distilled SFT model (DeepSeek-R1-Distill-Qwen-1.5B) and a 7B base model."}]]}],["$","$La","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:13:0:style","children":"Mathematical Reasoning using 1.5B model (A strong SFT model). "}]}],["$","$1","1",{"children":"Compared with other 1.5B distilled models ["}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:13:2"}]}],["$","$1","3",{"children":"; "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:13:4"}]}],["$","$1","5",{"children":"], our models exhibit superior performance with average accuracy 55.7% of GPG-RS1, as illustrated in Table "}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:13:6"}]}],["$","$1","7",{"children":"Additionally, GPG-RS1 and GPG-RS3 shows strong results in AMC23 with a score of 77.5% and 80.0%, obviously surpassing Open-RS 67.5% and 70.0%. Both GPG-RS1 and GPG-RS3 demonstrate competitive performance across various benchmarks, particularly excelling in MATH-500 ["}],["$","$1","8",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:13:8"}]}],["$","$1","9",{"children":"; "}],["$","$1","10",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:13:10"}]}],["$","$1","11",{"children":"] with scores of 87.6% and 85.0%, and OlympiadBench ["}],["$","$1","12",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:13:12"}]}],["$","$1","13",{"children":"] with scores of 50.5% and 52.4%."}]]}],["$","$La","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:14:0:style","children":"Mathematical Reasoning using 7B model. "}]}],["$","$1","1",{"children":"As illustrated in Table. "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:14:2"}]}],["$","$1","3",{"children":"GPG-7B achieves an average score of 57.7% and outperforms other baselines with clear margins."}]]}],["$","$La","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/6-7.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:15:0"}]]}]}]]}],["$","$La","16",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:16:0:style","children":"3.3 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:16:1:style","children":"Multimodal Task Evalutaion"}]}]]}],["$","$La","17",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We further evaluate our method on several very recent multimodal benchmarks, most of which report results based on GRPO."}]]}],["$","$La","18",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:18:0:style","children":"Visual Reasoning. "}]}],["$","$1","1",{"children":"We initially evaluate the GPG method using the CV-Bench ["}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:18:2"}]}],["$","$1","3",{"children":"] visual reasoning dataset, strictly adhering to the parameter settings of VisualThinker-R1-Zero. As illustrated in Table "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:18:4"}]}],["$","$1","5",{"children":"the GPG method demonstrates a significant improvement in performance. Specifically, it attains a"}]]}],["$","$La","19",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/7-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:19:0"}]]}]}]]}],["$","$La","20",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 7: Visual reasoning results on CV-Bench ["}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:20:1"}]}],["$","$1","2",{"children":"], which shows GPG training on base model has overall better performance over GRPO training and the base model."}]]}],["$","$La","21",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 8: Reasoning grounding results on LISA ["}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:21:1"}]}],["$","$1","2",{"children":"]. GPG surpasses GRPO in reasoning grounding with 239 training images."}]]}],["$","$La","22",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"score of 69.11% on CV-Bench, representing an increase of 9.64% points compared to the 59.47% score achieved by GRPO."}]]}],["$","$La","23",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:23:0:style","children":"Geometry Reasoning. "}]}],["$","$1","1",{"children":"In addition to visual reasoning, MLLMs exhibit notable proficiency in geometry reasoning. To evaluate the efficacy of the GPG method in this domain, we employ an experimental setup similar to that used in R1-V ["}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:23:2"}]}],["$","$1","3",{"children":"] using the GEOQA ["}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:23:4"}]}],["$","$1","5",{"children":"] dataset. The results, presented in Table "}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:23:6"}]}],["$","$1","7",{"children":"indicate that the GPG method achieved a score of 50.80%, surpassing the GRPO’s score of 47.48% by 3.32% points. This demonstrates the superior performance of the GPG method in addressing complex geometric reasoning tasks."}]]}],["$","$La","24",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:24:0:style","children":"Classification. "}]}],["$","$1","1",{"children":"Beyond the evaluation of reasoning tasks, we also assess the enhancement of the GPG method over GRPO in image perception tasks. As shown in Table "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:24:2"}]}],["$","$1","3",{"children":"the GPG method achieves an average score of 86.0% across four classification datasets, surpassing GRPO by 4.1% points. Additionally, our method consistently produces improvements across all four classification datasets, underscoring its superiority in image perception tasks."}]]}],["$","$La","25",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:25:0:style","children":"Reasoning Grounding. "}]}],["$","$1","1",{"children":"The final critical aspect of evaluating MLLMs involves precisely identifying objects according to user requirements. To this end, we employ the Qwen2-VL-2B model for grounding tasks using the LISA dataset ["}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:25:2"}]}],["$","$1","3",{"children":"], with the results presented in Table "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:25:4"}]}],["$","$1","5",{"children":"In comparison to the GRPO method, the GPG approach demonstrates a substantial enhancement, improving all metrics by over 10% points. This significant improvement underscores the superiority of the GPG method in object localization, leading to considerable advancements in reasoning and perception capabilities."}]]}],["$","$La","26",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-45","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:26:0:style","children":"User: "}]}],["$","$1","1",{"children":"Solve the following math problem efficiently and clearly. The last line of your response should be of the following format: \"Therefore, the final answer is: . I hope it is correct\" (without quotes) where "}],["$","$1","2",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:26:2"}]]}]}],["$","$1","3",{"children":" is just the final number or expression that solves the problem."}]]}],["$","$La","27",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:27:0"}]]}]}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:27:1"}]]}]}],["$","$1","2",{"children":"such that "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:27:3"}]]}]}],["$","$1","4",{"children":"is a rhombus whose diagonals "}],["$","$1","5",{"children":"intersect at the origin. Find the greatest real number that is less than "}],["$","$1","6",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:27:6"}]]}]}],["$","$1","7",{"children":"for all such rhombi. "}],["$","$1","8",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:27:8:style","children":"Assistant: "}]}],["$","$1","9",{"children":"Let's solve this step by step. "}],["$","$1","10",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:27:10:style","children":"Ground Truth: "}]}],["$","$1","11",{"children":"480"}]]}],["$","$La","28",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/7-6.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:28:0"}]]}]}]]}],["$","$La","29",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 3: Comparison of GPG and GRPO in mathematical reasoning task based on DeepSeek-R1-Distill-Qwen-1.5B model trained on Open-r1 dataset: a test case from AIME24 dataset."}]]}],["$","$La","30",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:30:0:style","children":"3.4 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:30:1:style","children":"Ablation Study and Discussion"}]}]]}],["$","$La","31",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:31:0:style","children":"Case Study and Training Analysis. "}]}],["$","$1","1",{"children":"We present the reasoning processes of GPG and GRPO, as illustrated in Fig. "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:31:2"}]}],["$","$1","3",{"children":"Compared to GRPO, the GPG approach demonstrates a more comprehensive and accurate reasoning capability, whereas GRPO exhibits errors in formula analysis. Consequently, GPG arrived at the correct solution, while GRPO produced an incorrect result. In Fig. "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:31:4"}]}],["$","$1","5",{"children":"we present a range of real-time training metrics to illustrate the effectiveness of GPG as a straightforward yet strong RL algorithm."}]]}],["$","$La","32",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:32:0:style","children":"Sensitivity on Group Size. "}]}],["$","$1","1",{"children":"We study the effect of the number of generations within a group. As shown in Table "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:32:2"}]}],["$","$1","3",{"children":"increasing the group size from 2 to 16 leads to progressive improvements across most metrics. Specifically, the Average performance improves steadily with larger group sizes. We choose 8 to achieve a good tradeoff between training cost and performance."}]]}],["$","$La","33",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:33:0:style","children":"KL constraint. "}]}],["$","$1","1",{"children":"In principle, our method is designed to optimize the original reinforcement learning (RL) problem directly. And it’s a bit strange without imposing any distribution constraints. Despite this, we conducted an ablation study to evaluate the impact of adding a distribution constraint. The results are presented in Table "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:33:2"}]}],["$","$1","3",{"children":"Our findings indicate that incorporating such a constraint negatively impacts performance."}]]}],["$","$La","34",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:34:0:style","children":"Comparison with Various RL Methods. "}]}],["$","$1","1",{"children":"We attempt to explain the differences between GPG and other RL methods in the simplest way. As shown in Table "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:34:2"}]}],["$","$1","3",{"children":"it can be seen that the loss of GPG does not include the “CLIP term” and the “KL divergence”. Its form and calculation are the simplest, and as discussed in Section "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:34:4"}]}],["$","$1","5",{"children":"its performance is better than other methods."}]]}],["$","$La","35",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/8-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:35:0"}]]}]}]]}],["$","$La","36",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:36:0:style","children":"3.5 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:36:1:style","children":"Broader Impact"}]}]]}],["$","$La","37",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Achieving advanced general intelligence critically depends on augmenting the reasoning capabilities of models, with efficient and scalable reinforcement learning methods serving as a cornerstone. Our proposed approach investigates a minimalist strategy that aims to enhance reasoning capacity through simplicity and efficiency, thereby potentially facilitating the development of scalable systems. However, given the constraints of our computational budget, we have not evaluated our method on extremely large models."}]]}]]}],["$","$L30","4",{"heading":"4 Related Work","index":4,"length":10,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:0:style","children":"Large Model Reasoning. "}]}],["$","$1","1",{"children":"Recent advancements in both LLM and Multimodal Large Language Model (MLLM) have increasingly focused on enabling models to simulate human-like, stepwise reasoning processes. In the field of LLMs, researchers have pioneered methods such as Chain-of-Thought (CoT) prompting ["}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:2"}]}],["$","$1","3",{"children":"; "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:4"}]}],["$","$1","5",{"children":"; "}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:6"}]}],["$","$1","7",{"children":"; "}],["$","$1","8",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:8"}]}],["$","$1","9",{"children":"; "}],["$","$1","10",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:10"}]}],["$","$1","11",{"children":"; "}],["$","$1","12",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:12"}]}],["$","$1","13",{"children":"; "}],["$","$1","14",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:14"}]}],["$","$1","15",{"children":"], Tree-of-Thought ["}],["$","$1","16",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:16"}]}],["$","$1","17",{"children":"], Monte Carlo Tree Search ["}],["$","$1","18",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:18"}]}],["$","$1","19",{"children":"; "}],["$","$1","20",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:20"}]}],["$","$1","21",{"children":"; "}],["$","$1","22",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:22"}]}],["$","$1","23",{"children":"], and the construction of complex SFT datasets ["}],["$","$1","24",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:24"}]}],["$","$1","25",{"children":"], to enhance performance in reasoning tasks. Notably, approaches such as DeepSeek-R1 ["}],["$","$1","26",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:26"}]}],["$","$1","27",{"children":"] have employed large-scale RL with format-specific and result-oriented reward functions, guiding LLMs toward self-emerging, human-like, complex CoT reasoning with significant performance improvements in challenging reasoning tasks. Meanwhile, MLLMs, which convert inputs from various modalities into a unified LLM vocabulary representation space for processing, have exhibited superior performance in vision understanding tasks ["}],["$","$1","28",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:28"}]}],["$","$1","29",{"children":"; "}],["$","$1","30",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:30"}]}],["$","$1","31",{"children":"; "}],["$","$1","32",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:32"}]}],["$","$1","33",{"children":"; "}],["$","$1","34",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:34"}]}],["$","$1","35",{"children":"; "}],["$","$1","36",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:36"}]}],["$","$1","37",{"children":"; "}],["$","$1","38",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:38"}]}],["$","$1","39",{"children":"; "}],["$","$1","40",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:40"}]}],["$","$1","41",{"children":"; "}],["$","$1","42",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:42"}]}],["$","$1","43",{"children":"; "}],["$","$1","44",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:44"}]}],["$","$1","45",{"children":"]. Building on the advancements in LLM "}],["$","$1","46",{"children":"reasoning, there has been a collective effort within the research community to apply the DeepSeek-R1 methodology to MLLMs to enhance their visual reasoning capabilities, yielding remarkable progress "}],["$","$1","47",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:47"}]}],["$","$1","48",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:48"}]}],["$","$1","49",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:49"}]}],["$","$1","50",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:50"}]}],["$","$1","51",{"children":"."}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:0:style","children":"Reinforcement Learning. "}]}],["$","$1","1",{"children":"RL has driven significant progress in sequential decision-making, with policy gradient methods being fundamental to optimizing stochastic policies. The REINFORCE algorithm ["}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:2"}]}],["$","$1","3",{"children":"] established early principles for gradient-based policy updates in trajectory-driven tasks. However, its high variance has posed challenges for scalability. To address this, subsequent research has focused on stabilizing policy optimization processes. Trust Region Policy Optimization (TRPO) ["}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:4"}]}],["$","$1","5",{"children":"] introduced constrained updates via quadratic approximations to ensure monotonic improvement. This was further refined by PPO ["}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:6"}]}],["$","$1","7",{"children":"], which employed clipped objective functions to simplify the optimization process. Subsequent studies have sought to enhance the PPO alogrithm ["}],["$","$1","8",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:8"}]}],["$","$1","9",{"children":"; "}],["$","$1","10",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:10"}]}],["$","$1","11",{"children":"] or elaborate on its implementation ["}],["$","$1","12",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:12"}]}],["$","$1","13",{"children":"]. PPO has achieved widespread use in language model alignment and robotic control. However, the algorithm’s dependence on conservative policy updates or heuristic clipping thresholds can undermine its exploration potential in favor of stability, which poses a significant challenge in complex domains requiring dynamic strategy adaptation."}]]}]]}],["$","$L30","5",{"heading":"5 Conclusion","index":5,"length":10,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"In this paper, we introduce GPG, which effectively addresses the critical challenges inherent in reinforcement fine-tuning approaches such as PPO and GRPO. By directly incorporating group-based decision dynamics into the standard PG method, GPG simplifies the training process and significantly reduces computational overhead without sacrificing reasoning quality. This breakthrough provides a more efficient framework for training advanced LLMs capable of complex reasoning, thereby contributing to more resource-effective and scalable artificial intelligence systems."}]]}]]}],["$","$L30","6",{"heading":"References","index":6,"length":10,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-74","style":"$undefined","children":"[1] "}]}],["$","$1","1",{"children":"Marcin Andrychowicz, Anton Raichuk, Piotr Sta´nczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Leonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters for on-policy deep actor-critic methods? a large-scale study. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:0:2:style","children":"International conference on learning representations"}]}],["$","$1","3",{"children":", 2021."}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-3","style":"$undefined","children":"[2] "}]}],["$","$1","1",{"children":"Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:1:2:style","children":"arXiv preprint arXiv:2502.13923"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-34","style":"$undefined","children":"[3] "}]}],["$","$1","1",{"children":"Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022."}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-33","style":"$undefined","children":"[4] "}]}],["$","$1","1",{"children":"Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3. "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:3:2"}]}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:3:3"}]}],["$","$1","4",{"children":", 2025. Accessed: 2025-02-02."}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-62","style":"$undefined","children":"[5] "}]}],["$","$1","1",{"children":"Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:4:2:style","children":"arXiv preprint arXiv:2412.05271"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-71","style":"$undefined","children":"[6] "}]}],["$","$1","1",{"children":"Xiangxiang Chu. Policy optimization with penalized point probability distance: An alternative to proximal policy optimization. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:5:2:style","children":"arXiv preprint arXiv:1807.00442"}]}],["$","$1","3",{"children":", 2018."}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-6","style":"$undefined","children":"[7] "}]}],["$","$1","1",{"children":"Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:6:2:style","children":"arXiv preprint"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"[8] "}],["$","$1","1",{"children":"Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:7:2:style","children":"arXiv preprint arXiv:2502.01456"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-27","style":"$undefined","children":"[9] "}]}],["$","$1","1",{"children":"Quy-Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What works and what doesn’t, 2025."}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-73","style":"$undefined","children":"[10] "}]}],["$","$1","1",{"children":"Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep rl: A case study on ppo and trpo. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:9:2:style","children":"International conference on learning representations"}]}],["$","$1","3",{"children":", 2019."}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-22","style":"$undefined","children":"[11] "}]}],["$","$1","1",{"children":"Hugging Face. Open r1: A fully open reproduction of deepseek-r1. "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:10:2"}]}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:10:3"}]}],["$","$1","4",{"children":", 2025. Accessed: January 2025."}]]}],["$","$La","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-58","style":"$undefined","children":"[12] "}]}],["$","$1","1",{"children":"Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training, 2024."}]]}],["$","$La","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-8","style":"$undefined","children":"[13] "}]}],["$","$1","1",{"children":"Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:12:2:style","children":"arXiv preprint arXiv:2410.07985"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-63","style":"$undefined","children":"[14] Google. Gemini: A family of highly capable multimodal models, 2023."}]}]]}],["$","$La","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"[15] "}],["$","$1","1",{"children":"Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:14:2:style","children":"arXiv preprint arXiv:2501.04519"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-2","style":"$undefined","children":"[16] "}]}],["$","$1","1",{"children":"Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:15:2:style","children":"arXiv preprint arXiv:2501.12948"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","16",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-28","style":"$undefined","children":"[17] "}]}],["$","$1","1",{"children":"Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:16:2:style","children":"arXiv preprint arXiv:2103.03874"}]}],["$","$1","3",{"children":", 2021."}]]}],["$","$La","17",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-75","style":"$undefined","children":"[18] "}]}],["$","$1","1",{"children":"Chloe Ching-Yun Hsu, Celestine Mendler-Dünner, and Moritz Hardt. Revisiting design choices in proximal policy optimization. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:17:2:style","children":"arXiv preprint arXiv:2009.10897"}]}],["$","$1","3",{"children":", 2020."}]]}],["$","$La","18",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-17","style":"$undefined","children":"[19] "}]}],["$","$1","1",{"children":"Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models, 2025."}]]}],["$","$La","19",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"[20] "}],["$","$1","1",{"children":"Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:19:2:style","children":"arXiv preprint arXiv:2503.24290"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","20",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-9","style":"$undefined","children":"[21] "}]}],["$","$1","1",{"children":"Hailang Huang, Yong Wang, Zixuan Huang, Huaqiu Li, Tongwen Huang, Xiangxiang Chu, and Richong Zhang. Mmgenbench: Fully automatically evaluating lmms from the text-to-image generation perspective, 2025."}]]}],["$","$La","21",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-12","style":"$undefined","children":"[22] "}]}],["$","$1","1",{"children":"Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al. Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:21:2:style","children":"Advances in Neural Information Processing Systems"}]}],["$","$1","3",{"children":", 37:19209–19253, 2024."}]]}],["$","$La","22",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-7","style":"$undefined","children":"[23] "}]}],["$","$1","1",{"children":"LI Jia, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, et al. Numinamath, 2024."}]]}],["$","$La","23",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-52","style":"$undefined","children":"[24] "}]}],["$","$1","1",{"children":"Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:23:2:style","children":"Advances in Neural Information Processing Systems"}]}],["$","$1","3",{"children":", volume 35, pages 22199–22213. Curran Associates, Inc., 2022."}]]}],["$","$La","24",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-38","style":"$undefined","children":"[25] "}]}],["$","$1","1",{"children":"Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for finegrained categorization. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:24:2:style","children":"Proceedings of the IEEE international conference on computer vision workshops"}]}],["$","$1","3",{"children":", pages 554–561, 2013."}]]}],["$","$La","25",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-10","style":"$undefined","children":"[26] "}]}],["$","$1","1",{"children":"Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:25:2:style","children":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition"}]}],["$","$1","3",{"children":", pages 9579–9589, 2024."}]]}],["$","$La","26",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-30","style":"$undefined","children":"[27] "}]}],["$","$1","1",{"children":"Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:26:2:style","children":"Advances in Neural Information Processing Systems"}]}],["$","$1","3",{"children":", 35:3843–3857, 2022."}]]}],["$","$La","27",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-16","style":"$undefined","children":"[28] "}]}],["$","$1","1",{"children":"Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models, 2024."}]]}],["$","$La","28",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-11","style":"$undefined","children":"[29] "}]}],["$","$1","1",{"children":"Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:28:2:style","children":"The Twelfth International Conference on Learning Representations"}]}],["$","$1","3",{"children":", 2023."}]]}],["$","$La","29",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-61","style":"$undefined","children":"[30] "}]}],["$","$1","1",{"children":"Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:29:2:style","children":"Advances in neural information processing systems"}]}],["$","$1","3",{"children":", 36:34892–34916, 2023."}]]}],["$","$La","30",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-18","style":"$undefined","children":"[31] "}]}],["$","$1","1",{"children":"Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:30:2:style","children":"arXiv preprint arXiv:2503.20783"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","31",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-67","style":"$undefined","children":"[32] "}]}],["$","$1","1",{"children":"Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:31:2:style","children":"arXiv preprint arXiv:2503.01785"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","32",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-37","style":"$undefined","children":"[33] "}]}],["$","$1","1",{"children":"Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Finegrained visual classification of aircraft. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:32:2:style","children":"arXiv preprint arXiv:1306.5151"}]}],["$","$1","3",{"children":", 2013."}]]}],["$","$La","33",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-21","style":"$undefined","children":"[34] "}]}],["$","$1","1",{"children":"Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:33:2:style","children":"International conference on machine learning"}]}],["$","$1","3",{"children":", pages 1928–1937. PmLR, 2016."}]]}],["$","$La","34",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-55","style":"$undefined","children":"[35] "}]}],["$","$1","1",{"children":"Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:34:2:style","children":"arXiv preprint arXiv:2501.19393"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","35",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-35","style":"$undefined","children":"[36] "}]}],["$","$1","1",{"children":"Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:35:2:style","children":"2008 Sixth Indian conference on computer vision, graphics & image processing"}]}],["$","$1","3",{"children":", pages 722–729. IEEE, 2008."}]]}],["$","$La","36",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-64","style":"$undefined","children":"[37] OpenAI. Gpt-4v(ision) system card, 2023."}]}]]}],["$","$La","37",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-1","style":"$undefined","children":"[38] OpenAI. Learning to reason with llms, 2024."}]}]]}],["$","$La","38",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-36","style":"$undefined","children":"[39] "}]}],["$","$1","1",{"children":"Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:38:2:style","children":"2012 IEEE conference on computer vision and pattern recognition"}]}],["$","$1","3",{"children":", pages 3498–3505. IEEE, 2012."}]]}],["$","$La","39",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-31","style":"$undefined","children":"[40] "}]}],["$","$1","1",{"children":"Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Spatial aptitude training for multimodal language models, 2024."}]]}],["$","$La","40",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-53","style":"$undefined","children":"[41] "}]}],["$","$1","1",{"children":"Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:40:2:style","children":"Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems"}]}],["$","$1","3",{"children":", CHI EA ’21, New York, NY, USA, 2021. Association for Computing Machinery."}]]}],["$","$La","41",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-40","style":"$undefined","children":"[42] "}]}],["$","$1","1",{"children":"RUCAIBox STILL Team. Still-3-1.5b-preview: Enhancing slow thinking abilities of small models through reinforcement learning. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:41:2:style","children":"URL"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","42",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-70","style":"$undefined","children":"[43] "}]}],["$","$1","1",{"children":"John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:42:2:style","children":"International conference on machine learning"}]}],["$","$1","3",{"children":", pages 1889–1897. PMLR, 2015."}]]}],["$","$La","43",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-20","style":"$undefined","children":"[44] "}]}],["$","$1","1",{"children":"John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation, 2018."}]]}],["$","$La","44",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-13","style":"$undefined","children":"[45] "}]}],["$","$1","1",{"children":"John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:44:2:style","children":"arXiv preprint arXiv:1707.06347"}]}],["$","$1","3",{"children":", 2017."}]]}],["$","$La","45",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-14","style":"$undefined","children":"[46] "}]}],["$","$1","1",{"children":"Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:45:2:style","children":"arXiv preprint arXiv:2402.03300"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","46",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"[47] "}],["$","$1","1",{"children":"Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:46:2:style","children":"arXiv preprint arXiv: 2409.19256"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","47",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-19","style":"$undefined","children":"[48] "}]}],["$","$1","1",{"children":"Richard S Sutton, Andrew G Barto, et al. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:47:2:style","children":"Reinforcement learning: An introduction"}]}],["$","$1","3",{"children":", volume 1. MIT press Cambridge, 1998."}]]}],["$","$La","48",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-32","style":"$undefined","children":"[49] "}]}],["$","$1","1",{"children":"Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024."}]]}],["$","$La","49",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-60","style":"$undefined","children":"[50] "}]}],["$","$1","1",{"children":"Trieu Trinh, Yuhuai Wu, Quoc Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:49:2:style","children":"Nature"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","50",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-65","style":"$undefined","children":"[51] "}]}],["$","$1","1",{"children":"Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:50:2:style","children":"arXiv preprint"}]}],["$","$1","3",{"children":", 2023."}]]}],["$","$La","51",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-51","style":"$undefined","children":"[52] "}]}],["$","$1","1",{"children":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:51:2:style","children":"Advances in neural information processing systems"}]}],["$","$1","3",{"children":", 35:24824–24837, 2022."}]]}],["$","$La","52",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-69","style":"$undefined","children":"[53] "}]}],["$","$1","1",{"children":"Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:52:2:style","children":"Machine learning"}]}],["$","$1","3",{"children":", 8:229–256, 1992."}]]}],["$","$La","53",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-4","style":"$undefined","children":"[54] "}]}],["$","$1","1",{"children":"Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:53:2:style","children":"arXiv preprint arXiv:2412.10302"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","54",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-59","style":"$undefined","children":"[55] "}]}],["$","$1","1",{"children":"Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024."}]]}],["$","$La","55",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"[56] "}],["$","$1","1",{"children":"An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:55:2:style","children":"arXiv preprint arXiv:2409.12122"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","56",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-57","style":"$undefined","children":"[57] "}]}],["$","$1","1",{"children":"Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:56:2:style","children":"Advances in neural information processing systems"}]}],["$","$1","3",{"children":", 36:11809–11822, 2023."}]]}],["$","$La","57",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-5","style":"$undefined","children":"[58] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu "}]}],["$","$1","1",{"children":"Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:57:2:style","children":"arXiv preprint arXiv:2408.01800"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","58",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-56","style":"$undefined","children":"[59] "}]}],["$","$1","1",{"children":"Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025."}]]}],["$","$La","59",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-15","style":"$undefined","children":"[60] "}]}],["$","$1","1",{"children":"Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, and Mingxuan Wang. Dapo: An open-source llm reinforcement learning system at scale, 2025."}]]}],["$","$La","60",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-54","style":"$undefined","children":"[61] "}]}],["$","$1","1",{"children":"Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STar: Bootstrapping reasoning with reasoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:60:2:style","children":"Advances in Neural Information Processing Systems"}]}],["$","$1","3",{"children":", 2022."}]]}],["$","$La","61",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"[62] "}],["$","$1","1",{"children":"Weihao Zeng, Yuzhen Huang, Wei Liu, Keqing He, Qian Liu, Zejun Ma, and Junxian He. 7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient. "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:61:2"}]}],["$","$1","3",{"children":", 2025. Notion Blog."}]]}],["$","$La","62",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-66","style":"$undefined","children":"[63] "}]}],["$","$1","1",{"children":"Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:6:paragraphs:62:2:style","children":"arXiv preprint arXiv:2503.12937"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","63",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-72","style":"$undefined","children":"[64] "}]}],["$","$1","1",{"children":"Rui Zheng, Shihan Dou, Songyang Gao, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Limao Xiong, Lu Chen, Zhiheng Xi, Yuhao Zhou, Nuo Xu, Wenbin Lai, Minghao Zhu, Rongxiang Weng, Wensen Cheng, Cheng Chang, Zhangyue Yin, Yuan Hua, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. Secrets of rlhf in large language models part i: Ppo, 2023."}]]}],["$","$La","64",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-68","style":"$undefined","children":"[65] "}]}],["$","$1","1",{"children":"Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s \"aha moment\" in visual reasoning on a 2b non-sft model, 2025."}]]}]]}],["$","$L30","7",{"heading":"A Analysis of Distributed Gradient Averaging with Invalid Samples","index":7,"length":10,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:0:0:style","children":"A.1 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:0:1:style","children":"Problem Formulation"}]}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Consider a distributed training setup where:"}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• A batch of "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:2:1:style","children":"B "}]}],["$","$1","2",{"children":"samples is evenly distributed across "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:2:3:style","children":"N "}]}],["$","$1","4",{"children":"GPUs, with each GPU processing "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:2:5:style","children":"K "}]}],["$","$1","6",{"children":"= "}],["$","$1","7",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:2:7:style","children":"B/N "}]}],["$","$1","8",{"children":"samples."}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• For the "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:3:1:style","children":"i"}]}],["$","$1","2",{"children":"-th GPU, the first "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:3:3"}]]}]}],["$","$1","4",{"children":"samples produce zero gradients (invalid samples), while the remaining "}],["$","$1","5",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:3:5"}]]}]}],["$","$1","6",{"children":"samples generate valid gradients."}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• Let "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:4:1"}]]}]}],["$","$1","2",{"children":"denote the total invalid samples, and "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:4:3"}]]}]}],["$","$1","4",{"children":"the effective"}],["$","$1","5",{"children":"valid samples."}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Let "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:5:1"}]]}]}],["$","$1","2",{"children":"represent the gradient of the "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:5:3:style","children":"j"}]}],["$","$1","4",{"children":"-th valid sample on the "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:5:5:style","children":"i"}]}],["$","$1","6",{"children":"-th GPU. We define the valid gradient sum for GPU "}],["$","$1","7",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:5:7:style","children":"i "}]}],["$","$1","8",{"children":"as:"}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/14-5.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:6:0"}]]}]}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The conventional distributed averaging approach in PyTorch computes a gradient estimate:"}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/14-6.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:8:0"}]]}]}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"whereas the theoretically correct gradient should be:"}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/14-7.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:10:0"}]]}]}]]}],["$","$La","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:11:0:style","children":"Proof. "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:11:1:style","children":"Step 1: Conventional Approach Derivation"}]}]]}],["$","$La","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Each GPU calculates its local mean using the "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:12:1:style","children":"assigned "}]}],["$","$1","2",{"children":"sample count "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:12:3:style","children":"K "}]}],["$","$1","4",{"children":"(not valid samples):"}]]}],["$","$La","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/14-8.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:13:0"}]]}]}]]}],["$","$La","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Global averaging then gives:"}]]}],["$","$La","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/14-9.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:15:0"}]]}]}]]}],["$","$La","16",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:16:0:style","children":"Step 2: True Gradient Computation"}]}]]}],["$","$La","17",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The correct gradient averages over only valid samples:"}]]}],["$","$La","18",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/14-10.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:18:0"}]]}]}]]}],["$","$La","19",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Observe the proportional relationship:"}]]}],["$","$La","20",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/14-11.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:7:paragraphs:20:0"}]]}]}]]}]]}],["$","$L30","8",{"heading":"B More Experiment Details","index":8,"length":10,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-29","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:0:0:style","children":"B.1 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:0:1:style","children":"Experiment Settings"}]}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/15-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:1:0"}]]}]}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:2:0:style","children":"B.2 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:2:1:style","children":"More Ablation Experiment Results"}]}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/15-1.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:3:0"}]]}]}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 11: Ablation on different group size (wo AGE) using Qwen2.5 Math 7B."}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:5:0:style","children":"Reward Normalization. "}]}],["$","$1","1",{"children":"We study the role of reward normalization and show the result in Table "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:5:2"}]}],["$","$1","3",{"children":"Normalization within a batch is common practice in the RL training process ["}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:5:4"}]}],["$","$1","5",{"children":"]. The experiment "}],["$","$1","6",{"children":["$","span",null,{"tabIndex":-1,"id":"id-48","style":"$undefined","children":"results show that reward normalization within a group is better than the batch."}]}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/15-2.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:6:0"}]]}]}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 12: Ablation on reward normalization using Qwen2.5 Math 7B."}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:8:0:style","children":"B.3 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:8:1:style","children":"Prompt and Reward Function"}]}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:9:0:style","children":"Prompt for Reasoning. "}]}],["$","$1","1",{"children":"In the process of reinforcement fine-tuning, specific instructions are incorporated into the system prompt. These instructions encourage the model to generate intermediate reasoning steps, thereby facilitating the reasoning capabilities of the model. An example of this approach is provided below "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:9:2"}]}],["$","$1","3",{"children":":"}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:10:0:style","children":"Reward Function. "}]}],["$","$1","1",{"children":"For most tasks, we use the accuracy and formatting reward functions. For the grounding task, the Intersection over Union (IoU) reward function is utilized. For the Qwen 7B setting, we only use the accuracy reward."}]]}],["$","$La","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• Accuracy: If the model’s output is consistent with the ground truth, a reward of 1.0 is awarded. • Formatting: If the format of the model output is “ ”, a reward of 1.0 is granted."}]]}],["$","$La","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"• IoU: Consistent with Visual-RFT ["}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:12:1"}]}],["$","$1","2",{"children":"], the reward value is derived from the calculated scores of the bounding boxes generated by the model."}]]}],["$","$La","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:13:0:style","children":"B.4 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:13:1:style","children":"More Experiment Settings"}]}]]}],["$","$La","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"To evaluate the unimodal reasoning capabilities of our proposed method, we utilize two publicly available code repositories: Open-r1 ["}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:14:1"}]}],["$","$1","2",{"children":"] and Open-rs ["}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:14:3"}]}],["$","$1","4",{"children":"]. These repositories are selected due to their"}]]}],["$","$La","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/16-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:15:0"}]]}]}]]}],["$","$La","16",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within and tags, respectively, i.e., reasoning process here answer here "}]]}],["$","$La","17",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/16-1.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:17:0"}]]}]}]]}],["$","$La","18",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n<|im_start|>user\\nKelly can read five pages of her fiction book or two pages of her history textbook in seven minutes. "}],["$","$1","1",{"children":"If Kelly wants to read thirty pages of each book, for how many minutes in total must Kelly read?\\nPlease reason step by step, and put your final answer within boxed.<|im_end|>\\n<|im_start|>assistant\\n\""}]]}],["$","$La","19",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"extensive coverage of various reasoning scenarios and their ability to present substantial challenges that effectively assess the reasoning capabilities of advanced models. The DeepSeek-R1-Distill-Qwen-1.5B model is trained for 100 and 50 global steps using the open-s1 and open-rs datasets, as reported in the repository "}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:19:1"}]}],["$","$1","2",{"children":", resulting in the GPG-RS1 and GPG-RS3 models, respectively."}]]}],["$","$La","20",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"For multimodal tasks, we have selected three renowned frameworks as our code base: VisualThinker-R1-Zero ["}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:20:1"}]}],["$","$1","2",{"children":"], R1-V ["}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:20:3"}]}],["$","$1","4",{"children":"], and Visual-RFT ["}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:20:5"}]}],["$","$1","6",{"children":"]. These frameworks cover a variety of tasks, including visual reasoning, geometric reasoning, and image perception. The use of distinct code bases enables a comprehensive assessment of the performance enhancements achieved by our method across different tasks. Specifically, for the VisualThinker-R1-Zero framework, we evaluate the results of the GPG approach on the CV-Bench ["}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:20:7"}]}],["$","$1","8",{"children":"]. Additionally, we evaluate the GPGresults on the GEOQA dataset ["}],["$","$1","9",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:20:9"}]}],["$","$1","10",{"children":"] based on R1-V. Finally, for tasks related to image perception, such as classification ["}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:20:11"}]}],["$","$1","12",{"children":"; "}],["$","$1","13",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:20:13"}]}],["$","$1","14",{"children":"; "}],["$","$1","15",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:20:15"}]}],["$","$1","16",{"children":"; "}],["$","$1","17",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:20:17"}]}],["$","$1","18",{"children":"] and reasoning grounding ["}],["$","$1","19",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:20:19"}]}],["$","$1","20",{"children":"], we examine the performance of PGP using the Visual-RFT framework."}]]}],["$","$La","21",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/16-2.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:21:0"}]]}]}]]}],["$","$La","22",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 4: Comparison of GPG(blue curves) and GRPO(gray curves) in terms of training loss, rewards and completion length. Experiments are based on DeepSeek-R1-Distill-Qwen-1.5B, same as Table. "}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:22:1"}]}]]}]]}],["$","$L30","9",{"heading":"C More Related Work","index":9,"length":10,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:0:0:style","children":"Proximal Policy Optimization"}]}],["$","$1","1",{"children":". PPO ["}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:0:2"}]}],["$","$1","3",{"children":"] addresses the inherent optimization instability of Trust Region Policy Optimization (TRPO) ["}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:0:4"}]}],["$","$1","5",{"children":"] through a clipped surrogate objective. Formally, let the probability ratio between the updated policy "}],["$","$1","6",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:0:6"}]]}]}],["$","$1","7",{"children":"and the previous policy "}],["$","$1","8",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:0:8"}]]}]}],["$","$1","9",{"children":"be defined as"}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/16-5.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:1:0"}]]}]}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"where "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:2:1"}]]}]}],["$","$1","2",{"children":"and "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:2:3"}]]}]}],["$","$1","4",{"children":"denote the action and state at timestep "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:2:5:style","children":"t"}]}],["$","$1","6",{"children":", respectively. While TRPO maximizes the surrogate objective"}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/16-8.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:3:0"}]]}]}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"under a Kullback-Leibler (KL) divergence constraint, PPO reformulates this via a clipped mechanism. Here, "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:4:1"}]]}]}],["$","$1","2",{"children":"represents the estimated advantage function quantifying the relative value of action "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:4:3"}]]}]}],["$","$1","4",{"children":"in state "}],["$","$1","5",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:4:5"}]]}]}],["$","$1","6",{"children":". The PPO objective is defined as:"}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/16-12.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:5:0"}]]}]}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"where the clip operator restricts "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:6:1"}]]}]}],["$","$1","2",{"children":"to the interval "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:6:3"}]]}]}],["$","$1","4",{"children":", with "}],["$","$1","5",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:6:5"}]]}]}],["$","$1","6",{"children":"being a hyperparameter controlling the policy update magnitude. This constraint prevents excessive policy deviations that could degrade performance."}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"To further stabilize training and promote exploration, the composite objective incorporates three components: 1) Clipped policy gradient term "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:7:1"}]]}]}],["$","$1","2",{"children":", 2) Value function loss:"}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/17-4.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:8:0"}]]}]}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"where "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:9:1"}]]}]}],["$","$1","2",{"children":"is the state-value function estimator and "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:9:3"}]]}]}],["$","$1","4",{"children":"denotes the target value computed via temporal-difference methods, 3) Entropy regularization:"}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/17-7.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:10:0"}]]}]}]]}],["$","$La","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"with "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:11:1:style","children":"A "}]}],["$","$1","2",{"children":"being the action space, which prevents premature policy convergence by encouraging stochasticity."}]]}],["$","$La","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The complete objective integrates these terms as:"}]]}],["$","$La","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/17-8.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:13:0"}]]}]}]]}],["$","$La","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"where "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:14:1"}]]}]}],["$","$1","2",{"children":"and "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:14:3"}]]}]}],["$","$1","4",{"children":"are coefficients balancing policy optimization, value estimation accuracy, and exploration. Crucially, PPO replaces TRPO’s computationally intensive second-order KL constraints with first-order gradient clipping, enabling efficient large-scale implementations while preserving monotonic policy improvement guarantees, as rigorously established through surrogate objective monotonicity analysis "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:14:5"}]}],["$","$1","6",{"children":"."}]]}],["$","$La","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:15:0:style","children":"Group Relative Policy Optimization"}]}],["$","$1","1",{"children":". GRPO ["}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:15:2"}]}],["$","$1","3",{"children":"] establishes a policy gradient framework that eliminates dependency on explicit value function approximation through comparative advantage estimation within response groups. The method operates by sampling multiple candidate outputs for each input question and constructing advantage signals based on relative rewards within these groups. For a given question "}],["$","$1","4",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:15:4"}]]}]}],["$","$1","5",{"children":", the algorithm generates "}],["$","$1","6",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:15:6"}]]}]}],["$","$1","7",{"children":"from the current policy "}],["$","$1","8",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:15:8"}]]}]}],["$","$1","9",{"children":", then computes token-level advantages using intra-group reward comparisons."}]]}],["$","$La","16",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The advantage term "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:16:1"}]]}]}],["$","$1","2",{"children":"for the "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:16:3:style","children":"t"}]}],["$","$1","4",{"children":"-th token in the "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:16:5:style","children":"i"}]}],["$","$1","6",{"children":"-th response is defined as the deviation from the group average reward:"}]]}],["$","$La","17",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/17-15.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:17:0"}]]}]}]]}],["$","$La","18",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"where "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:18:1"}]]}]}],["$","$1","2",{"children":"denotes the reward model’s evaluation. This design inherently aligns with the comparative training paradigm of reward models, which typically learn from pairwise response rankings."}]]}],["$","$La","19",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The optimization objective integrates clipped probability ratios with explicit KL regularization. Defining the token-level probability ratio as:"}]]}],["$","$La","20",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/17-17.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:20:0"}]]}]}]]}],["$","$La","21",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"the clipped surrogate objective constrains policy updates through:"}]]}],["$","$La","22",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/17-18.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:22:0"}]]}]}]]}],["$","$La","23",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Diverging from PPO’s implicit KL control via reward shaping, GRPO directly regularizes policy divergence using an unbiased KL estimator:"}]]}],["$","$La","24",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/17-19.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:24:0"}]]}]}]]}],["$","$La","25",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The complete objective combines these components with a regularization coefficient "}],["$","$1","1",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:25:1"}]]}]}]]}],["$","$La","26",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.02546/images/17-21.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.93,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:26:0"}]]}]}]]}]]}]],["$","$L34",null,{"paper":"$c:props:children:props:children:0:props:product"}]]