36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"2509.24494","publisher":"arxiv","paperJSON":{"title":"Why Tree-Style Branching Matters for Thought Advantage Estimation in GRPO","paperID":"2509.24494","avgLineHeight":10.96,"imgScale":4,"sections":[{"heading":"ABSTRACT","paragraphs":[[{"text":"$3c","element":"span"}]]},{"heading":"1 INTRODUCTION","paragraphs":[[{"text":"The Deepseek-R1 model demonstrates that reinforcement learning (RL)–specifically through Group Relative Policy Optimization (GRPO) ","element":"span"},{"href":"#id-0","referenceIndex":30,"text":"Shao et al. ","element":"a"},{"href":"#id-0","referenceIndex":30,"text":"(2024)","element":"a"},{"text":"—is effective in training Chain-of-Thought (CoT). This method involves prompting the Large Language Model (LLM) to generate a reasoning trace before producing a final answer, which is then reinforced via a reward signal to enhance the model’s reasoning capabilities. Subsequently, methods such as DAPO ","element":"span"},{"href":"#id-1","referenceIndex":39,"text":"Yu et al. ","element":"a"},{"href":"#id-1","referenceIndex":39,"text":"(2025)","element":"a"},{"text":", Dr.GRPO ","element":"span"},{"href":"#id-2","referenceIndex":21,"text":"Liu et al. ","element":"a"},{"href":"#id-2","referenceIndex":21,"text":"(2025)","element":"a"},{"text":", and GPG ","element":"span"},{"href":"#id-3","referenceIndex":7,"text":"Chu et al. ","element":"a"},{"href":"#id-3","referenceIndex":7,"text":"(2025) ","element":"a"},{"text":"have improved upon GRPO’s loss function from various perspectives, achieving more stable training curves and better results on mathematical problems. Beyond textual reasoning tasks, the GRPO paradigm has also been extended to multi-modal scenarios ","element":"span"},{"href":"#id-4","referenceIndex":6,"text":"Chen et al. ","element":"a"},{"href":"#id-4","referenceIndex":6,"text":"(2025b)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":31,"text":"Shen et al. ","element":"a"},{"href":"#id-5","referenceIndex":31,"text":"(2025)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-6","referenceIndex":13,"text":"Huang et al. ","element":"a"},{"href":"#id-6","referenceIndex":13,"text":"(2025b)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":10,"text":"Feng et al. ","element":"a"},{"href":"#id-7","referenceIndex":10,"text":"(2025)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-8","referenceIndex":33,"text":"Song et al. ","element":"a"},{"href":"#id-8","referenceIndex":33,"text":"(2025)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":17,"text":"Kim et al. ","element":"a"},{"href":"#id-9","referenceIndex":17,"text":"(2025)","element":"a"},{"text":". Compared to refinements of the algorithm itself, these multi-modal applications have largely adopted task-specific reward functions to improve performance on specific objectives, such as adding a temporal video reward in Video-R1 ","element":"span"},{"href":"#id-7","referenceIndex":10,"text":"Feng et al. ","element":"a"},{"href":"#id-7","referenceIndex":10,"text":"(2025) ","element":"a"},{"text":"or a trajectory distance reward in ManipVLM-R1 ","element":"span"},{"href":"#id-8","referenceIndex":33,"text":"Song et al. ","element":"a"},{"href":"#id-8","referenceIndex":33,"text":"(2025)","element":"a"},{"text":". These studies have shown that CoT, trained with verifiable reward functions in RL, significantly enhances multi-modal reasoning. Despite these successes, GRPO still suffers from inherent challenges that limit training stability, efficiency, and overall effectiveness. These challenges manifest in several distinct yet interrelated aspects of the GRPO framework—namely, gradient coupling between thoughts and answers, inefficiencies in generation, and instability in advantage estimation.","element":"span"}],[{"text":"A well-known issue is the mismatch between reasoning traces and final answers: the reasoning may be valid while the final answer is wrong, or conversely, a flawed reasoning may still yield a correct answer. This phenomenon can be observed in both pure textual reasoning tasks ","element":"span"},{"href":"#id-10","referenceIndex":32,"text":"Simoni et al. ","element":"a"},{"href":"#id-10","referenceIndex":32,"text":"(2025)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-11","referenceIndex":19,"text":"Lin et al. ","element":"a"},{"href":"#id-11","referenceIndex":19,"text":"(2025a)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-12","referenceIndex":27,"text":"Paul et al. ","element":"a"},{"href":"#id-12","referenceIndex":27,"text":"(2024)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-13","referenceIndex":35,"text":"Turpin et al. ","element":"a"},{"href":"#id-13","referenceIndex":35,"text":"(2023) ","element":"a"},{"text":"and multi-modal tasks ","element":"span"},{"href":"#id-4","referenceIndex":6,"text":"Chen et al. ","element":"a"},{"href":"#id-4","referenceIndex":6,"text":"(2025b)","element":"a"},{"text":"; ","element":"span"},{"href":"#id-14","referenceIndex":2,"text":"Balasubramanian et al. ","element":"a"},{"href":"#id-14","referenceIndex":2,"text":"(2025) ","element":"a"},{"text":"including our experiments ","element":"span"},{"href":"#id-15","text":"5.4. ","element":"a"},{"text":"Since the gradients of thoughts and answers are inherently coupled in GRPO, such inconsistencies can distort the gradient direction and consequently undermine training effectiveness. Although GRPO-CARE ","element":"span"},{"href":"#id-4","referenceIndex":6,"text":"Chen et al. ","element":"a"},{"href":"#id-4","referenceIndex":6,"text":"(2025b) ","element":"a"},{"text":"introduces a consistency reward to alleviate this, it risks reward hacking and is difficult to apply when semantic consistency is ill-defined (","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g.","element":"span"},{"text":", it is difficult to judge the consistency between a CoT and the numerical coordinates of a predicted bounding box).","element":"span"}],[{"text":"$3d","element":"span"}],[{"text":"The third challenge concerns variance in advantage estimation. From a probabilistic modeling perspective, a “good” thought is fundamentally one that increases the likelihood of generating a “good” answer. It follows that the most robust method for evaluating a thought’s quality would be to assess the overall distribution of multiple answers sampled from it. The GRPO, in its current design, estimates a thought’s advantage based on a single sampled answer. A consequence of this singlesample estimation, particularly when combined with the stochasticity of high-temperature sampling in LLMs and VLMs, is the potential for increased variance in the advantage estimation. Crucially, more accurate estimation of thought advantages is not only beneficial for reducing training instability, but also for guiding the model to internalize what constitutes a genuinely “good” thought. This, in turn, enables the model to generate higher-quality answers more reliably.","element":"span"}],[{"text":"In this paper, we propose ","element":"span"},{"style":{"fontWeight":"bold"},"text":"GRPO-MA (GRPO with Multi-Answer)","element":"span"},{"text":". For each of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"thoughts, we sample ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"answers. A thought’s value is the average reward of its ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"answers, which is used to derive its advantage relative to other thoughts. while each of the ","element":"span"},{"style":{"height":10.8},"width":129.55,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/1-0.png","element":"img","alt":" K × M","inline":true,"padRight":true},{"text":"answers also receives its own advantage. These two advantages are then used to update thought and answer tokens separately. Our theoretical analysis, based on the delta method ","element":"span"},{"href":"#id-16","referenceIndex":26,"text":"Oehlert ","element":"a"},{"href":"#id-16","referenceIndex":26,"text":"(1992)","element":"a"},{"text":", reveals the distinct effects of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"on the variance of the thought’s advantage. The analysis shows that as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"increases, the variance monotonically decreases towards zero. In contrast, increasing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"only reduces the variance to a non-zero constant. This design brings three benefits: (1) It reduces gradient coupling from noisy thought–answer mismatches by basing thought updates on an averaged reward. (2) The multi-answer estimate of a thought’s value has lower variance, leading to more stable advantage estimation. (3) It is computationally efficient by amortizing the cost of generating ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"thoughts across ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"answers each, avoiding the higher cost of generating ","element":"span"},{"style":{"height":10.8},"width":126.44,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/1-1.png","element":"img","alt":" K × M","inline":true,"padRight":true},{"text":"full reasoning responses, while still preserving a diverse and informative set of reward signals.","element":"span"}],[{"text":"We evaluate the effectiveness of GRPO-MA on Code, Math, several distinct vision tasks (Object Detection, Affordance Prediction, Trajectory Prediction, Demand Prediction, OCR-based VQA) and a simulator-based visual manipulation task. Compared to a GRPO baseline with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"responses, GRPO-MA yields clear gains with only a marginal increase in training time. Compared to a baseline with ","element":"span"},{"style":{"height":10.8},"width":128.79,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/1-2.png","element":"img","alt":" K × M","inline":true,"padRight":true},{"text":"responses, it achieves similar or slightly better performance using only about 60% of the training time, highlighting improved sample efficiency from more stable gradient estimation. In the visual manipulation task with extremely sparse rewards, GRPO-MA substantially outperforms the standard GRPO algorithm. Our ablation studies further show that increasing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"generally leads to improved model performance, and that the stability gained from more reliable thought advantage estimation may play an even more critical role than the sheer richness of reward signals. Finally, we also compare gradient spikes during training and find that GRPO-MA produces fewer gradient spikes, reflecting greater stability in the training progress.","element":"span"}],[{"text":"In summary, our contributions are as follows:","element":"span"}],[{"text":"• We propose the GRPO-MA algorithm, a simple but effective and general improvement strategy for GRPO that is compatible with other mainstream enhancements like DAPO.","element":"span"}],[{"text":"• We provide a theoretical analysis showing that our method can improve the stability of advantage estimation, leading to more stable gradients.","element":"span"}],[{"text":"• Across multiple distinct tasks, GRPO-MA consistently achieves performance gains over the baseline GRPO.","element":"span"}]]},{"heading":"2 RELATED WORK","paragraphs":[[{"text":"The GRPO algorithm has inspired several works to enhance its stability and efficiency by refining its loss function and sampling strategies. DAPO ","element":"span"},{"href":"#id-1","referenceIndex":39,"text":"Yu et al. ","element":"a"},{"href":"#id-1","referenceIndex":39,"text":"(2025) ","element":"a"},{"text":"introduces several “tricks” to stabilize training, such as Clip-Higher for exploration, Dynamic Sampling to filter uninformative samples, and a Token-Level Policy Gradient Loss to properly weight complex reasoning chains. Dr. GRPO ","element":"span"},{"href":"#id-2","referenceIndex":21,"text":"Liu et al. ","element":"a"},{"href":"#id-2","referenceIndex":21,"text":"(2025) ","element":"a"},{"text":"corrects inherent response length bias and question difficulty bias by removing specific normalization terms from the loss and advantage calculation, leading to more stable training. Generative Policy Gradient (GPG) ","element":"span"},{"href":"#id-3","referenceIndex":7,"text":"Chu et al. ","element":"a"},{"href":"#id-3","referenceIndex":7,"text":"(2025) ","element":"a"},{"text":"simplifies the GRPO objective and introduces a gradient rescaling method to counteract “zero-gradient” samples, ensuring more effective policy updates. Further research has focused on improving efficiency, with CPPO ","element":"span"},{"href":"#id-17","referenceIndex":20,"text":"Lin et al. ","element":"a"},{"href":"#id-17","referenceIndex":20,"text":"(2025b) ","element":"a"},{"text":"pruning low-impact samples to reduce computational cost and Off-Policy GRPO ","element":"span"},{"href":"#id-18","referenceIndex":24,"text":"Mroueh et al. ","element":"a"},{"href":"#id-18","referenceIndex":24,"text":"(2025) ","element":"a"},{"text":"using stale data to improve sample efficiency. Other works enhance stability, such as GSPO ","element":"span"},{"href":"#id-19","referenceIndex":41,"text":"Zheng et al. ","element":"a"},{"href":"#id-19","referenceIndex":41,"text":"(2025)","element":"a"},{"text":", which realigns importance sampling at the sequence level; GMPO ","element":"span"},{"href":"#id-20","referenceIndex":40,"text":"Zhao et al. ","element":"a"},{"href":"#id-20","referenceIndex":40,"text":"(2025)","element":"a"},{"text":", which uses a geometric mean to mitigate sensitivity to outliers ; and GTPO ","element":"span"},{"href":"#id-10","referenceIndex":32,"text":"Simoni et al. ","element":"a"},{"href":"#id-10","referenceIndex":32,"text":"(2025)","element":"a"},{"text":", which resolves gradient conflicts and prevents policy collapse through trajectory analysis. Additionally, specialized solutions like Spectral Policy Optimization ","element":"span"},{"href":"#id-21","referenceIndex":5,"text":"Chen et al. ","element":"a"},{"href":"#id-21","referenceIndex":5,"text":"(2025a) ","element":"a"},{"text":"create learning signals for “all-negative” sample groups using AI feedback.","element":"span"}]]},{"heading":"3 PRELIMINARY: GRPO","paragraphs":[[{"text":"The GRPO algorithm ","element":"span"},{"href":"#id-0","referenceIndex":30,"text":"Shao et al. ","element":"a"},{"href":"#id-0","referenceIndex":30,"text":"(2024) ","element":"a"},{"text":"is a Proximal Policy Optimization (PPO) variant ","element":"span"},{"href":"#id-22","referenceIndex":29,"text":"Schulman ","element":"a"},{"href":"#id-22","referenceIndex":29,"text":"et al. ","element":"a"},{"href":"#id-22","referenceIndex":29,"text":"(2017)","element":"a"},{"text":". As a model-free method, GRPO omits a value model and instead calculates advantage values by directly normalizing the rewards obtained from generated responses.","element":"span"}],[{"text":"For a given prompt ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":", GRPO samples ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"responses ","element":"span"},{"style":{"height":16},"width":315.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/2-0.png","element":"img","alt":" O = {o1, . . . , oK}","inline":true,"padRight":true},{"text":"from ","element":"span"},{"style":{"height":9.19},"width":37.7,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/2-1.png","element":"img","alt":" πθ","inline":true},{"text":", obtain rewards ","element":"span"},{"style":{"height":16},"width":237.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/2-2.png","element":"img","alt":"{R1, . . . , RK}","inline":true},{"text":", and compute the advantage as ","element":"span"},{"style":{"height":24.43},"width":395.14,"height":61.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/2-3.png","element":"img","alt":" A(oi) = Ri−Mean({Rk})Std({Rk})","inline":true,"padRight":true},{"text":". In general, a response ","element":"span"},{"style":{"height":9.19},"width":30.32,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/2-4.png","element":"img","alt":" oi","inline":true,"padRight":true},{"text":"consists of a thought ","element":"span"},{"style":{"height":13.19},"width":48.34,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/2-5.png","element":"img","alt":" thi","inline":true,"padRight":true},{"text":"and an answer ","element":"span"},{"style":{"height":9.19},"width":74.66,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/2-6.png","element":"img","alt":" ansi","inline":true},{"text":".","element":"span"}],[{"text":"The GRPO objective is built upon a clipped surrogate objective that maximizes the expected advantage while regularizing the policy change. Formally, given a generated sequence ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":", the clipped objective is defined as:","element":"span"}],[{"id":"id-23","style":{"width":"93%"},"width":1477,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/2-7.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":26.01},"width":403.47,"height":65.02,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/2-8.png","element":"img","alt":" rt(θ) = πθ(yt|p,y ","element":"span"},{"text":"1","element":"span"},{"text":". For instance, T4A4 signifies a process of generating 4 thoughts, with each thought producing 4 answers, resulting in a total of 16 responses.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"More details (datasets, hyperparameters, training settings) are in the appendix ","element":"span"},{"href":"#id-39","style":{"fontWeight":"bold"},"text":"A.3 ","element":"a"},{"style":{"fontWeight":"bold"},"text":"and ","element":"span"},{"href":"#id-40","style":{"fontWeight":"bold"},"text":"A.4.","element":"a"}],[{"text":"5.1 ","element":"span"},{"text":"T","element":"span"},{"text":"EXT AND ","element":"span"},{"text":"V","element":"span"},{"text":"ISION ","element":"span"},{"text":"T","element":"span"},{"text":"ASK","element":"span"}],[{"text":"5.1.1 ","element":"span"},{"text":"T","element":"span"},{"text":"ASK ","element":"span"},{"text":"S","element":"span"},{"text":"ETTING AND ","element":"span"},{"text":"M","element":"span"},{"text":"ETRIC","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Math ","element":"span"},{"text":"Given a math problem, output the correct solution. Metric: pass@10/32 ","element":"span"},{"href":"#id-41","referenceIndex":4,"text":"Chen et al. ","element":"a"},{"href":"#id-41","referenceIndex":4,"text":"(2021)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Code ","element":"span"},{"text":"Given a programming problem, output the solution code. Metric: pass@10/32.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Object Detection ","element":"span"},{"text":"Given an image and a specified object name, output bounding boxes. Metric: proportion of predictions with Intersection-over-Union (IoU) above threshold.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Affordance Prediction ","element":"span"},{"text":"Given an image and an affordance (","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g.","element":"span"},{"text":", grasping, holding), output 2D coordinates. Metric: proportion of matched points.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Trajectory Prediction ","element":"span"},{"text":"Given an image and a manipulation instruction, output the 2D end-effector trajectory. ","element":"span"},{"text":"Metrics: Discrete Fr´echet Distance (DFD) ","element":"span"},{"href":"#id-42","referenceIndex":9,"text":"Eiter et al. ","element":"a"},{"href":"#id-42","referenceIndex":9,"text":"(1994)","element":"a"},{"text":", Hausdorff Distance (HD) ","element":"span"},{"href":"#id-43","referenceIndex":14,"text":"Huttenlocher et al. ","element":"a"},{"href":"#id-43","referenceIndex":14,"text":"(2002)","element":"a"},{"text":", Root Mean Square Error (RMSE), EndPoint Distance.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Demand Prediction ","element":"span"},{"text":"Given an image and a human demand instruction, output 2D coordinates of the demanded object. Metric: proportion of correct points.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"OCR-based VQA ","element":"span"},{"text":"Given an image and a question requiring text understanding (","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g.","element":"span"},{"text":", infographics, scene text, documents), output the answer. Metric: Average Normalized Levenshtein Similarity (ANLS) ","element":"span"},{"href":"#id-34","referenceIndex":3,"text":"Biten et al. ","element":"a"},{"href":"#id-34","referenceIndex":3,"text":"(2019)","element":"a"},{"text":".","element":"span"}],[{"text":"Please note that, unlike the ","element":"span"},{"text":" ","element":"span"},{"text":"and ","element":"span"},{"text":" ","element":"span"},{"text":"tags used in GRPO, we design a distinct structured output format for ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Math ","element":"span"},{"text":"consisting of three tags: ","element":"span"},{"text":"","element":"span"},{"text":", ","element":"span"},{"text":"","element":"span"},{"text":", and ","element":"span"},{"text":"","element":"span"},{"text":". Multi-sampling is applied to both ","element":"span"},{"text":" ","element":"span"},{"text":"and ","element":"span"},{"text":"","element":"span"},{"text":". For other tasks, Multi-sampling is applied only to the ","element":"span"},{"text":"","element":"span"},{"text":".","element":"span"}],[{"text":"We track the ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Gradient Spike Score (GSS) ","element":"span"},{"href":"#id-44","referenceIndex":12,"text":"Huang et al. ","element":"a"},{"href":"#id-44","referenceIndex":12,"text":"(2025a) ","element":"a"},{"text":"to measure gradient stability, defined as GSS","element":"span"},{"style":{"height":20.29},"width":253.11,"height":50.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/5-0.png","element":"img","alt":"(gi) = |gi|1","inline":true}],[{"text":"the number of spikes above 10 (GSS@10), where smaller is better. For all tasks, we also report the per-step training time (s).","element":"span"}],[{"text":"5.1.2 ","element":"span"},{"text":"B","element":"span"},{"text":"ASELINES","element":"span"}],[{"text":"We adopt models from the Qwen2.5-VL-Instruct series (3B, 7B, and 72B) ","element":"span"},{"href":"#id-37","referenceIndex":1,"text":"Bai et al. ","element":"a"},{"href":"#id-37","referenceIndex":1,"text":"(2025) ","element":"a"},{"text":"as baselines to evaluate the performance of general-purpose models on our tasks. In addition, we train Qwen2.5-VL-3B-Instruct with real labels using supervised fine-tuning (SFT) to compare against GRPO, denoted as ","element":"span"},{"style":{"fontWeight":"bold"},"text":"SFT ","element":"span"},{"text":"in the results. Finally, we compare our proposed GRPO-MA with GRPO","element":"span"}],[{"id":"id-48","style":{"width":"96%"},"width":1536,"height":524,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/6-0.png","element":"img"}],[{"text":"Figure 2: A case study comparing the baseline GRPO with our proposed GRPO-MA on a referring expression grounding task. The prompt is to locate the “purple bottled beverage”. The baseline model, GRPO (T4A1), recognizes the target’s existence but its reasoning is distracted by other salient objects (the snacks), leading to a failure in grounding. In contrast, our GRPO-MA (T4A4) correctly reasons about the scene’s context, focuses on the target object held by the robotic arm, and successfully provides the precise bounding box. This demonstrates the superior robustness of GRPO-MA in complex scene understanding and reasoning.","element":"figcaption","subtype":"caption"}],[{"id":"id-45","text":"Table 1: Combined Results for Math and Code Generation Benchmarks. ","element":"figcaption","subtype":"caption"},{"text":"TN: The number of thoughts; AN: The number of answers per thought; S/S: Second/Step during training; Bold indicates the best performance among the GRPO variants.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1576,"height":424,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/6-1.png","element":"img"}],[{"text":"under different numbers of responses to demonstrate the superiority of GRPO-MA in terms of training efficiency and performance.","element":"span"}],[{"text":"5.1.3 ","element":"span"},{"text":"M","element":"span"},{"text":"AIN ","element":"span"},{"text":"R","element":"span"},{"text":"ESULTS","element":"span"}],[{"text":"The experimental results are presented in Tab. ","element":"span"},{"href":"#id-45","text":"1 ","element":"a"},{"text":"(Math and Code Problem), Tab. ","element":"span"},{"href":"#id-46","text":"2 ","element":"a"},{"text":"(Object Detection, Affordance Prediction and Demand Prediction), and Tab. ","element":"span"},{"href":"#id-47","text":"3 ","element":"a"},{"text":"(OCR-based VQA and Trajectory Prediction). Across multiple visual tasks, our proposed GRPO-MA outperforms both the GRPO and SFT under various settings, demonstrating its excellent versatility across diverse tasks. Compared to T4A1, T4A4 achieves significant performance gains with only about a 15% increase in training time. Compared to T16A1, T4A4 achieves comparable or even slightly better performance with about a 40% reduction in training time, which demonstrates that GRPO-MA does not involve a trade-off between training efficiency and training performance, but rather enhances both simultaneously.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Gradient Stability ","element":"span"},{"text":"In most experiments, T4A4 achieves the lowest GSS@10, indicating the best gradient stability during training, consistent with our theoretical analysis: as a crucial component of gradient magnitude, the more stable estimation of the advantage value also contributes to greater gradient stability.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Case Study ","element":"span"},{"text":"As illustrated in Fig. ","element":"span"},{"href":"#id-48","text":"2, ","element":"a"},{"text":"we present a case study to contrast the reasoning processes of T4A4 and T4A1 for the object detection task. T4A4 focuses on the general vicinity of the target object and its surrounding context. Conversely, T4A1 fails to detect the target, instead paying its attention on the central region of the image. Additional case studies are provided in the appendix ","element":"span"},{"href":"#id-49","text":"A.5.","element":"a"}],[{"id":"id-46","text":"Table 2: Combined Results for Object Detection, Affordance, and Demand Prediction. TN: The ","element":"figcaption","subtype":"caption"},{"text":"number of thoughts; AN: The number of answers per thought; S/S: Second/Step during training; UMD: UMD Part Afforance Dataset; AGD20K: AGD20K Dataset; Bold indicates the best performance among models of the same size.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1579,"height":295,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/7-0.png","element":"img"}],[{"id":"id-47","text":"Table 3: Combined Results for OCR-based VQA and Trajectory Prediction. TN: The number of ","element":"figcaption","subtype":"caption"},{"text":"thoughts; AN: The number of answers per thought; S/S: Second/Step during training. Bold indicates the best performance among models of the same size.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1572,"height":326,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/7-1.png","element":"img"}],[{"text":"5.2 ","element":"span"},{"text":"S","element":"span"},{"text":"IMULATOR","element":"span"},{"text":"-","element":"span"},{"text":"BASED ","element":"span"},{"text":"M","element":"span"},{"text":"ANIPULATION ","element":"span"},{"text":"T","element":"span"},{"text":"ASK","element":"span"}],[{"text":"5.2.1 ","element":"span"},{"text":"T","element":"span"},{"text":"ASK ","element":"span"},{"text":"S","element":"span"},{"text":"ETTING","element":"span"}],[{"text":"We adapt most of the experimental settings introduced in ManipLLM ","element":"span"},{"href":"#id-36","referenceIndex":18,"text":"Li et al. ","element":"a"},{"href":"#id-36","referenceIndex":18,"text":"(2024)","element":"a"},{"text":"$3e","element":"span"}],[{"text":"5.2.2 ","element":"span"},{"text":"B","element":"span"},{"text":"ASELINES","element":"span"}],[{"text":"We adapt some of the same baselines used in visual tasks and added several additional baselines: ManipLLM-7B, CoT-SFT, and GRPO-NoThink.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"ManipLLM-7B ","element":"span"},{"text":"They collects a large number of successful samples in the simulator and constructs multiple task-specific question-answer pairs, utilizing the SFT training approach. We have finetuned their weights in the new settings.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"CoT-SFT ","element":"span"},{"text":"We collect successful samples of GRPO-MA-T4A4 (including the chain of thoughts and answers), then fine-tune Qwen2.5-VL-3B using SFT.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"GRPO-NoThink ","element":"span"},{"text":"We employ GRPO to train the Qwen2.5-VL-3B, but we do not require the model to generate a thought process; instead, it directly produces the answers.","element":"span"}],[{"text":"5.2.3 ","element":"span"},{"text":"M","element":"span"},{"text":"AIN ","element":"span"},{"text":"R","element":"span"},{"text":"ESULTS","element":"span"}],[{"text":"The experimental results are presented in Tab. ","element":"span"},{"href":"#id-50","text":"4. ","element":"a"},{"text":"A direct comparison reveals that the performance of T4A4 is significantly superior to that of T4A1. This outcome demonstrates that in tasks with","element":"span"}],[{"id":"id-50","text":"Table 4: Manipulating Point Prediction. TN: The number of thoughts; AN: The number of answers ","element":"figcaption","subtype":"caption"},{"text":"per thought; S/S: Second/Step during training.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"48%"},"width":762,"height":396,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/8-0.png","element":"img"}],[{"text":"extremely sparse rewards, such as multi-modal manipulation, employing a multi-answer sampling strategy leads to a more stable training process and facilitate sampling of effective responses.","element":"span"}],[{"text":"Furthermore, our experiments provide valuable insights into the indispensable role of the Chain of Thought (CoT) in this context. We observe that the GRPO-NoThink model, which ablates the CoT while sampling an equal number of answers as GRPO-MA-T4A4, suffers a substantial degradation in performance. This result, along with the strong performance of the CoT-SFT model, clearly indicates that a high-quality CoT is a critical prerequisite for generating superior answers and effectively tackling such complex tasks.","element":"span"}],[{"text":"5.3 ","element":"span"},{"text":"A","element":"span"},{"text":"BLATION ","element":"span"},{"text":"S","element":"span"},{"text":"TUDY","element":"span"}],[{"id":"id-51","style":{"width":"98%"},"width":1568,"height":283,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/8-1.png","element":"img"}],[{"text":"Figure 3: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Ablation Study on Trajectory Prediction ","element":"figcaption","subtype":"caption"},{"text":"While maintaining the number of thoughts ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"K ","element":"figcaption","subtype":"caption"},{"text":"= 4","element":"figcaption","subtype":"caption"},{"text":", we gradually increase the number of responses ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"M ","element":"figcaption","subtype":"caption"},{"text":"per thought from 1 to 8 (i.e., the number of responses is 4, 8, 12...32).","element":"figcaption","subtype":"caption"}],[{"text":"We conduct a detailed ablation study on the trajectory prediction task to analyze the effect of the number of generated answers ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"per thought, as shown in Fig. ","element":"span"},{"href":"#id-51","text":"3. ","element":"a"},{"text":"The results indicate that as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"increases, all evaluation metrics decrease, although the rate of decline becomes progressively smaller.","element":"span"}],[{"text":"Surprisingly, T4A3 features 4 thoughts and 12 answers, outperforming T16A1’s 16 thoughts and 16 answers across all metrics. One possible explanation for this finding is that the importance of reward signal richness (the number of answers) is less significant than the quality of thoughts; filtering out higher-quality thoughts has a greater impact on the overall training process. Specifically, our method assesses a thought’s quality by averaging the rewards of its ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"subsequent answers (","element":"span"},{"style":{"height":22.8},"width":403.76,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/8-2.png","element":"img","alt":"V (thi) = 1M�Mj=1 Ri,j","inline":true},{"text":"). With ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= 3","element":"span"},{"text":", T4A3 obtains a more stable and reliable estimate of ","element":"span"},{"text":"each thought’s value, effectively reducing the noise from any single-answer evaluation. In contrast, T16A1’s approach (","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= 1","element":"span"},{"text":") is far more susceptible to randomness, as a single, potentially noisy reward is used to judge the entire thought.","element":"span"}],[{"id":"id-15","text":"5.4 ","element":"span"},{"text":"I","element":"span"},{"text":"NCONSISTENCY ","element":"span"},{"text":"A","element":"span"},{"text":"NALYSIS","element":"span"}],[{"text":"We quantify the inconsistency between thoughts and answers during training. For a thought ","element":"span"},{"style":{"height":13.19},"width":48.37,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/8-3.png","element":"img","alt":" thi","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"answers, if ","element":"span"},{"style":{"height":16.79},"width":528.62,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/8-4.png","element":"img","alt":" sign(A(thi)) ̸= sign(A(ansi,j))","inline":true},{"text":", we mark it as inconsistent. The inconsistency rate is defined as InconsistencyRate ","element":"span"},{"style":{"height":22.8},"width":734.33,"height":56.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/8-5.png","element":"img","alt":" = 1KM�Ki=1�Mj=1 1[A(thi)A(ansi,j) < 0]","inline":true},{"text":", where ","element":"span"},{"style":{"height":16},"width":56.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/8-6.png","element":"img","alt":" 1[·]","inline":true,"padRight":true},{"text":"denotes ","element":"span"},{"text":"the indicator function, which equals ","element":"span"},{"text":"1 ","element":"span"},{"text":"if the condition inside holds and ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise.","element":"span"}],[{"text":"Under the T4A4 setting, the inconsistency rate is ","element":"span"},{"style":{"fontWeight":"bold"},"text":"25.65% ","element":"span"},{"text":"for trajectory prediction and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"24.83% ","element":"span"},{"text":"for object detection. Notably, this ratio is also indicative for GRPO baselines (T4A1, T8A1, T16A1), even though they do not explicitly generate multiple answers per thought and thus cannot directly compute it, since they share the same generation hyperparameters (","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g.","element":"span"},{"text":", temperature, top-","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", and top- ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"sampling). This observation further supports our claim that inconsistency is common in GRPO’s training. Moreover, this inconsistency implicitly undermines model training.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Accuracy reward curves and richness of reward signal analysis are in the appendix ","element":"span"},{"href":"#id-52","style":{"fontWeight":"bold"},"text":"A.6.","element":"a"}]]},{"heading":"6 CONCLUSION","paragraphs":[[{"text":"We present GRPO-MA, a simple yet theoretically grounded extension of GRPO that tackles three key challenges in training Chain-of-Thought models: unstable advantage estimation, gradient coupling between thoughts and answers, and sparse reward signals under limited sampling. By generating multiple answers per thought, GRPO-MA reduces the variance of advantage estimation, decouples the gradient between thoughts and answers, and densifies reward feedback. Our theoretical analysis further shows that increasing the number of answers per thought is a principled way to stabilize gradients, which is corroborated by experiments on math, code, and multimodal tasks. Together, these results demonstrate that GRPO-MA improves both the stability and efficiency of GRPO-based reinforcement learning.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Limitation ","element":"span"},{"text":"Our study has several limitations. First, computational constraints prevent our experiments on larger-scale models. Second, our analysis relies on the simplifying assumption that thought values are independent, a condition that may not hold true in practice. Finally, the lack of a general-purpose reward model means that our testing is confined to tasks with verifiable rewards.","element":"span"}]]},{"heading":"REFERENCES","paragraphs":[[{"id":"id-37","text":"Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, ","element":"span"},{"text":"Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2502.13923","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-14","text":"Sriram Balasubramanian, Samyadeep Basu, and Soheil Feizi. A closer look at bias and chain-of- ","element":"span"},{"text":"thought faithfulness of large (vision) language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2505.23945","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-34","text":"Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc¸al Rusinol, Minesh Mathew, ","element":"span"},{"text":"CV Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Icdar 2019 competition on scene text visual question answering. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2019 International Conference on Document Analysis and Recognition (ICDAR)","element":"span"},{"text":", pp. 1563–1570. IEEE, 2019.","element":"span"}],[{"id":"id-41","text":"Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared ","element":"span"},{"text":"Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2107.03374","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-21","text":"Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, and Tianyi Lin. Spectral policy optimization: Coloring ","element":"span"},{"text":"your incorrect reasoning in grpo. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2505.11595","element":"span"},{"text":", 2025a.","element":"span"}],[{"id":"id-4","text":"Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, and Xihui Liu. Grpo- ","element":"span"},{"text":"care: ","element":"span"},{"text":"Consistency-aware reinforcement learning for multimodal reasoning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2506.16141","element":"span"},{"text":", 2025b.","element":"span"}],[{"id":"id-3","text":"Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. Gpg: A simple and strong ","element":"span"},{"text":"reinforcement learning baseline for model reasoning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2504.02546","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-29","text":"AgiBot World Colosseum contributors. ","element":"span"},{"text":"Agibot world colosseum. ","element":"span"},{"href":"https://github.com/OpenDriveLab/AgiBot-World","text":"https://github.com/ ","element":"a"},{"href":"https://github.com/OpenDriveLab/AgiBot-World","text":"OpenDriveLab/AgiBot-World","element":"a"},{"text":", 2024.","element":"span"}],[{"id":"id-42","text":"Thomas Eiter, Heikki Mannila, et al. Computing discrete fr´echet distance. 1994.","element":"span"}],[{"id":"id-7","text":"Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, ","element":"span"},{"text":"Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.21776","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-38","text":"Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, ","element":"span"},{"text":"Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICLR","element":"span"},{"text":", 1(2):3, 2022.","element":"span"}],[{"id":"id-44","text":"Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, and Shiwei Liu. Spam: Spike- ","element":"span"},{"text":"aware adam with momentum reset for stable llm training. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2501.06842","element":"span"},{"text":", 2025a.","element":"span"}],[{"id":"id-6","text":"Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and ","element":"span"},{"text":"Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.06749","element":"span"},{"text":", 2025b.","element":"span"}],[{"id":"id-43","text":"Daniel P Huttenlocher, Gregory A. Klanderman, and William J Rucklidge. Comparing images using ","element":"span"},{"text":"the hausdorff distance. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on pattern analysis and machine intelligence","element":"span"},{"text":", 15(9): 850–863, 2002.","element":"span"}],[{"id":"id-32","text":"Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, ","element":"span"},{"text":"Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Computer Vision and Pattern Recognition Conference","element":"span"},{"text":", pp. 1724–1734, 2025.","element":"span"}],[{"id":"id-54","text":"Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, ","element":"span"},{"text":"Alexander Clegg, Eric Undersander, Angel X Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pp. 16384–16393, 2024.","element":"span"}],[{"id":"id-9","text":"Dongyoung Kim, Sumin Park, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, and Younggyo Seo. ","element":"span"},{"text":"Robot-r1: Reinforcement learning for enhanced embodied reasoning in robotics. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2506.00070","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-36","text":"Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, ","element":"span"},{"text":"Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for objectcentric robotic manipulation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition","element":"span"},{"text":", pp. 18061–18070, 2024.","element":"span"}],[{"id":"id-11","text":"Zhenru Lin, Jiawen Tao, Yang Yuan, and Andrew Chi-Chih Yao. Existing llms are not self-consistent ","element":"span"},{"text":"for simple tasks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2506.18781","element":"span"},{"text":", 2025a.","element":"span"}],[{"id":"id-17","text":"Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. Cppo: Accelerating the training of group ","element":"span"},{"text":"relative policy optimization-based reasoning models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.22342","element":"span"},{"text":", 2025b.","element":"span"}],[{"id":"id-2","text":"Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, ","element":"span"},{"text":"and Min Lin. ","element":"span"},{"text":"Understanding r1-zero-like training: A critical perspective. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.20783","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-31","text":"Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, and Dacheng Tao. Learning affordance grounding ","element":"span"},{"text":"from exocentric images. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE/CVF conference on computer vision and pattern recognition","element":"span"},{"text":", pp. 2252–2261, 2022.","element":"span"}],[{"id":"id-53","text":"Maxwell-Jia. Aime2024. ","element":"span"},{"href":"https://huggingface.co/datasets/Maxwell-Jia/AIME_2024","text":"https://huggingface.co/datasets/Maxwell-Jia/AIME_ ","element":"a"},{"href":"https://huggingface.co/datasets/Maxwell-Jia/AIME_2024","text":"2024","element":"a"},{"text":", 2024.","element":"span"}],[{"id":"id-18","text":"Youssef Mroueh, Nicolas Dupuis, Brian Belgodere, Apoorva Nitsure, Mattia Rigotti, Kristjan Gree- ","element":"span"},{"text":"newald, Jiri Navratil, Jerret Ross, and Jesus Rios. Revisiting group relative policy optimization: Insights into on-policy and off-policy training. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2505.22257","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-30","text":"Austin Myers, Ching L. Teo, Cornelia Ferm¨uller, and Yiannis Aloimonos. Affordance detection of ","element":"span"},{"text":"tool parts from geometric features. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICRA","element":"span"},{"text":", 2015.","element":"span"}],[{"id":"id-16","text":"Gary W Oehlert. A note on the delta method. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The American Statistician","element":"span"},{"text":", 46(1):27–29, 1992.","element":"span"}],[{"id":"id-12","text":"Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring ","element":"span"},{"text":"and improving faithfulness of chain-of-thought reasoning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Findings of the Association for Computational Linguistics: EMNLP 2024","element":"span"},{"text":", pp. 15012–15032, 2024.","element":"span"}],[{"id":"id-27","text":"PrimeIntellect. Synthetic-1: Scaling distributed synthetic data generation for verified reasoning. ","element":"span"},{"href":"https://www.primeintellect.ai/blog/synthetic-1","text":"https://www.primeintellect.ai/blog/synthetic-1","element":"a"},{"text":", 2024.","element":"span"}],[{"id":"id-22","text":"John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy ","element":"span"},{"text":"optimization algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1707.06347","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-0","text":"Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, ","element":"span"},{"text":"Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2402.03300","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-5","text":"Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun ","element":"span"},{"text":"Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2504.07615","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-10","text":"Marco Simoni, Aleksandar Fontana, Giulio Rossolini, and Andrea Saracino. Gtpo: Trajectory-based ","element":"span"},{"text":"policy optimization in large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2508.03772","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-8","text":"Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Zixiang Xu, Zeyu Zhang, ","element":"span"},{"text":"Xiaoqing Zhang, Qian Jiang, Zhenhao Chen, et al. ","element":"span"},{"text":"Maniplvm-r1: Reinforcement learning for reasoning in embodied manipulation with large vision-language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2505.16517","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-35","text":"Rub`en Tito, Minesh Mathew, CV Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Icdar 2021 ","element":"span"},{"text":"competition on document visual question answering. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Document Analysis and Recognition","element":"span"},{"text":", pp. 635–649. Springer, 2021.","element":"span"}],[{"id":"id-13","text":"Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always ","element":"span"},{"text":"say what they think: Unfaithful explanations in chain-of-thought prompting. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 36:74952–74965, 2023.","element":"span"}],[{"id":"id-33","text":"Hongcheng Wang, Peiqi Liu, Wenzhe Cai, Mingdong Wu, Zhengyu Qian, and Hao Dong. Mo-ddn: ","element":"span"},{"text":"A coarse-to-fine attribute-based exploration agent for multi-object demand-driven navigation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 37:64176–64214, 2024.","element":"span"}],[{"id":"id-28","text":"Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz- ","element":"span"},{"text":"Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, et al. Livebench: A challenging, contaminationfree llm benchmark. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2406.19314","element":"span"},{"text":", 4, 2024.","element":"span"}],[{"id":"id-55","text":"Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao ","element":"span"},{"text":"Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su. SAPIEN: A simulated part-based interactive environment. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","element":"span"},{"text":", June 2020.","element":"span"}],[{"id":"id-1","text":"Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian ","element":"span"},{"text":"Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2503.14476","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-20","text":"Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shao- ","element":"span"},{"text":"han Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2507.20673","element":"span"},{"text":", 2025.","element":"span"}],[{"id":"id-19","text":"Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, ","element":"span"},{"text":"Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2507.18071","element":"span"},{"text":", 2025.","element":"span"}]]},{"heading":"A APPENDIX","paragraphs":[[{"text":"Below is the table of contents for the appendix.","element":"span"}],[{"style":{"width":"90%"},"width":1434,"height":1406,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/12-0.png","element":"img"}],[{"text":"VLM-R1 ","element":"span"},{"href":"#id-5","referenceIndex":31,"text":"Shen et al. ","element":"a"},{"href":"#id-5","referenceIndex":31,"text":"(2025) ","element":"a"},{"text":"applies a general GRPO pipeline to Vision-Language Models, enabling smaller models to achieve competitive performance on complex visual reasoning tasks. Vision-R1 ","element":"span"},{"href":"#id-6","referenceIndex":13,"text":"Huang et al. ","element":"a"},{"href":"#id-6","referenceIndex":13,"text":"(2025b) ","element":"a"},{"text":"generates high-quality multimodal Chain-of-Thought data and uses Progressive Thinking Suppression Training (PTST) to prevent the model from creating overly long reasoning paths. Video-R1 ","element":"span"},{"href":"#id-7","referenceIndex":10,"text":"Feng et al. ","element":"a"},{"href":"#id-7","referenceIndex":10,"text":"(2025) ","element":"a"},{"text":"introduces Temporal-GRPO (T-GRPO), a novel reward scheme that encourages the model to leverage temporal information in video sequences. ManipLVM-R1 ","element":"span"},{"href":"#id-8","referenceIndex":33,"text":"Song et al. ","element":"a"},{"href":"#id-8","referenceIndex":33,"text":"(2025) ","element":"a"},{"text":"employs GRPO for robotic manipulation with new affordance-aware and trajectory matching reward functions to improve the localization of interactive parts and the physical plausibility of actions. Robot-R1 ","element":"span"},{"href":"#id-9","referenceIndex":17,"text":"Kim et al. ","element":"a"},{"href":"#id-9","referenceIndex":17,"text":"(2025) ","element":"a"},{"text":"reframes robot learning as a multiple-choice question answering task, using GRPO to optimize the reasoning for embodied manipulation.","element":"span"}],[{"id":"id-26","style":{"width":"39%"},"width":625,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/12-1.png","element":"img"}],[{"text":"This document provides a full derivation of the approximate variance for the Thought Advantage, ","element":"span"},{"style":{"height":16},"width":112,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/12-2.png","element":"img","alt":"A(thi)","inline":true},{"text":", as presented in the main paper. We first review the multivariate Delta Method, establish the asymptotic normality of our estimators via the Central Limit Theorem (CLT), and finally present the detailed application and gradient calculation.","element":"span"}],[{"style":{"width":"48%"},"width":769,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/12-3.png","element":"img"}],[{"text":"The Delta Method is a fundamental result in statistics used to approximate the moments of a function of one or more random variables. The multivariate version is central to our analysis.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"General Formulation. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":23.29},"width":367.72,"height":58.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-0.png","element":"img","alt":"−→V = (V1, V2, . . . , VK)","inline":true,"padRight":true},{"text":"be a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":"-dimensional random vector of estimators with a true mean vector ","element":"span"},{"style":{"height":19.26},"width":369.26,"height":48.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-1.png","element":"img","alt":"−→µ = (µ1, µ2, . . . , µK)","inline":true},{"text":". Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"denote the sample size used to compute each estimator ","element":"span"},{"style":{"height":13.19},"width":40.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-2.png","element":"img","alt":" Vk","inline":true},{"text":". To emphasize that these estimators are functions of the sample size, we denote the vector as ","element":"span"},{"style":{"height":21.68},"width":71.85,"height":54.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-3.png","element":"img","alt":"−→V M","inline":true},{"text":". The Delta Method provides the asymptotic distribution of ","element":"span"},{"style":{"height":23.29},"width":130.52,"height":58.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-4.png","element":"img","alt":" f(−→V M)","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":11.2},"width":146.76,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-5.png","element":"img","alt":" M → ∞","inline":true},{"text":". Specifically, if ","element":"span"},{"style":{"height":21.68},"width":71.85,"height":54.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-6.png","element":"img","alt":"−→V M","inline":true,"padRight":true},{"text":"satisfies the condition for the Central Limit Theorem such that:","element":"span"}],[{"style":{"width":"68%"},"width":1084,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-7.png","element":"img"}],[{"text":"where","element":"span"},{"style":{"height":17.04},"width":42.15,"height":42.59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-8.png","element":"img","alt":"d−→","inline":true,"padRight":true},{"text":"denotes convergence in distribution, then the transformed variable ","element":"span"},{"style":{"height":23.29},"width":130.5,"height":58.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-9.png","element":"img","alt":" f(−→V M)","inline":true,"padRight":true},{"text":"also converges in distribution:","element":"span"}],[{"style":{"width":"80%"},"width":1280,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-10.png","element":"img"}],[{"text":"From this formal result, we derive the practical formula for approximating the variance of ","element":"span"},{"style":{"height":23.29},"width":130.52,"height":58.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-11.png","element":"img","alt":" f(−→V M)","inline":true,"padRight":true},{"text":"for a large but finite sample size ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M","element":"span"},{"text":". The term","element":"span"},{"style":{"height":16},"width":75.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-12.png","element":"img","alt":"√M","inline":true,"padRight":true},{"text":"acts as a scaling factor that ensures the limiting distribution has a finite, non-zero variance. The variance of the estimator itself is given by:","element":"span"}],[{"style":{"width":"72%"},"width":1153,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-13.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":23.29},"width":168.83,"height":58.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-14.png","element":"img","alt":" Var(−→V M)","inline":true,"padRight":true},{"text":"is the actual covariance matrix of the estimator vector, which is related to the asymptotic covariance by ","element":"span"},{"style":{"height":24.08},"width":434.35,"height":60.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-15.png","element":"img","alt":" Var(−→V M) ≈ Σasymptotic/M","inline":true},{"text":".","element":"span"}],[{"style":{"width":"76%"},"width":1214,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-16.png","element":"img"}],[{"text":"Before applying the Delta Method, we must first establish that our core estimator, the vector of estimated values ","element":"span"},{"style":{"height":24.36},"width":100.92,"height":60.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-17.png","element":"img","alt":"−−−→V (th)","inline":true},{"text":", satisfies the prerequisite of being asymptotically normal. This justification comes from the Central Limit Theorem (CLT).","element":"span"}],[{"text":"For each thought ","element":"span"},{"style":{"height":13.19},"width":54.33,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-18.png","element":"img","alt":" thk","inline":true},{"text":", its estimated value ","element":"span"},{"style":{"height":16},"width":120.57,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-19.png","element":"img","alt":" V (thk)","inline":true,"padRight":true},{"text":"is the sample mean of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"i.i.d. random variables, the rewards ","element":"span"},{"style":{"height":19.93},"width":169.26,"height":49.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-20.png","element":"img","alt":" {Rk,j}Mj=1","inline":true},{"text":":","element":"span"}],[{"style":{"width":"61%"},"width":974,"height":121,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-21.png","element":"img"}],[{"text":"The rewards have a finite true mean ","element":"span"},{"style":{"height":10.88},"width":63.04,"height":27.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-22.png","element":"img","alt":" µRk","inline":true,"padRight":true},{"text":"and a finite true variance ","element":"span"},{"style":{"height":19.47},"width":61.76,"height":48.67,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-23.png","element":"img","alt":" σ2Rk","inline":true},{"text":". According to the CLT, as ","element":"span"},{"text":"the sample size ","element":"span"},{"style":{"height":11.2},"width":151.54,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-24.png","element":"img","alt":" M → ∞","inline":true},{"text":", the distribution of the standardized sample mean converges to a normal distribution. This is formally stated as:","element":"span"}],[{"style":{"width":"67%"},"width":1077,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-25.png","element":"img"}],[{"text":"We ","element":"span"},{"text":"now ","element":"span"},{"text":"extend ","element":"span"},{"text":"this ","element":"span"},{"text":"to ","element":"span"},{"text":"the ","element":"span"},{"text":"full ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":"-dimensional ","element":"span"},{"text":"vector ","element":"span"},{"text":"of ","element":"span"},{"text":"estimators, ","element":"span"},{"style":{"height":24.36},"width":193.83,"height":60.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-26.png","element":"img","alt":"−−−→V (th) =","inline":true},{"style":{"height":16},"width":369.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-27.png","element":"img","alt":"(V (th1), . . . , V (thK))","inline":true},{"text":". Since we have assumed that the estimated values for different thoughts are mutually independent, the joint asymptotic distribution of the vector is also normal. The mean of this limiting distribution is a zero vector, and the covariance matrix is diagonal, composed of the individual variances. Therefore, the entire vector of estimators is asymptotically normal:","element":"span"}],[{"style":{"width":"67%"},"width":1065,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-28.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":19.26},"width":354.66,"height":48.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-29.png","element":"img","alt":"−→µ = (µR1, . . . , µRK)","inline":true,"padRight":true},{"text":"is the vector of true means, and ","element":"span"},{"style":{"height":15.59},"width":87.3,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-30.png","element":"img","alt":" Σdiag","inline":true,"padRight":true},{"text":"is the diagonal covariance matrix of the limiting distribution:","element":"span"}],[{"style":{"width":"68%"},"width":1083,"height":217,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-31.png","element":"img"}],[{"text":"This result formally justifies the application of the Multivariate Delta Method to the thought advantage function ","element":"span"},{"style":{"height":24.35},"width":329.33,"height":60.88,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/13-32.png","element":"img","alt":" A(thi) = fi(−−−→V (th))","inline":true},{"text":".","element":"span"}],[{"style":{"width":"69%"},"width":1101,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Verification of Assumptions. ","element":"span"},{"text":"The prerequisites for the Delta Method are satisfied. First, as established above, our estimator vector ","element":"span"},{"style":{"height":24.36},"width":100.92,"height":60.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-1.png","element":"img","alt":"−−−→V (th)","inline":true,"padRight":true},{"text":"is asymptotically normal. Second, the advantage function ","element":"span"},{"style":{"height":17.63},"width":455.56,"height":44.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-2.png","element":"img","alt":"A(thi) = (V (thi) − ¯V )/SV","inline":true,"padRight":true},{"text":"is continuously differentiable everywhere except where the denominator ","element":"span"},{"style":{"height":13.19},"width":125.25,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-3.png","element":"img","alt":" SV = 0","inline":true},{"text":", where ","element":"span"},{"style":{"height":21.11},"width":359.39,"height":52.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-4.png","element":"img","alt":"¯V = 1K�Kk=1 V (thk)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":28.8},"width":145.26,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-5.png","element":"img","alt":" SV =�","inline":true}],[{"text":"gradient at ","element":"span"},{"style":{"height":18.46},"width":40,"height":46.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-6.png","element":"img","alt":"−→µ","inline":true,"padRight":true},{"text":", where the denominator’s analogue is ","element":"span"},{"style":{"height":11.59},"width":63.11,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-7.png","element":"img","alt":" σµR","inline":true},{"text":". The approximation is thus valid assuming ","element":"span"},{"style":{"height":15.59},"width":139.84,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-8.png","element":"img","alt":"σµR > 0","inline":true},{"text":", i.e., not all thoughts have the same true value.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Gradient Calculation. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":24.36},"width":453.96,"height":60.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-9.png","element":"img","alt":"−−−→V (th) = V = (V1, . . . , VK)","inline":true,"padRight":true},{"text":"and define","element":"span"}],[{"style":{"width":"62%"},"width":991,"height":283,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-10.png","element":"img"}],[{"text":"We compute ","element":"span"},{"style":{"height":16},"width":139.69,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-11.png","element":"img","alt":" ∂fi/∂Vk","inline":true,"padRight":true},{"text":"in steps.","element":"span"}],[{"text":"First,","element":"span"}],[{"style":{"width":"67%"},"width":1065,"height":87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-12.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"we have, using ","element":"span"},{"style":{"height":19.37},"width":448.08,"height":48.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-13.png","element":"img","alt":" ∂(Vj − ¯V )/∂Vk = δjk − 1K","inline":true,"padRight":true},{"text":",","element":"span"}],[{"style":{"width":"99%"},"width":1582,"height":493,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-14.png","element":"img"}],[{"text":"Applying the quotient rule yields, for arbitrary ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"89%"},"width":1419,"height":114,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-15.png","element":"img"}],[{"text":"Evaluate at ","element":"span"},{"style":{"height":18.46},"width":125.23,"height":46.16,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-16.png","element":"img","alt":" V = −→µ","inline":true,"padRight":true},{"text":"and denote","element":"span"}],[{"style":{"width":"99%"},"width":1582,"height":286,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-17.png","element":"img"}],[{"text":"Finally, by the first-order multivariate Delta method, with ","element":"span"},{"style":{"height":24.79},"width":343.47,"height":61.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-18.png","element":"img","alt":" Var(−→V ) = 1M Σdiag","inline":true,"padRight":true},{"text":"(and ","element":"span"},{"style":{"height":15.59},"width":145.25,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-19.png","element":"img","alt":" Σdiag =","inline":true},{"style":{"height":19.38},"width":332.8,"height":48.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-20.png","element":"img","alt":"diag(σ2R1, . . . , σ2RK)","inline":true},{"text":"),","element":"span"}],[{"style":{"width":"96%"},"width":1526,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/14-21.png","element":"img"}],[{"style":{"width":"68%"},"width":1080,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-0.png","element":"img"}],[{"text":"For a single answer ","element":"span"},{"style":{"height":11.59},"width":97.39,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-1.png","element":"img","alt":" ansi,j","inline":true},{"text":", the advantage is defined as","element":"span"}],[{"style":{"width":"99%"},"width":1571,"height":183,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-2.png","element":"img"}],[{"text":"Using the first-order multivariate Delta method, the variance of ","element":"span"},{"style":{"height":16.79},"width":162.57,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-3.png","element":"img","alt":" A(ansi,j)","inline":true,"padRight":true},{"text":"can be approximated as","element":"span"}],[{"style":{"width":"84%"},"width":1344,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.79},"width":337.62,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-5.png","element":"img","alt":" gi,j(R) = A(ansi,j)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.4},"width":28,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-6.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"denotes the vector of reward means.","element":"span"}],[{"text":"Evaluating the gradient at ","element":"span"},{"style":{"height":14},"width":115.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-7.png","element":"img","alt":" R = µ","inline":true,"padRight":true},{"text":"and grouping by thought, we obtain","element":"span"}],[{"style":{"width":"96%"},"width":1532,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-8.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.28},"width":167.38,"height":43.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-9.png","element":"img","alt":" δ(k,m),(i,j)","inline":true,"padRight":true},{"text":"is the Kronecker delta, ","element":"span"},{"style":{"height":16.79},"width":404.24,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-10.png","element":"img","alt":" ˜µk = (µRk − µ ¯R)/σµR","inline":true,"padRight":true},{"text":"is the expected advantage of ","element":"span"},{"text":"thought ","element":"span"},{"style":{"height":21.11},"width":396.79,"height":52.78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-11.png","element":"img","alt":" thk, µ ¯R = 1K�Kk=1 µRk","inline":true},{"text":", and ","element":"span"},{"style":{"height":21.96},"width":520.86,"height":54.89,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-12.png","element":"img","alt":" σ2µR = 1K−1�Kk=1(µRk − µ ¯R)2","inline":true},{"text":".","element":"span"}],[{"id":"id-25","style":{"width":"51%"},"width":811,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-13.png","element":"img"}],[{"text":"To examine whether the assumption of independence across thoughts (i.e., a diagonal covariance matrix) holds in practice, we conducted numerical simulations and empirically estimated the covariance structure of ","element":"span"},{"style":{"height":23.29},"width":387.96,"height":58.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-14.png","element":"img","alt":"−→V (th) = (V1, . . . , VK)","inline":true},{"text":". Specifically, we generated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"independent replications of the full ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":"-dimensional estimator vector, denoted ","element":"span"},{"style":{"height":14.18},"width":76.25,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-15.png","element":"img","alt":" V (n)","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , N","element":"span"},{"text":", and computed the empirical covariance matrix:","element":"span"}],[{"style":{"width":"85%"},"width":1351,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-16.png","element":"img"}],[{"text":"We then assessed the degree of diagonal dominance using Row-wise strict diagonal dominance and Frobenius-norm based diagonal energy ratio.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Row-wise strict diagonal dominance. ","element":"span"},{"text":"For each row ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":", the covariance matrix is said to be strictly diagonally dominant if","element":"span"}],[{"style":{"width":"59%"},"width":939,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-17.png","element":"img"}],[{"text":"We summarize this property by the proportion of rows that satisfy the condition:","element":"span"}],[{"style":{"width":"71%"},"width":1140,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-18.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":73.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-19.png","element":"img","alt":" 1{·}","inline":true,"padRight":true},{"text":"denotes the indicator function. A value ","element":"span"},{"style":{"height":14},"width":222.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-20.png","element":"img","alt":" prow dom ≈ 1","inline":true,"padRight":true},{"text":"indicates strong diagonal dominance.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Frobenius-norm based diagonal energy ratio. ","element":"span"},{"text":"We also consider the proportion of squared Frobenius norm explained by the diagonal entries:","element":"span"}],[{"style":{"width":"71%"},"width":1139,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-21.png","element":"img"}],[{"text":"Higher values of ","element":"span"},{"style":{"height":10},"width":44.6,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-22.png","element":"img","alt":" ρF","inline":true,"padRight":true},{"text":"indicate that the diagonal terms dominate the overall covariance energy.","element":"span"}],[{"text":"We select 50 samples from the Trajectory Prediction task and, at the 1500-step checkpoint, compute the covariance matrix of the thought-value estimates by performing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"= 10 ","element":"span"},{"text":"independent replications per sample. The empirical results yield ","element":"span"},{"style":{"height":10},"width":165.42,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-23.png","element":"img","alt":" prow dom=","inline":true},{"style":{"fontWeight":"bold"},"text":"63.65% ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":10},"width":81.42,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/15-24.png","element":"img","alt":" ρF =","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"70.71% ","element":"span"},{"text":"averaged on 50 samples. Since our theoretical derivations rely on the assumption that the covariance matrix is diagonal, these diagnostics suggest that this assumption has a certain degree of validity in practice, as the estimated covariance matrices exhibit a clear tendency toward diagonal dominance.","element":"span"}],[{"id":"id-39","style":{"width":"36%"},"width":581,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/16-0.png","element":"img"}],[{"text":"A.3.1 ","element":"span"},{"text":"M","element":"span"},{"text":"ATH","element":"span"}],[{"text":"We conduct our experiments using problems from the DAPO ","element":"span"},{"href":"#id-1","referenceIndex":39,"text":"Yu et al. ","element":"a"},{"href":"#id-1","referenceIndex":39,"text":"(2025) ","element":"a"},{"text":"training set and evaluate on the AIME2024 test set ","element":"span"},{"href":"#id-53","referenceIndex":23,"text":"Maxwell-Jia ","element":"a"},{"href":"#id-53","referenceIndex":23,"text":"(2024)","element":"a"},{"text":". The Math training set is constructed by randomly sampling ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"},{"text":"000 ","element":"span"},{"text":"problems from the DAPO training corpus. The model is trained for a single epoch on these ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"},{"text":"000 ","element":"span"},{"text":"training samples. We do not use a validation set; instead, we select the final model parameters saved at the end of training (the last checkpoint) for testing.","element":"span"}],[{"text":"At test time, for each test problem from AIME2024 we generate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 100 ","element":"span"},{"text":"independent candidate outputs (“generations”). From these ","element":"span"},{"text":"100 ","element":"span"},{"text":"generations we compute the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"pass@k ","element":"span"},{"text":"metrics for ","element":"span"},{"style":{"height":16},"width":208.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/16-1.png","element":"img","alt":" k ∈ {10, 32}","inline":true},{"text":".","element":"span"}],[{"text":"The reward function is designed with two complementary components: a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"format reward ","element":"span"},{"text":"and an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"accuracy reward","element":"span"},{"text":". The model is required to generate outputs in a predefined structured format:","element":"span"}],[{"style":{"width":"38%"},"width":617,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/16-2.png","element":"img"}],[{"text":"where the answer is represented as a single integes ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":". The format reward assigns a value of ","element":"span"},{"text":"1 ","element":"span"},{"text":"if and only if the output strictly follows the required format, and ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise. The accuracy reward is +1 if the predicted answer is identical to the true answer, and 0 otherwise.","element":"span"}],[{"text":"The full prompt template is shown below:","element":"span"}],[{"style":{"width":"85%"},"width":1360,"height":206,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/16-3.png","element":"img"}],[{"text":"Roles and constraints: - : State relevant concepts, theorems, formulas, and solution plan. Do NOT perform numeric calculations or write equations here. - : Perform ALL detailed computations and step-by-step derivations based on the analysis. Show equations and numeric work here. - : Output ONLY the final integer (optional sign). No words, units, punctuation (except the sign), or explanations.","element":"span"}],[{"style":{"width":"84%"},"width":1334,"height":253,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/16-4.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Implementation Note","element":"span"},{"text":": In our multi-sample framework, the sampled content encompasses both ","element":"span"},{"style":{"fontStyle":"italic"},"text":"< answer > ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"< process > ","element":"span"},{"text":"elements.","element":"span"}],[{"text":"A.3.2 ","element":"span"},{"text":"C","element":"span"},{"text":"ODE","element":"span"}],[{"text":"We conduct our experiments using the Python-code portion of the SYNTHETIC-1 dataset ","element":"span"},{"href":"#id-27","referenceIndex":28,"text":"PrimeIn- ","element":"a"},{"href":"#id-27","referenceIndex":28,"text":"tellect ","element":"a"},{"href":"#id-27","referenceIndex":28,"text":"(2024) ","element":"a"},{"text":"and evaluate on the LiveBench code test set ","element":"span"},{"href":"#id-28","referenceIndex":37,"text":"White et al. ","element":"a"},{"href":"#id-28","referenceIndex":37,"text":"(2024)","element":"a"},{"text":". The Code training set is constructed by randomly sampling ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"},{"text":"000 ","element":"span"},{"text":"problems from the SYNTHETIC-1 Python-code corpus. The model is trained for a single epoch on these ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"},{"text":"000 ","element":"span"},{"text":"training samples. We do not use a validation set; instead, we select the final model parameters saved at the end of training (the last checkpoint) for testing.","element":"span"}],[{"text":"At test time, for each test problem from LiveBench we generate ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"= 100 ","element":"span"},{"text":"independent candidate outputs (“generations”). From these ","element":"span"},{"text":"100 ","element":"span"},{"text":"generations we compute the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"pass@k ","element":"span"},{"text":"metrics for ","element":"span"},{"style":{"height":16},"width":208.09,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/17-0.png","element":"img","alt":" k ∈ {10, 32}","inline":true,"padRight":true},{"text":"as described below.","element":"span"}],[{"text":"The reward function is designed with two complementary components: a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"format reward ","element":"span"},{"text":"and a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"functional (accuracy) reward","element":"span"},{"text":". The model is required to generate outputs in a predefined structured format:","element":"span"}],[{"style":{"width":"32%"},"width":521,"height":75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/17-1.png","element":"img"}],[{"text":"The format reward assigns a value of ","element":"span"},{"text":"1 ","element":"span"},{"text":"if and only if the output strictly follows the required tag structure and the content within ","element":"span"},{"text":" ","element":"span"},{"text":"can be parsed as a syntactically valid Python program. Otherwise the format reward is ","element":"span"},{"text":"0","element":"span"},{"text":".","element":"span"}],[{"text":"The functional (accuracy) reward is ","element":"span"},{"text":"+1 ","element":"span"},{"text":"if the program inside ","element":"span"},{"text":" ","element":"span"},{"text":"executes successfully on the official hidden test inputs, terminates without runtime error, and produces outputs that exactly match the expected outputs for all test cases. Otherwise the accuracy reward is ","element":"span"},{"text":"0","element":"span"},{"text":".","element":"span"}],[{"text":"The prompt used to condition the model for each problem is exactly:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"Question","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"First output the thinking process in tags and then output the final code in tags. ","element":"span"},{"text":"The answer should be a complete Python code solution that solves the given problem. ","element":"span"},{"text":"Make sure your code handles all edge cases and follows the input/output format specified in the problem. ","element":"span"},{"text":"DONOT OUTPUT ANY CODE OR SOLUTION IN THE THINK TAGS.","element":"span"}],[{"style":{"width":"30%"},"width":483,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/17-2.png","element":"img"}],[{"text":"We conduct our experiments using the ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Agibot World dataset ","element":"span"},{"href":"#id-29","referenceIndex":8,"text":"contributors ","element":"a"},{"href":"#id-29","referenceIndex":8,"text":"(2024)","element":"a"},{"text":". The data is partitioned into training, validation, and test sets based on specific ","element":"span"},{"text":"task id","element":"span"},{"text":"s from Agibot World dataset. Specifically, the training set is constructed from ","element":"span"},{"text":"task id","element":"span"},{"text":"s 424, 480, and 507, comprising a total of 3,000 images (randomly sampling). The validation and test sets are derived from ","element":"span"},{"text":"task id ","element":"span"},{"text":"582 and 1352, respectively. For all images, the ground-truth bounding boxes and corresponding object labels are annotated through a crowdsourcing process. The object detection model is trained for a single epoch on the 3,000-image training set.","element":"span"}],[{"text":"After training, we perform model selection by evaluating checkpoints on the designated validation set. The model checkpoint that achieves the highest average ","element":"span"},{"text":"IoU@0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"5 ","element":"span"},{"text":"(as defined below) on the validation data is selected for the final evaluation. The performance of this selected model is then reported on the test set.","element":"span"}],[{"text":"We evaluate the model’s performance using a ","element":"span"},{"style":{"fontWeight":"bold"},"text":"IoU rate ","element":"span"},{"text":"metric, which measures the proportion of correctly localized objects based on the Intersection over Union (IoU). A detection is considered positive if the IoU between the predicted bounding box (","element":"span"},{"style":{"height":15.59},"width":94.44,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/17-3.png","element":"img","alt":"Bpred","inline":true},{"text":") and the ground-truth bounding box (","element":"span"},{"style":{"height":15.59},"width":58.77,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/17-4.png","element":"img","alt":"Bgt","inline":true},{"text":") exceeds a given threshold ","element":"span"},{"style":{"height":6.8},"width":21,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/17-5.png","element":"img","alt":" τ","inline":true},{"text":".","element":"span"}],[{"text":"The IoU rate at a specific threshold ","element":"span"},{"style":{"height":6.8},"width":21,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/17-6.png","element":"img","alt":" τ","inline":true},{"text":", denoted as ","element":"span"},{"style":{"height":10.8},"width":116.2,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/17-7.png","element":"img","alt":" IoU@τ","inline":true},{"text":", is formulated as:","element":"span"}],[{"style":{"width":"71%"},"width":1133,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/17-8.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"is the total number of samples in the test set, and ","element":"span"},{"style":{"height":16},"width":65.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/17-9.png","element":"img","alt":" 1(·)","inline":true,"padRight":true},{"text":"is the indicator function. To provide a comprehensive assessment, we report the performance across four different IoU thresholds: ","element":"span"},{"style":{"height":9.6},"width":63.34,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/17-10.png","element":"img","alt":" τ ∈","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"6","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"7","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"8","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":".","element":"span"}],[{"text":"The reward function is designed with two complementary components: a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"format reward ","element":"span"},{"text":"and an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"accuracy reward","element":"span"},{"text":". The model is required to generate outputs in a predefined structured format:","element":"span"}],[{"style":{"width":"46%"},"width":737,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/17-11.png","element":"img"}],[{"text":"where the bounding box is represented as a list of four integers ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"d, d, d, d","element":"span"},{"text":"]","element":"span"},{"text":". The format reward assigns a value of ","element":"span"},{"text":"1 ","element":"span"},{"text":"if and only if the output strictly follows the required format, and ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise. The accuracy reward is defined as the IoU between the predicted bounding box and the ground-truth bounding box.","element":"span"}],[{"text":"The full prompt template is shown below:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"Question","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"First output the thinking process in tags and then output the final answer in tags. ","element":"span"},{"text":"Output the final answer in List format. Only output the bounding box using [x min, y min, x max, y max] format in the final answer. ","element":"span"},{"text":"DO NOT OUTPUT ANY ANSWER OR CONCLUSION IN THE THINK TAGS.","element":"span"}],[{"style":{"width":"37%"},"width":595,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/18-0.png","element":"img"}],[{"text":"The task is defined as affordance prediction, where the model, given an image and a specified affordance (e.g., grasping, holding), is required to predict a pixel-wise mask indicating the corresponding region.","element":"span"}],[{"text":"We primarily use the UMD Part Affordance Dataset ","element":"span"},{"href":"#id-30","referenceIndex":25,"text":"Myers et al. ","element":"a"},{"href":"#id-30","referenceIndex":25,"text":"(2015)","element":"a"},{"text":". The official training split of this dataset is used to construct our training and validation sets. Specifically, we use 3,000 images for training and a held-out portion of the original training split for validation. For evaluation, we use the official test split of the UMD dataset. To further assess the model’s generalization capabilities, we also use the entire AGD20K dataset ","element":"span"},{"href":"#id-31","referenceIndex":22,"text":"Luo et al. ","element":"a"},{"href":"#id-31","referenceIndex":22,"text":"(2022) ","element":"a"},{"text":"as an additional, challenging test set.","element":"span"}],[{"text":"The affordance prediction model is trained for a single epoch on the 3,000-image training set. After training, we perform model selection by evaluating checkpoints on the designated validation set. The model checkpoint that achieves the highest Success Rate (as defined below) on the validation data is selected for the final evaluation. The performance of this selected model is then reported on the test sets (UMD test and AGD20K).","element":"span"}],[{"text":"We evaluate the model’s performance using a Success Rate metric. This metric measures the proportion of samples where the predicted point correctly falls within the ground-truth affordance mask. A prediction is considered successful if the pixel value at the predicted 2D coordinate is 1 in the ground-truth binary mask.","element":"span"}],[{"text":"The Success Rate is formulated as:","element":"span"}],[{"style":{"width":"71%"},"width":1135,"height":99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/18-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"is the total number of samples in the test set, ","element":"span"},{"style":{"height":23.75},"width":78.08,"height":59.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/18-2.png","element":"img","alt":" C(i)pred","inline":true,"padRight":true},{"text":"is the predicted 2D coordinate ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y","element":"span"},{"text":") ","element":"span"},{"text":"for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th sample, and ","element":"span"},{"style":{"height":22.6},"width":78.68,"height":56.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/18-3.png","element":"img","alt":" M (i)gt","inline":true,"padRight":true},{"text":"is the corresponding ground-truth affordance mask. The notation","element":"span"}],[{"style":{"height":23.75},"width":192.72,"height":59.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/18-4.png","element":"img","alt":"M (i)gt (C(i)pred)","inline":true,"padRight":true},{"text":"represents the value of the mask at the predicted coordinate. ","element":"span"},{"style":{"height":16},"width":65.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/18-5.png","element":"img","alt":" 1(·)","inline":true,"padRight":true},{"text":"is the indicator func- ","element":"span"},{"text":"tion, which is 1 if the condition is true and 0 otherwise.","element":"span"}],[{"text":"The reward function consists of two complementary components: a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"format reward ","element":"span"},{"text":"and an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"accuracy reward","element":"span"},{"text":".","element":"span"}],[{"text":"The model is required to generate outputs in the following structured format:","element":"span"}],[{"style":{"width":"37%"},"width":593,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/18-6.png","element":"img"}],[{"text":"where the final answer corresponds to a 2D coordinate ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"d, d","element":"span"},{"text":"]","element":"span"},{"text":", with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"denoting an integer. The format reward assigns a value of ","element":"span"},{"text":"1 ","element":"span"},{"text":"if and only if the output strictly adheres to this format; otherwise, it is set to ","element":"span"},{"text":"0","element":"span"},{"text":". The accuracy reward evaluates the correctness of the prediction by checking whether the predicted 2D point lies within the ground-truth affordance mask (i.e., a region where the mask value equals ","element":"span"},{"text":"1","element":"span"},{"text":"). If the prediction falls inside the valid region, +1 reward is given; otherwise, it is not.","element":"span"}],[{"text":"The full prompt template is shown below:","element":"span"}],[{"style":{"width":"78%"},"width":1243,"height":75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/18-7.png","element":"img"}],[{"text":" tags. ","element":"span"},{"text":"Only output one affordance point using [x, y] format. ","element":"span"},{"text":"DO NOT OUTPUT ANY ANSWER OR CONCLUSION IN THE THINK TAGS.","element":"span"}],[{"style":{"width":"36%"},"width":586,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/19-0.png","element":"img"}],[{"text":"The task is defined as trajectory prediction, where the model, given an image and a manipulation instruction, is required to predict the two-dimensional trajectory of the robotic arm’s end-effector in the image’s pixel coordinate system. The trajectory is represented as a sequence of coordinates, and the predicted path should follow the ground-truth trajectory to successfully complete the instructed manipulation.","element":"span"}],[{"text":"We primarily use the trajectory subset of the BAAI ShareRobot dataset ","element":"span"},{"href":"#id-32","referenceIndex":15,"text":"Ji et al. ","element":"a"},{"href":"#id-32","referenceIndex":15,"text":"(2025)","element":"a"},{"text":". The original dataset is partitioned into training, validation, and test sets. Specifically, we use 3,000 images for training, a held-out portion of the training split for validation, and the test split for evaluation. The model is trained for a single epoch on the 3,000-image training set. After training, we perform model selection by evaluating checkpoints on the designated validation set. The checkpoint that achieves the highest reward value (as defined below) on the validation data is selected for the final evaluation. The performance of this selected model is then reported on the held-out test set.","element":"span"}],[{"text":"We evaluate the model’s performance using multiple geometric similarity metrics, following the design in ManipVLM-R1 ","element":"span"},{"href":"#id-8","referenceIndex":33,"text":"Song et al. ","element":"a"},{"href":"#id-8","referenceIndex":33,"text":"(2025)","element":"a"},{"text":". These metrics measure how well the predicted trajectory matches the ground truth from different perspectives. Specifically, we use Discrete Fr´echet Distance (DFD), Hausdorff Distance (HD), Root Mean Square Error (RMSE), and Endpoint Distance as evaluation criteria.","element":"span"}],[{"text":"The model is required to generate outputs in the following structured format:","element":"span"}],[{"style":{"width":"81%"},"width":1286,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/19-1.png","element":"img"}],[{"text":"where the final answer corresponds to a variable-length sequence of 2D coordinates ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y","element":"span"},{"text":"]","element":"span"},{"text":", with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"denoting integers.","element":"span"}],[{"text":"The reward function consists of two complementary components: a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"format reward","element":"span"},{"text":"and a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"accuracy reward","element":"span"},{"text":". The format reward assigns a value of ","element":"span"},{"text":"1 ","element":"span"},{"text":"if and only if the output strictly adheres to this format; otherwise, it is set to ","element":"span"},{"text":"0","element":"span"},{"text":". To measure how well the predicted trajectory ","element":"span"},{"style":{"height":14.83},"width":29,"height":37.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/19-2.png","element":"img","alt":"ˆT","inline":true,"padRight":true},{"text":"matches the ground-truth trajectory ","element":"span"},{"style":{"height":10.98},"width":44.83,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/19-3.png","element":"img","alt":" T ∗","inline":true},{"text":", we adopt an accuracy reward following the design in ManipVLM-R1 ","element":"span"},{"href":"#id-8","referenceIndex":33,"text":"Song et al. ","element":"a"},{"href":"#id-8","referenceIndex":33,"text":"(2025)","element":"a"},{"text":". Specifically, the reward is defined as","element":"span"}],[{"style":{"width":"82%"},"width":1304,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/19-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.2},"width":183.92,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/19-5.png","element":"img","alt":" DDFD, DHD","inline":true},{"text":", and ","element":"span"},{"style":{"height":13.19},"width":108.92,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/19-6.png","element":"img","alt":" DRMSE","inline":true,"padRight":true},{"text":"denote the Discrete Fr´echet Distance, Hausdorff Distance, and Root Mean Square Error between the predicted trajectory ","element":"span"},{"style":{"height":14.83},"width":29,"height":37.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/19-7.png","element":"img","alt":"ˆT","inline":true,"padRight":true},{"text":"and the ground-truth trajectory ","element":"span"},{"style":{"height":10.98},"width":44.84,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/19-8.png","element":"img","alt":" T ∗","inline":true},{"text":". The final term enforces endpoint accuracy by penalizing the distance between the predicted endpoint ","element":"span"},{"style":{"height":14},"width":47.04,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/19-9.png","element":"img","alt":" ˆpN","inline":true,"padRight":true},{"text":"and the ground-truth endpoint ","element":"span"},{"style":{"height":15.38},"width":52.04,"height":38.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/19-10.png","element":"img","alt":" p∗M","inline":true},{"text":".","element":"span"}],[{"text":"The model is guided by a carefully designed prompt that specifies both the reasoning and the answer requirements. The full prompt template is shown below:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"Question","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"First output the thinking process in tags and then output the final answer in tags. ","element":"span"},{"text":"Output the final answer in the following JSON format: ","element":"span"},{"text":"[[x1, y1], [x2, y2], ..., [xn, yn]]. ","element":"span"},{"text":"Where each coordinate pair represents a point in the image’s pixel space and the center of the end effector needs to follow the coordinates to complete the task. ","element":"span"},{"text":"Each hand trajectory includes unknown number of [x, y] coordinate pairs.DO NOT OUTPUT ANY ANSWER OR CONCLUSION IN THE THINK TAGS.","element":"span"}],[{"style":{"width":"32%"},"width":518,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/20-0.png","element":"img"}],[{"text":"The task is defined as demand prediction, where the model, given an image and a human demand instruction (e.g., “I am thirsty”), is required to output a two-dimensional coordinate corresponding to an object in the image that fulfills the demand (e.g., a water bottle or a juice box). A prediction is considered correct if the predicted point lies inside the ground-truth segmentation mask of the demanded object.","element":"span"}],[{"text":"We construct the dataset for this task based on MO-DDN ","element":"span"},{"href":"#id-33","referenceIndex":36,"text":"Wang et al. ","element":"a"},{"href":"#id-33","referenceIndex":36,"text":"(2024)","element":"a"},{"text":", which requires robots to ground a natural demand instruction to objects in the environment. MO-DDN itself is built upon the HSSD scene dataset ","element":"span"},{"href":"#id-54","referenceIndex":16,"text":"Khanna et al. ","element":"a"},{"href":"#id-54","referenceIndex":16,"text":"(2024)","element":"a"},{"text":", together with a custom demand–object dataset. To build our data, we randomly sample a demand instruction and pair it with a scene containing a target object that satisfies the demand. We then crop and store the corresponding image, resulting in instruction–image pairs.","element":"span"}],[{"text":"Following the original MO-DDN splits, we collect data separately from the training and testing tasks. Specifically, we use ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"},{"text":"000 ","element":"span"},{"text":"instruction–image pairs as the training set and ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"},{"text":"000 ","element":"span"},{"text":"pairs as the validation set, both sampled from the training tasks. For evaluation, we construct a test set of ","element":"span"},{"text":"5","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"},{"text":"000 ","element":"span"},{"text":"instruction–image pairs sampled from thetesting tasks.","element":"span"}],[{"text":"We train the model for a single epoch on the training set and perform model selection based on validation accuracy. The checkpoint achieving the highest validation performance is then used for testing, and we report results on the test set.","element":"span"}],[{"text":"We evaluate the model’s performance using a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Success Rate ","element":"span"},{"text":"metric, defined as the proportion of samples where the predicted coordinate falls within the ground-truth mask of the demanded object. Formally:","element":"span"}],[{"style":{"width":"72%"},"width":1142,"height":103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/20-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"is the number of samples in the test set, ","element":"span"},{"style":{"height":23.75},"width":78.1,"height":59.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/20-2.png","element":"img","alt":" C(i)pred","inline":true,"padRight":true},{"text":"denotes the predicted 2D coordinate ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y","element":"span"},{"text":") ","element":"span"},{"text":"for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th sample, and ","element":"span"},{"style":{"height":22.6},"width":78.71,"height":56.5,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/20-3.png","element":"img","alt":" M (i)gt","inline":true,"padRight":true},{"text":"is the ground-truth binary mask of the demanded object. The notation ","element":"span"},{"style":{"height":23.75},"width":192.72,"height":59.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/20-4.png","element":"img","alt":"M (i)gt (C(i)pred)","inline":true,"padRight":true},{"text":"indicates the mask value at the predicted location. ","element":"span"},{"style":{"height":16},"width":65.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/20-5.png","element":"img","alt":" 1(·)","inline":true,"padRight":true},{"text":"is the indicator function that ","element":"span"},{"text":"equals ","element":"span"},{"text":"1 ","element":"span"},{"text":"if the condition holds and ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise.","element":"span"}],[{"text":"The reward function for training consists of two complementary components: a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"format reward ","element":"span"},{"text":"and an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"accuracy reward","element":"span"},{"text":". The model must output predictions in the following structured format:","element":"span"}],[{"style":{"width":"37%"},"width":593,"height":77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/20-6.png","element":"img"}],[{"text":"where the final answer corresponds to a 2D coordinate ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"d, d","element":"span"},{"text":"]","element":"span"},{"text":", with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"denoting an integer. The format reward is assigned ","element":"span"},{"text":"1 ","element":"span"},{"text":"if the output strictly follows this structure, and ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise. The accuracy reward is assigned if and only if the predicted coordinate lies within the ground-truth object mask. These two rewards jointly ensure syntactically valid outputs and semantic correctness.","element":"span"}],[{"text":"The model is guided by a prompt template that specifies both the thinking process and the final answer format. The full prompt is given below:","element":"span"}],[{"text":"You are completing a navigation task where you need to detect objects from the image that fulfill a user’s demand. The user’s demand is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"Question","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":". ","element":"span"},{"text":"First output the thinking process in tags and then output the final answer in tags. ","element":"span"},{"text":"Only output one point using [x, y] format that represents the target demanded object. ","element":"span"},{"text":"DO NOT OUTPUT ANY ANSWER OR CONCLUSION IN THE THINK TAGS.","element":"span"}],[{"text":"A.3.7 ","element":"span"},{"text":"OCR-","element":"span"},{"text":"BASED ","element":"span"},{"text":"VQA","element":"span"}],[{"text":"The task is defined as OCR-based Visual Question Answering (VQA), where the model, given an image containing textual information and a natural language question, is required to output a short natural language answer. The answer must be grounded in the image content and can involve both text extraction and reasoning over visual elements.","element":"span"}],[{"text":"We construct the dataset by combining three OCR-based VQA benchmarks: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Document VQA ","element":"span"},{"href":"#id-35","referenceIndex":34,"text":"Tito ","element":"a"},{"href":"#id-35","referenceIndex":34,"text":"et al. ","element":"a"},{"href":"#id-35","referenceIndex":34,"text":"(2021)","element":"a"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Infographics VQA ","element":"span"},{"href":"#id-35","referenceIndex":34,"text":"Tito et al. ","element":"a"},{"href":"#id-35","referenceIndex":34,"text":"(2021)","element":"a"},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Scene Text VQA ","element":"span"},{"href":"#id-34","referenceIndex":3,"text":"Biten et al. ","element":"a"},{"href":"#id-34","referenceIndex":3,"text":"(2019)","element":"a"},{"text":". Document VQA focuses on answering questions asked over document images, which may contain printed, typewritten, and handwritten content (","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g.","element":"span"},{"text":", letters, memos, reports). The answers are typically text spans taken verbatim from the document. Infographics VQA considers questions over infographic images containing charts, diagrams, or other structured visual data, where answers are not always explicitly extracted text but can include inferred information. Scene Text VQA consists of natural scene images with embedded text (","element":"span"},{"style":{"fontStyle":"italic"},"text":"e.g.","element":"span"},{"text":", storefronts, street signs). The model must jointly leverage OCR reading and visual understanding to answer the questions.","element":"span"}],[{"text":"From each of the three training sets, we randomly select ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"},{"text":"000 ","element":"span"},{"text":"samples, resulting in a combined training set of ","element":"span"},{"text":"9","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"},{"text":"000 ","element":"span"},{"text":"samples. Additionally, we construct a validation set of ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"},{"text":"500 ","element":"span"},{"text":"samples (also drawn from the training splits), while the official validation sets of each benchmark are used as our test set.","element":"span"}],[{"text":"The model is trained for a single epoch on the ","element":"span"},{"text":"9","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"},{"text":"000","element":"span"},{"text":"-sample mixed training set. Model selection is performed based on validation performance, and the checkpoint achieving the highest validation score is reported on the test sets.","element":"span"}],[{"text":"The evaluation metric is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Average Normalized Levenshtein Similarity ","element":"span"},{"text":"(ANLS), which measures the string-level similarity between the predicted and ground-truth answers. ANLS accounts for OCR errors by softly penalizing recognition mistakes. A threshold ","element":"span"},{"style":{"height":11.2},"width":128.59,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/21-0.png","element":"img","alt":" τ = 0.5","inline":true,"padRight":true},{"text":"is applied to determine whether a predicted answer is considered valid. Formally, ANLS is defined as:","element":"span"}],[{"style":{"width":"78%"},"width":1243,"height":263,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/21-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"is the number of questions, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"is the number of ground-truth answers per question, ","element":"span"},{"style":{"height":11.59},"width":45.32,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/21-2.png","element":"img","alt":" aij","inline":true,"padRight":true},{"text":"is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th ground-truth answer for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th question ","element":"span"},{"style":{"height":10},"width":28.77,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/21-3.png","element":"img","alt":" qi","inline":true},{"text":", and ","element":"span"},{"style":{"height":11.59},"width":44.94,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/21-4.png","element":"img","alt":" oqi","inline":true,"padRight":true},{"text":"is the predicted answer. ","element":"span"},{"style":{"height":16},"width":106.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/21-5.png","element":"img","alt":" NL(·)","inline":true,"padRight":true},{"text":"denotes the normalized Levenshtein distance.","element":"span"}],[{"text":"The reward function consists of a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"format reward ","element":"span"},{"text":"and an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"accuracy reward","element":"span"},{"text":". The model must output answers in the following structured format:","element":"span"}],[{"style":{"width":"32%"},"width":521,"height":75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/21-6.png","element":"img"}],[{"text":"The format reward is ","element":"span"},{"text":"1 ","element":"span"},{"text":"if the output strictly follows this structure, and ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise. The accuracy reward corresponds to the ANLS score of the predicted answer for the current question.","element":"span"}],[{"text":"The model is guided by the following prompt template:","element":"span"}],[{"style":{"width":"87%"},"width":1388,"height":280,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/21-7.png","element":"img"}],[{"text":"The task is defined as a simulator-based visual manipulation problem where, given a single RGB ","element":"span"},{"text":"observation of a manipulation scene, the model must specify a contact point ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y","element":"span"},{"text":") ","element":"span"},{"text":"on the object at which a sucker should attempt to manipulate. The model’s output must be grounded in the visual observation and may require reasoning about object geometry, affordances, and reachable contact locations.","element":"span"}],[{"text":"We construct the dataset and evaluation splits based on the PartNet Mobility dataset ","element":"span"},{"href":"#id-55","referenceIndex":38,"text":"Xiang et al. ","element":"a"},{"href":"#id-55","referenceIndex":38,"text":"(2020) ","element":"a"},{"text":"and the ManipLLM experimental setup (","element":"span"},{"style":{"fontWeight":"bold"},"text":"A crucial point is that we have followed their setting by using suckers as the end effectors for the robotic arms.","element":"span"},{"text":"). For training, we adopt the same 20 training categories as ManipLLM, consisting of 1,043 object instances. Training scenes are generated following the SAPIEN simulator ","element":"span"},{"href":"#id-55","referenceIndex":38,"text":"Xiang et al. ","element":"a"},{"href":"#id-55","referenceIndex":38,"text":"(2020) ","element":"a"},{"text":"setup and ManipLLM scene configurations. For testing, we use the open-sourced ManipLLM test set, which contains approximately 1,830 successful test samples spanning both Seen and Unseen objects. To better evaluate model generalization to novel viewpoints, we further construct a camera-perturbed test set by modifying each test sample: the camera orientation vector ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"0] ","element":"span"},{"text":"is replaced by ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y, z","element":"span"},{"text":"] ","element":"span"},{"text":"where each of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y, z ","element":"span"},{"text":"is sampled uniformly from the signed interval ","element":"span"},{"style":{"height":16},"width":172.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/22-0.png","element":"img","alt":" ±[0.2, 0.6]","inline":true},{"text":". This perturbation preserves other scene properties while intentionally stressing viewpoint robustness. In order to simplify control and isolate contact selection, the sucker approach direction in all experiments is fixed to be the surface normal at the chosen manipulation point ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, y","element":"span"},{"text":")","element":"span"},{"text":".","element":"span"}],[{"text":"The required output must follow a strict format consisting of a reasoning trace and a final contact point, written as:","element":"span"}],[{"style":{"width":"35%"},"width":569,"height":77,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/22-1.png","element":"img"}],[{"text":"The evaluation metric is Success Rate, following ManipLLM’s criterion based on the manipulated object’s displacement after the scripted sucker motion. Formally, given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"trials,","element":"span"}],[{"style":{"width":"96%"},"width":1525,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/22-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":73.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/22-3.png","element":"img","alt":" 1{·}","inline":true,"padRight":true},{"text":"is the indicator function. We report Success Rate on the camera-perturbed test sets, and further provide breakdowns by Seen vs. Unseen objects.","element":"span"}],[{"text":"The reward function during GRPO training consists of a format reward and a task reward. The format reward is ","element":"span"},{"text":"1 ","element":"span"},{"text":"if the output strictly follows the required structure and ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise. The task reward is ","element":"span"},{"text":"1 ","element":"span"},{"text":"if the manipulation attempt succeeds according to ManipLLM’s displacement criterion and ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise. The overall reward is defined as","element":"span"}],[{"style":{"width":"23%"},"width":375,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/22-4.png","element":"img"}],[{"text":"so that only properly formatted and successful outputs receive credit. This ensures that malformed answers cannot be rewarded even if the manipulation itself succeeds.","element":"span"}],[{"text":"All experiments are conducted with Qwen2.5-VL-3B as the base model. We train using GRPO for 4,000 optimization steps, selecting checkpoints based on validation success rate. The validation set is constructed by sampling held-out scenes from the same 20 training categories without overlap with the test split. The prompt used in training is as follows:","element":"span"}],[{"text":"\"system\": ","element":"span"},{"text":"\"You are an intelligent manipulator. ","element":"span"},{"text":"A conversation between User and Assistant. ","element":"span"},{"text":"The user asks a question, and the Assistant solves it. ","element":"span"},{"text":"The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. ","element":"span"},{"text":"The reasoning process and answer are enclosed within and tags, respectively, i.e. ","element":"span"},{"text":" reasoning process here answer here .\"","element":"span"}],[{"style":{"width":"70%"},"width":1123,"height":208,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/22-5.png","element":"img"}],[{"text":"Table 5: Hyperparameters for GRPO training.","element":"figcaption","subtype":"caption"}],[{"id":"id-56","style":{"width":"89%"},"width":1425,"height":1207,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/23-0.png","element":"img"}],[{"id":"id-40","text":"We summarize the key hyperparameters used in our GRPO training experiments in Tab. ","element":"span"},{"href":"#id-56","text":"5. ","element":"a"},{"text":"The settings are organized into general, training, and LoRA-related categories for clarity.","element":"span"}],[{"text":"A.4.2 ","element":"span"},{"text":"SFT D","element":"span"},{"text":"ETAILS","element":"span"}],[{"text":"For all Supervised Fine-Tuning (SFT) baselines, we train for 5 epochs. All other settings are kept consistent with GRPO, including the dataset, model selection criteria, and metric calculation.","element":"span"}],[{"style":{"width":"36%"},"width":577,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/23-1.png","element":"img"}],[{"text":"Our model is trained using the Hugging Face ","element":"span"},{"text":"transformers ","element":"span"},{"text":"library (version 4.51.3). During inference, we customize the decoding strategy via the ","element":"span"},{"text":"GenerationConfig ","element":"span"},{"text":"class. Specifically, we set ","element":"span"},{"text":"temperature=1.0 ","element":"span"},{"text":"and ","element":"span"},{"text":"do sample=True ","element":"span"},{"text":"to enable stochastic sampling. We also define ","element":"span"},{"text":"stop strings=[\"\", \"\"] ","element":"span"},{"text":"only when generating thoughts. The remaining parameters are maintained at their default settings.","element":"span"}],[{"id":"id-49","style":{"width":"51%"},"width":811,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/23-2.png","element":"img"}],[{"text":"To provide a more intuitive and in-depth analysis of our model’s performance, this section presents a series of curated case studies and visualizations. These examples encompass a range of key tasks, including object detection (Fig. ","element":"span"},{"href":"#id-57","text":"4) ","element":"a"},{"text":"and trajectory prediction (Fig. ","element":"span"},{"href":"#id-58","text":"5 ","element":"a"},{"text":"and Fig. ","element":"span"},{"href":"#id-59","text":"6)","element":"a"},{"text":". Our aim is to leverage these concrete scenarios to delve into the model’s behavior, decision-making logic, and inherent strengths and limitations.","element":"span"}],[{"text":"Specifically, in the simulator-based visual manipulation task, we visualize the distribution of the target operation points over multiple sampling attempts in Fig. ","element":"span"},{"href":"#id-60","text":"7. ","element":"a"},{"text":"Green points indicate successful manipulations, while red points represent failures. This visualization demonstrates the robustness of our model.","element":"span"}],[{"id":"id-66","text":"Table 6: NoZeroRate on Different Task. TN: The number of thoughts; AN: The number of answers ","element":"figcaption","subtype":"caption"},{"text":"per thought. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Bold ","element":"figcaption","subtype":"caption"},{"text":"indicates the best performance and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"italics ","element":"figcaption","subtype":"caption"},{"text":"indicate the second-best performance.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"95%"},"width":1506,"height":385,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/24-0.png","element":"img"}],[{"id":"id-52","text":"In this section, we further present some experimental results, including the accuracy reward curves ","element":"span"},{"text":"during training, and an analysis of the richness of reward signal.","element":"span"}],[{"style":{"width":"39%"},"width":624,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/24-1.png","element":"img"}],[{"text":"We present the accuracy reward curves for five visual tasks in Object Detection (Fig. ","element":"span"},{"href":"#id-61","text":"8)","element":"a"},{"text":", Affordance Prediction (Fig. ","element":"span"},{"href":"#id-62","text":"9)","element":"a"},{"text":", Demand Prediction (Fig. ","element":"span"},{"href":"#id-63","text":"10)","element":"a"},{"text":", OCR-based VQA (Fig. ","element":"span"},{"href":"#id-64","text":"11) ","element":"a"},{"text":"and Trajectory Prediction (Fig. ","element":"span"},{"href":"#id-65","text":"12)","element":"a"},{"text":". During the curve plotting process, we smooth the curve using a moving average method with a window size of 200. The curves demonstrate that T4A4 (red) exhibits performance comparable to that of T16A1 (blue) in the majority of cases, at times showing a marginal advantage.","element":"span"}],[{"style":{"width":"43%"},"width":688,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/24-2.png","element":"img"}],[{"text":"For tasks with binary (0-1) rewards, such as Code, Math, Affordance Prediction, Demand Prediction and Simulator-based Visual Manipulation, we compute the proportion of samples whose total reward is positive, which we refer to as the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NoZeroRate","element":"span"},{"text":". Formally, it is defined as","element":"span"}],[{"style":{"width":"77%"},"width":1227,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/24-3.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":73.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/24-4.png","element":"img","alt":" 1{·}","inline":true,"padRight":true},{"text":"denotes the indicator function, which equals ","element":"span"},{"text":"1 ","element":"span"},{"text":"if the condition inside holds and ","element":"span"},{"text":"0 ","element":"span"},{"text":"otherwise. Here, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is the total number of time steps, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"indexes a specific time step, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"is the number of thoughts, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"is the number of answers per thought, and ","element":"span"},{"style":{"height":20.49},"width":128.38,"height":51.23,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/24-5.png","element":"img","alt":" AccRti,j","inline":true,"padRight":true},{"text":"denotes the accuracy reward asso- ","element":"span"},{"text":"ciated with the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th answer under the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th thought at time step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". A higher NoZeroRate indicates a lower proportion of advantage collapses (where collapse means all advantage values become zero), and a higher proportion of effective gradient information contribution.","element":"span"}],[{"text":"The statistical results are presented in Tab. ","element":"span"},{"href":"#id-66","text":"6. ","element":"a"},{"text":"We observe that T4A4 achieves the second-highest proportion of non-zero accuracy rewards across all tasks, only behind T16A1. On the one hand, this indicates that under the T4A4 setting, the answers generated by each thought are largely different. On the other hand, it suggests that the diversity of generated answers can be substantially improved by generating additional answers per thought, as shown by the comparison between T4A4 and T4A1.","element":"span"}],[{"text":"A.7 ","element":"span"},{"text":"U","element":"span"},{"text":"SAGE OF ","element":"span"},{"text":"LLM","element":"span"},{"text":"S","element":"span"}],[{"text":"We employ a Large Language Model (LLM) to refine the manuscript, with a focus on correcting grammatical errors and enhancing overall readability.","element":"span"}],[{"id":"id-57","style":{"width":"96%"},"width":1536,"height":524,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/25-0.png","element":"img"}],[{"text":"Figure 4: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Case Study on Object Detection ","element":"figcaption","subtype":"caption"},{"text":"Green text indicates key reasoning content.","element":"figcaption","subtype":"caption"}],[{"id":"id-58","style":{"width":"96%"},"width":1536,"height":533,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/25-1.png","element":"img"}],[{"text":"Figure 5: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Case Study on Trajectory Prediction ","element":"figcaption","subtype":"caption"},{"text":"Green text indicates key reasoning content.","element":"figcaption","subtype":"caption"}],[{"id":"id-59","style":{"width":"96%"},"width":1536,"height":807,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/25-2.png","element":"img"}],[{"text":"Figure 6: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Case Study on Trajectory Prediction ","element":"figcaption","subtype":"caption"},{"text":"Green text indicates key reasoning content.","element":"figcaption","subtype":"caption"}],[{"id":"id-60","style":{"width":"95%"},"width":1520,"height":750,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/26-0.png","element":"img"}],[{"text":"Figure 7: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Visualization on Simulator-based Visual Manipulation ","element":"figcaption","subtype":"caption"},{"text":"Red dots indicate failures, while green dots represent successes. We can observe that most GRPO-MA-T4A4 points are located on the object. In contrast, GRPO-T4A1 frequently misses the object, resulting in a lower success rate.","element":"figcaption","subtype":"caption"}],[{"id":"id-61","style":{"width":"98%"},"width":1566,"height":1104,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/26-1.png","element":"img"}],[{"text":"Figure 8: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Accuracy Reward Curve on Object Detection","element":"figcaption","subtype":"caption"}],[{"id":"id-62","style":{"width":"98%"},"width":1566,"height":1103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/27-0.png","element":"img"}],[{"text":"Figure 9: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Accuracy Reward Curve on Affordance Prediction","element":"figcaption","subtype":"caption"}],[{"id":"id-63","style":{"width":"98%"},"width":1566,"height":1103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/27-1.png","element":"img"}],[{"text":"Figure 10: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Accuracy Reward Curve on Demand Prediction","element":"figcaption","subtype":"caption"}],[{"id":"id-64","style":{"width":"98%"},"width":1566,"height":1103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/28-0.png","element":"img"}],[{"text":"Figure 11: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Accuracy Reward Curve on OCR-based VQA","element":"figcaption","subtype":"caption"}],[{"id":"id-65","style":{"width":"98%"},"width":1566,"height":1103,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2509.24494/images/28-1.png","element":"img"}],[{"text":"Figure 12: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Accuracy Reward Curve on Trajectory Prediction","element":"figcaption","subtype":"caption"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]