2d:[[["$","$L30","0",{"heading":"Abstract","index":0,"length":12,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing “pseudo reasoning paths” imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:0:paragraphs:0:1:style","children":"VLAA-Thinking"}]}],["$","$1","2",{"children":", a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, "}],["$","$1","3",{"children":"VLAA-Thinking "}],["$","$1","4",{"children":"comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:0:paragraphs:0:5:style","children":"top-1 "}]}],["$","$1","6",{"children":"performance on Open LMM Reasoning Leaderboard"}],["$","$1","7",{"children":"1 "}],["$","$1","8",{"children":"among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area."}]]}]]}],["$","$L30","1",{"heading":"1 Introduction","index":1,"length":12,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Large Language Models (LLMs) with strong reasoning capability have recently gained wide attention with the emergence of OpenAI’s o1/o3 and Deepseek-R1 ("}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:3"}]}],["$","$1","4",{"children":"; "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:5"}]}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:6"}]}],["$","$1","7",{"children":", "}],["$","$1","8",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:8"}]}],["$","$1","9",{"children":"). A common practice to empower models with reasoning abilities comprises two steps: supervised fine-tuning (SFT) on reasoning data, followed by reinforcement learning (RL) to further boost performance. This successful paradigm has inspired efforts to extend these strengths beyond textual domains to Large Vision-Language Models (LVLMs) ("}],["$","$1","10",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:10"}]}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:11"}]}],["$","$1","12",{"children":", "}],["$","$1","13",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:13"}]}],["$","$1","14",{"children":"; "}],["$","$1","15",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:15"}]}],["$","$1","16",{"children":", "}],["$","$1","17",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:17"}]}],["$","$1","18",{"children":"; "}],["$","$1","19",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:19"}]}],["$","$1","20",{"children":", "}],["$","$1","21",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:21"}]}],["$","$1","22",{"children":"; "}],["$","$1","23",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:23"}]}],["$","$1","24",{"children":", "}],["$","$1","25",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:25"}]}],["$","$1","26",{"children":"; "}],["$","$1","27",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:27"}]}],["$","$1","28",{"children":", "}],["$","$1","29",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:0:29"}]}],["$","$1","30",{"children":")."}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/1-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:1:0"}]]}]}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 1: Examples from LVLMs trained with different strategies for reasoning "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:2:1:style","children":"Left"}]}],["$","$1","2",{"children":": response from a model trained with SFT, showing "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:2:3:style","children":"pseudo reasoning traces "}]}],["$","$1","4",{"children":"and a number of "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:2:5:style","children":"pseudo self-reflective cues "}]}],["$","$1","6",{"children":"("}],["$","$1","7",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:2:7:style","children":"i.e"}]}],["$","$1","8",{"children":"., aha-moments) imitated from R1. "}],["$","$1","9",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:2:9:style","children":"Right"}]}],["$","$1","10",{"children":": response from a model trained with RL, showing "}],["$","$1","11",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:2:11:style","children":"native reasoning ability "}]}],["$","$1","12",{"children":"and "}],["$","$1","13",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:2:13:style","children":"authentic aha-moments "}]}],["$","$1","14",{"children":"emerged from RL training. "}],["$","$1","15",{"children":"Wrong reasoning steps "}],["$","$1","16",{"children":"are colored red and aha-moments are highlighted."}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"In this work, we take a step further and examine whether the widely adopted “SFT then RL” paradigm similarly benefits the development of reasoning-capable LVLMs. Specifically, we ask: "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:3:1:style","children":"1) What are the distinct effect of SFT and RL in multimodal reasoning? and 2) Is this two-stage paradigm truly necessary for reasoning in LVLMs? "}]}],["$","$1","2",{"children":"To systematically explore these questions, we curate "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:3:3:style","children":"VLAA-Thinking"}]}],["$","$1","4",{"children":", the first comprehensive and high-quality image-text reasoning dataset explicitly designed to support both SFT and RL. Unlike prior datasets, "}],["$","$1","5",{"children":"VLAA-Thinking "}],["$","$1","6",{"children":"includes detailed, step-by-step reasoning traces derived from the R1-style “think-then-speak” intermediate reasoning. We construct a dedicated SFT split featuring multimodal chain-of-thought (CoT) examples suitable for visual instruction tuning, alongside a more challenging RL split curated from the same source encourage deeper and more adaptive reasoning behaviors. To effectively transfer reasoning capabilities from text-only models to the multimodal domain, we construct our dataset through a six-stage pipeline: metadata collection, image captioning, R1-based distillation, answer rewriting, verification, and split curation. Specifically, we input image captions and visual questions into DeepSeek-R1 to generate initial reasoning traces. These outputs are then rewritten for improved fluency and verified for correctness using a GPT-based verifier, resulting in high-quality multimodal reasoning dataset for SFT and RL."}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Next, we carefully ablate the role of SFT, RL and their combinations in multimodal reasoning using our "}],["$","$1","1",{"children":"VLAA-Thinking "}],["$","$1","2",{"children":"dataset. To better understand the role of SFT, we perform a detailed analysis, systematically examining the impact of SFT data type ("}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:3:style","children":"e.g"}]}],["$","$1","4",{"children":"., with and without the self-reflective ”aha moments”), dataset scale, and model capacity. To explore the potential of RL in the vision-language context, we design a novel mixed reward function within the Group Relative Policy Optimization (GRPO) ("}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:5"}]}],["$","$1","6",{"children":", "}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:7"}]}],["$","$1","8",{"children":") framework that involves both perception and cognition rewards to incentivize the model to produce well-reasoned answers. Specifically, our mixed reward signal blends 2 types of reward with 5 types of functions. For "}],["$","$1","9",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:9:style","children":"rule-based questions"}]}],["$","$1","10",{"children":", there are functions for "}],["$","$1","11",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:11:style","children":"digit"}]}],["$","$1","12",{"children":", "}],["$","$1","13",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:13:style","children":"multiple-choice"}]}],["$","$1","14",{"children":", "}],["$","$1","15",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:15:style","children":"math "}]}],["$","$1","16",{"children":"and "}],["$","$1","17",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:17:style","children":"bounding box "}]}],["$","$1","18",{"children":"outputs. For "}],["$","$1","19",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:19:style","children":"open-ended questions"}]}],["$","$1","20",{"children":", we adopt a competent reward model, "}],["$","$1","21",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:21:style","children":"XComposer-2.5-RM "}]}],["$","$1","22",{"children":"("}],["$","$1","23",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:23"}]}],["$","$1","24",{"children":", "}],["$","$1","25",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:4:25"}]}],["$","$1","26",{"children":"), along with a reference-based reward method to score an answer. We then closely investigate the effects of different reward functions, base models, and the interaction between SFT and GRPO to further optimize reasoning capabilities."}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Our extensive experiments comparing SFT and RL reveal several noteworthy insights. First, we probe the contribution of SFT and RL in multimodal reasoning: while SFT improves performance on standard tasks over the base model, it falls short in enhancing complex reasoning. Merely imitating an expert’s thinking through SFT often induces “"}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:5:1:style","children":"pseudo reasoning paths"}]}],["$","$1","2",{"children":"”, a superficial reasoning pattern which may contain “"}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:5:3:style","children":"pseudo aha moments"}]}],["$","$1","4",{"children":"” (superficial self-reflective cues), as illustrated in Figure "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:5:5"}]}],["$","$1","6",{"children":". We show that these imitated reasoning patterns can hinder genuine reasoning advancement, "}],["$","$1","7",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:5:7:style","children":"i.e"}]}],["$","$1","8",{"children":"., 47% relative performance drop on 7B models. This observation is also in line with recent studies highlighting the need for"}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/2-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:6:0"}]]}]}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 2: "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:7:1:style","children":"Data generation pipeline. "}]}],["$","$1","2",{"children":"We first generate initial reasoning traces by feeding detailed captions and visual questions into DeepSeek-R1.These outputs are then rewritten for improved fluency and verified for correctness using a GPT-based verifier. the resulting data is split into "}],["$","$1","3",{"children":"VLAA-Thinking-SFT "}],["$","$1","4",{"children":"and "}],["$","$1","5",{"children":"VLAA-Thinking-RL"}],["$","$1","6",{"children":"."}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"feedback and exploration signals to drive advanced reasoning behaviors ("}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:8:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:8:3"}]}],["$","$1","4",{"children":"). Additionally, our ablations show that for rule-based rewards, math and multiple-choice are more beneficial than others, and that a combination of both rule-based and open-ended rewards yields the best performance."}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"While prior work suggests that SFT followed by RL in LVLMs offers the best of both worlds ("}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:9:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:9:3"}]}],["$","$1","4",{"children":"; "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:9:5"}]}],["$","$1","6",{"children":", "}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:9:7"}]}],["$","$1","8",{"children":"; "}],["$","$1","9",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:9:9"}]}],["$","$1","10",{"children":", "}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:9:11"}]}],["$","$1","12",{"children":")—first mimicking good reasoning format, then refining via RL feedback, we find that "}],["$","$1","13",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:9:13:style","children":"applying SFT before GRPO "}]}],["$","$1","14",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:9:14:style","children":"hurts "}]}],["$","$1","15",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:9:15:style","children":"performance on aligned models"}]}],["$","$1","16",{"children":", with an average 12.7% drop, and even a smaller scale SFT leads to a similar decline. Regarding model size, larger models cannot immune from the degeneration brought by SFT, as 7B models share almost the same performance drop with their smaller counterparts. Finally, examining the training procedure, we observe little correlation between response length, reward, and performance—SFT-ed models get higher initial rewards and longer response yet underperform RL-trained ones, contrasting with the previous observation that better models usually produce longer answers with higher RL reward ("}],["$","$1","17",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:9:17"}]}],["$","$1","18",{"children":", "}],["$","$1","19",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:9:19"}]}],["$","$1","20",{"children":"; "}],["$","$1","21",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:9:21"}]}],["$","$1","22",{"children":", "}],["$","$1","23",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:9:23"}]}],["$","$1","24",{"children":")."}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"To summarize, while SFT helps unaligned models follow instructions, it limits exploration during RL by promoting imitative reasoning. In contrast, learning directly from reward signals yields more effective and adaptable thinking behavior. Empirically, direct RL proves superior. Our model, "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:10:1:style","children":"VLAA-Thinker-Qwen2.5VL-3B"}]}],["$","$1","2",{"children":", achieves the "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:1:paragraphs:10:3:style","children":"top-1 "}]}],["$","$1","4",{"children":"performance on the Open LMM Reasoning Leaderboard among 4B-scale LVLMs, surpassing the previous state-of-the-art by 1.8%. Our case study further emphasizes these gains with more concise, effective reasoning traces presented in model answers."}]]}]]}],["$","$L30","2",{"heading":"2 The VLAA-Thinking Dataset","index":2,"length":12,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"To systematically evaluate the “SFT then RL” paradigm for developing reasoning capabilities in LVLMs, we construct "}],["$","$1","1",{"children":"VLAA-Thinking"}],["$","$1","2",{"children":", a dataset that consists of two parts: 1) "}],["$","$1","3",{"children":"VLAA-Thinking"}],["$","$1","4",{"children":"-SFT which captures step-by-step reasoning grounded in visual inputs for SFT, and 2) "}],["$","$1","5",{"children":"VLAA-Thinking"}],["$","$1","6",{"children":"-RL which contains challenging samples designed specifically for RL. Our data generation pipeline is designed to transfer reasoning capabilities from a powerful text-only model to the multimodal domain through a structured, multi-stage process. The entire pipeline, as illustrated in Figure "}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:0:7"}]}],["$","$1","8",{"children":", consists of six key components:"}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:0:style","children":"#1: Metadata Collection "}]}],["$","$1","1",{"children":"We collect metadata from 9 vision-language datasets featuring either closed- or open-ended questions. Specifically, we sample data containing unique images from CLEVR-Math ("}],["$","$1","2",{"children":"Lindstr¨"}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:3"}]}],["$","$1","4",{"children":", "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:5"}]}],["$","$1","6",{"children":"), Math PUMA ("}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:7"}]}],["$","$1","8",{"children":", "}],["$","$1","9",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:9"}]}],["$","$1","10",{"children":"), ArxivQA ("}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:11"}]}],["$","$1","12",{"children":", "}],["$","$1","13",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:13"}]}],["$","$1","14",{"children":"), DocVQA ("}],["$","$1","15",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:15"}]}],["$","$1","16",{"children":", "}],["$","$1","17",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:17"}]}],["$","$1","18",{"children":"), VizWiz ("}],["$","$1","19",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:19"}]}],["$","$1","20",{"children":", "}],["$","$1","21",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:21"}]}],["$","$1","22",{"children":"), and ALLaVA ("}],["$","$1","23",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:23"}]}],["$","$1","24",{"children":", "}],["$","$1","25",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:25"}]}],["$","$1","26",{"children":"), and process them through our complete data pipeline. In addition, we directly adopt COCO and VisualGenome data from LLaVA-CoT ("}],["$","$1","27",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:1:27"}]}],["$","$1","28",{"children":","}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/3-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:2:0"}]]}]}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 1: "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:3:1:style","children":"Data statistics of "}]}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:3:2:style","children":"VLAA-Thinking"}]}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:3:3:style","children":". "}]}],["$","$1","4",{"children":"We present the original volume of metadata (#Ori.), the data size after the distillation pipeline (#Pipeline), the size of sampled examples for SFT (#Final SFT) and RL (#Final RL), respectively. Note that we only use GeoQA170K with verifiable answers for the RL split."}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:4:0"}]}],["$","$1","1",{"children":"). An exception is GeoQA170K ("}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:4:2"}]}],["$","$1","3",{"children":", "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:4:4"}]}],["$","$1","5",{"children":"), which we include only in the RL split due to persistent hallucination issues during captioning. Detailed statistics are in Table "}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:4:6"}]}],["$","$1","7",{"children":"."}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:0:style","children":"#2: Visual Input and Additional Information "}]}],["$","$1","1",{"children":"Each sample begins with an image, question, and its corresponding answer. To bridge the gap between the visual modality and language reasoning, we resort to GPT-4o to generate a detailed image caption describing the content in structured and semantically rich language (detailed prompts in Appendix "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:2"}]}],["$","$1","3",{"children":"). During this process, we take full advantage of the provided knowledge in the data beyond just the GPT captions. In detail, we provide these dataset-specific information: (1) CLEVRMath: Instructions for synthesizing the image from CLEVR ("}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:4"}]}],["$","$1","5",{"children":", "}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:5:6"}]}],["$","$1","7",{"children":"); (2) Math PUMA: Textual description of math problems in the image from the dataset itself. (3) ALLaVA-LAION: Fine-grained and verified GPT-4V captions from the original dataset."}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:6:0:style","children":"#3: Reasoning Answer Distillation "}]}],["$","$1","1",{"children":"We utilize a strong text-only reasoning model: DeepSeek-R1 to generate thinking rationale and final answers. The model is provided with the image caption, the visual question, and additional information from certain datasets. It responds using a structured reasoning format that is between "}],["$","$1","2",{"children":" "}],["$","$1","3",{"children":"and "}],["$","$1","4",{"children":" "}],["$","$1","5",{"children":"tags and contains a sequence of logical steps leading to the final answer."}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:7:0:style","children":"#4: Answer and Rewriting "}]}],["$","$1","1",{"children":"To enhance consistency and eliminate modality-specific artifacts, the raw reasoning answers generated by R1 are passed through a rewriting module ("}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:7:2:style","children":"i.e"}]}],["$","$1","3",{"children":"., GPT-3.5-turbo ("}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:7:4"}]}],["$","$1","5",{"children":", "}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:7:6"}]}],["$","$1","7",{"children":") in our experiment). This module removes unnecessary phrases ("}],["$","$1","8",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:7:8:style","children":"e.g"}]}],["$","$1","9",{"children":"., references to “caption”), and ensures the answer adheres to a clean, instruction-following format based on the image. We further filter out samples with the sentence length gap larger than 15 words to ensure minimum modifications in this process."}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:8:0:style","children":"#5: Automated Verification "}]}],["$","$1","1",{"children":"To assess whether the generated reasoning answers is correct regarding the groundtruth answer, we implement an automated verifier. This verifier compares the rewritten reasoning answer to the groundtruth of the visual question, determining whether the outputs are correct or incorrect. Only the examples that are verified as correct are retained as the final training data."}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:9:0:style","children":"#6: Curating Splits for SFT and RL "}]}],["$","$1","1",{"children":"The last step of our data generation pipeline is to curate two non-overlapped training sets for SFT and RL, respectively. Inspired by "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:9:2"}]}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:9:3"}]}],["$","$1","4",{"children":"("}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:9:5"}]}],["$","$1","6",{"children":") which finds that RL is particularly effective in encouraging deeper reasoning on challenging cases, we aim to select more challenging samples for the RL split. To achieve this, we propose using the presence of "}],["$","$1","7",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:9:7:style","children":"self-reflective cues "}]}],["$","$1","8",{"children":"("}],["$","$1","9",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:9:9:style","children":"i.e"}]}],["$","$1","10",{"children":"., the “aha moments”) in the distilled answers as an indicator of a sample’s difficulty level (details are in Appendix "}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:9:11"}]}],["$","$1","12",{"children":"). For the SFT split, we exclude samples "}],["$","$1","13",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:9:13:style","children":"with “aha moments”"}]}],["$","$1","14",{"children":", as such samples may be too complex to fully imitate through finetuning. On the other hand, the harder examples with “aha moments” form the RL split, on which reward-driven learning may be better suited to elicit meaningful reflection."}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Following these steps, our dataset adheres to the format "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:10:1:style","children":"{"}]}],["$","$1","2",{"children":"image, question, reasoning, answer"}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:2:paragraphs:10:3:style","children":"}"}]}],["$","$1","4",{"children":", with reasoning and answer generated by DeepSeek-R1. We construct a high-quality multimodal reasoning dataset with 126,413 samples for SFT and 25,195 samples for RL."}]]}]]}],["$","$L30","3",{"heading":"3 Investigating The Role of SFT for Multimodal Reasoning","index":3,"length":12,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"SFT has become the de-facto approach for training LLMs. Recent studies aim to extend the strengths of SFT to empower LVLMs with reasoning abilities by training on specially formatted data.Unlike prior methods that incorporate standalone textual descriptions of images ("}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:3"}]}],["$","$1","4",{"children":"), this direct strategy enables the model to develop grammatically coherent reasoning abilities, allowing it to “think before speak.” In recent vision-language reasoning systems, there is a notable trend of complementing or even replacing SFT with RL to enhance complex reasoning abilities ("}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:5"}]}],["$","$1","6",{"children":", "}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:7"}]}],["$","$1","8",{"children":"; "}],["$","$1","9",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:9"}]}],["$","$1","10",{"children":", "}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:11"}]}],["$","$1","12",{"children":"). We follow this line and take it further by probing the underlying cause of this shift. Our finding suggests that self-reflection thinking (“aha moments”) from the SFT process is overloaded with excessive and irrelevant reasoning, becomes what we call “pseudo aha moments” and ultimately hurts performance. In this section, we explore "}],["$","$1","13",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:13:style","children":"1) "}]}],["$","$1","14",{"children":"the model perform when SFT-ed on data with aha-moments and "}],["$","$1","15",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:0:15:style","children":"2) "}]}],["$","$1","16",{"children":"the effect of SFT data size to model performance."}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-51","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:0:style","children":"3.1 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:1:1:style","children":"Experiment Setup"}]}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"To investigate the effect of SFT training with aha-moments, we collect the distilled VQA pairs whose distilled answers contain aha-moments, totaling 55K samples. To study the effect of SFT with different sizes of training sets, we use perplexity (PPL) filtering to obtain a smaller SFT dataset. Specifically, we compute the PPL score of each answer in "}],["$","$1","1",{"children":"VLAA-Thinking-SFT-126K "}],["$","$1","2",{"children":"using Qwen2-VL-2B and Qwen2.5-VL-3B, and sort all samples by their average PPL scores over the two models. We keep the samples with high PPLs to obtain a total of 25K SFT samples, as these harder examples push models to learn more effectively and efficiently ("}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:2:3"}]}],["$","$1","4",{"children":", "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:2:5"}]}],["$","$1","6",{"children":"; "}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:2:7"}]}],["$","$1","8",{"children":", "}],["$","$1","9",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:2:9"}]}],["$","$1","10",{"children":")."}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We select four models for training: Qwen2VL (2B and 7B)"}],["$","$1","1",{"children":"2"}],["$","$1","2",{"children":", Qwen2.5VL (3B and 7B). Each model is trained with a batch size of 128 and their vision encoder frozen. We evaluate model performance with VLMEvalKit ("}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:3"}]}],["$","$1","4",{"children":", "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:5"}]}],["$","$1","6",{"children":") on 6 math reasoning benchmarks hosted in Open LMM Reasoning Leaderboard, which contains 6 challenging math reasoning benchmarks including MathVista ("}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:7"}]}],["$","$1","8",{"children":", "}],["$","$1","9",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:9"}]}],["$","$1","10",{"children":"), MathVision ("}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:11"}]}],["$","$1","12",{"children":", "}],["$","$1","13",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:13"}]}],["$","$1","14",{"children":"), MathVerse ("}],["$","$1","15",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:15"}]}],["$","$1","16",{"children":", "}],["$","$1","17",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:17"}]}],["$","$1","18",{"children":"), DynaMath ("}],["$","$1","19",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:19"}]}],["$","$1","20",{"children":", "}],["$","$1","21",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:21"}]}],["$","$1","22",{"children":"), WeMath ("}],["$","$1","23",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:23"}]}],["$","$1","24",{"children":", "}],["$","$1","25",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:25"}]}],["$","$1","26",{"children":"), LogicVista ("}],["$","$1","27",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:27"}]}],["$","$1","28",{"children":", "}],["$","$1","29",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:29"}]}],["$","$1","30",{"children":"). We present the percentage of relative performance drop of different models in Figure "}],["$","$1","31",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:3:31"}]}],["$","$1","32",{"children":". Detailed training and evaluation setup are in Appendix "}],["$","$1","33",{"children":"B"}],["$","$1","34",{"children":"."}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-45","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:0:style","children":"3.2 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:4:1:style","children":"Findings"}]}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/4-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:5:0"}]]}]}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 2: "}],["$","$1","1",{"children":"Average performance over 6 reasoning benchmarks of Qwen-2.5-VL-3B SFT-ed on different sizes of SFT data and on data containing only examples with aha moment ("}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:6:2:style","children":"aha"}]}],["$","$1","3",{"children":"-55K)."}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:7:0:style","children":"SFT with "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:7:1:style","children":"Aha Moments "}]}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:7:2:style","children":"Degrades Performance. "}]}],["$","$1","3",{"children":"We present results for the Qwen-2.5-VL-3B model trained under three different settings using our SFT data in Table "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:7:4"}]}],["$","$1","5",{"children":". Somewhat unexpectedly, the model fine-tuned on 55K examples containing the "}],["$","$1","6",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:7:6:style","children":"aha moment "}]}],["$","$1","7",{"children":"performs significantly worse than the base model, with an average drop of 10.5%. This suggests that chasing the "}],["$","$1","8",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:7:8:style","children":"aha moment "}]}],["$","$1","9",{"children":"through SFT is unreliable, as SFT merely teaches the model to mimic rather than to generalize genuine self-reflective reasoning. Additionally, the table shows evidence that straightforward SFT using multimodal reasoning data also degrades performance, "}],["$","$1","10",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:7:10:style","children":"e.g"}]}],["$","$1","11",{"children":"., we observe an average drop of 10.2% and 19.1% when fine-tuning on 25K and 126K samples, respectively."}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/5-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:8:0"}]]}]}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 3: "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:9:1:style","children":"Delta percentage performance change "}]}],["$","$1","2",{"children":"of different models trained with supervised fine-tuning (SFT) only."}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:10:0:style","children":"More SFT Data, Worse Performance. "}]}],["$","$1","1",{"children":"Counterintuitively, even a five-fold increase in the supervised dataset (from 25K to 126K instances) often fails to improve performance and in most cases actually "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:10:2:style","children":"harms "}]}],["$","$1","3",{"children":"it. Models trained with 126K SFT samples suffer a relative performance drop of over average 14% compared to their 25K-trained counterparts over all model and task settings ("}],["$","$1","4",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:10:4:style","children":"e.g"}]}],["$","$1","5",{"children":"., 25K: 32.2% "}],["$","$1","6",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:10:6:style","children":"vs"}]}],["$","$1","7",{"children":". 126K: 47.0%). This degradation is particularly evident on complex datasets such as WeMath and DynaMath, where the relative decrease reaches as high as 97.9% over Qwen2.5-VL models on average. Even on mid-difficulty benchmarks like MathVision and MathVerse ("}],["$","$1","8",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:10:8:style","children":"i.e"}]}],["$","$1","9",{"children":"., model performance is relatively higher), the 126K SFT models underperform, with an average drop of 28.6% compared to the untrained model over 4 models. These results suggest that simply scaling up SFT data does not boost generalizable reasoning skills of LLMs, and may instead suppress the model’s capacity on various reasoning tasks."}]]}],["$","$La","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:11:0:style","children":"Larger Models Are Not Immune to SFT Degeneration. "}]}],["$","$1","1",{"children":"Contrary to expectations, scaling up model size does not mitigate the adverse effects of excessive SFT, under heavier SFT they exhibit pronounced drops on the most challenging evaluations. A larger 7B models fine-tuned on 126K examples experience drops nearly identical in magnitude to their smaller 2B or 3B counterparts: 47.2% for smaller models "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:11:2:style","children":"vs"}]}],["$","$1","3",{"children":". 45.4% for larger models compared with base models. Notably, despite the strong performance of Qwen2.5-VL-7B model ("}],["$","$1","4",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:3:paragraphs:11:4:style","children":"e.g"}]}],["$","$1","5",{"children":"., 68.1% on MathVista), it also suffers an average decline of 52.5% on these reasoning tasks when SFT-ed with 126K data."}]]}],["$","$La","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"These findings highlight the limitations of SFT as a tool for enhancing multimodal reasoning. While it may be suitable for learning reasoning formats, it falls short of the expectations for fostering inherent self-reflection. Rather than simply scaling supervision data, our results suggest for a shift toward more advanced training methods like RL."}]]}]]}],["$","$L30","4",{"heading":"4 Improving Multimodal Reasoning with Mixed Rewards","index":4,"length":12,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The previous section shows that SFT is insufficient to transfer R1’s ability to LVLMs on vision-language tasks. Therefore, it is crucial to seek for other post-training methods to elicit the reasoning ability of LVLMs. Since reinforcement learning (RL) is effective in enhancing reasoning ability ("}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:3"}]}],["$","$1","4",{"children":"; "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:5"}]}],["$","$1","6",{"children":", "}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:7"}]}],["$","$1","8",{"children":"), and GRPO has recently been proven more effective and efficient on textual math reasoning task ("}],["$","$1","9",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:9"}]}],["$","$1","10",{"children":", "}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:11"}]}],["$","$1","12",{"children":"; "}],["$","$1","13",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:0:13"}]}],["$","$1","14",{"children":","}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/6-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:1:0"}]]}]}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 4: The proposed "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:2:1:style","children":"Mixed Reward Module "}]}],["$","$1","2",{"children":"for GRPO training, comprising 2 reward formats (rule-based and open-ended) and 5 types of verifiable rewards (digit, MCQ, math, IoU and general reasoning)."}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:3:0"}]}],["$","$1","1",{"children":") than other methods like PPO ("}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:3:2"}]}],["$","$1","3",{"children":", "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:3:4"}]}],["$","$1","5",{"children":"), it motivates us to apply GRPO training for vision-language reasoning tasks."}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Mathematically, let "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:4:1:style","children":"q "}]}],["$","$1","2",{"children":"be a query and "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:4:3"}]]}]}],["$","$1","4",{"children":"be a group of "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:4:5:style","children":"G "}]}],["$","$1","6",{"children":"sampled outputs from the old "}],["$","$1","7",{"children":"policy model "}],["$","$1","8",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:4:8"}]]}]}],["$","$1","9",{"children":", GRPO maximizes the following objective:"}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/6-3.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:5:0"}]]}]}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"where "}],["$","$1","1",{"children":"ˆ"}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:6:2:style","children":"A"}]}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:6:3:style","children":"i"}]}],["$","$1","4",{"children":","}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:6:5:style","children":"t "}]}],["$","$1","6",{"children":"is the estimated advantage, "}],["$","$1","7",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:6:7"}]]}]}],["$","$1","8",{"children":"is the KL penalty coefficient and "}],["$","$1","9",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:6:9"}]]}]}],["$","$1","10",{"children":"current, old, and reference policies, respectively."}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:7:0:style","children":"4.1 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:7:1:style","children":"GRPO with Mixed Reward"}]}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"To better adapt GRPO to multimodal reasoning, in addition to adopting the rule-based reward similar to the textual GRPO training, it is necessary to consider additional characteristics introduced by the vision modality. Inspired by ("}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:8:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:8:3"}]}],["$","$1","4",{"children":") which benchmarks LVLMs by "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:8:5:style","children":"perception "}]}],["$","$1","6",{"children":"and "}],["$","$1","7",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:8:7:style","children":"cognition "}]}],["$","$1","8",{"children":"(reasoning), we propose a "}],["$","$1","9",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:8:9:style","children":"mixed reward framework "}]}],["$","$1","10",{"children":"for GRPO training, as illustrated in Figure "}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:8:11"}]}],["$","$1","12",{"children":". The reward system comprises five types of verifiable rewards with two formats, encompassing both visual perception and visual reasoning tasks."}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:9:0:style","children":"Rule-Based Reward "}]}],["$","$1","1",{"children":"There are 4 types of rule-based rewards, including digit matching, option letter matching and math expression matching and Intersection over Union for bounding boxes. For digit matching, the model is asked to answer counting questions from CLEVR-Math whose groundtruths are a single digit. For option letter matching, the model is required to answer an MCQ. For math expression matching, the model is asked to solve a math question, such as finding a function expression or the volume of a cone, and output its answers in latex format. We use the "}],["$","$1","2",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:9:2"}]]}]}],["$","$1","3",{"children":"package to check for correctness. For bounding boxes, the model is prompted to output the bounding box coordinates of an object in the image, and an IoU score (range from 0 to 1) is computed as reward."}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:0:style","children":"Open-ended Reward "}]}],["$","$1","1",{"children":"We leverage InternLM-XComposer2.5-Reward ("}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:2"}]}],["$","$1","3",{"children":", "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:4"}]}],["$","$1","5",{"children":") as the scorer, denoted as "}],["$","$1","6",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:6:style","children":"S"}]}],["$","$1","7",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:7"}]]}]}],["$","$1","8",{"children":", which takes an image and a QA pair as input, and outputs a reward score. Following "}],["$","$1","9",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:9"}]}],["$","$1","10",{"children":"("}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:11"}]}],["$","$1","12",{"children":"), the reward for a sampled response ˆ"}],["$","$1","13",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:13:style","children":"y "}]}],["$","$1","14",{"children":"is computed as "}],["$","$1","15",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:15:style","children":"R"}]}],["$","$1","16",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:16:style","children":"open "}]}],["$","$1","17",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:17"}]]}]}],["$","$1","18",{"children":"if "}],["$","$1","19",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:19:style","children":"f"}]}],["$","$1","20",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:20"}]]}]}],["$","$1","21",{"children":"else 0, where "}],["$","$1","22",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:22:style","children":"S"}]}],["$","$1","23",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:23"}]]}]}],["$","$1","24",{"children":"is the score of the reference answer, and "}],["$","$1","25",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:10:25"}]]}]}],["$","$1","26",{"children":"is a smoothing hyperparameter. Note that the open-ended reward is normalized into [0,1], which is consistent with the scale of rule-based reward, partially avoiding reward hacking during training."}]]}],["$","$La","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:11:0:style","children":"Implicit Format Reward "}]}],["$","$1","1",{"children":"Unlike "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:11:2"}]}],["$","$1","3",{"children":"("}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:11:4"}]}],["$","$1","5",{"children":") and its subsequent works which use a separate reward term for format correctness, we discard this format reward term and make the format reward supersede all other rewards. Namely, whenever we are unable to extract a valid response from the raw answer, the reward would be 0. We empirically find that by specifying the output format in system prompt, the model is able to generate answers with correct formats through trials and errors. The implicit format reward design simplifies the reward computation. Further, it may yield better performance since less restriction is imposed on the exploration process ("}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:11:6"}]}],["$","$1","7",{"children":", "}],["$","$1","8",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:11:8"}]}],["$","$1","9",{"children":")."}]]}],["$","$La","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:12:0:style","children":"4.2 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:12:1:style","children":"Effect of SFT on GRPO Training"}]}]]}],["$","$La","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/7-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:13:0"}]]}]}]]}],["$","$La","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 3: "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:14:1:style","children":"Benchmark results of models trained with GRPO on different backbones. "}]}],["$","$1","2",{"children":"SFT+GRPO yields performance degradation, indicating that SFT is NOT compatible with GRPO in multimodal reasoning."}]]}],["$","$La","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:15:0:style","children":"SFT is NOT Compatible with GRPO in Multimodal Reasoning. "}]}],["$","$1","1",{"children":"Although we reveal in Section "}],["$","$1","2",{"children":"3 "}],["$","$1","3",{"children":"that SFT alone leads to a performance drop in multimodal reasoning, it is still unclear whether SFT plays a crucial role in aiding GRPO, like the golden key in DeepSeek-R1. We experiment with different backbones for GRPO training. Specifically, we adopt Qwen2VL-7B-Base and Qwen2VL-7B-Inst, and perform SFT on them with 25K samples, followed by GRPO training."}]]}],["$","$La","16",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"From Table "}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:16:1"}]}],["$","$1","2",{"children":", we observe that models undergoing SFT before GRPO training perform worse than those trained with GRPO alone, presenting an average drop of 8.9% across Qwen2VL-Base and Qwen2VL-Inst compared to their non-SFT counterparts. We also find that SFT introduces more degradation to instruction models than to base models without instruction-following capabilities. For instance, Qwen2VL-Inst suffers a 7.7% more drop in performance than Qwen2VL-Base post-SFT, suggesting that SFT can compromise the instruction-following ability crucial for effective GRPO training. Taken together, these results suggest that SFT is currently incompatible with GRPO in the context of multimodal reasoning, impairing both base and instruction-tuned LVLMs."}]]}],["$","$La","17",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/7-1.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:17:0"}]]}]}]]}],["$","$La","18",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 5: "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:18:1:style","children":"Impact of SFT with 5K and 10K samples before GRPO. "}]}],["$","$1","2",{"children":"Smaller-sized SFT datasets still jeopardizes GRPO performance."}]]}],["$","$La","19",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:19:0:style","children":"Smaller SFT Dataset Still Jeopardizes GRPO Performance. "}]}],["$","$1","1",{"children":"Since we reveal in Section "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:19:2"}]}],["$","$1","3",{"children":"that more SFT data yields lower performance, we try to investigate the effect of downsizing "}],["$","$1","4",{"children":"the SFT training set. Following the PPL filtering method in Section "}],["$","$1","5",{"children":"3"}],["$","$1","6",{"children":", we select top-10K and top-5K samples from "}],["$","$1","7",{"children":"VLAA-Thinking-SFT-126K "}],["$","$1","8",{"children":"to finetune Qwen2.5-VL-3B, followed by GRPO training. For comparison, we also conduct GRPO training without SFT."}]]}],["$","$La","20",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We present the performance of Qwen2.5-VL-3B on each task in Figure "}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:20:1"}]}],["$","$1","2",{"children":". A clear observation is that applying SFT on 5K examples prior to GRPO significantly degrades performance compared to using GRPO alone, showing an average drop of 13.5%. Moreover, scaling up SFT data to 10K yields only a marginal improvement of 0.8%. These results further support that SFT before GRPO can hinder the model’s learning capability."}]]}],["$","$La","21",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/8-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:21:0"}]]}]}]]}],["$","$La","22",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 6: "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:22:1:style","children":"Response length (left) and reward (right) during training. "}]}],["$","$1","2",{"children":"Training with only GRPO yields the lowest response length and yet the highest final reward and best benchmark performance, indicating that response length, reward, and model performance are NOT necessarily related."}]]}],["$","$La","23",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:23:0:style","children":"Response Length, Reward, and Model Performance are NOT Necessarily Related. "}]}],["$","$1","1",{"children":"Prior work in RL suggests that longer responses often correlate with better reasoning and higher RL rewards ("}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:23:2"}]}],["$","$1","3",{"children":", "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:23:4"}]}],["$","$1","5",{"children":"; "}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:23:6"}]}],["$","$1","7",{"children":", "}],["$","$1","8",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:23:8"}]}],["$","$1","9",{"children":"; "}],["$","$1","10",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:23:10"}]}],["$","$1","11",{"children":", "}],["$","$1","12",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:23:12"}]}],["$","$1","13",{"children":"). However, our findings in Figure "}],["$","$1","14",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:23:14"}]}],["$","$1","15",{"children":"reveal that response length and reward in GRPO are not reliable indicators of reasoning ability. For instance, the 10K SFT+GRPO model produces the longest responses but ends up with lower rewards than the GRPO-only model ("}],["$","$1","16",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:23:16"}]]}]}],["$","$1","17",{"children":"0.35 vs. "}],["$","$1","18",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:23:18"}]]}]}],["$","$1","19",{"children":"0.5) after training. Similarly, the 5K SFT+GRPO variant shows moderate length and reward but still underperforms on downstream tasks."}]]}],["$","$La","24",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Interestingly, both SFT-ed models start with higher initial rewards ("}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:24:1:style","children":"e.g"}]}],["$","$1","2",{"children":"., "}],["$","$1","3",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:24:3"}]]}]}],["$","$1","4",{"children":"0.20 for 10K SFT+GRPO "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:24:5:style","children":"vs"}]}],["$","$1","6",{"children":". "}],["$","$1","7",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:24:7"}]]}]}],["$","$1","8",{"children":"0.05 for GRPO-only), which is likely due to their early learning experience with supervision since SFT and GRPO data share the same distribution. However, they exhibit limited reward improvement during training, whereas the GRPO-only model rapidly surpasses them. These trends further reveal that SFT solely provides a higher “lower bound” for RL training, yet it may lower the “upper bound” since the reasoning SFT data constrains the model’s exploration paths. Therefore, "}],["$","$1","9",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:24:9:style","children":"reasoning is a native emerging ability that is more likely to be developed through RL, not SFT"}]}],["$","$1","10",{"children":". While SFT-ed models may appear to reason, their behavior is closer to pattern imitation — "}],["$","$1","11",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:24:11:style","children":"a form of pseudo-reasoning that lacks the generalizable reasoning skills"}]}],["$","$1","12",{"children":"."}]]}],["$","$La","25",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:25:0:style","children":"4.3 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:25:1:style","children":"GRPO Training "}]}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:25:2:style","children":"without "}]}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:25:3:style","children":"SFT"}]}]]}],["$","$La","26",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Following the findings in the previous section, we directly conduct GRPO training which yields four models: VLAA-Thinker-Qwen2-VL-2B, VLAA-Thinker-Qwen2-VL-7B, VLAA-Thinker-Qwen2.5-VL-3B, VLAA-Thinker-Qwen2.5-VL-7B. We also train on a base model of Qwen2-VL-7B, and the resulting model is named VLAA-Thinker-Qwen2-7B-Zero."}]]}],["$","$La","27",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We sample 4 times for each query with temperature 0.8. Rollout and training batch size are set as 512 and 256, respectively. We train our model for 1 episode (outer loop) and 1 epoch per episode (inner loop) on 8*H100 GPUs with 49 steps. More details of training setup are in Appendix "}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:27:1"}]}],["$","$1","2",{"children":". We follow the identical evaluation setup as described in Section "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:27:3"}]}],["$","$1","4",{"children":". We present evaluation results in Table "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:27:5"}]}],["$","$1","6",{"children":"and list our main findings below."}]]}],["$","$La","28",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:28:0:style","children":"Direct GRPO Training Boosts Model Performance. "}]}],["$","$1","1",{"children":"Models trained directly with GRPO on the VL-Thinking RL consistently outperform their respective base models. For example,"}]]}],["$","$La","29",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/9-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:29:0"}]]}]}]]}],["$","$La","30",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 4: Evaluation results of 6 math reasoning benchmarks on Open LMM Leaderboard. VLAA-Thinker models significantly outperform baselines and other models."}]]}],["$","$La","31",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"at the 7B scale, two models trained on VL-Thinking achieve an average score of 36.5%, marking a 2.0% improvement over their base model average of 34.5%. Moreover, our best-performing 7B model consistently outperforms other similarly sized LVLMs ("}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:31:1:style","children":"e.g"}]}],["$","$1","2",{"children":"., InternVL2.5-8B, LLaVA-OneVision-7B), while our 3B model surpasses the recent reasoningfocused model, VLM-R1-Math, by 1.1% on average. These results once again demonstrate that GRPO significantly enhances reasoning capabilities, even without additional SFT."}]]}],["$","$La","32",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:32:0:style","children":"Stronger Instruction Model Leads to Better Post-GRPO Reasoning. "}]}],["$","$1","1",{"children":"An interesting observation is that model with better instruction tuning generally performs better. The instruction-aligned Qwen2-7B model, after GRPO, outperforms its unaligned counterpart VLAA-Thinker-Qwen2-7B-Zero by 1.8% on average (31.3% "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:32:2:style","children":"vs"}]}],["$","$1","3",{"children":". 29.5%), with notable gains on harder tasks like DynaMath (5.0%) and WeMath (3.1%). Moreover, using a stronger instruction-tuned model for GRPO further improves across both 3B and 7B scales — VLAA-Thinker-Qwen2.5 surpasses VLAA-Thinker-Qwen2 by 12.6% on average, confirming that higher-quality instruction tuning leads to more effective post-RL reasoning."}]]}],["$","$La","33",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/9-1.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:33:0"}]]}]}]]}],["$","$La","34",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 7: "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:34:1:style","children":"Heatmap of different “aha” expressions "}]}],["$","$1","2",{"children":"generated by VLAA-Thinker models during training."}]]}],["$","$La","35",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:35:0:style","children":"Emergence of Authentic "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:35:1:style","children":"Aha Moments"}]}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:35:2:style","children":". "}]}],["$","$1","3",{"children":"To show that our GRPO training can induce authentic self-reflection process, we plot the frequency of four aha expressions (“alternatively”, “double-check”, “i should check”, “wait”) for each VLAA-Thinker model in Figure "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:35:4"}]}],["$","$1","5",{"children":". Since all models are trained using GRPO without being SFT-ed on distilled reasoning paths, all aha moments emerge from the GRPO process, demonstrating the model’s self-developed reflective ability. Another finding is that the number of "}],["$","$1","6",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:35:6:style","children":"aha moments "}]}],["$","$1","7",{"children":"is not directly correlate with overall model performance, as more "}],["$","$1","8",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:35:8:style","children":"aha moments "}]}],["$","$1","9",{"children":"do not necessarily translate to higher reasoning scores."}]]}],["$","$La","36",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:36:0:style","children":"4.4 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:36:1:style","children":"Ablations"}]}]]}],["$","$La","37",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/10-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:37:0"}]]}]}]]}],["$","$La","38",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 5: "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:38:1:style","children":"Ablation of Mixed Reward "}]}],["$","$1","2",{"children":"on MVi: MathVision, MVs: MathVerse and WM: WeMath. A combination of rule-based and open-ended rewards yields significant boost in performance."}]]}],["$","$La","39",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:39:0:style","children":"Mixed Reward. "}]}],["$","$1","1",{"children":"To demonstrate the effectiveness of our mixed reward strategy, we perform an ablation study on Qwen2.5-VL-3B by selectively disabling individual reward components and evaluating performance across three math reasoning benchmarks, as shown in Table "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:39:2"}]}],["$","$1","3",{"children":". The model trained with "}],["$","$1","4",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:39:4:style","children":"Mixed Reward "}]}],["$","$1","5",{"children":"achieves the best overall performance, with an average improvement of 6.2% over the baseline, demonstrating the effectiveness of our reward design. Using only rule-based rewards (All Rule-Based) also yields consistent gains ("}],["$","$1","6",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:39:6:style","children":"e.g"}]}],["$","$1","7",{"children":"., 29.1% vs. 25.3% baseline), while removing specific components—especially MCQ (w/o MCQ) leads to substantial drops. These results highlight the critical role of rule-based rewards in GRPO for multimodal reasoning tasks."}]]}],["$","$La","40",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/10-1.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:40:0"}]]}]}]]}],["$","$La","41",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Table 6: "}],["$","$1","1",{"children":"Ablation on "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:41:2:style","children":"LR "}]}],["$","$1","3",{"children":"and "}],["$","$1","4",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:41:4:style","children":"KL Coef. "}]}],["$","$1","5",{"children":"on MVs: MathVerse, DM: DynaMath and LV: LogicVista."}]]}],["$","$La","42",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:42:0:style","children":"Hyperparameters "}]}],["$","$1","1",{"children":"To search for better hyperparameters, we experiment with different "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:42:2:style","children":"learning rates "}]}],["$","$1","3",{"children":"(LR) and "}],["$","$1","4",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:42:4:style","children":"KL divergence "}]}],["$","$1","5",{"children":"settings on Qwen2.5-VL-3B. We start with a basic setting where LR anneals to zero following a cosine scheduler with no KL constraint. Results are shown in Table "}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:42:6"}]}],["$","$1","7",{"children":". LR1 uses a minimum learning rate of 8"}],["$","$1","8",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:42:8:style","children":"e"}]}],["$","$1","9",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:42:9"}]]}]}],["$","$1","10",{"children":"with warmup ratio 10%, whereas LR2 uses a minimum learning rate of 5"}],["$","$1","11",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:42:11:style","children":"e"}]}],["$","$1","12",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:42:12"}]]}]}],["$","$1","13",{"children":"with warmup ratio 3%. Since LR2 performs slightly better than LR1, we compare two KL settings on top of LR2. KL1 uses an initial KL of 1"}],["$","$1","14",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:42:14:style","children":"e"}]}],["$","$1","15",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:42:15"}]]}]}]]}],["$","$La","43",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"and a target KL of 5"}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:43:1:style","children":"e"}]}],["$","$1","2",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:43:2"}]]}]}],["$","$1","3",{"children":", whereas KL2 uses an initial KL coefficient of 1"}],["$","$1","4",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:43:4:style","children":"e"}]}],["$","$1","5",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:43:5"}]]}]}],["$","$1","6",{"children":"and a target KL of 5"}],["$","$1","7",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:43:7:style","children":"e"}]}],["$","$1","8",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"1ch","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[null,["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:43:8"}]]}]}],["$","$1","9",{"children":". We find that introducing KL constraints significantly improves the performance on MathVerse and DynaMath by 1.1% and 3.2%, respectively, and that using a smaller KL can encourage the model to evolve."}]]}],["$","$La","44",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:44:0:style","children":"4.5 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:44:1:style","children":"Case Study"}]}]]}],["$","$La","45",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We provide an example showcasing the improvement of VLAA-Thinker over the original model in Appendix "}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:4:paragraphs:45:1"}]}],["$","$1","2",{"children":". Qwen2.5VL-7B generates lengthy response with wrong reasoning traces. Although it outputs some self-reflective patterns like “re-evaluate”, the final answer remains wrong. On the other hand, VLAA-Thinker-Qwen2.5VL-7B is able to reason on the right track, with only a minor mistake near the end of its thinking process. Nevertheless, the high-level idea and reasoning process is overall correct, demonstrating strong capability of solving complex reasoning tasks."}]]}]]}],["$","$L30","5",{"heading":"5 Related Work","index":5,"length":12,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:0:style","children":"Vision-Language Reasoning Models. "}]}],["$","$1","1",{"children":"Recent advances in vision-language (VL) reasoning models build on the success of text-only reasoning systems like OpenAI’s o1 ("}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:2"}]}],["$","$1","3",{"children":", "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:4"}]}],["$","$1","5",{"children":") and DeepSeek-R1 ("}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:6"}]}],["$","$1","7",{"children":", "}],["$","$1","8",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:8"}]}],["$","$1","9",{"children":"). Earlier VL methods, such as few-shot prompting and chain-of-thought (CoT), offered limited visual reasoning ("}],["$","$1","10",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:10"}]}],["$","$1","11",{"children":", "}],["$","$1","12",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:12"}]}],["$","$1","13",{"children":"; "}],["$","$1","14",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:14"}]}],["$","$1","15",{"children":", "}],["$","$1","16",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:16"}]}],["$","$1","17",{"children":"). Recently, LLaVA-CoT ("}],["$","$1","18",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:18"}]}],["$","$1","19",{"children":", "}],["$","$1","20",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:20"}]}],["$","$1","21",{"children":") adopts an SFT approach a 4-step structured outputs to enhance model’s reasoning, yet lacking flexibility due to its rigid output format. More recently, newer models incorporate more natural reasoning traces and reinforcement learning. VLM-R1 ("}],["$","$1","22",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:22"}]}],["$","$1","23",{"children":", "}],["$","$1","24",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:24"}]}],["$","$1","25",{"children":") and R1-V ("}],["$","$1","26",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:26"}]}],["$","$1","27",{"children":", "}],["$","$1","28",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:28"}]}],["$","$1","29",{"children":") align multimodal LLMs using step-by-step reasoning and policy optimization. VisualThinker-R1-Zero ("}],["$","$1","30",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:30"}]}],["$","$1","31",{"children":", "}],["$","$1","32",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:32"}]}],["$","$1","33",{"children":") goes further by training a 2B model via pure RL from scratch, achieving emergent inner reasoning. LMM-R1 ("}],["$","$1","34",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:34"}]}],["$","$1","35",{"children":", "}],["$","$1","36",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:36"}]}],["$","$1","37",{"children":") transfers CoT skills from language to vision through staged RL. Vision-R1 ("}],["$","$1","38",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:38"}]}],["$","$1","39",{"children":", "}],["$","$1","40",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:0:40"}]}],["$","$1","41",{"children":") combines reasoning trace supervision and RL with correctness and format rewards to train a strong 7B VL reasoner. Different from these concurrent works, we propose a high-quality multimodal reasoning dataset with R1-like reasoning traces for both SFT and RL, and provide a comprehensive study on training paradigms."}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:0:style","children":"Reward Modeling in Reinforcement Learning. "}]}],["$","$1","1",{"children":"Reward design plays a central role in reasoning-oriented RL. While model-based rewards offer flexibility ("}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:2"}]}],["$","$1","3",{"children":", "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:4"}]}],["$","$1","5",{"children":"; "}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:6"}]}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:7"}]}],["$","$1","8",{"children":", "}],["$","$1","9",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:9"}]}],["$","$1","10",{"children":"; "}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:11"}]}],["$","$1","12",{"children":", "}],["$","$1","13",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:13"}]}],["$","$1","14",{"children":"), they are prone to reward hacking ("}],["$","$1","15",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:15"}]}],["$","$1","16",{"children":", "}],["$","$1","17",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:17"}]}],["$","$1","18",{"children":"; "}],["$","$1","19",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:19"}]}],["$","$1","20",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:20"}]}],["$","$1","21",{"children":", "}],["$","$1","22",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:22"}]}],["$","$1","23",{"children":"; "}],["$","$1","24",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:24"}]}],["$","$1","25",{"children":", "}],["$","$1","26",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:26"}]}],["$","$1","27",{"children":"), making them risky for reasoning tasks. Recent VL models prefer binary correctness rewards ("}],["$","$1","28",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:28"}]}],["$","$1","29",{"children":", "}],["$","$1","30",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:30"}]}],["$","$1","31",{"children":"; "}],["$","$1","32",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:32"}]}],["$","$1","33",{"children":", "}],["$","$1","34",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:34"}]}],["$","$1","35",{"children":") for math or QA tasks, directly reinforcing accurate outputs. Others apply rule-based rewards, enforcing structured formats or logic chains ("}],["$","$1","36",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:36"}]}],["$","$1","37",{"children":", "}],["$","$1","38",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:38"}]}],["$","$1","39",{"children":"; "}],["$","$1","40",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:40"}]}],["$","$1","41",{"children":", "}],["$","$1","42",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:42"}]}],["$","$1","43",{"children":"). While recent studies deploy strong reward models for enhancing LVLM reasoning, they are grounded by specific domains or simpler tasks ("}],["$","$1","44",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:44"}]}],["$","$1","45",{"children":", "}],["$","$1","46",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:46"}]}],["$","$1","47",{"children":"; "}],["$","$1","48",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:48"}]}],["$","$1","49",{"children":", "}],["$","$1","50",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:50"}]}],["$","$1","51",{"children":"). GRPO-style methods use relative ranking within output batches to guide optimization without value critics ("}],["$","$1","52",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:52"}]}],["$","$1","53",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:53"}]}],["$","$1","54",{"children":", "}],["$","$1","55",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:55"}]}],["$","$1","56",{"children":"; "}],["$","$1","57",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:57"}]}],["$","$1","58",{"children":", "}],["$","$1","59",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:5:paragraphs:1:59"}]}],["$","$1","60",{"children":"). Our Mix Reward objective combines the model-based and rule-based reward in four complex rewarding scenarios, yielding better performance than existing approaches."}]]}]]}],["$","$L30","6",{"heading":"6 Conclusion","index":6,"length":12,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"This work provides a comparative analysis on the effectiveness of leveraging SFT or RL (more specifically, GRPO) to build LVLM with strong reasoning ability. We show by extensive experiments that distilling reasoning data and performing SFT is a deficient way to transfer reasoning ability across modalities. We then extend our dataset to GRPO training with a proposed mixed reward objective, which yields substantial improvement over the baseline models. We present several findings regarding combining SFT and GRPO and the correlation between reward, respond length, and final performance. These results indicate that reasoning is a native emerging ability acquired from RL, rather than SFT, which merely equips the model with ‘pseudo-reasoning’ ability."}]]}]]}],["$","$L30","7",{"heading":"Acknowledgement","index":7,"length":12,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We thank the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs."}]]}]]}],["$","$L30","8",{"heading":"References","index":8,"length":12,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-25","style":"$undefined","children":"Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L Leavitt, "}]}],["$","$1","1",{"children":"and Mansheej Paul. Perplexed by perplexity: Perplexity-based data pruning with small reference models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:0:2:style","children":"arXiv preprint arXiv:2405.20541"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-22","style":"$undefined","children":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla "}]}],["$","$1","1",{"children":"Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:1:2:style","children":"NeurIPS"}]}],["$","$1","3",{"children":", 2020."}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-16","style":"$undefined","children":"Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi "}]}],["$","$1","1",{"children":"Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for lite vision-language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:2:2:style","children":"arXiv preprint arXiv:2402.11684"}]}],["$","$1","3",{"children":", 2024a."}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-3","style":"$undefined","children":"Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super gen- "}]}],["$","$1","1",{"children":"eralization ability in vision-language models with less than $3. "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:3:2"}]}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:3:3"}]}],["$","$1","4",{"children":", 2025a. Accessed: 2025-02-02."}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-63","style":"$undefined","children":"Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng "}]}],["$","$1","1",{"children":"Huang, Mohammad Shoeybi, and Bryan Catanzaro. Odin: Disentangled reward mitigates hacking in rlhf. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:4:2:style","children":"arXiv preprint arXiv:2402.07319"}]}],["$","$1","3",{"children":", 2024b."}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-48","style":"$undefined","children":"Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, "}]}],["$","$1","1",{"children":"Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, et al. An empirical study on eliciting and improving r1-like reasoning models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:5:2:style","children":"arXiv preprint arXiv:2503.04548"}]}],["$","$1","3",{"children":", 2025b."}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-23","style":"$undefined","children":"Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, "}]}],["$","$1","1",{"children":"Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:6:2:style","children":"arXiv preprint arXiv:2501.17161"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-66","style":"$undefined","children":"Huilin Deng, Ding Zou, Rui Ma, Hongchen Luo, Yang Cao, and Yu Kang. Boosting the "}]}],["$","$1","1",{"children":"generalization and reasoning of vision language models with curriculum reinforcement learning. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:7:2:style","children":"arXiv preprint arXiv:2503.07065"}]}],["$","$1","3",{"children":", 2025a."}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-4","style":"$undefined","children":"Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, and Kai-Wei Chang. Open- "}]}],["$","$1","1",{"children":"vlthinker: An early exploration to complex vision-language reasoning via iterative self-improvement. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:8:2:style","children":"arXiv preprint arXiv:2503.17352"}]}],["$","$1","3",{"children":", 2025b."}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-27","style":"$undefined","children":"Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi "}]}],["$","$1","1",{"children":"Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:9:2:style","children":"Proceedings of the 32nd ACM International Conference on Multimedia"}]}],["$","$1","3",{"children":", pp. 11198–11201, 2024."}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-62","style":"$undefined","children":"Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D’Amour, DJ Dvi- "}]}],["$","$1","1",{"children":"jotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, et al. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:10:2:style","children":"arXiv preprint arXiv:2312.09244"}]}],["$","$1","3",{"children":", 2023."}]]}],["$","$La","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-40","style":"$undefined","children":"Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, "}]}],["$","$1","1",{"children":"Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024. URL "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:11:2"}]}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:11:3"}]}],["$","$1","4",{"children":"."}]]}],["$","$La","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-64","style":"$undefined","children":"Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. Reward "}]}],["$","$1","1",{"children":"shaping to mitigate reward hacking in rlhf. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:12:2:style","children":"arXiv preprint arXiv:2502.18770"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-18","style":"$undefined","children":"Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing "}]}],["$","$1","1",{"children":"Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric problem with multi-modal large language model. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:13:2:style","children":"arXiv preprint arXiv:2312.11370"}]}],["$","$1","3",{"children":", 2023."}]]}],["$","$La","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-61","style":"$undefined","children":"Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weilin Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju "}]}],["$","$1","1",{"children":"Wang, and Yi Wu. On designing effective rl reward at training time for llm reasoning. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:14:2:style","children":"arXiv preprint arXiv:2410.15115"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-0","style":"$undefined","children":"Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, "}]}],["$","$1","1",{"children":"Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:15:2:style","children":"arXiv preprint arXiv:2501.12948"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","16",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-15","style":"$undefined","children":"Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo "}]}],["$","$1","1",{"children":"Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:16:2:style","children":"Proceedings of the IEEE conference on computer vision and pattern recognition"}]}],["$","$1","3",{"children":", pp. 3608–3617, 2018."}]]}],["$","$La","17",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-78","style":"$undefined","children":"Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Open- "}]}],["$","$1","1",{"children":"rlhf: An easy-to-use, scalable and high-performance rlhf framework. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:17:2:style","children":"arXiv preprint arXiv:2405.11143"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","18",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-58","style":"$undefined","children":"Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and "}]}],["$","$1","1",{"children":"Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:18:2:style","children":"arXiv preprint arXiv:2503.06749"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","19",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-1","style":"$undefined","children":"Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, "}]}],["$","$1","1",{"children":"Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:19:2:style","children":"arXiv preprint arXiv:2412.16720"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","20",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-38","style":"$undefined","children":"Afrar Jahin, Arif Hassan Zidan, Yu Bao, Shizhe Liang, Tianming Liu, and Wei Zhang. "}]}],["$","$1","1",{"children":"Unveiling the mathematical reasoning in deepseek models: A comparative study of large language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:20:2:style","children":"arXiv preprint arXiv:2503.10573"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","21",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-21","style":"$undefined","children":"Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, "}]}],["$","$1","1",{"children":"and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. "}],["$","$1","2",{"children":"In "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:21:3:style","children":"Proceedings of the IEEE conference on computer vision and pattern recognition"}]}],["$","$1","4",{"children":", pp. 2901–2910, 2017."}]]}],["$","$La","22",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-37","style":"$undefined","children":"Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, "}]}],["$","$1","1",{"children":"Edward Grefenstette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:22:2:style","children":"arXiv preprint arXiv:2310.06452"}]}],["$","$1","3",{"children":", 2023."}]]}],["$","$La","23",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-59","style":"$undefined","children":"Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. Reward design with "}]}],["$","$1","1",{"children":"language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:23:2:style","children":"arXiv preprint arXiv:2303.00001"}]}],["$","$1","3",{"children":", 2023."}]]}],["$","$La","24",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-13","style":"$undefined","children":"Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. "}]}],["$","$1","1",{"children":"Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:24:2:style","children":"arXiv preprint arXiv:2403.00231"}]}],["$","$1","3",{"children":", 2024a."}]]}],["$","$La","25",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-26","style":"$undefined","children":"Ming Li, Yong Zhang, Shwai He, Zhitao Li, Hongyu Zhao, Jianzong Wang, Ning Cheng, "}]}],["$","$1","1",{"children":"and Tianyi Zhou. Superfiltering: Weak-to-strong data filtering for fast instruction-tuning. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:25:2:style","children":"arXiv preprint arXiv:2402.00530"}]}],["$","$1","3",{"children":", 2024b."}]]}],["$","$La","26",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-11","style":"$undefined","children":"Adam Dahlgren Lindstr"}]}],["$","$1","1",{"children":"¨om and Savitha Sam Abraham. Clevr-math: A dataset for compositional language, visual and mathematical reasoning. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:26:2:style","children":"arXiv preprint arXiv:2208.05358"}]}],["$","$1","3",{"children":", 2022."}]]}],["$","$La","27",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-65","style":"$undefined","children":"Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua "}]}],["$","$1","1",{"children":"Lin, and Jiaqi Wang. "}],["$","$1","2",{"children":"Visual-rft: Visual reinforcement fine-tuning. "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:27:3:style","children":"arXiv preprint arXiv:2503.01785"}]}],["$","$1","4",{"children":", 2025."}]]}],["$","$La","28",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-28","style":"$undefined","children":"Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao "}]}],["$","$1","1",{"children":"Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:28:2:style","children":"International Conference on Learning Representations (ICLR)"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","29",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-14","style":"$undefined","children":"Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa "}]}],["$","$1","1",{"children":"on document images. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:29:2:style","children":"Proceedings of the IEEE/CVF winter conference on applications of computer vision"}]}],["$","$1","3",{"children":", pp. 2200–2209, 2021."}]]}],["$","$La","30",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-42","style":"$undefined","children":"Dilxat Muhtar, Enzhuo Zhang, Zhenshi Li, Feng Gu, Yanglangxing He, Pengfeng Xiao, and "}]}],["$","$1","1",{"children":"Xueliang Zhang. Quality-driven curation of remote sensing vision-language data via learned scoring models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:30:2:style","children":"arXiv preprint arXiv:2503.00743"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","31",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-2","style":"$undefined","children":"Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai "}]}],["$","$1","1",{"children":"Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:31:2:style","children":"arXiv preprint arXiv:2503.07536"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","32",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-32","style":"$undefined","children":"Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma "}]}],["$","$1","1",{"children":"GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. "}],["$","$1","2",{"children":"We-math: Does your large multimodal model achieve human-like mathematical reasoning? "}],["$","$1","3",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:32:3:style","children":"arXiv preprint arXiv:2407.01284"}]}],["$","$1","4",{"children":", 2024."}]]}],["$","$La","33",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-39","style":"$undefined","children":"John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal "}]}],["$","$1","1",{"children":"policy optimization algorithms. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:33:2:style","children":"arXiv preprint arXiv:1707.06347"}]}],["$","$1","3",{"children":", 2017."}]]}],["$","$La","34",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-7","style":"$undefined","children":"Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, "}]}],["$","$1","1",{"children":"Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:34:2:style","children":"arXiv preprint arXiv:2402.03300"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","35",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-5","style":"$undefined","children":"Haozhan Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng "}]}],["$","$1","1",{"children":"Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model. "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:35:2"}]}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:35:3"}]}],["$","$1","4",{"children":", 2025. Accessed: 2025-02-15."}]]}],["$","$La","36",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-67","style":"$undefined","children":"Haoqin Tu, Weitao Feng, Hardy Chen, Hui Liu, Xianfeng Tang, and Cihang Xie. Vilbench: A "}]}],["$","$1","1",{"children":"suite for vision-language process reward modeling. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:36:2:style","children":"arXiv preprint arXiv:2503.20271"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","37",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-60","style":"$undefined","children":"Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, "}]}],["$","$1","1",{"children":"Senjie Jin, Enyu Zhou, Chenyu Shi, et al. Secrets of rlhf in large language models part ii: Reward modeling. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:37:2:style","children":"arXiv preprint arXiv:2401.06080"}]}],["$","$1","3",{"children":", 2024a."}]]}],["$","$La","38",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-29","style":"$undefined","children":"Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, "}]}],["$","$1","1",{"children":"and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:38:2:style","children":"The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track"}]}],["$","$1","3",{"children":", 2024b. URL "}],["$","$1","4",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:38:4"}]}],["$","$1","5",{"children":"."}]]}],["$","$La","39",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-57","style":"$undefined","children":"Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V "}]}],["$","$1","1",{"children":"Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:39:2:style","children":"NeurIPS"}]}],["$","$1","3",{"children":", 2022."}]]}],["$","$La","40",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-33","style":"$undefined","children":"Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical "}]}],["$","$1","1",{"children":"reasoning benchmark in visual contexts. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:40:2:style","children":"arXiv preprint arXiv:2407.04973"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","41",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-17","style":"$undefined","children":"Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision "}]}],["$","$1","1",{"children":"language models reason step-by-step, 2024. URL "}],["$","$1","2",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:41:2"}]}],["$","$1","3",{"children":"."}]]}],["$","$La","42",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-36","style":"$undefined","children":"Haoyan Yang, Ting Hua, Shangqian Gao, Binfeng Xu, Zheng Tang, Jie Xu, Hongxia Jin, and "}]}],["$","$1","1",{"children":"Vijay Srinivasan. Dynamic noise preference optimization for llm self-improvement via synthetic data. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:42:2:style","children":"arXiv preprint arXiv:2502.05400"}]}],["$","$1","3",{"children":", 2025a."}]]}],["$","$La","43",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-6","style":"$undefined","children":"Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, "}]}],["$","$1","1",{"children":"Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:43:2:style","children":"arXiv preprint arXiv:2503.10615"}]}],["$","$1","3",{"children":", 2025b."}]]}],["$","$La","44",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-8","style":"$undefined","children":"Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi "}]}],["$","$1","1",{"children":"Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, et al. Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:44:2:style","children":"arXiv preprint arXiv:2501.12368"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","45",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-43","style":"$undefined","children":"Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. "}]}],["$","$1","1",{"children":"Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:45:2:style","children":"arXiv preprint arXiv:2503.18892"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","46",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-30","style":"$undefined","children":"Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun "}]}],["$","$1","1",{"children":"Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:46:2:style","children":"European Conference on Computer Vision"}]}],["$","$1","3",{"children":", pp. 169–186. Springer, 2024."}]]}],["$","$La","47",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-47","style":"$undefined","children":"Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui "}]}],["$","$1","1",{"children":"Hsieh. R1-zero’s” aha moment” in visual reasoning on a 2b non-sft model. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:47:2:style","children":"arXiv preprint arXiv:2503.05132"}]}],["$","$1","3",{"children":", 2025."}]]}],["$","$La","48",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-12","style":"$undefined","children":"Wenwen Zhuang, Xin Huang, Xiantao Zhang, and Jin Zeng. Math-puma: Progressive "}]}],["$","$1","1",{"children":"upward multimodal alignment to enhance mathematical reasoning. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:48:2:style","children":"arXiv preprint arXiv:2408.08640"}]}],["$","$1","3",{"children":", 2024."}]]}],["$","$La","49",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-31","style":"$undefined","children":"Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: "}]}],["$","$1","1",{"children":"A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. "}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:8:paragraphs:49:2:style","children":"arXiv preprint arXiv:2411.00836"}]}],["$","$1","3",{"children":", 2024."}]]}]]}],["$","$L30","9",{"heading":"A Data Generation","index":9,"length":12,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-20","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:0:0:style","children":"A.1 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:0:1:style","children":"Prompt"}]}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We show the prompts for captioning (Figure "}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:1:1"}]}],["$","$1","2",{"children":"), R1 answer distillation (Figure "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:1:3"}]}],["$","$1","4",{"children":"), rewriting "}],["$","$1","5",{"children":["$","span",null,{"tabIndex":-1,"id":"id-68","style":"$undefined","children":"(Figure "}]}],["$","$1","6",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:1:6"}]}],["$","$1","7",{"children":") and verification (Figure "}],["$","$1","8",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:1:8"}]}],["$","$1","9",{"children":")."}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/16-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:2:0"}]]}]}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 8: Prompt for captioning with GPT-4-Turbo."}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/16-1.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:4:0"}]]}]}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 9: Prompt for distillation with Deepseek-R1."}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-24","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:6:0:style","children":"A.2 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:6:1:style","children":"Aha-Moment Filtering"}]}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We use the following list of keywords to identify aha moments: "}],["$","$1","1",{"children":"wait"}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":"again"}],["$","$1","4",{"children":", "}],["$","$1","5",{"children":"double-check"}],["$","$1","6",{"children":", "}],["$","$1","7",{"children":"hmm"}],["$","$1","8",{"children":", "}],["$","$1","9",{"children":"mistake"}],["$","$1","10",{"children":", "}],["$","$1","11",{"children":"alternatively"}],["$","$1","12",{"children":", "}],["$","$1","13",{"children":"check"}],["$","$1","14",{"children":", "}],["$","$1","15",{"children":"i should confirm"}],["$","$1","16",{"children":". All answers are matched with the logic: "}],["$","$1","17",{"children":"has aha = any([aha in text.lower() for aha in ahas])"}],["$","$1","18",{"children":"."}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:8:0:style","children":"A.3 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:8:1:style","children":"Sample Demonstration for "}]}],["$","$1","2",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:8:2:style","children":"VLAA-Thinking-SFT-126K"}]}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We show several examples from "}],["$","$1","1",{"children":"VLAA-Thinking-SFT-126K "}],["$","$1","2",{"children":"in Figure "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:9:3"}]}],["$","$1","4",{"children":", Figure "}],["$","$1","5",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:9:5"}]}],["$","$1","6",{"children":", Figure "}],["$","$1","7",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:9:7"}]}],["$","$1","8",{"children":", Figure "}],["$","$1","9",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:9:9"}]}],["$","$1","10",{"children":"and Figure "}],["$","$1","11",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:9:11"}]}],["$","$1","12",{"children":"."}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/17-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:10:0"}]]}]}]]}],["$","$La","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 10: Prompt for answer rewriting with GPT-4-Turbo."}]]}],["$","$La","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/17-1.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:9:paragraphs:12:0"}]]}]}]]}],["$","$La","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 11: Prompt for verification with GPT-3.5-Turbo."}]]}]]}],["$","$L30","10",{"heading":"B Details of SFT Experiments","index":10,"length":12,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:0:0:style","children":"B.1 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:0:1:style","children":"Training"}]}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"To enhance the instruction following ability, we append task-specific instructions ("}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:1:1:style","children":"i.e"}]}],["$","$1","2",{"children":"., MCQ, short answer) to questions. The system prompt shown in Figure "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:1:3"}]}],["$","$1","4",{"children":"is used. We use a global batch size of 128. Models are trained for 190 steps on 25K samples and 985 steps on 126K samples. All experiments are run on 8*H100 GPUs."}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Interestingly, we observe loss spikes for 25K SFT training on Qwen2-VL-7B which causes model collapse. Therefore, we run the settings for multiple times until we obtain a normal loss curve, and use that checkpoint for evaluation."}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/17-2.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:3:0"}]]}]}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:4:0:style","children":"a question, and you should try to solve it. You should first think about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer"}]}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:5:0:style","children":"are enclosed within and tags, respectively, i.e., reasoning process here answer here ."}]}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 12: System Prompt used for training and evaluation."}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-79","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:7:0:style","children":"B.2 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:7:1:style","children":"Evaluation"}]}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We adopt VLMEvalKit ("}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:8:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:10:paragraphs:8:3"}]}],["$","$1","4",{"children":") for all evaluation experiments. "}],["$","$1","5",{"children":"We set "}],["$","$1","6",{"children":"use custom prompt "}],["$","$1","7",{"children":"to "}],["$","$1","8",{"children":"False "}],["$","$1","9",{"children":"following the settings of most models in the toolkit. For higher efficiency, we set "}],["$","$1","10",{"children":"max pixels "}],["$","$1","11",{"children":"to 256*32*32, and "}],["$","$1","12",{"children":"max new tokens "}],["$","$1","13",{"children":"to 800. We also set system prompt as the one we used for training for a consistent training-test behavior. The other hyperparameters are default to the original toolkit."}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We specify the split of datasets and metrics reported:"}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"1. MathVista: The Test Mini split of MathVista dataset; overall accuracy."}]]}],["$","$La","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"2. MathVision: The Full test set of MathVision; overall accuracy."}]]}],["$","$La","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"3. MathVerse: The Test Mini split of MathVerse; accuracy of ”Vision Only” ."}]]}],["$","$La","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"4. DynaMath: The Full test set of DynaMath; overall accuracy."}]]}],["$","$La","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"5. WeMath: The Test Mini split of WeMath; ”Score (Strict)”."}]]}],["$","$La","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"6. LogicVista: The Full test set of LogicVista; overall accuracy."}]]}]]}],["$","$L30","11",{"heading":"C Details of GRPO Experiments","index":11,"length":12,"content":[["$","$La","0",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-50","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:0:0:style","children":"C.1 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:0:1:style","children":"Training"}]}]]}],["$","$La","1",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We adapt our code from OpenRLHF framework ("}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:1:1"}]}],["$","$1","2",{"children":", "}],["$","$1","3",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:1:3"}]}],["$","$1","4",{"children":"). To suit for our need of deploying a reward model on the same machine, we offload the reward model to CPU and only move it to GPU when performing rollouts and scoring. This design saves valuable GPU memory which accelerate the training process."}]]}],["$","$La","2",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We also perform dataset-specific inspection and find some issues for several datasets. For example, although ArxivQA contains only MCQ, the answer format includes “A”, “A)”, “(a)”, "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:2:1:style","children":"etc"}]}],["$","$1","2",{"children":". And in the synthesis subset of Math PUMA, we find that some solutions only contain the value of solved unknown variables when the questions ask to output the entire function expression. We fix these issues by rule-based filtering and GPT-assisted rewriting, aiming to improve the quality of the VL-Thinking dataset."}]]}],["$","$La","3",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:3:0:style","children":"C.2 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:3:1:style","children":"Evaluation"}]}]]}],["$","$La","4",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We evaluate our models with an identical setting described in Appendix "}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:4:1"}]}],["$","$1","2",{"children":"."}]]}],["$","$La","5",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"id-56","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:5:0:style","children":"C.3 "}]}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:5:1:style","children":"Case Study"}]}]]}],["$","$La","6",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"We present a case demonstrating the improvement of VLAA-Thinker-Qwen2.5VL-7B over its backbone in Figure "}],["$","$1","1",{"children":["$","$L31",null,{"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:6:1"}]}],["$","$1","2",{"children":"."}]]}],["$","$La","7",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/19-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:7:0"}]]}]}]]}],["$","$La","8",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 13: A case from MathVerse testmini (sample index 20). Markdowns are rendered for illustration purpose. Wrong reasoning paths are colored red."}]]}],["$","$La","9",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/20-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:9:0"}]]}]}]]}],["$","$La","10",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 14: A VL-Thinking sample from GeoQA170K."}]]}],["$","$La","11",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/21-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:11:0"}]]}]}]]}],["$","$La","12",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 15: A VL-Thinking sample from Math PUMA (subset Synthesis)."}]]}],["$","$La","13",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/21-1.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:13:0"}]]}]}]]}],["$","$La","14",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/22-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:14:0"}]]}]}]]}],["$","$La","15",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:15:0:style","children":" "}]}],["$","$1","1",{"children":"Okay, let's see. The user is asking to subtract all balls and determine how many objects are left. First, I need to figure out which objects are considered \"balls\" in the image. The image shows a large blue metal sphere. A sphere is a three-dimensional circle, which is essentially a ball shape. So the blue metal sphere is definitely a ball."}]]}],["$","$La","16",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/22-1.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:16:0"}]]}]}]]}],["$","$La","17",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Next, the surrounding objects are two small red metal cylinders and a single small brown metal cylinder in the foreground. Cylinders are not balls; they have a different shape—circular bases and straight sides. So"}]]}],["$","$La","18",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"those would not be subtracted. The question says to subtract all balls, so only the sphere (ball) needs to be removed."}]]}],["$","$La","19",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/22-2.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:19:0"}]]}]}]]}],["$","$La","20",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Originally, there's the sphere (1 object), two red cylinders (2 objects), and one brown cylinder (1 object). Total objects before subtraction: 1 + 2 + 1 = 4. Subtracting the one ball leaves 3 objects: the two red cylinders and the brown cylinder."}]]}],["$","$La","21",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/22-3.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:21:0"}]]}]}]]}],["$","$La","22",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Wait, but I should make sure there are no other balls mentioned. The image only refers to the sphere as a ball-like object. The cylinders are clearly not balls. So subtracting the sphere leaves three cylinders. The answer should be 3."}]]}],["$","$La","23",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/22-4.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:23:0"}]]}]}]]}],["$","$La","24",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"In the scene, there is one large blue sphere (a ball-shaped object) centrally positioned. The remaining objects consist of two small red cylinders flanking it symmetrically and a single brown cylinder in the foreground. Removing the sphere leaves only the three cylindrical objects visible."}]]}],["$","$La","25",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/22-5.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:25:0"}]]}]}]]}],["$","$La","26",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 16: A VL-Thinking sample from CLEVR-Math."}]]}],["$","$La","27",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/23-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:27:0"}]]}]}]]}],["$","$La","28",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"intensity for the 1030 nm wavelength? Choose from the following options A) The ratio increases as the repetition rate increases. B) The ratio decreases as the repetition rate increases. C) The ratio remains constant as the repetition rate increases. D) The ratio fluctuates unpredictably with changes in repetition rate."}]]}],["$","$La","29",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/23-1.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:29:0"}]]}]}]]}],["$","$La","30",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:30:0:style","children":" "}]}],["$","$1","1",{"children":"Okay, so I'm looking at figure (b) here, and I need to figure out how the repetition rate affects the ratio of 9th to 11th harmonic intensity for the 1030 nm wavelength. Let me start by understanding what's in the image."}]]}],["$","$La","31",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/23-2.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:31:0"}]]}]}]]}],["$","$La","32",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"First, the image is a scatter plot with the x-axis being the repetition rate in kHz and the y-axis being the ratio of 9th to 7th and 11th harmonic intensity on a logarithmic scale. There are two sets of data points: red dots"}]]}],["$","$La","33",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"and blue squares. The red dots are probably representing the 1030 nm wavelength since in panel (a), the red curve was labeled as 1030 nm. The blue squares would then correspond to the 515 nm wavelength."}]]}],["$","$La","34",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/23-3.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:34:0"}]]}]}]]}],["$","$La","35",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The question specifically asks about the 1030 nm wavelength, so I'll focus on the red dots. Looking at the scatter plot, as the repetition rate increases from left to right on the x-axis, I need to see what happens to the y-axis values for the red dots."}]]}],["$","$La","36",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/23-4.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:36:0"}]]}]}]]}],["$","$La","37",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"From what I can see, as the repetition rate goes up, the red dots trend downward. That means the ratio of 9th to 11th harmonic intensity is decreasing. There's no indication of an increase or a constant ratio; instead, it's a clear downward trend. The red dots don't seem to fluctuate unpredictably either; they follow a decreasing"}]]}],["$","$La","38",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/23-5.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:38:0"}]]}]}]]}],["$","$La","39",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"So, putting it all together, the ratio decreases as the repetition rate increases for the 1030 nm wavelength. That would correspond to option B. "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:39:1:style","children":""}]}]]}],["$","$La","40",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/23-6.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:40:0"}]]}]}]]}],["$","$La","41",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"The image shows that the ratio of 9th to 11th harmonic intensity for the 1030 nm wavelength decreases as the repetition rate increases."}]]}],["$","$La","42",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/23-7.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:42:0"}]]}]}]]}],["$","$La","43",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 17: A VL-Thinking sample from ArxivQA."}]]}],["$","$La","44",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/24-0.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:44:0"}]]}]}]]}],["$","$La","45",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"central given its prominent placement and the fact that it has a dedicated Facebook page marker. It appears to be situated near major roads and commercial entities like KFC, Harvey Norman, and Dan Murphy's, which indicates it is likely a well-known establishment, possibly a retail store or a social venue, that serves as a local landmark or gathering place. The specific type of store cannot be determined from the map alone, but its central location among various retailers and close to a park suggests it may cater to a wide range of customers and possibly holds a key place in the local community."}]]}],["$","$La","46",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/24-1.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:46:0"}]]}]}]]}],["$","$La","47",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:47:0:style","children":" "}]}],["$","$1","1",{"children":"Okay, let's see. The user is asking about the significance of Preston Store in the local area based on its proximity to other landmarks. First, I need to visualize the map shown. The Preston Store is labeled prominently with a red marker and a Facebook 'f' logo, so it's probably a key location."}]]}],["$","$La","48",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/24-2.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:48:0"}]]}]}]]}],["$","$La","49",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Looking at the nearby landmarks, there's KFC, Harvey Norman, Dan Murphy's, and Max Kirwan Mazda. Those are all big-name retailers and services. Then there's Target, Myer Northland, and Anaconda Preston. These are major stores, which suggests the area is a commercial hub. The store being near these places"}]]}],["$","$La","50",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/24-3.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:50:0"}]]}]}]]}],["$","$La","51",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Also, there's TW Blake Park and Darebin Creek Forest Park nearby. Green areas mean that Preston Store might be a place people visit before or after outdoor activities. Maybe it's a convenience store or a community spot where people gather. The mix of retail, dining, and parks indicates it's a multifunctional area, so Preston Store could serve both daily shopping needs and act as a meeting point. The Facebook marker suggests they have an online presence, maybe promoting local events or services, enhancing community engagement. Since it's near major roads like Bell St and Murray Rd, accessibility is good, increasing its importance as a central landmark. "}],["$","$1","1",{"children":["$","span",null,{"tabIndex":-1,"id":"$undefined","style":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:51:1:style","children":""}]}]]}],["$","$La","52",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/24-4.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:52:0"}]]}]}]]}],["$","$La","53",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":"Preston Store appears centrally positioned within a bustling commercial and recreational hub, making it a likely focal point for local activity. Its prominent placement near major retailers like Target, Myer Northland, and Harvey Norman suggests it could serve as a practical anchor for shoppers—perhaps offering essentials or services that complement these larger stores. The proximity to fast food (KFC) and leisure destinations (Darebin Creek Forest Park, TW Blake Park) positions it as a potential pitstop for families or individuals transitioning between errands, dining, and outdoor activities. The red Facebook marker implies community engagement, possibly acting as an information node or local business with social media visibility. Its accessibility via intersecting roads like Bell St and Murray Rd further underscores its role as a convenient landmark in the area's daily flow."}]]}],["$","$La","54",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":{"compact":"left","expanded":"justify"},"typography":{"compact":"paperBody2","expanded":"paperBody1"}},"children":[["$","$1","0",{"children":["$","$Lb",null,{"component":"span","sx":{"verticalAlign":"middle","px":"$undefined","& img":{"imageRendering":"-webkit-optimize-contrast"}},"children":[["$","$L32",null,{"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2504.11468/images/24-5.png","alt":"$undefined"}],["$","$L33",null,{"imgScale":4,"avgLineHeight":10.99,"fragment":"$c:props:children:props:children:3:1:props:paperJSON:sections:11:paragraphs:54:0"}]]}]}]]}],["$","$La","55",{"component":"p","variant":"paperBody1","overflow":"hidden","sx":{"pb":2,"textWrap":"pretty","textAlign":"center","color":"var(--secondary-color)","typography":"paperBody2"},"children":[["$","$1","0",{"children":"Figure 18: A VL-Thinking sample from ALLaVA-LAION."}]]}]]}]],["$","$L34",null,{"paper":"$c:props:children:props:children:0:props:product"}]]