Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Devs

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling | Read Paper on Bytez