Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling | Read Paper on Bytez