SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning | Read Paper on Bytez