SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning | Read Paper on Bytez

Devs

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning | Read Paper on Bytez