Learning to Reason under Off-Policy Guidance | Read Paper on Bytez