J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization | Read Paper on Bytez