Elo Uncovered: Robustness and Best Practices in Language Model Evaluation | Read Paper on Bytez