Global Convergence in Training Large-Scale Transformers | Read Paper on Bytez