Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training | Read Paper on Bytez