First Attentions Last: Better Exploiting First Attentions for Efficient Parallel Training | Read Paper on Bytez