Understanding Differential Transformer Unchains Pretrained Self-Attentions | Read Paper on Bytez