Mini-Sequence Transformers: Optimizing Intermediate Memory for Long Sequences Training | Read Paper on Bytez