An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling | Read Paper on Bytez