Efficient Vision-Language Pre-training by Cluster Masking | Read Paper on Bytez