Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data | Read Paper on Bytez