With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You | Read Paper on Bytez