Cross-Modal Fine-Tuning: Align then Refine | Read Paper on Bytez