Robust Cross-modal Alignment Learning for Cross-Scene Spatial Reasoning and Grounding | Read Paper on Bytez