Vision-Language Models Create Cross-Modal Task Representations | Read Paper on Bytez