Embodied Scene Understanding for Vision Language Models via MetaVQA | Read Paper on Bytez