ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks | Read Paper on Bytez