OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction | Read Paper on Bytez