Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action | Read Paper on Bytez