ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts | Read Paper on Bytez