LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding | Read Paper on Bytez