Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding | Read Paper on Bytez