Distilling Vision-Language Models on Millions of Videos | Read Paper on Bytez