FILS: Self-Supervised Video Feature Prediction In Semantic Language Space | Read Paper on Bytez