VicTR: Video-conditioned Text Representations for Activity Recognition | Read Paper on Bytez