A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions | Read Paper on Bytez