Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

Devs

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP | Read Paper on Bytez