An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching

Devs

An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching | Read Paper on Bytez