Aligning Audio-Visual Joint Representations with an Agentic Workflow | Read Paper on Bytez