MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning | Read Paper on Bytez