From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech | Read Paper on Bytez