Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention | Read Paper on Bytez