Deep Learning Based Multi-modal Addressee Recognition in Visual Scenes with Utterances | Read Paper on Bytez