Deep Learning Based Multi-modal Addressee Recognition in Visual Scenes with Utterances
2018·Arxiv