To have AI systems navigate and carry out instructions in a visual world, an agent needs to extract semantically meaningful representation of natural language by mapping it to visual elements in the environment. We simulate the problem of Task Oriented Grounding by training an agent to take natural language instructions and learn to navigate a virtual environment introduced by [7]. Consider a scenario as depicted in Fig. 1, which shows the egocentric view that the agent sees at some time step.
Figure 1. This figure indicate the overview of the problem.
The agent receives a natural language instruction at the beginning of every episode and pixel level visual information at every time step, based on which it needs to carry out a navigational task specified by the instruction. To carry out the task with high accuracy, the agent has to draw semantic correspondences between the visual and textual modalities in order to learn a policy. This problem has several challenges: 1) The agent has to have the ability to recognize the objects indicated by the instructions, 2) It needs to have some notion of memory of the previous observations in order to explore the environment since the object concerned may not be in the field of view, 3) It has to ground each concept of the instruction in the environment and reason about the semantics, eg. instructions having superlative degree - ‘Go to the tallest torch’. and 4) It has to learn a policy so that it can successfully navigate to the correct object while avoiding the incorrect ones. The main contribution of the work is a state processing module that uses a Dynamic Attention Architecture for multi-modal fusion to generate an informative and robust definition of state for the policy learning module to see. To demonstrate the significance of our Dynamic Attention Network we use the Gated Attention model of [7] as the baseline. For fair comparison we adopt the same overall architecture and the environment settings from the baseline.
The main contribution of this paper are follows:
• We propose a novel Dynamic Attention Network for generating attention to improve response in Task Oriented Language Grounding.
• We propose 1D convolution as a method for multi-modal fusion over the Hadamard product to achieve faster convergence of the network.
• We demonstrate experimental results to show the effects of Dynamic Attention on the accuracy and convergence rate with various similar architectures that differ subtly but produces significant changes when it comes to the overall performance.
• We show visualizations of the attention masks generated by Dynamic Attention Network(DAN) to demonstrate its robustness. We compare Zero-Shot (ZS) and Multi-Task (MT) generalization accuracy with the baseline for three modes of difficulty of the task.
The task of grounding is well studied in the various field computer vision and a natural language processing. Staring from image description [3, 14, 23] where natural language concepts are grounded on image. [43, 45, 21, 47, 13, 9, 20, 48] have generated descriptive sentences from images with the help of Deep Networks. To generate engaging question about image is known as Visual Question Generation (VQG) [33, 19, 38]. To generate similar type grounding question given question is known as paraphrase question generation[39]. Also, a variety of methods have been proposed by [29, 26, 1, 41, 28, 34] for grounding natural language question on image for solving visual question answering (VQA) task. For grounding natural language question on image for solving VQA task includes attentionbased methods [50, 15, 16, 46, 27, 42, 36, 40]. Also, There have been many works for solving Visual Dialog grounding by asking set of question answering [10, 11, 12, 44, 37].
Grounding natural language instructions have been studied in video, such as [6], [25] look at grounding concepts through human-robot interaction. [17], [5] and [4] aimed to ground navigational instruction and the focus was to ground verbs like follow, go, move, pick up etc. [8] learn a navigational policy in a 2D maze like environment by using a semantic parser. [2] and [31] ground natural language instruction by mapping instructions to action sequences. [30] map navigational instructions to action sequences by representing the state using bag of word features. [49] trained a model to learn to navigate a 2D maze environment. [35] study zero-shot generalization in a 3D environment. A similar line of work was done by [31] who solve for joint reasoning of linguistic and visual inputs for a task of moving blocks in a 2D environment. They use raw image from the 2D grid, processed by a Convolutional Neural Network (CNN) [24] and instruction representation obtained through an LSTM [18] which are then combined through concatenation.
[7] propose a Gated-Attention architecture for Task Oriented Language Grounding and evaluate their approach on an environment built VizDoom [22]. We use their work as the baseline for comparison. They generate an attention vector as a function of the instruction embedding and fuse with the image representation through a Hadamard product. The problem with this approach is that the attention vector remains static throughout an episode since it is conditioned on the instruction alone. Also it does not leverage the continuity of the observation space of the 3D Doom scenario. In this paper we propose a novel method for multi-modal fusion in the form of Dynamic Attention and show its effectiveness over the the existing benchmarks. Since most of the robotic tasks are in the real world where the observation space is continuous, our model can be a generalized framework in any robotics application.
We conduct our experiments on an environment introduced by [7]. The environment is built on top of the VizDoom API [22], based on Doom, a classic first person shooter game. The game generates the first person view of the agent at every time step and the agent can interact with the environment by choosing from one of the actions: Turn Right, Turn Left, Move Forward. Each episode starts in a confined room where the agent and various doom objects are spawned at random locations and an instruction of the form ”Go to the tall green pillar” is chosen at random from a corpus. The objects have various visual attributes such as color, shape and size. Associated with every instruction there is a set of correct objects. Each time an instruction is selected, the environment generates a random combination of a correct and 4 incorrect objects and they are spawned at various locations on the map depending on the difficulty. The levels are Easy: Agent is spawned at a fixed location. The objects are spawned in a straight line in the field of view of the agent, Medium: The objects are spawned at random locations even though they are still in the field of view of the agent, Hard: The objects and agent are spawned randomly and the agent can have any initial orientation. The agent might need to explore the environment to see all the objects. The objective of the agent is to navigate to the cor-
Figure 2. Overall Architecture
rect object while avoiding the incorrect ones.
Our method consists of four modules as illustrated in Fig 2:
1. We obtain embedding for the inputs: current frame image and the instruction using CNN and Gated Recurrent Unit (GRU)networks. We term this as Representation module.
2. We obtain attention vector conditioned on the current frame image and the instruction embedding through the Dynamic Attention Module.
3. We apply the generated attention vector on the current frame through 1D convolution to generate a state vector.
4. Finally our policy module will take as input the state and give as output a probability distribution over the action space and a scalar value.
4.1. Image Processing Module:
It consists of a 3 layered Convolutional Neural Network [24] to generate a feature representation of the Image. Let be the feature representation of the Image, d denotes the number of feature maps of the CNN output, HxW is the size of each feature map.
4.2. Instruction Processing Module:
It consists of a Gated Recurrent Unit (GRU). It encodes the feature representation of the Image as a vector, , where l is the dimension of the language encoding.
4.3. Dynamic Attention Module
We introduce a Dynamic Attention Network to generate an attention vector over the current image frame. It takes as input the current frame image ”attended” with the attention vector of the previous time step concatenated with the instruction encoding and produces the attention vector for the next time step as the output. We model the Dynamic Attention as the Cell-State of an LSTM.
The attention can be formulated as:
Where are the weight and bias parameters of the LSTM gates. Fig. 3 shows our Dynamic Attention Module unrolled over time. At every time-step, the attended image (applying the attention vector generation at the previous time-step) and the instruction representation is concatenated and fed to an LSTM cell as an input. The updated cell-state is the attention vector for the next time step.
4.4. State Generation Unit
We generate a state for the policy learner by performing 1D convolution of the attention vector over the feature maps of the current frame as shown in Fig. 4. This is in contrast with [7] who use Hadamard product for applying the attention vector over the feature maps. We achieve faster convergence rate due to reduced number of parameters of the network.
4.5. Policy Learning Module
The policy learning module is an Asynchronous Advantage Actor Critic (A3C) network [32]. The A3C module produces a probability distribution over the action space and a scalar value.The action for the current time step is sampled from the distribution and the action is taken to to get the reward. The policy and value loss is then back-propagated through the network to update the parameters.
Figure 3. Dynamic Attention Module The figure shows the Dynamic Attention Module unrolled in time.
Figure 4. State Generation The State Representation is generated by applying a 1D convolution of the attention vector on the image feature activation maps
4.6. Cost Function
The Actor-Critic algorithms follows an approximate policy gradient:
Where is the estimate of the value function for taking action a in state
is the policy and
is the expected total reward and
is the parameters of the policy network.
We evaluated the proposed method DAN through several experiments and performing quantitative and qualita-
Figure 5. Policy Learning Module The State Representation is given as an input to a standard A3C learning module that learns a mapping from state to actions
tive analysis. Quantitative analysis includes ablation analysis with similar variants of the model that we tried and analyze the performance of each (Section 5.3). Comparison of our proposed method with various state of the art models is provided in section 5.4. Section 6 shows qualitative analysis through visualization of the attention maps and study of their properties and failure cases.
Table 1. Ablation Analysis Comparison of the ZS and MT performances of various models similar to DAN cell-state.
5.1. Environment Setup
Experiments are performed on all three difficulty modes. During training, the objects are spawned from a training set of 55 instructions and 15 instructions pertaining to unseen attribute-object combinations are held out for a test set for zero-shot evaluation. At each time step the agent is presented with a state definition generated by our state processing module based on which the agent will take one of the three actions. The episode ends if one of the three events occur: The agent reaches an object, the number of timesteps reaches a maximum episode length of T = 30. At the end of each episode the agent receives a reward of: 1 for reaching the correct object, -0.2 for reaching an incorrect object, 0 if the episode times out. Evaluation metric is the accuracy which is the fraction of time the agent reaches the correct object.
The agent is tested on two scenarios suggested by [7]. Multitask Generalization: The agent is evaluated on unseen maps having unseen combination of objects at random locations with instructions from the train set. Zero-shot Generalization: The agent is evaluated on unseen test instructions.
5.2. Implementation Setup
At every time step, the agent receives the screen buffer image of the environment as a first person view. The image features are extracted using a 3 layer CNN. The image feature has 64 channels each having dimensions 8 x 17. The instruction representation is generated by a GRU of size 256. The attention vector is obtained from the Dynamic Attention Module LSTM and applied to the image by means of a 1D convolution resulting in the attended image representation of size 1 x 8 x 17. The attended image is then flattened in to a vector of size 8 x 17 which gives the state representation that is given as the input to the A3C module as the input. The attention is updated for the next time step by giving as input to the Dynamic Attention LSTM, the concatenated vector of the current state representation and the instruction representation. The A3C module produces a probability distribution over the action space and a value. Action for this time step is sampled from the distribution and the action taken to get the reward for the time step. The policy and value loss is then back-propagated through the network to update the parameters.
5.3. Ablation analysis
Figure 6. Dynamic Attention LSTM Cell-State The figure compares the Dynamic Attention LSTM-output and the Dynamic Attention Cell-State with the baseline
For the experiments, we trained each model 3 times from scratch and plotted the mean of their accuracy after each epoch to get the training curve. In the Gated Attention architecture of [7], the attention vector is a function of the instruction representation alone. The first hypothesis is that along with the instruction, the current frame image information is also a necessary context for generating the attention. As a proof of concept we use the Gated Attention Network as that of [7] but make the attention vector a function of concatenation of the instruction encoding and the image convolution features. From Fig. 7-a, our model shows increased steady state accuracy but slower convergence.
Table 2. Comparison with BaselineZero Shot Generalization of our method (DAN) with Gated attention (GA) and concatenation method
The next hypothesis is that the attention vector at a given time is not independent of those at the previous time steps. The attention vectors of successive time steps should be co-related and need not be computed from scratch every time. To verify, we model the attention as an LSTM. At each time step, we take the attention vector of the previous time step and apply it to the current frame image features. This
Figure 7. (a) The green curve shows training accuracy plot of Cur- rent Frame attention and blue is the baseline. (b)The red curve shows the training accuracy plot of the Dynamic Attention - Output model. (c)The red curve shows the faster convergence due to the use of 1D convolution over Gated Attention method. (Blue is the baseline and green is LSTM-output, both using Gated Attention). Accuracy plots of various similar models to Dynamic Attention Network (cell-state) and their comparison
attended image features concatenated with the instruction encoding is fed to an LSTM cell to generate the attention vector for the next time step. We still use the baseline’s Hadamard product for applying the attention on the image features. Fig 7-b shows the comparison of accuracy plot
Table 3. This table provides comparison result for Multi-Task Generalization
with current frame and the baseline. We observe a greater steady state accuracy and a faster convergence than the current frame attention.
To tackle the slower convergence rate, we replace the Gated Attention approach by 1D convolution. The baseline uses a Hadamard product followed by downsizing through an FC layer to generate a state definition for the policy learner. The use of 1D convolution eliminates the need for an FC layer to downsize the state representation and reduces the number of trainable parameters. Fig. 7-c shows the faster convergence rate of 1D convolution over Hadamard product.
Our next hypothesis is that in continuous observation spaces, the attention tends not to change abruptly. Most of the information is retained in the attention from one time step to the next. Gradually some information is added and removed from the attention vector, as new objects are introduced in the field of view of the agent while some are removed. This behaviour is inherent in the cell-state of an LSTM. Hence to incorporate this inductive bias, we model the attention vector as the cell-state of an LSTM.
Fig 6 shows the accuracy plot of Dynamic Attention LSTM Cell-State (red) as compared to Dynamic Attention LSTM output (green) and the baseline (blue). We see that the model shows faster convergence and stabler training curve indicating a more robust representation of state. In Table 1 we compare the performances of each model in Zero-Shot and Multi-Task generalization tasks.
5.4. Results and Comparison with state-of-the-art
We now show comparison in performance of our Dynamic Attention LSTM - Cell State model with the baseline Gated Attention model in the three difficulty modes and Multi-Task and Zero-Shot generalization settings. We also compare our model with [31] which combines the image representation and language representation through concatenation.
From the curves we observe that the Dynamic Attention model outperforms the baseline in all the difficulty modes in terms of rate of convergence and steady state accuracy. Table 2 and 3 shows the comparison of Zero-shot and Multi-Task generalization performances with the baseline and concatenation approaches for all three modes of difficulty. We see that our model beats the state of the art by significant margins.This shows that our model has learnt to generalize better in unseen scenarios and with unseen in-
Figure 8. This figure indicates visualization of Attention at regular time steps for three instructions. In figure-(a), the first frame is the reference frames and the last frame indicates the attended frame for a particular instruction. Similarly figure-(b) the last frame only green pillar is highlighted and other objects are sub-pressed. In figure-(c), the last frame able to localised the red object among all the object, which follow the instruction carefully.
structions.
From the attention visualizations Fig. 8 we note the following:
• At the start of every episode, the attention quickly shifts to the objects leaving the background unattended. It is clear that the agent has learned to detect
foreground objects and distinguish it from the background.
• Even if the field of view is changing constantly, the attention remains fixated on the objects which were under the agent´s attention. This shows the robustness of the grounding. This is also the case with human attention. We tend to fixate our gaze on the objects that we are observing even though the frame that we are currently seeing is not stationary.
• The agent quickly manages to focus on the objects of interest based on color description, shape etc as can be seen from the examples. When it fixates on the object(s) of interest, the attention subsides from the other objects. Only the objects very close to the object of interest get some attention as is expected since the agent has to avoid hitting the wrong objects.
• Fig 8-b depicts a failure case. In this case the attention focuses on the correct object but the agent does not move towards it. This might be to the choice of reward function. Since the agent receives a negative of -0.2 when it approaches an incorrect object and 0 reward when it does not reach any, statistically, not taking any action in certain scenarios might give it a greater expected reward even though correct grounding is achieved.
In this paper we have proposed a dynamic attention network that can ground natural language instructions to visual elements and actions. We showed the effectiveness of the dynamic attention over the static attention model of gated attention network in terms of convergence rate as well as steady state performance. We have shown that the cell-state of an LSTM can be a natural choice for modeling dynamic attention through the performance of A3C and as well as the quality of attention that it generates. We demonstrated the effect of using 1D convolution on the rate of convergence of the network. Through visualizations we have shown the robustness and quality of the grounding. Finally we conclude that the use of dynamic attention helps in grounding of instructions to objects and actions and is a natural choice when dealing with continuous observation spaces like in a 3D world.
[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), 2015.
[2] Yoav Artzi and Luke Zettlemoyer. Weakly supervised learn- ing of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguistics, 2013.
[3] K Barnard, P Duygulu, and D Forsyth. N. de freitas, d. Blei, and MI Jordan,” Matching Words and Pictures”, submitted to JMLR, 2003.
[4] M. Beetz, U. Klank, I. Kresse, A. Maldonado, L. Msenlech- ner, D. Pangercic, T. Rhr, and M. Tenorth. Robotic roommates making pancakes. In 2011 11th IEEE-RAS International Conference on Humanoid Robots, pages 529–536, Oct 2011.
[5] Mario Bollini, Stefanie Tellex, Tyler Thompson, Nicholas Roy, and Daniela Rus. Interpreting and Executing Recipes
with a Cooking Robot, pages 481–495. Springer International Publishing, Heidelberg, 2013.
[6] C. Chao, M. Cakmak, and A. L. Thomaz. Towards grounding concepts for transfer in goal learning from demonstration. In 2011 IEEE International Conference on Development and Learning (ICDL), volume 2, pages 1–6, Aug 2011.
[7] Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, and Ruslan Salakhutdinov. Gated-attention architectures for task-oriented language grounding. In Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence(AAAI), 2018.
[8] David L. Chen and Raymond J. Mooney. Learning to inter- pret natural language navigation instructions from observations. pages 859–865, August 2011.
[9] Xinlei Chen and C Lawrence Zitnick. Mind’s eye: A recur- rent visual representation for image caption generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2422–2431, 2015.
[10] Abhishek Das, Harsh Agrawal, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
[11] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos´e MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[12] Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. Guesswhat?! visual object discovery through multi-modal dialogue. In Proc. of CVPR, 2017.
[13] Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivas- tava, Li Deng, Piotr Doll´ar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John Platt, et al. From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.
[14] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Pe- ter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. Every picture tells a story: Generating sentences from images. In European conference on computer vision, pages 15–29. Springer, 2010.
[15] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
[16] Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. Are you talking to a machine? dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems, pages 2296–2304, 2015.
[17] S. Guadarrama, L. Riano, D. Golland, D. Gohring, Y. Jia, D. Klein, P. Abbeel, and T. Darrell. Grounding spatial relations for human-robot interaction. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1640–1647, Nov 2013.
[18] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov. 1997.
[19] Unnat Jain, Ziyu Zhang, and Alexander Schwing. Creativity: Generating diverse questions using variational autoencoders. arXiv preprint arXiv:1704.03493, 2017.
[20] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4565–4574, 2016.
[21] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic align- ments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
[22] Michal Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaskowski. Vizdoom: A doombased AI research platform for visual reinforcement learning. CoRR, abs/1605.02097, 2016.
[23] Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th CVPR. Citeseer, 2011.
[24] Yann Lecun and Yoshua Bengio. Convolutional networks for images, speech, and time-series. MIT Press, 1995.
[25] S´everin Lemaignan, Raquel Ros, E. Akin Sisbot, Rachid Alami, and Michael Beetz. Grounding the interaction: Anchoring situated discourse in everyday human-robot interaction. International Journal of Social Robotics, 4(2):181–199, Apr 2012.
[26] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
[27] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297, 2016.
[28] Lin Ma, Zhengdong Lu, and Hang Li. Learning to answer questions from image using convolutional neural network. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[29] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NIPS), 2014.
[30] Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. Lis- ten, attend, and walk: Neural mapping of navigational instructions to action sequences. CoRR, abs/1506.04089, 2015.
[31] Dipendra Kumar Misra, John Langford, and Yoav Artzi. Mapping instructions and visual observations to actions with reinforcement learning. CoRR, abs/1704.08795, 2017.
[32] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, pages 1928–1937, 2016.
[33] Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende. Generating natural questions about an image. arXiv preprint arXiv:1603.06059, 2016.
[34] Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. Im- age question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 30–38, 2016.
[35] Junhyuk Oh, Satinder P. Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. CoRR, abs/1706.05064, 2017.
[36] Badri Patro and Vinay P. Namboodiri. Differential attention for visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[37] Badri N. Patro, Anupriy, and Vinay P. Namboodiri. Prob- abilistic framework for solving visual dialog. ArXiv, abs/1909.04800, 2019.
[38] Badri Narayana Patro, Sandeep Kumar, Vinod Kumar Kurmi, and Vinay Namboodiri. Multimodal differential network for visual question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4002–4012. Association for Computational Linguistics, 2018.
[39] Badri Narayana Patro, Vinod Kumar Kurmi, Sandeep Ku- mar, and Vinay Namboodiri. Learning semantic sentence embeddings using sequential pair-wise discriminator. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2715–2729, 2018.
[40] Badri N. Patro, Mayank Lunayach, Shivansh Patel, and Vinay P. Namboodiri. U-cam: Visual explanation using uncertainty based class activation maps. In arXiv preprint arXiv:1908.06306, 2019.
[41] Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems (NIPS), pages 2953–2961, 2015.
[42] Kevin J Shih, Saurabh Singh, and Derek Hoiem. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4613–4621, 2016.
[43] Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association of Computational Linguistics, 2(1):207–218, 2014.
[44] Florian Strub, Harm De Vries, Jeremie Mary, Bilal Piot, Aaron Courville, and Olivier Pietquin. End-to-end optimization of goal-driven and visually grounded dialogue systems. arXiv preprint arXiv:1703.05423, 2017.
[45] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015.
[46] Huijuan Xu and Kate Saenko. Ask, attend and answer: Ex- ploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision, pages 451–466. Springer, 2016.
[47] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua
Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057, 2015.
[48] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
[49] Haonan Yu, Haichao Zhang, and Wei Xu. A deep compo- sitional framework for human-like language acquisition in virtual environment. CoRR, abs/1703.09831, 2017.
[50] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4995–5004, 2016.