The EmbodiedQA (EQA), proposed by [8], is a task of training an embodied agent which is required to intelligently navigate in a simulated environment and gather visual information to answer questions. For example, as shown in Figure 1, the agent is initialized at a random location and asked a question (). In order to answer the question, the agent has to explore the environment and find a visually grounded answer to the question. Different from all previous vision-language tasks, such as Image Captioning and Visual QA, embodied means that the environment is part of the cognitive system [29] that will influence the mind. This indicates the cognition and the actions are based on the understanding of the dynamics of the environment as well as common sense knowledge (
) that gradually collected from daily experiences. Therefore, a mental model [15, 16] with imagery function [10, 19, 33] that models the environment is crucial to building an embodied agent.
Many deep reinforcement learning based methods [8, 9, 11] have been proposed to improve the EQA task. They are designed to achieve the ultimate objectives by learning to select primitive actions without long-term planning. None of them explicitly model the mental imagery function of the agent, while the mental imagery is important to long-term planning and has a close relation to many valuable high-level meta-skills such as generalization and interpretation.
When humans are solving a task, they do not make actions solely based on current observations. There is a mental imagery model behind every decision. The model comes from our everyday experiences; it predicts the dynamics of the environment and forms some mental images (i.e. mental imagery) in our mind. These mental images can be viewed as short-term subgoals that provide a path to a more concrete solution. Take building a house with Lego for example. Firstly, we may imagine the foundation of the house and build the foundation based on the imaginary ‘sketch’. Then we may imagine the main body and the roof of the house in turn, and finally, build the whole house according to the imagery in our minds. In this case, we are not building the house directly without planning, but dividing it into several sub-stages (i.e. subgoals) and imagining the scene of each individual stage. Mental imagery facilities the understanding of our physical world; it can predict the future without having to experience that outcome directly. It is a meta-skill that help us to achieve different tasks across different domains more efficiently.
Motivated by the role of the mental imagery in human’s decision making, we propose a novel Mental Imagery eNhanceD (MIND) module for the embodied agent, as well as a relevant deep reinforcement framework for training. The MIND module is composed of a Mental Autoencoder and an Imagery Model. Together they explicitly model the dynamics of the environment (). An agent with MIND module will have several advantages:
• Faster Convergence: The MIND module helps the agent to create a better understanding of the environment (e.g. ). Such knowledge makes the agent a faster and better learner in locating a feasible policy with only a few trails.
• Better Generalizability: Since the MIND module explicitly models the dynamics of the environment, the learned module is transferable across different tasks, as long as the dynamics of the environment remains the same. This is especially useful for learning in unknown scenes with only a few training examples are available.
• Better Planning Efficiency: The MIND module can generate mental images that are treated as short-term subgoals by our proposed deep reinforcement framework. In this new RL framework, we designed a special reward (i.e. planned reward) to encourage our agent learning to form more task-related and objective-related short-term subgoals (). These subgoals are easy to achieve and reusable.
• Better Behavioral Interpretation: The mental images generated in planning time visualize the agent’s intentions in a way that human can understand, and this makes real-time behavioral interpretation or even correction feasible.
2.1 Embodied Question Answering
Recently, several deep reinforcement learning based hierarchical architectures for EmbodiedQA have been proposed by Gordon et al. [11] in the AI2-THOR environments [18], and by Das et al. [8, 9] in the House3D environments [32]. These approaches decompose the control problem into multiple levels and consist of a factorized set of modules [3, 24, 26]. Gordon et al. [11] propose the Hierarchical Interactive Memory Network (HIMN), consisting of a high-level planner and some low-level controller, allowing the agent to operate at multiple levels of temporal abstraction. The high-level planner chooses the task to be performed and the speciïňĄed low-level controller executes the task. Das et al. [8] divide EmbodiedQA agent into four modules-vision, language, navigation and answering, and the navigation module (PACMAN) decomposes navigation into a planner, which selects actions to perform, and a controller, which performs these actions. Das et al. [9] later propose a hierarchical Neural Modular Controller (NMC), consisting of a master policy and several sub-policies. These approaches all ignore the crucial importance of the mental imagery model on embodied cognition, which results in poor generalizability and low planning efficiency. More specifically, they do not consider the dynamics of the environments, making it harder to generalize to new scenes. Besides, most approaches just execute primitive actions over long time horizons. They can not plan to complete a sequence of short-term subgoals and finally answer the questions. Although NMC [9] contains a master policy to choose high-level subgoals, it requires additional training data annotated with a series of subgoals to train the master policy, and the types of subgoals are pre-defined, which is not useful in practice. In comparison, MIND module first models the environment dynamics, which enhances its generalizability, and predicts imaginary short-term subgoals, which guarantees the agent’s planning efficiency and provides a path to visualize the agentâĂŹs intentions.
2.2 Mental Imagery
Mental imagery (varieties of which are sometimes colloquially referred to as‘visualizing’, ‘seeing in the mind’s eye’, ‘hearing in the head’, etc.) is quasi-perceptual experience [10, 19, 33]; it resembles perceptual experience, but occurs in the absence of the appropriate external stimuli [33]. Numerous experiments carried out over the past twenty years have probed the nature of mental imagery and unlocked its powers [10, 12, 19]. The predictive model in our minds which forms the mental imagery is called as mental model [15, 16]. Ha et al. [12] instantiate the mental model as ‘world model’ and apply it to some games-Car Racing and VizDoom. However, it can not do long-term planning. It just predicts the latent representation of next frame after a primitive action, which has nothing to do with the task. While we desigh a RL framework to encourage our MIND module to generate task-related and interpretable prediction after several actions.
2.3 Vision-and-Language Navigation
Vision-and-language navigation(VLN) [2] requires the agent to understand natural-language navigation instructions and achieve the ultimate goal in a simulated environment. Natural language command of robots in unstructured environments has been a research goal for several decades [30]. Early approaches [6, 7, 21, 22] simplify the problem of visual perception to some degree. They restrict environments to require limited perception or enumerated all navigation goals or objects, and the navigation goal in these approaches is usually directly annotated in a prior global map. In recent work, Mei et al. [22] propose a neural sequence-to-sequence model to map the natural-language navigation instructions to actions. Anderson et al. [2] formulate VLN as visually grounded sequence-to-sequence transcoding problem, and propose a sequence-to-sequence architecture with an attention mechanism, as well as a Room-to-Room dataset which is the first benchmark dataset in real buildings. Wang et al. [27] propose a Cross-Modal Matching Critic to reconstruct the language instructions from the trajectories executed by the navigator, which is aimed to encourage the global matching between them. Similar to EmbodiedQA, VLN also needs to navigate in the environment to achieve some goals. The crucial difference between them is how the goals are speciïňĄed. VLN explicitly provides a sequence of instructions and specifies the target. The VLN agent just needs to map the natural-language navigation instructions to actions. In contrast, EmbodiedQA does not provide instructions to the agent, and the ultimate goals are implicit in natural-language questions. The instructions in VLN can be seen as short-term subgoals, which require the agent to plan them by itself in EmbodiedQA.
The MIND Module has two components: the Mental Autoencoder and the Imagery Model. The mental autoencoder receives raw RGB images through a single egocentric RGB camera and learns a compressed spatial and temporal representation from the environment. As the agent explores the environment, it gets a series of images and has its mental representations. Then the imagery model can use the sequences of mental representations to form useful hypothesis of how the environment works and predicts the future without having to experience that outcome directly.
3.1 Mental Autoencoder
The Mental Autoencoder is composed of a mental encoder and a mental decoder. The mental encoder is aimed to extract spatial and temporal information from the environment and uses them to form mental representations. We use -VAE [5, 13, 17] to discover disentangled latent factors. This means each dimension of the inferred latent representation represents one single generative factor(e.g.,room direction,scale) and relatively invariant to other dimensions. Such disentangled representation has good interpretability and easy generalization to a variety of tasks. It enables the MIND module to control the imagery generation in a more interpretable way(e.g., increase the dimension that controls the room direction to generate an image that turns left from current view).
Formally, let denote 224
3 raw RGB image,
the mental representation encoded by the mental encoder, and
denote the mental image produced by the mental decoder.
As shown in Figure 2, the encoder is a Convolutional Neural Networks (CNNs) [20] which takes as input and passes
4 convolutional layers to encode it into low dimension vectors
and
, with the same size
. The mental representation
is then sampled from the Gaussian prior
. The decoder is also instantiated as a neural network that learns to reconstruct the image given
. The loss function of
-VAE is defined as:
where is the parameters of encoder and
is the parameters of decoder.
Figure 2: The encoder outputs low dimension vectors which are the parameters of Gaussian distribution N(Âţ, ÏČI). The decoder receives mental representation
pled from N(Âţ, ÏČI) and uses it to reconstruct the original image.
3.2 Imagery Model
Through the mental encoder, we get a series of robust and disentangled mental representations in the process of agent exploring the environment. The imagery model is aimed to compress temporal information over time and predict the next mental representation given a specific action. Then we can decode
by the mental decoder to get the mental image
which reflects the imagery in mind. We use long short-term memory (LSTM) [14] for time series modelling and combine it with a Mixture Density Network (MDN) [4] as the output layer. Let
denote the pre- diction of the next mental representation, distinguished from the real mental representation encoded by mental encoder at time t+1. Instead of a deterministic prediction of
, the MDN outputs the parameters of the mixture distribution
it to sample a prediction of the next mental representation
. Importantly, its mechanism for generating the next mental representation
is similar to the mechanism of mental encoder (they both output the parameters of Gaussian distribution and sample mental representation from it).
Formally, let the action the agent will take, and
denote LSTM’s hidden state at time t. The imagery model predicts
as follows:
More specially, the MDN takes the LSTM’s output as its input, outputs the parameters of a mixture of Gaussian distribution, and then samples
from this distribution. Thus we can consider that the LSTM’s hidden state contains the spatial and temporal information of the environment. When we combine the MIND Module with navigation model, we will use the LSTM’s hidden state directly during planning. The details will be described in Section 4.
Figure 3 showcases the internal process of the MIND module, which contains the mental encoder and the imagery model. Given the egocentric 224224 RGB image, our mental encoder first encodes it into mental representation
to our imagery model as input. The imagery model takes
, the previous hidden state
and an action
as inputs, and then predicts the next
. For example, as shown in Figure 3, at time t-1, we can know that the agent is facing a staircase through the observation
it receives. The imagery model receives ‘turn left’ action, so it predicts the next mental representation
before actually executing ‘turn left’ action. The mental image
at time t-1 is reconstructed by the mental decoder using
. It is similar to the observation
at time t which is the real scene that the agent faces after turning left. From
, we can see that our MIND Module tries to imagine the scene that the agent will face after turning left. It means that without having to truly perform an action
, imagery model can predict the outcome which can help the agent to select a better action to perform. Moreover, the predictive mental images visualize the short-term goals of the agent, which make our method interpretable.
Figure 3: An example of the MIND Module. Mental Encoder outputs the mental representation of the current observation. Given the current mental representation and the next action, Imagery Model predicts the next mental representation. In this example, we use mental decoder to reconstruct the future observation which is the consequence of the given action.
3.3 Training Procedure
In this subsection, we describe how to train our MIND module. Importantly, our MIND module is pretrained independently just using expert demonstrations in EQA dataset, without any additional annotation data to train it. In this task, The agent may be spawned at a random location in a 3D environment and may not immediately âĂŸseeâĂŹ the scene containing the answer to the visual question. The expert demonstrations are trajectories following the shortest paths from the agentâĂŹs initial location to the target (more details in the experiment section). We first use these demonstrations to train our mental encoder to learn a mental representation of each frame and reconstruct the frame using . We minimize the difference between the original frame
and the reconstructed frame produced by the decoder from mental representation
After that, we can use our trained mental encoder and expert demonstrations to train our imagery model. Given mental representation encoded by mental encoder and the action
that agent performed, imagery model predicts the next
. We minimize the difference between
and the real
. Notice that
not corresponding to the next frame in trajectory after executing an atomic action. We expect our MIND module to predict a further outcome of several actions instead of just an atomic action, which is more helpful for generating imaginary subgoals.
After pre-training the MIND module, we can apply it to navigating. Our navigation model is based on the PACMAN [8], which is a hierarchical model that decomposes the navigator into a planner and a controller.
4.1 PACMAN Navigator
PACMAN navigator contains a planner and a controller. The planner selects actions(i.e. forward, turn-left, turn-right, stop) and the controller decides how many times to perform the primitive action. More specially, the planner first selects an action and gives control to the controller. The controller outputs 0 or 1 that 0 means to stop and return control to the planner, and 1 means to execute the action that the planner has chosen once. Besides, the controller must return control to the planner after five consecutive 1. One forward action is equivalent to 0. 25 meters and one turn right or turns left action is equivalent to 9change in viewing angle. The planner is instantiated as an LSTM, and the controller is instantiated as a multilayer perceptron. Formally, the planner produces an action
as follows:
where Q is the encoding of the question, and is the encoding of the observed image at t-th planner-time and n-th controller-time.
The controller produces 0 or 1 as follows:
where 0 means to stop and return control to the planner, and 1 means to execute the action that the planner has chosen once.
4.2 Plan in MIND
In our approach, the planner takes the imagery model’s hidden state as an extra input and selects an action to perform. The MIND module executes at each Planner timestep to help Planner choose better action. At controller-time, it is the same as PACMAN model. Specifically, at each planner-time step, the planner selects an action based on the question encoding, current observation, and its hidden state. Further, it takes into consideration the mental images generated by the MIND module. As illustrated in Figure 4, the mental images above the MIND module are the imaginary consequence of performing a sequence of specific actions, and they visualize the short-term subgoals of our agent. For example, at time
Figure 4: The overview of the MIND agent. Before deciding an action to execute, it predicts some short-term subgoals in the mind, which yields better planning efficiency and visualizes the agentâĂŹs intentions in a way that human can understand. The images above the MIND Module are the mental images generated by the MIND module.
t+3, our agent has just entered the room, and the mental image predicts the consequence of performing several forward actions. The Mental image
shows the imaginary scene of the room’s front door, which indicates that the short-term goal of our agent at time t+3 is to go to the front door of the room.
With the mental imagery enhanced module, our agent can generate imaginary short-term subgoals that related to the final objectives, which improves its planning efficiency. Also, the MIND module introduces a basic understanding of the environment to the agent, which enhances its generalizability to unseen buildings and guarantees its performance when training data is few. For example, with such knowledge, our agent may know that it can only leave the room through the door, so it may not hit the wall many times to learn how to get out of the room. Therefore, we combine the planner in PACMAN with the MIND module.
4.3 Reinforcement Learning of the MIND Agent
In this subsection, we will describe how to train the MIND module and the navigator together in details. We first use imitation learning to warm-start our MIND agent and then use A3C[23] for finetunning.
Imitation Learning to Warm-start: In EQA dataset, there are four kinds of questions (more details will be described in the next section) and each question has a target object. The question is about properties (e.g. location, color) of the target object (), so it enables to generate expert demonstrations for imitation learning using the shortest path from the agent’s initial position to the target object. We use these expert demonstrations to warm-start our agent. We train our agent to mimic the expert demonstration using behavior cloning. More specially, given the current observation, question, our agent is trained to select the right action on the expert demonstrations. We find it hard to let it learn expert demonstrations directly because the initial position in expert demonstrations is too far from the target object. Similar to Das’s et al. work [8], we using distance-based curriculum learning to train our model. Firstly, we initialize our agent five steps away from the target object along the expert demonstrations and let it mimic the remaining actions in the expert trajectories. After it learns the remaining actions successfully, we backtrack five additional steps. Finally, our agent can mimic expert demonstrations after 20 epochs. The cost function
can be written as:
where is the demonstration action and
is the state, containing the current frame, mental image, question encoding, and the navigation history. The training objective is to maximize this function. Reinforcement Learning to Fine-tune: After behavior cloning, we use reinforcement learning to endow the agent with the ability to recover from wrong actions and encourage our MIND module to generate more task-related mental imagery.
Inspired by [8], we propose an actor-critic RL framework based on three kinds of rewards, namely and
that encourage our agent to reach the target location efficiently and give a correct answer. Importantly, the planned reward
encourage our agent learning to form more task-related and objective-related short-term subgoals.
The final goal of our agent is to answer the questions correctly. Therefore, we define the final reward to reflect whether the question is answered correctly. We use the same question-answering model as [8]. Question-answering model is called when our agent chooses to stop. It receives the question, and the image features from the last five frames along the navigation path and then computes imagequestion similarity for each image between the question encoding and the image features. These similarities are used as attention weights to combine these five image features with the question encoding. The final question-image features are passed through a softmax classiïňĄer to predict a distribution over 172 answers. Let T denote the last time step. The final reward is defined as:
Different from Das’s et al. work, we set final reward as one plus the weighted maximum between the maximum number of actions minus the actual number of actions n that our agent executes and zero if the answer is correct, and zero else. The second item in the final reward is aimed to encourage our agent to perform fewer actions to answer the question, which can improve the efficiency of navigation. We can adjust weight parameter
to balance between the correct answer and the navigation efficiency.
The progressive reward is an intermediate reward that encourages our agent get close to the target object. Let denotes the distance between the location at state
and the target location. Then the progressive reward after taking action
at state
is defined as:
If the distance to the target location becomes smaller after taking action , then our agent will get a positive reward that reflects how much the distance from the target has been reduced. Also, if our agent goes further with the target, it will be punished by a negative reward.
The progressive reward only considers the current effect but ignores the impact on the future. For example, in order to get closer to the target, it may go to the corner of a room instead of leaving the room, which may temporarily reduce the distance to the target location. However, it can never get to the target location without leaving the room.
To account for this, we define an intermediate reward called planned reward. As mentioned above, the mental images produced by the MIND module reflect the imagery short-term subgoals in our agent’s mind, and we can use them to inspect whether the agent’s next few actions are beneficial to the final objective. Therefore, we define the planned reward as the improvement of the correct answer’s probability. More specifically, let denote the third last frame at
-th planner-time and
denote the mental image at t-th planner-time. Let
denote the probability of the correct answer produced by the question-answering model, and
is the correct answer among 172 candidates. The planned reward is written as:
At each planner-time step, we compute the probability of the correct answer based on the current mental image and the last four frames
, and the probability based on the last five frames. We compare them to inspect whether the current subgoal of the agent is beneficial to answer the question. With the mental image, if the probability of the correct answer increases, it means that our MIND agent forms a task-related and objective-related short-term subgoals. If not, the agent will be punished by a negative reward. At test time, we just call answer model once when our agent stops.
So far, we have 3 kinds of rewards. They inspire an efficient navigation path to the target location and a correct answer to the given natural language question. We add three rewards together as the total reward function:
Then, we use A3C [23] with generalized advantage estimator (GAE) [25] to optimize our policy. The gradient of can be written as:
where is the generalized advantage estimator and
estimated value of state
produced by the critic for
5.1 EQA Dataset
Statistics: EQA dataset contains about 9000 questions in 774 environments. More specially, there are 7129 training data in 648 environments, 853 validation data in 68 environments and 905 testing data in 58 environments and there are no overlapping environments between them. Thus, the performance on test set directly reflects the generalizability to novel novel environment.
Question Form: In EQA dataset, there are 4 kinds of questions as shown below:
5.2 Evaluation Metric
In order to answer the question, the agent must move from a random initial position to the target location that contains the answer to the visual question. For instance, to answer ‘What room is the shoe rack located in?’, the agent must perceive the environment and perform a sequence of correct actions to move to the room with the shoe rack. In EQA dataset, the target location is marked by humans. Let denotes the initial distance to the target,
the final distance (how far is the agent from the goal when it stops) to the target and
denotes the change. We use
to evaluate the navigation performance and spawn agent 10, 30, or 50 primitive actions away from target, denoted as
. The bigger
indicates the agent has more ability to find the target
Table 1: Evaluation of EmbodiedQA agents on navigation and answering metrics for the EQA test set.
** this approach requires additional training data annotated with a sequence of subgoals, which is not available in EQA dataset.
location containing the visual answer. We use accuracy to evaluate the answering performance.
5.3 Setup
The laten space dimension of -VAE is 128. When training Mixture Density Network, it is easy to meet a numerically unstable problem. To avoid it, we use the log-sum-exp trick and gradient clipping technique and replace exponential function to ELU(1,x)+1. We have explored several different structures of the imagery model, and the best version we have is 1 LSTM layer, 5 Gaussians, 512 hidden units. We use Adam optimizer with a learning rate of 1e-4 to train mental encoder and with a learning rate of 1e-5 to train imagery model. We set the maximum number of actions
80, and the batch size is 20. During A3C fine-tuning, we set
99,
00 and the learning rate is 1e-4.
5.4 Results & Ablation Analysis
Overall Analysis: We compare our MIND agent with PACMAN[8], NMC [9], and Blindfold [1]. Blindfold is a question-only BoW baseline. Although it achieves the best QA accuracy by leveraging the biases in the dataset, its navigation accuracy is poor. As shown in Tab 1, our MIND(BC+A3C) achieves better ter QA accuracy at all distance compared with PACMAN and NMC. This suggests that our MIND agent has stronger generalizability. This gain mainly comes from the MIND module’s ability to plan short-term goals instead of just executing primitive actions over long time horizons and modelling the dynamics of the environment. Even if MIND(BC) isn’t fine-tuned using A3C, it performs better than PACMAN(BC + Reinforcement) in
. This fact proves the effectiveness of our MIND module. Comparing MIND(BC) with MIND(BC + A3C), we can see that our RL framework signiïňĄ-cantly boosts performance in answering accuracy. NMC achieves best navigation performance at
, since it has more annotated well-desighed subgoals(e.g., find a specific object or room, exit a specific room) which are crucial for long-term planning.
To fully investigate the effectiveness of the three different rewards, we conduct an ablation analysis on these rewards. We only use
and
to train thress agents separately, and compare them with our best model. MIND(BC+A3C)
means that we train the agent without
. As shown in Figure 5, without intermediate rewards
MIND(BC+A3C)
performs even worse than MIND(BC), which reflects the original A3C tranning can’t improve performance.
Figure 5: Results of MIND agent with different reward. MIND(BC+A3C)means that we train the agent without
Figure 6: Learning curves of MIND and PACMAN agent.
Comparing MIND(BC+A3C) with MIND(BC+A3C)
, we find that the gain from planned reward
is higher than progressive reward
, it is because the planned reward encourages our MIND module to generate more task-related imagery, which enhances the navigational performance. Generalizability & Convergence Speed: To evaluate the ability to generalize from a few samples, we use part of validation data to train MIND agent and PACMAN, and then compare the generalizability of them on the testing data. The size of validation (853 questions in 68 environments) is only about one-tenth of the training (7129 questions in 648 environments). Our MIND module is pretrained independently using training data, and it has never ‘seen’ any environment in validation data or testing data. There are no overlapping environments between them, so this experiment can strictly test agent’s generalizability to unseen environments with a few training data. We use a different number of validation data as our training data in this experiment to train agents, and compare their performance on navigation and question answering. For a fair comparison, the PACMAN and MIND are both trained using behavior cloning to warm-start and REINFORCE [28] to fine-tune. We also show their learning curves when we train them using REINFORCE.
As shown in Figure 8, at each size of training data, MIND agent performs better than PACMAN both in navigation performance and answering accuracy. We can see that PACMAN almost does not work when the number of training data is 200. From Figure 6, we can see that our MIND agent converges faster and better.
Figure 7: Example trajectories executed by PACMAN, MIND and the human expert. The global trajectories are shown in a top-down view (the top-down view is not available to the agents). The black areas represent obstacles, which can not be directly passed by the agents. The red trajectory is the human demonstration. The blue trajectory and green trajectory are executed by MIND and PACMAN, respectively.
Figure 8: The navigation performance and answering accuracy at
Planning Efficiency & Behavioral Interpretation: To demonstrate our method’s superior performance on route planning and behavioral interpretation, we carry out a case study. As shown in Figure 7, the agent is spawned in the kitchen at the top right of the top-down view and asked a question. We can see that PACMAN agent first goes to the left. Although its direction is correct, it can not walk out of the kitchen through the wall, and it needs to try several times to get out of the room. It is because that it lacks the basic understanding of the environment () and the short-term planning (
). It just wants to get close to the target object, but it does not know it has to get out of the kitchen first. Therefore, it has low navigation efficiency. In contrast, our MIND agent acts more like humans. It plans short-term goals in mind, performs a sequence of actions to achieve them, and finally reaches the target location and answers the question. From the image (d) in Figure 5, we can see that the MIND agent is entering the corridor. The mental image (c) visualize its short-term subgoal that it intends to go straight and get to the end of the corridor, which is proved correct by MIND’s trajectory in a top-down view. When it gets to the end of the corridor, we can see that it plans to turn right and get close to the bedroom, which is shown in mental image (a). These two cases suggest that the MIND agent holds mental images in its mind, which are the short-term goals of itself and make its actions more interpretable and planned.
Further Discussion: Recently, we noticed that Wu et al.[31] proposed a simple baseline that can be end-to-end trained, which is competitive to the state-of-the-art. Their empirical results indicate that the QA bottleneck is due to the worse navigation ability, and current approaches are far from satisfaction. Further, Wu et al. introduced an easier and practical setting for EmbodiedQA. They propose a proxy task for the agent to explore the new environment by randomly placing some makers, which helps the agent to adapt the learned model to the new environment. This practical setting can be well applied to other approaches, and improve their generalizability to the new scene. Also, the text-only baseline[1] inspires us to create less biased QA pairs in the future.
In this paper, we propose the Mental Imagery eNhanceD Module for EmbodiedQA and a Deep Reinforcement Learning framework for MIND agent. The MIND module models the environment dynamics and predicts mental images that relate to the final goal, which endows our agents with strong generalizability and interpretability; and improves its planning efficiency. The experimental results and further analysis prove that the agent with the MIND module is superior to its counterparts not only in EQA performance but in many other aspects such as route planning, behavioral interpretation, and the ability to generalize from a few examples.
This work has been supported in part by National Key Research and Development Program of China (SQ2018AAA010010), NSFC (No.61751209, U1611461), Hikvision-Zhejiang University Joint Research Center, Zhejiang University-Tongdun Technology Joint Laboratory of Artificial Intelligence, Zhejiang University iFLYTEK Joint Research Center, Chinese Knowledge Center of Engineering Sci- ence and Technology (CKCEST), Engineering Research Center of Digital Library, Ministry of Education.
[1] Ankesh Anand, Eugene Belilovsky, Kyle Kastner, Hugo Larochelle, and Aaron Courville. 2018. Blindfold baselines for embodied qa. arXiv preprint arXiv:1811.05013 (2018).
[2] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sunderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. 2018. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (Jun 2018). https://doi.org/10.1109/cvpr.2018.00387
[3] Jacob Andreas, Dan Klein, and Sergey Levine. 2017. Modular multitask reinforcement learning with policy sketches. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 166–175.
[4] Christopher M Bishop. 1994. Mixture density networks. Technical Report. Citeseer.
[5] Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. 2018. Understanding disentangling in Κ-VAE. arXiv:cs.LG/1804.03599
[6] Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi, Dheeraj Rajagopal, and Ruslan Salakhutdinov. 2017. Gated-Attention Architectures for Task-Oriented Language Grounding. arXiv:cs.LG/1706.07230
[7] David L. Chen and Raymond J. Mooney. 2011. Learning to Interpret Natural Language Navigation Instructions fro mObservations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI-2011). San Francisco, CA, USA.
[8] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. 2018. Embodied Question Answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Abhishek Das, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. 2018. Neural modular control for embodied question answering. arXiv preprint arXiv:1810.11181 (2018).
[10] Ronald A Finke. 1989. Principles of mental imagery. The MIT Press.
[11] Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, and Ali Farhadi. 2018. Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4089–4098.
[12] David Ha and JÃijrgen Schmidhuber. 2018. World Models. arXiv:cs.LG/1803.10122
[13] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, Vol. 3.
[14] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[15] Philip N Johnson-Laird. 1995. Mental models, deductive reasoning, and the brain. The cognitive neurosciences 65 (1995), 999–1008.
[16] Natalie Jones, Helen Ross, Timothy Lynam, Pascal Perez, and Anne Leitch. 2011. Mental models: an interdisciplinary synthesis of theory and methods. (2011).
[17] Diederik P Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. arXiv:cs.LG/1312.6114
[18] Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. 2017. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474 (2017).
[19] SM Kosslyn and Zenon Pylyshyn. 1994. Image and brain: The resolution of the imagery debate. Nature 372, 6503 (1994), 289–289.
[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
[21] Matt Macmahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the Talk: Connecting Language, Knowledge, Action in Route Instructions. In In Proc. of the Nat. Conf. on Artificial Intelligence (AAAI. 1475–1482.
[22] Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. 2015. Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences. arXiv:cs.CL/1506.04089
[23] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning. 1928–1937.
[24] Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. 2017. Zero-shot task generalization with multi-task deep reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2661–2670.
[25] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015).
[26] Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J Mankowitz, and Shie Mannor. 2017. A deep hierarchical approach to lifelong learning in minecraft. In ThirtyFirst AAAI Conference on Artificial Intelligence.
[27] Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, YuanFang Wang, William Yang Wang, and Lei Zhang. 2018. Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation. arXiv preprint arXiv:1811.10092 (2018).
[28] Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3-4 (1992), 229–256.
[29] Margaret Wilson. 2002. Six views of embodied cognition. Psychonomic bulletin & review 9, 4 (2002), 625–636.
[30] Terry Winograd. 1971. Procedures as a representation for data in a computer program for understanding natural language. Technical Report. MASSACHUSETTS INST OF TECH CAMBRIDGE PROJECT MAC.
[31] Yu Wu, Lu Jiang, and Yi Yang. 2019. Revisiting EmbodiedQA: A Simple Baseline and Beyond. arXiv:cs.CV/1904.04166
[32] Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. 2018. Building Generalizable Agents with a Realistic and Rich 3D Environment. arXiv:cs.LG/1801.02209
[33] Edward N Zalta, Uri Nodelman, Colin Allen, and John Perry. 2003. Stanford encyclopedia of philosophy.