As AI becomes more widespread in the real world, and the strive towards universal AI gains more traction, the need for interpretable and general agents increases. Since more systems are performing automated decisions, humans require those systems to explain their behavior, and expect them to work in unknown scenarios.
In this paper, we investigate whether it is possible to train a Reinforcement Learning (RL) agent to operate in a virtual environment while being interpretable in its ‘intentions’, and how its interpretability helps in finding more compositional solutions. More specifically, while training a neural agent to follow some navigation instructions, we require it to spell out what is, at each time-step, its current objective. To accomplish that, we use the recently introduced diagnostic classifier (Hupkes et al., 2018), a linear classifier which assesses the presence of some specific information in a neural network by trying to predict it from its hidden states. In our case, we use it at training time to predict the current objective of the RL agent.
Our approach is inspired by how humans learn. While in traditional RL, the objective is defined in terms of a single goal, expressed through some reward function (Sutton and Barto, 2018), when we teach humans to follow instructions, not only do we check for accurate execution, but we also make sure that the instruction, usually expressed in natural language, is correctly understood. Are all word meanings in the instruction known? Is it clear how to segment the instruction such that it can be decomposed in sub-tasks, encouraging ef-ficient sub-task separation (Gopalan et al., 2017)? In this paper, we account for some of this extra supervision and measure its impact on learning ef-ficiency.
2.1 Following language instructions
One of the first attempts at following language instructions is SHRDLU (Winograd, 1971). It was designed to understand natural language by relating to a physical world. However, its apparent success stemmed from handwritten rules in a fi-nite grammar, which is unsustainable in natural language. In attempts to deal with incomplete information, probabilistic methods extract cues from the instruction to improve the agent’s learning capabilities (Kollar et al., 2010; Vogel and Jurafsky, 2010; Tellex et al., 2011; Dzifcak et al., 2009).
Recently, following language instructions has been actively researched, with the introduction of artificial environments (Bisk et al., 2016; Hermann et al., 2017; Wu et al., 2018). Now, the trend has been leaning towards deep reinforcement learning agents, hoping to fullfill the promise of generalizable agents that exploit the instruction (Misra et al., 2017; Bahdanau et al., 2018; Yu et al., 2018).
We aim to recover a more-human like learning environment; as humans provide linguistic and non-linguistic cues about how to segment instructions (e.g., during execution, by asking the learner to explain some of her actions), we probe the arti-ficial learner for its focus using information from the language instruction.
2.2 Compositionality
In the abstract information that an instruction in natural language provides for humans (Werning et al., 2012), artificial agents intuitively should also be able to benefit from the compositionality of language. If the agent is instructed to perform an action on an object that it has never seen before, but does know how to execute the action, it could reuse its knowledge and require less training before it can successfully complete the instruction.
In the context of following navigation commands, Lake and Baroni (2018) introduced the SCAN task, which is designed to test for compositional abilities in neural networks. The authors show how sequence-to-sequence models are generally able to learn navigation commands, but, as soon as they are tested on instructions which require compositional generalization, they fail miserably.
Additionally, intrinsic motivation aids in scaling RL agents through the use of an internal supervisory signal, representing a reward from performing “interesting” actions (e.g., Chentanez et al., 2005; S¸ims¸ek and Barto, 2006). These signals, obtained during unsupervised traversal of the environment, are used to help the agent form a set of skills by exploration. Later, they can be reused and employed when optimizing for a task. It has been successfully applied in cases such as efficient learning with sparse rewards (Pathak et al., 2017), or in the development of an embodied robotics actor (Frank et al., 2014).
Finally, compositionality is also at the core of curriculum learning for RL (e.g., Narvekar et al., 2017; Florensa et al., 2017). The idea is to design and solve a sequence of tasks with increasing complexity and reuse the skills acquired in these task to solve the target task. However, designing the curricula may be as hard as (or even more complex) than directly solving the target tasks (when prior knowledge is unavailable). It is crucial that the appriopriate design is chosen, but experiments show that curriculum learning can be beneficial in scaling the training of RL agents (Wu and Tian, 2016; Gupta et al., 2017).
Our approach can be seen as an instance of curriculum learning where the prior knowledge is the task instruction and the curriculum leverages the sequential structure of the task.
2.3 Understanding black box models
Presently, deep neural networks are mostly black boxes, and creating an understanding of their internal mechanisms remains a shot in the dark. Fortunately, recent work in explainable AI (XAI) attempts to increase the transparency of these models. An overview by Biran and Cotton (2017) distinguishes between two notions of explainability: justification and interpretability. Justifications are reasons for decisions an agent might make, but are not necessarily connected to the workings of the agent itself. This means they can be generated for non-interpretable systems, and require no retraining of the original model. Interpretations on the other hand, are inherent to the agent, and should reflect how the agent arrived at its decision through its interal workings.
Recent developments for generating interpretations include using t-SNE plots to visualise the latent space of agents (Zahavy et al., 2016; Jader- berg et al., 2018), examining the attention patterns when agents make decisions (Greydanus et al., 2017), including a human in the loop to help a model’s interpretability (Lage et al., 2018) and using ’diagnostic classifiers’ to decode which spe-cific information is encoded in the network (Hup- kes et al., 2018).
In this work, we encourage the agent to develop a more interpretable policy, which, at any time step, is able to report its current objective. Additionally, we investigate the compositionality after training the interpretable policy.
This section describes the environment, model and setup used in the experiments.
3.1 BabyAI game
As a testbed for the learning process we make use of the BabyAI platform (Chevalier-Boisvert et al., 2018), which consists of a grid world environment in which the agent is presented with a structured language instruction. The platform contains different levels, which increase in complexity through a combination of distractors, composite instructions, and sparse rewards. The observation presented to the agent is a 7x7 grid, a 2D representation of the agent’s surroundings. This egocentric view contains a symbolic representation of objects, walls, doors and their colors. The agent has access to actions such as picking up objects and walking around. The compact representation of the grid world allows for fast processing of the observations.
The instruction is given in the Baby Language, a well-defined subset of the English language, which is simple yet diverse. For all our experiments, we develop customized levels which spin off from the original GoTo level. We choose GoTo because is the least complex instruction and therefore easiest to learn. The atomic instruction is formed by selecting a color and object type at random, specifying a target for the agent. Optionally, the modifier twice or thrice can be added to an atomic instuction, much like the SCAN dataset (Lake and Baroni, 2017). Example atomic instruction include go to the red ball, go to a blue box twice and go to the yellow key thrice.
In the case an agent is instructed to visit an object multiple times, upon arriving at a target object the objects are shuffled around the environment. The agent has to visit the same object respectively one or two more times in order to complete the instruction correctly. To prevent infinite length episodes, every instruction has an associated maximum number of steps, corresponding to the complexity of the instruction.
Atomic instructions are subsequently combined through the use of various task connectors. By means of these operators a compound instruction can be made consisting of atomic instructions
. We consider the following task connectors:
• Before: Complete before completing
If the agent completes instruction
compound instruction fails, and no reward is given.
• After: Complete after completing
the agent completes instruction
compound instruction fails and no reward is given.
Besides combining atomic instructions, the connectors apply to complex instructions as well. In this case, the connectors are left-associative.
For example, a compound instruction is go to the blue box twice before go to the yellow key. An
overview of all levels considered in this work is given in Table 1, and a visual example of the setup is given in Figure 1.
3.2 Model
For our base agent, we select the Small BabyAI model, originally introduced by Chevalier- Boisvert et al. (2018). This model combines the language instruction and world representation in an Action-Critic architecture (Szepesv´ari, 2010). The instruction is parsed using a GRU using a fixed vocabulary, after which it is combined with the observation through two FiLM (Perez et al., 2018) layers. The output generated by these layers is passed into an LSTM to allow for temporal feedback connections. Ultimately, the LSTM’s output is used in an actor network to generate actions and a critic network to generate state values. The agent is optimized using Proximal Policy Optimization (PPO, Schulman et al., 2017), a sample efficient actor-critic approach.
3.3 Diagnostic classification
As an extension to the base model, the model is made interpretable through the addition of a diagnostic classifier. This classifier is tasked with providing an intuitive explanation of the agent’s behavior when asked, making it more interpretable for humans. It does so by generating, at every time step, the current target for the agent. While this does not directly give a justification for individual moves, it does give an idea of the current focus of the agent. Since we consider complex instructions, there are at least two subtasks to be completed, and through the classification the agent signifies its current objective (e.g. I’m trying to complete ). By means of this extra task, we aim to make the agent aware of the compositional nature of the instruction. The agent now has access to a signal that indicates the separation between two objects in its environment, and it is up to the agent to learn to compose previously learned behavior, and become more efficient.
To create the labels for the diagnostic classifier, we exploit the temporal relation between the sub-tasks. This way, the agent is trained to visit the objectives in order, and the focus of the agent should follow this same order. In the levels, there are N unique object type/color combinations, as can be generated by the Baby Language. By enumerating all N combinations, a mapping can be created. Subsequently, the labeler takes the language in-
Table 1: Overview of all levels. The last column denotes any special feature in each level.
struction and the current status of visits (e.g. whether the agent has visited
), and uses this mapping to generate a label. Since only the final label, and not the grammar or task status is exposed to the agent, we avoid providing further external information.
Finally, the labels are used to train the diagnostic classifier, which is a linear mapping from the LSTM’s hidden state to the N unique object/color combinations. Because of the classification task, a cross entropy term is added to the PPO reward function with a coefficient
. This results in Equation 1, which takes class labels
put probabilities
Note that this differs from regularized approaches for RL where the regularization term is computed w.r.t. the current policy estimate (e.g., Neu et al., 2017). This regularization term can be interpreted as a form of reward shaping (Ng et al., 1999).
Below, four experiments are outlined. The experiments are designed to quantify how the additional classifier in the agent is affecting its interpretability, and to check whether it has impacted the agent’s compositionality.
As a measure of the agent’s performance over time, different metrics are used. These metrics show how proficient an agent is in completing the overall instruction, or how consistently it can complete levels. The following are used:
• Diagnostic accuracy: The average accuracy of the diagnostic object prediction.
• Success rate: The average number of episodes that end with a positive reward out of all episodes. In other words: the average ratio of episodes ended within the maximum amount of steps that did not end in a failure.
Figure 1: Visual overview of the environment. The light gray area is currently in view for the agent, represented by the red triangle. The green ball and green square are the two objectives. Every small arrow is a future action taken by the agent, while simulataneously providing an object classification. For the white arrows, the correct label is “green ball.” For the blue arrows, the correct label is “green box.”
• Episode length: The average number of steps required for the completion of a level. At most, this is the maximum number of steps defined for each level.
• Failure rate: The ratio of episodes that end with the agent failing a task. Since there is a temporal ordering in the connectors, the agent is not allowed to visit them out of order. Similarly, if the agent fails to obey the twice/thrice modifier, the agent can fail the task by arriving at the next object too early.
• Timeout rate: The ratio of episodes that end without the agent completing the whole instruction, reaching the maximum number of steps.
Unless otherwise specified, we report the mean and standard devation over at least three different seeds to account for randomness factors in network initialization, the environment generation and the optimization process.
4.1 Diagnostic training
In this initial experiment, we add the diagnostic classifier to the agent (Aware model), and look at differences in how the training of the two models (Baseline and Aware) develop. For the Aware model, we record also the diagnostic classifier’s accuracy during training.
Furthermore, we perform an offline training test to check whether the hidden states of the agent are affected by diagnostic classification. Both the Baseline and Aware converged models are put in inference mode, and run for a fixed number of episodes. For all frames in these episodes, the hidden states and the correct diagnostic target are recorded. Together, they form an offline dataset, which we can use to train a new classifier, identical to the one used in the RL training. We then compare performance of the new classifier trained on the Baseline- vs. Aware-generated datasets.
4.2 Source-level performance
Here, we observe the performance of the two models on the levels they have been trained on. Since the Aware model has an added task of making its hidden states explainable for a small classifier, convergence might take longer than the Baseline model. Furthermore, the base performance on the two novel complex levels can be examined using the source-level performance.
4.3 Zero-shot generalization
Next, we check whether the Aware agent can use the extra training signal for separating subtasks. By introducing an unseen characteristic to objects in the environment, the agent now has to identify which object it does know, and generalize learned behavior to the unknown object. Being able to isolate single objects in the environment should help the agent in this type of generealization.
Specifically, we consider the following cases:
• Color: One object’s color is replaced with an unknown color.
• Object: One object’s type is replaced with an unknown type.
• ColorObject: One object’s type and color are both replaced with unknowns.
In all cases, we only change a single object in the environment, such that the agent should be able to deduce which object is altered. This aids the agent in completing the given instruction, whereas changing multiple objects could lead to the agent visiting the objects in the wrong order more often.
4.4 Sparse classification
Lastly, an attempt is made to make the guiding signal more realistic. In the original setting, for every timestep in the environment we ask the agent for its current objective. However, in humans, intuitively this is too frequent, and should only be asked occasionally.
Therefore, we lower the frequency of the diagnostic classification. Instead of every frame, a classification is only asked up to a maximum of three times per game episode. Now, the extra signal is much lower, and both the classifier might need more time to reach convergence, as well as the feedback on the agent’s hidden states is less dominant. We explore whether this is beneficial to the agent, considering the criteria from before. The training of the agent takes significantly longer, therefore only the BEFORE and MIXED-2 levels are considered.
Below, we give an overview of the results per experiment. First, the performance of the diagnostic classification task in general is presented. Second, all zero-shot experiments are shown. Finally, we elaborate on observations made during the sparse classification task.
5.1 Diagnostic training
See Figure 2. The Baseline model only has access to a classifier trained on the offline collected dataset, while the Aware model was evaluated at two different stages: once after training with RL, and once after retraining on the offline dataset.
Across all cases, the Aware model is able to predict the correct objective consistently. In the RL stage, the classifier is successfully trained, which indicates that the agent is still able to converge to a stable optimum. Furthermore, the subsequent difference between the offline trained classifiers shows that the hidden states are positively affected
Figure 2: Diagnostic classification accuracy (and standard deviations) of the Baseline and Aware model on all levels.
by the training process: the Aware model’s states are better suited for retraining the same classifier using a restricted dataset, and thus are more easily interpretable than the Baseline. Since this dataset is only a fraction of the number of frames that the agent observed during the RL stage, performance is slightly lower. Still, this shows that only a limited dataset is required before a classifier can be trained for the Aware model.
5.2 Source-level performance
See Table 2. In the two most simple levels, there is little difference in the models, as both models agree on a seemingly optimal policy. However, on the complex levels, the two models show different behavior. Especially on the BEFORE (REPEAT) level, the Aware model is able to reach a faster policy. Repeating a subtask is easier if the agent learns to disentangle objects better, and the increase in success rate shows that the Aware model is able to complete the compound instruction more often.
In Figure 3, the training progress can is plotted over the episode length metric for the two simple levels. Even though both models reach the same performance, there is a slight difference in their speed. Instead of the Aware model taking longer, because of the classification task, it can exploit the additional signal to learn slightly faster.
5.3 Zero-shot generalization
See Table 3. In this case, we see an improvement for the Aware model in the two simple levels. Both in episode length and in success rate, the Aware model outperforms the baseline. Here, the MIXED-2 level shows a larger difference than
Figure 3: Training progress for the two models over the episode length metric, for two different levels. The dashed line indicates the Baseline, the solid indicates the Aware model. Each line is an average over multiple individual runs.
the easier BEFORE level. This is is evidence for the need for complexity before the agent is able to exploit the language instruction fully.
However, for the complex levels, this difference is not as visible, but still the Aware model holds up to the baseline. When presented with only a new color, the Aware agent is able to be signifi-cantly faster, but in all other cases performance is comparable. Interestingly, the Aware model fails the whole instruction less often, but instead times out in both levels. This shift in termination reason is most likely due to the agent understanding that the known object in the level should not be visited yet, but fails to identify the unknown object. Upon inspection of the learned policy, the agent is actively avoiding the known object, but does not reach the other object in most cases. This shows that the training procedure did aid the agent in understanding its environment better: previously seen objects are more successfully identified, and the agent seems to know about their visiting order.
5.4 Sparse classification
Lastly, we present the results for the sparse diagnostic classification in Figure 4, Table 4 and Table 5.
In comparison with the Baseline and Aware model, learning a policy for traversing the two simple levels does not take longer and reaches the same optimum as before (see Table 4). This is because the RL agent itself is unaffected by the changes in the classification procedure. However, the impact on the hidden states is considerably lower, as can be seen in Figure 4. Here, the of-
Table 2: Performance of trained models in the source levels.
Table 3: Performance of a trained model on the source levels, applied in the new transfer learning setting. Here, there are three new scenarios: 1) a novel color, 2) a new type of object, 3) a combination of both.
Figure 4: Diagnostic classficiation results for the standard and sparse versions of the Aware model. All values are averaged over at least two runs, with a standard deviation under 0.05.
Table 4: Intra-level results for the standard and sparse versions of the Aware model. BA is the MIXED-2 level. The policies learned by all agents were comparable and did not differ significantly over multiple runs.
Table 5: Zero-shot performance of the sparsely trained models, compared to the standard Aware model.
fline retrained classifier is not as easy to train as the standard Aware model. Still, compared to the earlier Baseline results, the Sparse classification is able to instigate some changes to the latent space.
In the zero shot experiments, there is some slight improvements in episode lengths and success rates. The hidden states may now be in balance between interpretability, as they can be organized by the retrained classifier to a certain degree, and efficiency, as the agent generalize them to unseen situations.
In this paper, we explored the addition of a simple classification task to a complex instructionfollowing RL problem. Through this addition, the agent was intended to become both more interpretable, and more aware of the compositional nature of the instructions. The results indicate that the agent is able to provide its current objective consistently, while having a minimal impact on the policy itself. Furthermore, these modified agents can be shown to be more general in zero-shot settings, suggesting that the added training signal helps in disentangling objects.
Future research should focus on expanding the level set that the agent was trained and evaluated on. Other types of instructions from the BabyAI environment, such as Pick up or Put next add more complexity to the task that the agents has to accomplish, and could also benefit from the improvements in object disentanglement. Additionally, adding obstacles such as separate rooms connected by doors, or distractor objects can interfere with the current setup. These situations form an interesting case for testing the diagnostic classifi-cation.
Finally, creating a more explicit hierarchical structure for the agent could make it more efficient in composing learned skills (e.g. Sutton et al., 1999). Such a hierarchical approach could use the training signal to train elementary skills and compose them more efficiently than in the current model.
Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Arian Hosseini, Pushmeet Kohli, and Edward Grefenstette. 2018. Learning to understand goal specifications by modelling reward.
Or Biran and Courtenay Cotton. 2017. Explanation and justification in machine learning: A survey. In IJCAI-17 workshop on explainable AI (XAI), volume 8, page 1.
Yonatan Bisk, Deniz Yuret, and Daniel Marcu. 2016. Natural language communication with robots. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 751–761.
Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. 2005. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pages 1281–1288.
Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. 2018. Babyai: First steps towards grounded language learning with a human in the loop. arXiv preprint arXiv:1810.08272.
Juraj Dzifcak, Matthias Scheutz, Chitta Baral, and Paul Schermerhorn. 2009. What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In 2009 IEEE International Conference on Robotics and Automation, pages 4163–4168. IEEE.
Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. 2017. Reverse curriculum generation for reinforcement learning. In CoRL, volume 78 of Proceedings of Machine Learning Research, pages 482–495. PMLR.
Mikhail Frank, J¨urgen Leitner, Marijn Stollenga, Alexander F¨orster, and J¨urgen Schmidhuber. 2014. Curiosity driven reinforcement learning for motion planning on humanoids. Frontiers in neurorobotics, 7:25.
Nakul Gopalan, Michael L Littman, James Mac- Glashan, Shawn Squire, Stefanie Tellex, John Winder, Lawson LS Wong, et al. 2017. Planning with abstract markov decision processes. In Twenty-Seventh International Conference on Automated Planning and Scheduling.
Sam Greydanus, Anurag Koul, Jonathan Dodge, and Alan Fern. 2017. Visualizing and understanding atari agents. arXiv preprint arXiv:1711.00138.
Jayesh K Gupta, Maxim Egorov, and Mykel Kochen- derfer. 2017. Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pages 66–83. Springer.
Karl Moritz Hermann, Felix Hill, Simon Green, Fumin Wang, Ryan Faulkner, Hubert Soyer, David Szepesvari, Wojciech Marian Czarnecki, Max Jaderberg, Denis Teplyashin, et al. 2017. Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551.
Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema. 2018. Visualisation and’diagnostic classi-fiers’ reveal how recurrent and recursive neural networks process hierarchical structure. Journal of Ar-tificial Intelligence Research, 61:907–926.
Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. 2018. Humanlevel performance in first-person multiplayer games with population-based deep reinforcement learning. arXiv preprint arXiv:1807.01281.
Thomas Kollar, Stefanie Tellex, Deb Roy, and Nicholas Roy. 2010. Toward understanding natural language directions. In Proceedings of the 5th ACM/IEEE international conference on Human-robot interaction, pages 259–266. IEEE Press.
Isaac Lage, Andrew Ross, Samuel J Gershman, Been Kim, and Finale Doshi-Velez. 2018. Human-in-the-loop interpretability prior. In Advances in Neural Information Processing Systems, pages 10159–10168.
Brenden M Lake and Marco Baroni. 2017. General- ization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. arXiv preprint arXiv:1711.00350.
Brenden M. Lake and Marco Baroni. 2018. General- ization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018, pages 2879–2888.
Dipendra Misra, John Langford, and Yoav Artzi. 2017. Mapping instructions and visual observations to actions with reinforcement learning. arXiv preprint arXiv:1704.08795.
Sanmit Narvekar, Jivko Sinapov, and Peter Stone. 2017. Autonomous task sequencing for customized curriculum design in reinforcement learning. In IJCAI, pages 2536–2542. ijcai.org.
Gergely Neu, Anders Jonsson, and Vicenc¸ G´omez. 2017. A unified view of entropy-regularized markov decision processes. arXiv preprint arXiv:1705.07798.
Andrew Y Ng, Daishi Harada, and Stuart Russell. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287.
Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. 2017. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17.
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
¨Ozg¨ur S¸ims¸ek and Andrew G Barto. 2006. An intrin- sic reward mechanism for efficient exploration. In Proceedings of the 23rd international conference on Machine learning, pages 833–840. ACM.
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
Richard S Sutton, Doina Precup, and Satinder Singh. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211.
Csaba Szepesv´ari. 2010. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1):1–103.
Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R Walter, Ashis Gopal Banerjee, Seth Teller, and Nicholas Roy. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In Twenty-Fifth AAAI Conference on Artificial Intelligence.
Adam Vogel and Dan Jurafsky. 2010. Learning to fol- low navigational directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 806–814. Association for Computational Linguistics.
Markus Werning, Wolfram Hinzen, and Edouard Machery. 2012. The Oxford handbook of compositionality. Oxford Handbooks in Linguistic.
Terry Winograd. 1971. Procedures as a representation for data in a computer program for understanding natural language. Technical report, MASSACHUSETTS INST OF TECH CAMBRIDGE PROJECT MAC.
Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. 2018. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209.
Yuxin Wu and Yuandong Tian. 2016. Training agent for first-person shooter game with actor-critic curriculum learning.
Haonan Yu, Haichao Zhang, and Wei Xu. 2018. In- teractive grounded language acquisition and generalization in a 2d world. arXiv preprint arXiv:1802.01433.
Tom Zahavy, Nir Ben-Zrihem, and Shie Mannor. 2016. Graying the black box: Understanding dqns. In International Conference on Machine Learning, pages 1899–1908.