1 At a high level, ’Imitation Learning’ attempts to learn a general skill after observing some demonstrations of an expert performing the skill. Crucially, this learned skill is expected to generalize to situations where the learner has not observed the expert’s actions.
More formally, let represent a set of observations obtained from sensors at time t that constitute the state. Let
Brown University Department of Computer Science This writeup is intended as a Final Project submission for CSCI 1951M: The Great Ideas in Computer Science, Fall 2019 1Material in this section adapted from Levine [11]
represent a set of actions (eg. voltage sent to motors of a robot) taken by the expert at time t. Let T represent a trajectory that is a set of n state, expert action pairs [
for
]. Imitation Learning essentially assumes it is given m of these trajectories as input. The expected output is a policy
(where
represents a set of policyparameters) that maps states to actions. Specifically, it is taken to be a probability distribution that is a function of actions to be taken and states observed at each time step. Crucially, Imitation Learning hopes that the learned policy’s distribution of actions will match the experts distribution of actions (i.e, the distribution that
are drawn from).
Given the substantial research into and success of Machine Learning methods, the most natural thing to do might be to frame the Imitation Learning problem as a ’Supervised Learning’ 2 problem. This can be done rather simply: use any Supervised Learning algorithm to learn a function mapping states () to actions (
) after having trained on the collected trajectories of states observed and expert actions [
for
]. This is the essence of Pomerleau’s approach to using Imitation Learning for Autonomous Driving.
As can be seen from Fig. 1, Pomerleau used a Deep Neural Network (DNN) to perform his Supervised Learning of the expert actions. The network took two inputs: video and data from a laser range finder mounted on the car. It used these to produce a 45-unit long vector representing the curvature that the vehicle would need to travel along to reach the center of the road. It also produced an output indicating how much the road visually contrasted its surroundings. This output was fed back into the network as an input to improve its prediction accuracy.
Unlike many contemporary DNN’s, Pomerleau didn’t use a real-world dataset to train his DNN because it was infeasible for him to collect, store and process the necessary amount of data. Instead, he created a ’road simulator’ that would produce realistic images of roads with added noise and varying lighting conditions. Pomerleau’s simulator also generated laser range-finder input corresponding to his images. Finally, the simulator would produce ’expert’ actions based on knowing the ground-truth curvature of the road it produced.
Pomerleau’s approach exhibited rather impressive performance. His trained DNN was able to drive CMU’s NAVLAB
Fig. 1: An image from Pomerleau’s paper depicting his neural network.
(a modified Chevy van equipped with sensors and computers) at a speed of 0.5 meters per second through a 400 meter area of CMU’s campus under sunny conditions. This performance was comparable to that achieved by the state-of-the-art hand-engineered vision and navigation algorithms of the time. Given that this happened in 1994 - a time when most computers had a RAM of approximately 4 MB and DNN’s were widely believed to be useless - Pomerleau’s results are truly remarkable.
Even though Pomerleau only tested his algorithm on one small environment in optimal weather conditions, his results are still noteworthy. After having developed his road simulator, Pomerleau was able to accomplish in ”30 minutes of training time” what took CMU’s Vision and Navigation groups months of hand-tuning their algorithms. This work was one of Imitation Learning’s first major successes and it seems to have catalyzed a strong research interest in the field [3] [4] [2].
Inspired by the success of Pomerleau [12] and others in using Imitation Learning to solve complex problems, this paper sought to view the Imitation Learning Problem through a different lens. While almost all previous work had framed Imitation Learning as a Supervised Learning problem, this paper leveraged Reinforcement Learning (RL).
Roughly, RL attempts to learn a policy that maps states that an agent could be in (s) to actions to be taken from that state (a) at every time step. Crucially, the policy is learned such that the actions it takes maximize some reward function R(s, a). At the time that this paper was written, a number
of RL algorithms existed to learn this policy 3.
It is important to note that RL algorithms require a reward function (R(s, a)) to produce the policy . Abbeel and Ng’s key idea in this paper is that such an R can be derived from the expert trajectories that Imitation Learning assumes as input. This is what they call ’Inverse Reinforcement Learning’ (IRL). Once this reward function (
) is obtained, one can use any RL algorithm to obtain a policy
that maximizes this reward function. If we assume that the expert was attempting to maximize
, then the policy
returned by the RL algorithm will (roughly) match the expert’s policy. In this way, the Imitation Learning problem can be solved by first using IRL to derive a reward function and then using RL to obtain a policy.
Fig. 2: A visual illustration of how Imitation Learning can be performed by the combination of RL and IRL, obtained from [13]. Since RL uses a reward function as input and outputs optimal behavior, the method that takes an expert’s optimal behavior as input and outputs a reward function is termed ’Inverse’ RL.
Unfortunately, Abbeel’s IRL method cannot take as input only the expert trajectories. It also requires some set of features over states that it can use to learn the reward function. In the case of driving a car, these features might be specific aspects we might want a car to optimize such as whether it has just collided with a car, whether it is in the middle of a lane, etc. Thus, Abbeel’s method requires this extra human-specified input in addition to the expert trajectories that Pomerleau [12]’s method operates on directly.
Abbeel is able to use his method to train a car to perform various complicated behaviors in simulation. He demonstrates 5 different learned behaviors that follow policies ranging from simply avoiding all other cars on the road to only driving within the right lane and going off-road to overtake other cars to simply intentionally crashing into the fist car detected.
Overall, this paper introduced a different way of viewing the Imitation Learning problem than that used by Pomerleau [12] and most others before it. It is important to note that this paper did not claim that using IRL and RL to perform Imitation Learning is necessarily better than using Supervised Learning, it is simply different. One aspect of Abbeel’s method that was appealing to many in the community is that it learns an explicit reward function. By inspecting this, one can determine the quantities the agent is attempting to optimize and thus gain some understanding of why the policy is taking the actions it takes. In this manner, Abbeel’s method is more human-understandable and explainable than the Supervised Learning approaches that came before it.
In the years after the publication of Abbeel and Ng [2]’s IRL paper, a significant research interest developed in using IRL methods to learn complex, real-world skills. It was discovered that IRL methods are less error-prone than Supervised Learning methods for Imitation Learning. Since the Supervised Learning methods are trained to mimic decisions the expert made at each time-step, small deviations made early-on in the trajectory can lead to compounding errors later in the trajectory. IRL methods, on the other hand, learn the expert’s reward function and are thus able to correct small errors simply by optimizing for the maximum reward. Intuitively, IRL methods have an explicit notion of the expert’s goal and can thus optimize for this while direct Supervised Learning are simply attempting to copy what the expert did and generalize slightly [10].
However, IRL methods were found to be slow and ineffi-cient with data. This is because they need to iteratively estimate the expert’s reward function and run an RL algorithm to convergence. RL algorithms are notorious for requiring a large number of environment steps to converge. This is especially crippling when attempting to learn complex behaviors in large, high-dimensional environments. Ho’s GAIL paper attempts to remedy precisely this efficiency issue.
Ho’s paper is riddled with mathematical intricacies, but at a high-level the key insight they had is this: the process of performing IRL then RL implicitly seeks to produce a policy () whose distribution of state-action pairs is similar to the state-action pair distribution of the expert policy (
). This process of training a policy to match the state-action distribution of an expert policy can be done using a Generative Adversarial Neural Network (GAN) [8], which is a specific Neural Network architecture built to learn to match an arbitrary distribution. Performing this distribution matching with a GAN instead of with the combination of IRL and RL is significantly more data-efficient because it does not need to run an RL algorithm to convergence during each training iteration. Using a GAN in this way to more-efficiently perform IRL followed by RL is what the authors refer to as ’Generative Adversarial Imitation Learning’ (GAIL)4.
Fig. 3: An image from a presentation by one of the authors 6. GAIL was able to learn to make humanoid and ant walk in simulation with only approximately 50 timesteps of state-action pairs from a trajectory. The humanoid has approximately 17 joints and the ant has approximately 8 joints that the policy must learn to control individually to generate the desired behavior.
The paper experimentally demonstrates that GAIL is much more efficient with respect to expert data than ’Behavior Cloning’ methods on a number of high-dimensional tasks in simulation. ’Behavior Cloning’ methods are nothing but the direct Supervised Learning methods for Imitation Learning used by Pomerleau [12] and discussed in Section III. Furthermore, GAIL is compared with two versions of state-of-the-art IRL algorithms, including a version of the algorithm from Abbeel and Ng [2]. GAIL is shown to achieve superior performance given significantly fewer expert trajectories.
In summary, this paper introduced a novel method called GAIL that can induce the same policy that IRL methods would, but in a more data-efficient manner. Importantly, even though GAIL uses a DNN to learn its policy, it still requires access to the environment just as IRL methods do. This contrasts the Supervised Learning methods (such as Pomerleau’s) that can be trained only on the expert demonstrations without needing access to the environment. Furthermore, GAIL is only more efficient than IRL methods in terms of expert demonstration data. The authors note that it is not very efficient in terms of the number of interactions required with the environment.
Ho and Ermon [10] experimentally demonstrated that GAIL is more data-efficient than direct Supervised Learning (or Behavior Cloning). However, they did not provide any strong theoretical justification for why GAIL is able to do this. In fact, as mentioned at the end of Section IV, IRL methods are simply different than direct Supervised Learning methods. Aside from some high-level intuitions, it was theoretically unclear why they might offer more robust performance. Ghasemipour et al. [7]’s recent paper sets out to understand the differences between the various approaches to Imitation Learning at a theoretical level to build a unifying perspective amongst them and hopefully use this perspective to develop novel methods.
Section V introduced Ho and Ermon [10]’s insight that
IRL methods are really just attempting to learn a policy whose state-action distribution matches the state-action distribution of the expert’s policy. Ghasemipour builds on this by proving that all Imitation Learning methods are simply attempting to minimize some measure of divergence between the expert’s state-action distribution and the learned policy’s state-action distribution. Let’s use to denote the distribution of states and actions encountered when following the expert’s policy. Similarly, let
denote the distribution of states and actions encountered when following the learned policy. Ghasemipour specifically shows that GAIL and related methods seek to minimize the divergence between
and
whereas direct Supervised Learning methods seek to minimize the divergence between
and
, where
denotes actions taken at time t and
denotes the state at time t. Ghasemipour experimentally demonstrates that it is precisely this difference - that direct Supervised Learning methods attempt to match the expert’s distribution of actions conditioned on states while GAIL and other IRL-based methods attempt to match the expert’s joint distribution of actions and states - that makes GAIL perform better than the direct Supervised Learning methods on complex tasks in high-dimensional state spaces.
Additionally, given the insight that all Imitation Learning methods just seek to match state-action distributions, Ghasemipour raises the question of whether we should even infer these distributions from expert demonstrations. Instead, he introduces a variant of a state-of-the-art IRL method Fu et al. [6] that he calls FAIRL and then uses a version of this to learn to perform tasks that are directly-specified. Specifically, instead of providing expert demonstrations, Ghasemipour hand-specifies the distribution of states and actions he wants the learned policy to follow.
Fig. 4: An image from Ghasemipour’s paper showing a visualization of experiments performed on learning a policy to match a hand-specified distribution. Subfigures (a) through (d) are visualizations of the distributions for movement actions Ghasemipour specified and corresponding visualizations of the distributinos the learned policies had. The images on the very left showcase the simulation environments for the ’Fetch’ (top) and ’Pusher’ (bottom) tasks respectively.
To summarize, this paper presented a theoretical justifi-cation for why recent IRL-based methods (such as GAIL) achieve higher performance on complex Imitation Learning tasks when trained on much fewer expert trajectories than direct Supervised Learning methods. After experimentally verifying their theoretical claim, the authors introduce a new method to perform Imitation Learning without expert trajectories by directly learning to match some hand-specified state-action distribution. In domains where it is easy to hand-specify such distributions (such as simple pushing or movement domains), the author’s method is potentially easier to performing Imitation Learning with expert demonstrations. However, it is unlikely that such hand-specification is easy, or even possible, in all domains of interest.
There has been much progress with Imitation Learning methods over the past 30 years. Such methods have gone from being able to learn simple, specific tasks given ideal conditions and hand-engineered features (such as Pomerleau [12]’s early success with steering a car) to learning complex tasks in high-dimensional simulation environments (such as block-pushing or teaching an ant to walk) from very few expert demonstrations. Furthermore, we as a community have developed a wide array of methods - ranging from direct Supervised Learning to using Reinforcement Learning - to solve the Imitation Learning problem. These different methods have demonstrated different advantages and disadvantages, and we have recently developed a theoretical understanding of why such differences exist.
Despite all this progress, there are many open questions that remain to be answered and much work that remains to be done. Recent work has shown impressive experimental results in simulation domains, however it remains to be seen whether such methods will transfer well to complex tasks in the real-world. This might be especially difficult because, as noted by Ho and Ermon [10], GAIL and related state-of-the-art methods are not that efficient in terms of interaction with the environment needed for them to successfully learn new skills. Furthermore, as Ghasemipour et al. [7] points out, expert demonstrations may not be the easiest way to induce policies for certain tasks. Thus, features of a task that make it easily amenable to solving via Imitation Learning with expert demonstrations should be studied. Additionally, novel ways of providing state-action distributions for a learner to match (for example, via language commands, etc.) can be explored. Finally, while Ghasemipour et al. [7]’s work helped deepen our theoretical understanding of Imitation Learning methods, there is much more to be understood (for example, the exact effect of the measure of divergence chosen (for example, KL versus JS) on Imitation Learning’s performance) that could help develop novel methods that can solve increasingly complex tasks in real-world settings.
[1] Supervised learning, Oct 2019. URL https://en.wikipedia.org/wiki/ Supervised learning.
[2] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, pages 1–, New York, NY, USA, 2004. ACM. ISBN 1-58113-838-5. doi: 10.1145/1015330.1015430. URL http:// doi.acm.org/10.1145/1015330.1015430.
[3] Aijaz A. Baloch and Allen M. Waxman. Visual learning, adaptive expectations, and behavioral conditioning of the mobile robot mavin. Neural Networks, 4(3):271 – 302, 1991. ISSN 0893-6080. doi: https://doi.org/10.1016/0893-6080(91)90067-F. URL http://www.sciencedirect.com/science/ article/pii/089360809190067F.
[4] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to end learning for selfdriving cars. CoRR, abs/1604.07316, 2016. URL http://arxiv.org/abs/1604.07316.
[5] Xin-Qiang Cai, Yao-Xiang Ding, Yuan Jiang, and Zhi- Hua Zhou. Expert-level atari imitation learning from demonstrations only, 2019.
[6] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. CoRR, abs/1710.11248, 2017. URL http: //arxiv.org/abs/1710.11248.
[7] Seyed Kamyar Seyed Ghasemipour, Richard Zemel, and Shixiang Gu. A divergence minimization perspective on imitation learning methods, 2019.
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/ 5423-generative-adversarial-nets.pdf.
[9] Daniel Conrad Halbert. Programming by Example. PhD thesis, 1984. AAI8512843.
[10] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. CoRR, abs/1606.03476, 2016. URL http://arxiv.org/abs/1606.03476.
[11] Sergey V Levine. Cs285. URL http:// rail.eecs.berkeley.edu/deeprlcourse/.
[12] Dean A. Pomerleau. Advances in neural information processing systems 1. chapter ALVINN: An Autonomous Land Vehicle in a Neural Network, pages 305–313. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1989. ISBN 1-558-60015-9.
URL http://dl.acm.org/citation.cfm?id= 89851.89891.
[13] Jesus Rodriguez. What’s new in deep learning research: Openai and deepmind join forces to achieve superhuman..., Nov 2018. URL https://towardsdatascience.com/ whats-new-in-deep-learning-research- openai-and-deepmind-join-forces-to- achieve-superhuman-48e7d1accf85.
[14] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, USA, 2018. ISBN 0262039249, 9780262039246.
[15] Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08, pages 1433–1438. AAAI Press, 2008. ISBN 978-1-57735-368-3. URL http://dl.acm.org/ citation.cfm?id=1620270.1620297.