A New Framework for Query Efficient Active Imitation Learning

2019·Arxiv

ABSTRACT

ABSTRACT

We seek to align agent policy with human expert behavior in a reinforcement learning (RL) setting, without any prior knowledge about dynamics, reward function, and unsafe states. There is a human expert knowing the rewards and unsafe states based on his preference and objective, but querying that human expert is expensive. To address this challenge, we propose a new framework for imitation learning (IL) algorithm that actively and interactively learns a model of the users reward function with efficient queries. We build an adversarial generative model of states and a successor feature (SR) model trained over transition experience collected by learning policy. Our method uses these models to select state-action pairs, asking the user to comment on the optimality or safety, and trains a adversarial neural network to predict the rewards. Different from previous papers, which are almost all based on uncertainty sampling, the key idea is to actively and efficiently select state-action pairs from both on-policy and off-policy experience, by discriminating the queried (expert) and unqueried (generated) data and maximizing the efficiency of value function learning. We call this method adversarial reward query with successor representation. We evaluate the proposed method with simulated human on a state-based 2D navigation task, robotic control tasks and the image-based video games, which have high-dimensional observation and complex state dynamics. The results show that the proposed method significantly outperforms uncertainty-based methods on learning reward models, achieving better query efficiency, where the adversarial discriminator can make the agent learn human behavior more efficiently and the SR can select states which have stronger impact on value function. Moreover, the proposed method can also learn to avoid unsafe states when training the reward model.

1 INTRODUCTION

In reinforcement learning problem, the objective of RL agent is always set by reward functions (Sutton et al., 1998). Such reward functions are usually hand coded and, unfortunately, defining rewards for real robot tasks manually is challenging even for relatively well-understood problems such as robotic grasping. Recent learning from demonstrations (LfD) (Argall et al., 2009) algorithms aim to provide an intuitive interface that allows end-users to program robots without the use of code or expert knowledge. Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016) is a form of LfD that focuses on recovering the underlying reward function that generates an experts demonstrations (Abbeel & Ng, 2004). Reinforcement learning in conjunction with this optimized reward function always shows better generalization capability, compared with supervised policy learning, such as Behavior Cloning (Bain & Sommut, 1999), because GAIL captures the underlying intention of the demonstrator and attenuates the compounding error caused by covariate shift (Ross & Bagnell, 2010).

A significant problem for existing IL algorithms is that they often require a large number of demonstrations from human expert, while it is also difficult for human expert to determine what kinds of demonstrations are most informative to train the RL agent. To deal with this problem, active IL (Lopes et al., 2009; Cohn et al., 2011; Cui & Niekum, 2018) have been proposed that estimate the uncertainty and information gain to select queries that are expected to be the most informative under certain criteria. However, previous papers on active IL are all based on uncertainty sampling. Uncertainty estimation can be difficult in problems with high-dimensional observations and complex state dynamics. Besides, diversity is always neglected in query selection by these papers, which may lead to select similar query states and reduce efficiency. Moreover, uncertainty-based methods are vulnerable to outliers.

In this paper, we propose an active IL method which learns a low dimensional latent space from queried and unqueried data using Wasserstein Autoencoders (WAE) (Tolstikhin et al., 2017). Based on this learned latent space, we propose two methods to select data from the unqueried pool for human labeling. The first method selects data that are sufficiently different in the latent space from queried data. It is performed by an adversarial network which classifies which pool the data instances belong to (queried or unqueried) and does not depend on the task or policies for which are trying to collect human labels. The second method selects bottleneck states for human labeling, which are prototypical representatives of well-connected regions in state space and can hence have stronger impact on learning value function than other states. Instead of constructing a graph of the state space, it leverages Successor Representations (SR) (Dayan, 1993) to learn the bottleneck states. Since the temporal structure among states can be captured by the SR inherently, a reasonable proxy for the actual graph can be formed implicitly. By using the latent space learned by WAE as state features, the SR can be learned with temporal-difference (TD) learning (Sutton et al., 1998) and, as we discuss below, it can be seen as implicitly estimating the transition dynamics of the environment. Moreover, some recent papers have integrated SR with neural network function approximators (Kulkarni et al., 2016; Barreto et al., 2017), allowing us to implicitly form the graphical structure of high-dimensional state space in an efficient way.

2 PRELIMINARIES

In this section, we provide a review of some established models adopted in this paper.

2.1 GENERATIVE ADVERSARIAL IMITATION LEARNING

The base model of imitation learning (IL) in this paper is Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016). GAIL is the-state-of-art IL model which has shown success in many difficult problems. It builds on GAN (Goodfellow et al., 2014), which is a popluar generative model and shows success in many real-world tasks. GAN uses two networks, a generator G and a discriminator D. The generator tries to cheat the discriminator and generate samples that are indistinguishable from real data. The job of the discriminator is then to tell apart the data and the samples, predicting 1 with high probability if the sample is real and 0 otherwise. More specifically, GANs optimize the following objective function

Inspired from GAN (Goodfellow et al., 2014), GAIL is a model-free adversarial imitation learning framework. The reward given by the discriminator D drives the policy to imitate the expert (human) policy behavior by minimizing the Jensen-Shannon divergence between the state-action distributions generated by and the expert state-action distribution by via the learning objective

where is a hyper-parameter and is the entropy regularization term for policy . Here the discriminator D performs the binary classification to distinguish between samples generated by learning policy and target policy .

Behavior Cloning (BC) (Pomerleau, 1991; Rusu et al., 2015; Duan et al., 2017) is another popular imitation learning method. It can perform well when expert demonstrations are plentiful. However, the performance of BC can degenerate without abundant data (Ross & Bagnell, 2010; Ross et al., 2011). When using BC, even small errors in mimicking the expert demonstration behavior can be quickly accumulated as the policy is unrolled. A good policy should correct for mistakes made previously, but for BC to achieve this, the corrective behaviors have to appear frequently in the training data.

2.2 WASSERSTEIN AUTOENCODER

In order to learn generative models and latent representation for high-dimensional observations in IRL problems, we adopted the recently proposed Wasserstein Autoencoder (WAE) (Tolstikhin et al., 2017; Rubenstein et al., 2018). VAE (Kingma & Welling, 2013) is a dominant model in learning latent representation learning and generative modeling, which focuses on minimizing KL divergence. But it can produce samples which are far from the true data manifold, especially true for structured high-dimensional datasets such as natural images. Different from popular models such as VAE (Kingma & Welling, 2013), WAEs switche to minimize the optimal transport distance between the input data distribution and output generation distribution . Given any non-negative cost function between two input instances, denoting latent code as Z in latent space Z, WAEs minimize the following objective in terms of parameters of the encoder Q(Z|X) and decoder

G(X|Z),

where is the prior distribution of latent code Z, and is the aggregated posterior distribution, is any divergence in latent space Z and is the regularization hyper-parameter. In this paper, we find an adversarial divergence works well in image-based imitation learning problems.

2.3 SUCCESSOR REPRESENTATION

The Successor Representation (SR) represents a state in terms of its successors (Dayan, 1993). Originally, the SR for s is defined as a vector of |S| with i-th index equal to the discounted future occupancy for state given the agent starting at s. Since the SR counts the visitation of successor states, it is directly dependent on the policy and the transition dynamics . Specifically, the original definition of SR can be written as

where I denotes the indicator function. The SR can also be updated by a temporal difference (TD) (Tesauro, 1995) approach by writing it in terms of the SR of the next state:

where is the estimated SR being updated, and is the one-hot vector of state with respect to all other states. However, the original definition of SR cannot be applied to MDP with high-dimensional states, where the number of all states is overwhelming. Assuming state is represented by K-dimensional features , some recent papers (Kulkarni et al., 2016; Barreto et al., 2017) extent SR to high-dimensional problems by using deep neural network as approximators

3 METHODOLOGY

In this section, we are going to introduce the proposed model and query selection strategies. Here we propose both on-policy and off-policy query strategies. The on-policy strategy is to select state-action pairs from the learning policy which are different enough from the expert policy, so as to avoid unsafe actions during exploration. The off-policy strategy is to pick up states that are bottleneck states or form the core-set from the large amount state-action pairs in the replay buffer, so that the value function can be updated more efficiently. Here we denote as the latent of input state , and as the successor representation of state .

3.1 MODEL

The base model is AIRL (Fu et al., 2017), a state-of-art model for IRL achieving many success in large-scale problems. Different from previous work, here we consider the active IRL in large-scale problems with high-dimensional observations. So, we introduce a Wasserstein Autoencoder (WAE) to learn a low-dimensional latent space of the input states, and train the discriminator and successor representation on that latent space. Besides, for large-scale image-based IRL problems, in addition to the divergence regularizing term in regular WAE objective equation 2, we introduce another adversarial regularization term, measuring the distance in the learned latent space between the state-action pairs collected by the learning policy and expert policy.

Figure 1: The proposed model. The empty circle here denotes the stop gradient during training. SR is short for successor representation.

We use the same definition of discriminator D as equation ??. Regarding WAE (Tolstikhin et al., 2017), based on empirical study, we choose to use deterministic encoder. Specifically, in equation 2, the prior of latent codes is unit multivariate Gaussian N(0, I). The cost function is chosen to be l-2 norm loss. We choose the divergence distance to be maximum mean discrepancy (MMD) (Gretton et al., 2012), where the Radial Basis Function (RBF) kernel and Rational Quadratic (RQ) Kernel are both evaluated. Denote the encoder output of WAE as and decoder output as G(z), the training objective of WAE is as below

where the sets and must have the same size, and MMDis the empirical MMD where integral is approximated by discrete summation (Tolstikhin et al., 2017). In addition, we find an adversarial regularizer can improve the practical performance a lot, which is the empirical MMD between state latent from learning policy and expert policy, i.e., MMD, where and are states sampled from replay buffer and expert demonstrations, and they have the same size. Therefore, assuming the sets and denote current states, actions, and next states given the transitions sampled from learning policy and expert demonstrations, fixing learning policy , the overall training objective of the proposed model is to minimize the loss as below:

where the parameters of encoder, decoder and discriminator are all omitted. Moreover, the SR is learned by TD updates, whose training loss is defined as,

where is the target network of SR, and it is updated with periodically. Note that the encoder, or state feature network, , is not updated during the training of SR.

3.2 ON-POLICY QUERY: SAFE EXPLORATION

Our first query strategy is to select states which are sufficiently different in latent space compared with those in expert dataset, to maximize the performance of the latent representation learned on the newly queried data. We aim to choose states whose optimality are uncertain enough and are not well represented in the queried (expert) dataset. It is agnostic to the imitation learning task. This selection for query has to be performed by an adversarial network. So we can directly utilize the discriminator in GAIL, trained by same number of transitions from learning policy and expert policy .

Fortunately, it can be also regarded as a safety classifier (Zhang & Cho, 2017), returning label indicating whether the action given by the learning policy is likely to deviate from that by the expert policy without querying it. So we can also have safe exploration while conducting this query selection in an on-policy way. However, the safety classifier in (Zhang & Cho, 2017) is trained by binary logistic regression. Here the discriminator D is trained in an adversarial way. We are going to show that the adversarial learning approach is more query-efficient to learn safety strategy than that in (Zhang & Cho, 2017).

The safe strategy adopted here is that at every time point, the discriminator D determines if it is safe to take the action from the leaning policy given current state. In training iterations, we can maintain a threshold , measuring the safety of the state-action pair (s, a) suggested by the learning policy . If , it shows that (s, a) is too different from that the expert behavior and it is not safe to take action suggested by . In this case, the agent will query the expert and take the action from . If , it is safe to follow and take the action a.

The threshold can be fixed initially, tuned empirically in specific environments. It can also be updated in every iteration. For example, by the end of every iteration, can be set to the -quantile of the discriminator outputs of all state-action pairs taken by in this iteration. This can be very small, such as 0.02 or 0.05, so that the selected states for query are sufficiently different from those in expert dataset.

3.3 OFF-POLICY QUERY: A CORE-SET ON SUCCESSOR REPRESENTATIONS

Our second query strategy is to select landmark states, which are that are representatives of well-connected regions in the state space. These states always have stronger impact on value function learning than others. Inspired by the batch acquisition for active learning (Sener & Savarese, 2017; Pinsler et al., 2019; Kirsch et al., 2019), we propose a core-set approach for query selection, which aims to find a small subset of the whole unqueried states in the replay buffer, so that the value function learned over the subset of states is competitive over the whole set of unqueried states.

Here we formulate the query selection as a K-center problem (Lim et al., 2005; Sener & Savarese, 2017), which is to cluster states and find the core-set states based on the learnt successor representations (SR) (Kulkarni et al., 2016; Barreto et al., 2017) of the unqueried states in replay buffer. This process is conducted in an off-policy way. Here the core-set is found by k-medians algorithm for computational simplicity. Since the SR captures temporally close-by states efficiently, the generated clusters are spread across the state space, with each cluster assigned to a set of states that are densely connected.

3.4 OVERALL ALGORITHM

The whole algorithm is shown as below. In the beginning, the expert dataset is initialized with state-action pairs from human demonstration behavior. In order to show the effect of active queries, the expert dataset can only have small number of data instances, or even be empty. The update of threshold in on-policy query can be periodical and specific to the exact environment. For example, we can choose to cover the lowest 5% discriminator scores of state-action pairs in the replay buffer. The on-policy query is conducted at Line 6, and the off-policy query is from Line 15 to 16. Empirically, we find both on-policy and off-policy queries cannot be conducted too frequently. Because queries can change the expert dataset, which may negatively influence the training stability of GAIL.

4 EXPERIMENT

4.1 MAZE GAME

We first evaluate the proposed algorithm on maze, which is a 2D navigation game. The map divided into 10x10 equal-sized blocks. The map and policy loss learning curve are shown in Figure 3. The agent and target occupy one block. The agent can start in any empty grid and the target is at the lower-right corner of the map. The action space consists of four directional movements. The maze architecture is kept fixed for the purpose of this work. If an agent moves, but hits a wall, it will stay in the original grid. The rewards in this game is 10 for reaching the target and -1 for every movement. The benchmarks are random query and uncertainty-based query. The former is to randomly select state-action pairs for query from replay buffer. The latter is based on uncertainty estimation of the Q-value of state action pairs. The model builds on bootstrapped DQN (Osband et al., 2016), where the number of heads K is set to 10. The expert dataset only has one expert trajectory initially.

Figure 2: The proposed method evaluated on 2D Navigation Maze Game.

4.2 CAR RACING GAME

We also evaluated the proposed algorithm on a 2D driving game, to test its performance on tasks with high-dimensional states. Here the state is a top-down view of the car. The action controls gas, steering and break. In each query, the expert returns to the RL agent the optimal action according to human behavior under current state. The benchmarks are same as the section above.

Regarding the architecture, since this game has high-dimensional states, we use WAE to learn the latent representation of the state. The architecture of encoder and decoder of WAE are chosen to be same as that in (Barreto et al., 2017). The SR is learned by using the encoder output of WAE as state features.

Figure 3: The proposed method evaluated on Car Racing Game.

5 CONCLUSION

In this paper, we propose a new framework for imitation learning. It is based on two novel query selection strategies, on-policy and off-policy query. The first one builds on the adversarial network, which selects states that are sufficiently different from the queried states stored in expert dataset. The other utilizes the representation power of successor representation (SR), which can pick most representative states on policy’s occupancy measure and improve the efficiency of learning of value function. The evaluation on practical games validates the advantage of the proposed algorithm.

REFERENCES

Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, pp. 1. ACM, 2004.

Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.

Michael Bain and Claude Sommut. A framework for behavioural claning. Machine intelligence, 15 (15):103, 1999.

Andr´e Barreto, Will Dabney, R´emi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055–4065, 2017.

Robert Cohn, Edmund Durfee, and Satinder Singh. Comparing action-query strategies in semi- autonomous agents. In Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.

Yuchen Cui and Scott Niekum. Active reward learning from critiques. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6907–6914. IEEE, 2018.

Peter Dayan. Improving generalization for temporal difference learning: The successor representa- tion. Neural Computation, 5(4):613–624, 1993.

Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in neural information processing systems, pp. 1087–1098, 2017.

Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse rein- forcement learning. arXiv preprint arXiv:1710.11248, 2017.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch¨olkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573, 2016.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

Andreas Kirsch, Joost van Amersfoort, and Yarin Gal. Batchbald: Efficient and diverse batch acqui- sition for deep bayesian active learning. arXiv preprint arXiv:1906.08158, 2019.

Tejas D Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J Gershman. Deep successor reinforcement learning. arXiv preprint arXiv:1606.02396, 2016.

Andrew Lim, Brian Rodrigues, Fan Wang, and Zhou Xu. k-center problems with minimum cover- age. Theoretical Computer Science, 332(1-3):1–17, 2005.

Manuel Lopes, Francisco Melo, and Luis Montesano. Active learning for reward estimation in inverse reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 31–46. Springer, 2009.

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026–4034, 2016.

Robert Pinsler, Jonathan Gordon, Eric Nalisnick, and Jos´e Miguel Hern´andez-Lobato. Bayesian batch active learning as sparse subset approximation. In Advances in Neural Information Processing Systems, pp. 6356–6367, 2019.

Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.

St´ephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 661–668, 2010.

St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and struc- tured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.

Paul K Rubenstein, Bernhard Schoelkopf, and Ilya Tolstikhin. On the latent space of wasserstein auto-encoders. arXiv preprint arXiv:1802.03761, 2018.

Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirk- patrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.

Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017.

Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 2. MIT press Cambridge, 1998.

Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3): 58–68, 1995.

Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.

Jiakai Zhang and Kyunghyun Cho. Query-efficient imitation learning for end-to-end simulated driv- ing. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.