Provably Efficient Imitation Learning from Observation Alone

2019·Arxiv

Abstract

Abstract

We study Imitation Learning (IL) from Observations alone (ILFO) in large-scale MDPs. While most IL algorithms rely on an expert to directly provide actions to the learner, in this setting the expert only supplies sequences of observations. We design a new model-free algorithm for ILFO, Forward Adversarial Imitation Learning (FAIL), which learns a sequence of time-dependent policies by minimizing an Integral Probability Metric between the observation distributions of the expert policy and the learner. FAIL is the first provably efficient algorithm in ILFO setting, which learns a near-optimal policy with a number of samples that is polynomial in all relevant parameters but independent of the number of unique observations. The resulting theory extends the domain of provably sample efficient learning algorithms beyond existing results, which typically only consider tabular reinforcement learning settings or settings that require access to a near-optimal reset distribution. We also investigate the extension of FAIL in a model-based setting. Finally we demonstrate the efficacy of FAIL on multiple OpenAI Gym control tasks.

1. Introduction

Imitation Learning (IL) is a sample efficient approach to policy optimization (Ross et al., 2011; Ross & Bagnell, 2014; Sun et al., 2017) that has been extensively used in real applications, including Natural Language Processing (Daum´e III et al., 2009; Chang et al., 2015b;a), game playing (Silver et al., 2016; Hester et al., 2017), system identification (Venkatraman et al., 2015; Sun et al., 2016), and robotics control tasks (Pan et al., 2018). Most previous IL work considers settings where an expert can directly provide action signals. In these settings, a general strategy is to directly learn a policy that maps from state to action, via supervised learning approaches (e.g., DAgger (Ross et al., 2011), AggreVaTe (Ross & Bagnell, 2014), THOR (Sun et al., 2018a), Behaviour Cloning (Syed & Schapire, 2010)). Another popular strategy is to learn a policy by minimizing some divergence between the policy’s state-action distribution and the expert’s state-action distribution. Popular divergences include Forward KL (i.e., Behaviour Cloning), Jensen Shannon Divergence (e.g., GAIL (Ho & Ermon, 2016)).

Here, we consider a more challenging IL setting, where experts’ demonstrations consist only of observations, no action or reward signals are available to the learner, and no reset is allowed (e.g., a robot learns a task by just watching an expert performing the task). We call this setting Imitation Learning from Observations alone (ILFO). Under this setting, without access to expert actions, approaches like DAgger, AggreVaTe, GAIL, and Behaviour Cloning by defi-nition cannot work. Although recently several model-based approaches, which learn an inverse model that predicts the actions taken by an expert (Torabi et al., 2018; Edwards et al., 2018) based on successive observations, have been proposed, these approaches can suffer from covariate shift (Ross et al., 2011). While we wish to train a predictor that can infer an expert’s actions accurately under the expert’s observation distribution, we do not have access to actions generated by the expert conditioned on the expert’s observation (See Section 6 for a more detailed discussion). An alternative strategy is to handcraft cost functions that use some distance metric to penalize deviation from the experts’ trajectories (e.g., Liu et al. (2018); Peng et al. (2018)), which is then optimized by Reinforcement Learning (RL). These methods typically involve hand-designed cost functions that sometimes require prior task-specific knowledge (Peng et al., 2018). The quality of the learned policy is therefore completely dependent on the hand-designed costs which could be widely different from the true cost. Ideally, we would like to learn a policy that minimizes the unknown true cost function of the underlying MDP.

In this work, we explicitly consider learning near-optimal policies in a sample and computationally efficient manner. Specifically, we focus on large-scale MDPs where the number of unique observations is extremely large (e.g., highdimensional observations such as raw-pixel images). Such large-scale MDP settings immediately exclude most existing sample efficient RL algorithms, which are often designed for small tabular MDPs, whose sample complexities have a polynomial dependency on the number of observations and hence cannot scale well. To solve large-scale MDPs, we need to design algorithms that leverage function approximation for generalization. Specifically, we are interested in algorithms with the following three properties: (1) near-optimal performance guarantees, i.e., we want to search for a policy whose performance is close to the expert’s in terms of the expected total cost of the underlying MDP (and not a hand-designed cost function); (2) sample efficiency, we require sample complexity that scales polynomially with respect to all relevant parameters (e.g., horizon, number of actions, statistical complexity of function approximators) except the cardinality of the observation space—hence excluding PAC RL algorithms designed for small tabular MDPs; (3) computational efficiency: we rely on the notion of oracle-efficiency (Agarwal et al., 2014) and require the number of efficient oracle calls to scale polynomially— thereby excluding recently proposed algorithms for Contextual Decision Processes which are not computationally efficient (Jiang et al., 2016; Sun et al., 2018b). To the best of our knowledge, the desiderata above requires designing new algorithms.

With access to experts’ trajectories of observations we introduce a model-free algorithm, called Forward Adversarial Imitation Learning (FAIL), that decomposes ILFO into H independent two-player min-max games, where H is the horizon length. We aim to learn a sequence of time-dependent policies from h = 1 to H, where at any time step h, the policy is learned such that the generated observation distribution at time step h + 1, conditioned on being fixed, matches the expert’s observation distribution at time step h + 1, in terms of an Integral Probability Metric (IPM) (M¨uller et al., 1997). IPM is a family of divergences that can be understood as using a set of discriminators to distinguish two distributions (e.g., Wasserstein distance is one such special instance). We analyze the sample complexity of FAIL and show that FAIL can learn a near-optimal policy in sample complexity that does not explicitly depend on the cardinality of observation space, but rather only depends on the complexity measure of the policy class and the discriminator class. Hence FAIL satisfies the above mentioned three properties. The resulting theory extends the domain of provably sample efficient learning algorithms beyond existing results, which typically only consider tabular reinforcement learning settings (e.g., Dann & Brunskill (2015)) or settings that require access to a near-optimal reset distribution (e.g., Kakade & Langford (2002); Bagnell et al. (2004); Munos & Szepesv´ari (2008)). We also demonstrate that learning under ILFO can be exponentially more sample efficient than pure RL.We also study FAIL un- der three specific settings: (1) Lipschitz continuous MDPs, (2) Interactive ILFO where one can query the expert at any time during training, and (3) state abstraction. Finally, we demonstrate the efficacy of FAIL on multiple continuous control tasks.

2. Preliminaries

We consider an episodic finite horizon Decision Process that consists of , where for [H] is a time-dependent observation space,1 A is a discrete action space such that is the horizon. We assume the cost function is only defined at the last time step H (e.g., sparse cost), and the initial observation distribution, and P is the transition model i.e., for . Note that here we assume the cost function only depends on observations. We assume that for all is discrete, but is extremely large and hence any sample complexity that has polynomial dependency on be considered as sample inefficient, i.e., one cannot afford to visit every unique observation. We assume that the cost is bounded, i.e., for any sequence of observations, (e.g., zero-one loss at the end of each episode). For a time-dependent policy with , the value function is defined as:

and state-action function is defined as with . We denote as the distribution over at time step policy classes , the goal is to learn a with , which minimizes the expected cost:

Denote for . We define a Bellman Operator associated with the expert at time step h as where for any

Notation For a function , we denote as the Lipschitz constant: , with being the metric in space X.2 We consider a Reproducing Kernel Hilbert Space (RKHS) H defined with a positive definite kernel such that H is the span of , and we have f(x) = , with . For any , we define . We denote U(A) as a uniform distribution over action set A. For , we denote

Integral Probability Metrics (IPM) (M¨uller, 1997) is a family of distance measures on distributions: given two distributions and over X, and a function class F containing real-value functions and symmetric (e.g., ), IPM is defined as:

By choosing different class of functions F, various popular distances can be obtained. For instance, IPM with F = {f : recovers Total Variation distance, IPM with recovers Wasserstein distance, and IPM with with RKHS H reveals maximum mean discrepancy (MMD).

2.1. Assumptions

We first assume access to a Cost-Sensitive oracle and to a Linear Programming oracle.

Cost-Sensitive Oracle The Cost-Sensitive (CS) oracle takes a class of classifiers , a dataset consisting of pairs of feature x and cost vector , i.e., , as inputs, and outputs a classifier that minimizes the average expected classification cost:

Efficient cost sensitive classifiers exist (e.g., Beygelzimer et al. (2005)) and are widely used in sequential decision making tasks (e.g., Agarwal et al. (2014); Chang et al. (2015b)).

Linear Programming Oracle A Linear Programming (LP) oracle takes a class of functions G as inputs, optimizes a linear functional with respect to :

When G is in RKHS with bounded norm, the linear functional becomes , from which one can obtain the closed-form solution. Another example is when G consists of all functions with bounded Lipschitiz constant, i.e., for , Sriperum- budur et al. (2012) showed that be solved by Linear Programming with n many constraints, one for each pair Appendix E, we provide a new reduction to Linear Programming for

The second assumption is related to the richness of the function class. We simply consider a time-dependent policy class , and we assume realizability:

Assumption 2.1 (Realizability and Capacity of Function Class). We assume and contains and , i.e., . Further assume that for all is symmetric, and is finite in size.

Note that we assume and to be discrete (but could be extremely large) for analysis simplicity. As we will show later, our bound scales only logarithmically with respect to the size of function class.

3. Algorithm

Our algorithm, Forward Adversarial Imitation Learning (FAIL), aims to learn a sequence of policies such that its value is close to . Note that does not necessarily mean that the state distribution of is close to . FAIL learns a sequence of policies with this property in a forward training manner. The algorithm learns starting from i = 1. When learning , the algorithm fixes , and solves a min-max game to compute ; it then proceeds to time step i + 1. At a high level, FAIL decomposes a sequential learning problem into H-many independent two-player min-max games, where each game can be solved efficiently via no-regret online learning. Below, we first consider how to learn conditioned on being fixed. We then present FAIL by chaining H min-max games together.

3.1. Learning One Step Policy via a Min-Max Game

Throughout this section, we assume are learned already and fixed. The sequence of policies for determines a distribution over observation space at time step h. Also expert policy naturally induces a sequence of observation distributions for . The problem we consider in this section is to learn a policy , such that the resulting observation distribution from at time step h + 1 is close to the expert’s observation distribution at time step h + 1.

We consider the following IPM minimization problem. Given the distribution , and a policy , via the Markov property, we have the observation distribution at time step h + 1 conditioned on and as for any . Recall that the expert observation distribution at time step h + 1 is denoted as . The IPM with between is defined as:

Note that is parameterized by , and our goal is to minimize with respect to over . However, is not measur- able directly as we do not have access to but only samples from . To estimate , we draw a dataset such that , , together with observation set resulting from expert , we form the follow- ing empirical estimation of , via importance weighting (recall

where recall that K = |A| and the importance weight is used to account for the fact that we draw ac- tions uniformly from A but want to evaluate . Though due to the max operator, is not an unbiased estimate of , in Appendix A, we show that indeed is a good approximation of via an application of the standard Bernstein’s inequality and a union bound over . Hence we can approximately minimize with respect to resulting in a two-player min-max game. Intuitively, we can think of as a generator, such that, conditioned on , it generates next-step samples that are similar to the expert samples from , via fooling discriminators

Note that the above formulation is a two-player game, with the utility function for defined as:

Algorithm 1 solves the minmax game using no-regret online update on both f and . At iteration n, player f plays the best-response via (Line 3) and player plays the Follow-the-Regularized Leader (FTRL) (Shalev-Shwartz et al., 2012) as being con- vex regularization (Line 5). Note that other no-regret online learning algorithms (e.g., replacing FTRL by incremental online learning algorithms like OGD (Zinkevich, 2003) can also be used to approximately optimize the above min-max formulation. After the end of Algorithm 1, we output a policy among all computed policies such that is minimized (Line 7).

Regarding the computation efficiency of Algorithm 1, the best response computation on f in Line 3 can be computed by a call to the LP Oracle, while FTRL on can be implemented by a call to the regularized CS Oracle. Regarding the statistical performance, we have the following theorem:

The proof of the above theorem is included in Appendix A, which combines standard min-max theorem and uniform convergence analysis. The above theorem essentially shows that Algorithm 1 successfully finds a policy whose resulting IPM is close to the smallest possible IPM one could achieve if one had access to directly, up to error. Intuitively, from Theorem 3.1, we can see that if —the observation distribution resulting from fixed policies , is similar to , then we guarantee to learn a policy , such that the new sequence of policies will generate a new distribution that is close to , in terms of IPM with . The algorithm introduced below is based on this intuition.

3.2. Forward Adversarial Imitation Learning

Theorem 3.1 indicates that conditioned on being fixed, Algorithm 1 finds a policy such that it approximately minimizes the divergence—measured under IPM with , between the observation distribution resulting from , and the corresponding distribution from expert.

With Algorithm 1 as the building block, we now introduce our model-free algorithm—Forward Adversarial Imitation Learning (FAIL) in Algorithm 2. Algorithm 2 integrates Algorithm 1 into the Forward Training framework (Ross & Bagnell, 2010), by decomposing the sequential learning problem into H many independent distribution matching problems where each one is solved using Algorithm 1 independently. Every time step h, FAIL assumes that have been correctly learned in the sense that the resulting observation distribution is close to from expert. Therefore, FAIL is only required to focus on learning correctly conditioned on being fixed, such that is

close to , in terms of the IPM with . Intuitively, when one has a strong class of discriminators, and the two-player game in each time step is solved near optimally, then by induction from h = 1 to H, FAIL should be able to learn a sequence of policies such that is close to for all (for the base case, we simply have

3.3. Analysis of Algorithm 2

The performance of FAIL crucially depends on the capacity of the discriminators. Intuitively, discriminators that are too strong cause overfitting (unless one has extremely large number of samples). Conversely, discriminators that are too weak will not be able to distinguish from . This dilemma was studied in the Generative Adversarial Network (GAN) literature already by Arora et al. (2017). Below we study this tradeoff explicitly in IL.

To quantify the power of discriminator class for all h, we use inherent Bellman Error (IBE) with respect to

The Inherent Bellman Error is commonly used in approximate value iteration literature (Munos, 2005; Munos & Szepesv´ari, 2008; Lazaric et al., 2016) and policy evaluation literature (Sutton, 1988). It measures the worst possible projection error when projecting to function space Intuitively increasing the capacity of

Using a restricted function class F potentially introduces , hence one may tend to set to be infinitely powerful discriminator class such as function class consisting of all bounded functions (recall IPM becomes total variation in this case). However, using makes efficient learning impossible. The following proposition excludes the possibility of sample efficiency with discriminator class being

Theorem 3.2 (Infinite Capacity F does not generalize). There exists a MDP with H = 2, a policy set an expert policy with (i.e., is realizable), such that for datasets with , with , and with , as long as M = O(log(|X|)), we must have:

Namely, denote as the empirical distribution of a dataset D by assigning probability 1/|D| to any sample, we have:

The above theorem shows by just looking at the samples generated from and , and comparing them to the samples generated from the expert policy (IPM becomes Total variation here), we cannot distinguish from , as they both look similar to , i.e., none of the three datasets overlap with each other, resulting the TV distances between the empirical distributions become constants, unless the sample size scales

Theorem 3.2 suggests that one should explicitly regularize discriminator class so that it has finite capacity (e.g., bounded VC or Rademacher Complexity). The restricted discriminator class F has been widely used in practice as well such as learning generative models (i.e., Wasserstain GANs (Arjovsky et al., 2017)). Denote and . The following theorem shows that the learned time-dependent policies ’s performance is close to the expert’s performance:

Theorem 3.3 (Sample Complexity of FAIL). Under As- sumption 2.1, for any , set , , with probability at least Algorithm 2) outputs , such that,

by using

average inherent Bellman Error

Note that the average inherent Bellman error defined above is averaged over the state distribution of the learned policy and the state distribution of the expert, which is guaranteed to be smaller than the classic inherent Bellman error used in RL literature (i.e., (4)) which uses infinity norm over X. The proof of Theorem 3.3 is included in Ap- pendix C. Regarding computational complexity of Algo- rithm 2, we can see that it requires polynomial number of calls (with respect to parameters ) to the efficient oracles (Regularized CS oracle and LP oracle). Since our analysis only uses uniform convergence analysis and standard concentration inequalities, extension to continuous and F with complexity measure such as VC-dimension, Rademacher complexity, and covering number is standard. We give an example in Section 5.1.

4. The Gap Between ILFO and RL

To quantify the gap between RL and ILFO, below we present an exponential separation between ILFO and RL in terms of sample complexity to learn a near-optimal policy. We assume that the expert policy is optimal.

Proposition 4.1 (Exponential Separation Between RL and ILFO). Fix and . There exists a family of MDP with deterministic dynamics, with horizon H,