b

DiscoverSearch
About
My stuff
Provably Efficient Imitation Learning from Observation Alone
2019·arXiv
Abstract
Abstract

We study Imitation Learning (IL) from Observations alone (ILFO) in large-scale MDPs. While most IL algorithms rely on an expert to directly provide actions to the learner, in this setting the expert only supplies sequences of observations. We design a new model-free algorithm for ILFO, Forward Adversarial Imitation Learning (FAIL), which learns a sequence of time-dependent policies by minimizing an Integral Probability Metric between the observation distributions of the expert policy and the learner. FAIL is the first provably efficient algorithm in ILFO setting, which learns a near-optimal policy with a number of samples that is polynomial in all relevant parameters but independent of the number of unique observations. The resulting theory extends the domain of provably sample efficient learning algorithms beyond existing results, which typically only consider tabular reinforcement learning settings or settings that require access to a near-optimal reset distribution. We also investigate the extension of FAIL in a model-based setting. Finally we demonstrate the efficacy of FAIL on multiple OpenAI Gym control tasks.

Imitation Learning (IL) is a sample efficient approach to policy optimization (Ross et al., 2011; Ross & Bagnell, 2014; Sun et al., 2017) that has been extensively used in real applications, including Natural Language Processing (Daum´e III et al., 2009; Chang et al., 2015b;a), game playing (Silver et al., 2016; Hester et al., 2017), system identification (Venkatraman et al., 2015; Sun et al., 2016), and robotics control tasks (Pan et al., 2018). Most previous IL work considers settings where an expert can directly provide action signals. In these settings, a general strategy is to directly learn a policy that maps from state to action, via supervised learning approaches (e.g., DAgger (Ross et al., 2011), AggreVaTe (Ross & Bagnell, 2014), THOR (Sun et al., 2018a), Behaviour Cloning (Syed & Schapire, 2010)). Another popular strategy is to learn a policy by minimizing some divergence between the policy’s state-action distribution and the expert’s state-action distribution. Popular divergences include Forward KL (i.e., Behaviour Cloning), Jensen Shannon Divergence (e.g., GAIL (Ho & Ermon, 2016)).

Here, we consider a more challenging IL setting, where experts’ demonstrations consist only of observations, no action or reward signals are available to the learner, and no reset is allowed (e.g., a robot learns a task by just watching an expert performing the task). We call this setting Imitation Learning from Observations alone (ILFO). Under this setting, without access to expert actions, approaches like DAgger, AggreVaTe, GAIL, and Behaviour Cloning by defi-nition cannot work. Although recently several model-based approaches, which learn an inverse model that predicts the actions taken by an expert (Torabi et al., 2018; Edwards et al., 2018) based on successive observations, have been proposed, these approaches can suffer from covariate shift (Ross et al., 2011). While we wish to train a predictor that can infer an expert’s actions accurately under the expert’s observation distribution, we do not have access to actions generated by the expert conditioned on the expert’s observation (See Section 6 for a more detailed discussion). An alternative strategy is to handcraft cost functions that use some distance metric to penalize deviation from the experts’ trajectories (e.g., Liu et al. (2018); Peng et al. (2018)), which is then optimized by Reinforcement Learning (RL). These methods typically involve hand-designed cost functions that sometimes require prior task-specific knowledge (Peng et al., 2018). The quality of the learned policy is therefore completely dependent on the hand-designed costs which could be widely different from the true cost. Ideally, we would like to learn a policy that minimizes the unknown true cost function of the underlying MDP.

In this work, we explicitly consider learning near-optimal policies in a sample and computationally efficient manner. Specifically, we focus on large-scale MDPs where the number of unique observations is extremely large (e.g., highdimensional observations such as raw-pixel images). Such large-scale MDP settings immediately exclude most existing sample efficient RL algorithms, which are often designed for small tabular MDPs, whose sample complexities have a polynomial dependency on the number of observations and hence cannot scale well. To solve large-scale MDPs, we need to design algorithms that leverage function approximation for generalization. Specifically, we are interested in algorithms with the following three properties: (1) near-optimal performance guarantees, i.e., we want to search for a policy whose performance is close to the expert’s in terms of the expected total cost of the underlying MDP (and not a hand-designed cost function); (2) sample efficiency, we require sample complexity that scales polynomially with respect to all relevant parameters (e.g., horizon, number of actions, statistical complexity of function approximators) except the cardinality of the observation space—hence excluding PAC RL algorithms designed for small tabular MDPs; (3) computational efficiency: we rely on the notion of oracle-efficiency (Agarwal et al., 2014) and require the number of efficient oracle calls to scale polynomially— thereby excluding recently proposed algorithms for Contextual Decision Processes which are not computationally efficient (Jiang et al., 2016; Sun et al., 2018b). To the best of our knowledge, the desiderata above requires designing new algorithms.

With access to experts’ trajectories of observations we introduce a model-free algorithm, called Forward Adversarial Imitation Learning (FAIL), that decomposes ILFO into H independent two-player min-max games, where H is the horizon length. We aim to learn a sequence of time-dependent policies from h = 1 to H, where at any time step h, the policy  πhis learned such that the generated observation distribution at time step h + 1, conditioned on π1, . . . , πh−1being fixed, matches the expert’s observation distribution at time step h + 1, in terms of an Integral Probability Metric (IPM) (uller et al., 1997). IPM is a family of divergences that can be understood as using a set of discriminators to distinguish two distributions (e.g., Wasserstein distance is one such special instance). We analyze the sample complexity of FAIL and show that FAIL can learn a near-optimal policy in sample complexity that does not explicitly depend on the cardinality of observation space, but rather only depends on the complexity measure of the policy class and the discriminator class. Hence FAIL satisfies the above mentioned three properties. The resulting theory extends the domain of provably sample efficient learning algorithms beyond existing results, which typically only consider tabular reinforcement learning settings (e.g., Dann & Brunskill (2015)) or settings that require access to a near-optimal reset distribution (e.g., Kakade & Langford (2002); Bagnell et al. (2004); Munos & Szepesv´ari (2008)). We also demonstrate that learning under ILFO can be exponentially more sample efficient than pure RL.We also study FAIL un- der three specific settings: (1) Lipschitz continuous MDPs, (2) Interactive ILFO where one can query the expert at any time during training, and (3) state abstraction. Finally, we demonstrate the efficacy of FAIL on multiple continuous control tasks.

We consider an episodic finite horizon Decision Process that consists of  {Xh}Hh=1, A, c, H, ρ, P, where  Xhfor  h ∈[H] is a time-dependent observation space,1 A is a discrete action space such that  |A| = K ∈ N+, His the horizon. We assume the cost function  c : XH → Ris only defined at the last time step H (e.g., sparse cost), and  ρ ∈ ∆(X1) isthe initial observation distribution, and P is the transition model i.e.,  P : Xh × A → ∆(Xh+1)for  h ∈ [H − 1]. Note that here we assume the cost function only depends on observations. We assume that  Xhfor all  h ∈ [H]is discrete, but  |Xh|is extremely large and hence any sample complexity that has polynomial dependency on  |Xh| shouldbe considered as sample inefficient, i.e., one cannot afford to visit every unique observation. We assume that the cost is bounded, i.e., for any sequence of observations,  cH ≤ 1(e.g., zero-one loss at the end of each episode). For a time-dependent policy  π = {π1, . . . , πH}with  πh : Xh →∆(A), the value function  V πh : Xh → [0, 1]is defined as:

image

and state-action function  Qπh (xh, ah)is defined as Qπh (xh, ah) = Exh+1∼Pxh,ah�V πh+1(xh+1)�with V πH (x) = c(x). We denote  µπhas the distribution over  Xhat time step  h following π. Given Hpolicy classes {Π1, . . . , ΠH}, the goal is to learn a  π = {π1, . . . , πH}with  πh ∈ Πh, which minimizes the expected cost:

image

Denote  Fh ⊆ {f : Xh → R}for  h ∈ [H]. We define a Bellman Operator  Γhassociated with the expert  π⋆hat time step h as  Γh : Fh+1 → {f : Xh → R}where for any xh ∈ Xh, f ∈ Fh+1,

image

Notation For a function  f : X → R, we denote  ∥f∥∞ =supx∈X |f(x)|, ∥f∥Las the Lipschitz constant:  ∥f∥L =supx1,x2,x1̸=x2(f(x1) − f(x2))/d(x1, x2), with  d : X ×X → R+being the metric in space X.2 We consider a Reproducing Kernel Hilbert Space (RKHS) H defined with a positive definite kernel  k : X × X → [0, 1]such that H is the span of  {k(x, ·) : x ∈ X}, and we have f(x) = ⟨f, k(x, ·)⟩, with  ⟨k(x1, ·), k(x2, ·)⟩ ≜ k(x1, x2). For any f ∈ H, we define  ∥f∥2H = ⟨f, f⟩. We denote U(A) as a uniform distribution over action set A. For  N ∈ N+, we denote  [N] ≜ {1, 2, . . . , N}.

Integral Probability Metrics (IPM) (uller, 1997) is a family of distance measures on distributions: given two distributions  P1and  P2over X, and a function class F containing real-value functions  f : X → Rand symmetric (e.g.,  ∀f ∈ F, we have −f ∈ F), IPM is defined as:

image

By choosing different class of functions F, various popular distances can be obtained. For instance, IPM with F = {f : ∥f∥∞ ≤ 1}recovers Total Variation distance, IPM with F = {f : ∥f∥L ≤ 1}recovers Wasserstein distance, and IPM with  F = {f : ∥f∥H ≤ 1}with RKHS H reveals maximum mean discrepancy (MMD).

2.1. Assumptions

We first assume access to a Cost-Sensitive oracle and to a Linear Programming oracle.

Cost-Sensitive Oracle The Cost-Sensitive (CS) oracle takes a class of classifiers  Π ≜ {π : X → ∆(A)}, a dataset consisting of pairs of feature x and cost vector c ∈ RK, i.e.,  D = {xi, ci}Ni=1, as inputs, and outputs a classifier that minimizes the average expected classification cost: �Ni=1 π(·|xi)⊤ci/N.

Efficient cost sensitive classifiers exist (e.g., Beygelzimer et al. (2005)) and are widely used in sequential decision making tasks (e.g., Agarwal et al. (2014); Chang et al. (2015b)).

Linear Programming Oracle A Linear Programming (LP) oracle takes a class of functions G as inputs, optimizes a linear functional with respect to  g ∈ G: ming∈G�Ni=1 αig(xi).

When G is in RKHS with bounded norm, the linear functional becomes  maxg:∥g∥≤c⟨g, �ni=1 αiφ(xi)⟩, from which one can obtain the closed-form solution. Another example is when G consists of all functions with bounded Lipschitiz constant, i.e.,  G = {g : ∥g∥L ≤ c}for  c ∈ R+, Sriperum- budur et al. (2012) showed that  maxg∈G�ni=1 αig(xi) canbe solved by Linear Programming with n many constraints, one for each pair  (αi, xi). InAppendix E, we provide a new reduction to Linear Programming for  G = {g : ∥g∥L ≤c1, ∥g∥∞ ≤ c2} for c1, c2 ∈ R+.

The second assumption is related to the richness of the function class. We simply consider a time-dependent policy class  Πh and Fh for h ∈ [H], and we assume realizability:

Assumption 2.1 (Realizability and Capacity of Function Class). We assume  Πhand  Fhcontains  π⋆hand  V ⋆h, i.e., π⋆h ∈ Πh and V ⋆h ∈ Fh, ∀h ∈ [H]. Further assume that for all  h, Fhis symmetric, and  Πh and Fhis finite in size.

Note that we assume  Πhand  Fhto be discrete (but could be extremely large) for analysis simplicity. As we will show later, our bound scales only logarithmically with respect to the size of function class.

Our algorithm, Forward Adversarial Imitation Learning (FAIL), aims to learn a sequence of policies  π ={π1, . . . , πH}such that its value  J(π)is close to  J(π⋆). Note that  J(π) ≈ J(π⋆)does not necessarily mean that the state distribution of  πis close to  π⋆. FAIL learns a sequence of policies with this property in a forward training manner. The algorithm learns  πistarting from i = 1. When learning  πi, the algorithm fixes  {π1, . . . , πi−1}, and solves a min-max game to compute  πi; it then proceeds to time step i + 1. At a high level, FAIL decomposes a sequential learning problem into H-many independent two-player min-max games, where each game can be solved efficiently via no-regret online learning. Below, we first consider how to learn  πhconditioned on  {π1, . . . , πh−1}being fixed. We then present FAIL by chaining H min-max games together.

3.1. Learning One Step Policy via a Min-Max Game

Throughout this section, we assume  {π1, . . . , πh−1}are learned already and fixed. The sequence of policies {π1, . . . , πh−1}for  h ≥ 2determines a distribution  νh ∈∆(Xh)over observation space  Xhat time step h. Also expert policy  π⋆naturally induces a sequence of observation distributions  µ⋆h ∈ ∆(Xh)for  h ∈ [H]. The problem we consider in this section is to learn a policy πh ∈ Πh, such that the resulting observation distribution from  {π1, . . . , πh−1, πh}at time step h + 1 is close to the expert’s observation distribution  µ⋆h+1 at time step h + 1.

We consider the following IPM minimization problem. Given the distribution  νh ∈ ∆(Xh), and a policy  π ∈ Πh, via the Markov property, we have the observation distribution at time step h + 1 conditioned on  νhand  πas νh+1(x) ≜ �xh,ah νh(xh)π(ah|xh)P(x|xh, ah)for any x ∈ Xh+1. Recall that the expert observation distribution at time step h + 1 is denoted as  µ⋆h+1. The IPM with  Fh+1between  νh+1 and µ⋆h+1 is defined as:

image

image

Note that  dFh+1(π|νh, µ⋆h+1)is parameterized by  π, and our goal is to minimize  dFh+1(π|νh, µ⋆h+1)with respect to πover  Πh. However,  dFh+1(π|νh, µ⋆h+1)is not measur- able directly as we do not have access to  µ⋆h+1but only samples from  µ⋆h+1. To estimate  dFh+1, we draw a dataset D =�(xih, aih, xih+1)�Ni=1such that  xih ∼ νh, aih ∼ U(A), xih+1 ∼ P(·|xih, aih), together with observation set resulting from expert  D⋆ = {˜xih+1}N ′i=1iid∼ µ⋆h+1, we form the follow- ing empirical estimation of  dFh+1 for any π, via importance weighting (recall  aih ∼ U(A)):

image

where recall that K = |A| and the importance weight Kπ(aih|xih)is used to account for the fact that we draw ac- tions uniformly from A but want to evaluate  π. Though due to the max operator, �dFh+1(π|νh, µ⋆h+1)is not an unbiased estimate of  dFh+1(π|νh, µ⋆h+1), in Appendix A, we show that �dFh+1indeed is a good approximation of  dFh+1via an application of the standard Bernstein’s inequality and a union bound over  Fh+1. Hence we can approximately minimize �dFh+1with respect to  π: minπ∈Π �dFh+1(π|νh, µ⋆h+1),resulting in a two-player min-max game. Intuitively, we can think of  πas a generator, such that, conditioned on  νh, it generates next-step samples  xh+1that are similar to the expert samples from  µ⋆h+1, via fooling discriminators  Fh+1.

Note that the above formulation is a two-player game, with the utility function for  π and fdefined as:

image

Algorithm 1 solves the minmax game  minπ maxf u(π, f)using no-regret online update on both f and  π. At iteration n, player f plays the best-response via  fn =arg maxf u(πn, f)(Line 3) and player  πplays the Follow-the-Regularized Leader (FTRL) (Shalev-Shwartz et al., 2012) as  πn+1 = �nt=1 u(π, f t) + φ(π) with φbeing con- vex regularization (Line 5). Note that other no-regret online learning algorithms (e.g., replacing FTRL by incremental online learning algorithms like OGD (Zinkevich, 2003) can also be used to approximately optimize the above min-max formulation. After the end of Algorithm 1, we output a policy  πamong all computed policies  {πi}Ti=1such that �dFh+1(π|νn, µ⋆n+1)is minimized (Line 7).

Regarding the computation efficiency of Algorithm 1, the best response computation on f in Line 3 can be computed by a call to the LP Oracle, while FTRL on  πcan be implemented by a call to the regularized CS Oracle. Regarding the statistical performance, we have the following theorem:

image

The proof of the above theorem is included in Appendix A, which combines standard min-max theorem and uniform convergence analysis. The above theorem essentially shows that Algorithm 1 successfully finds a policy  πwhose resulting IPM is close to the smallest possible IPM one could achieve if one had access to  dFh+1(π|νh, µ⋆h+1)directly, up to  ϵerror. Intuitively, from Theorem 3.1, we can see that if νh—the observation distribution resulting from fixed policies  {π1, . . . , πh−1}, is similar to  µ⋆h, then we guarantee to learn a policy  π, such that the new sequence of policies {π1, . . . , πh−1, π}will generate a new distribution  νh+1that is close to  µ⋆h+1, in terms of IPM with  Fh+1. The algorithm introduced below is based on this intuition.

3.2. Forward Adversarial Imitation Learning

Theorem 3.1 indicates that conditioned on  {π1, . . . , πh−1}being fixed, Algorithm 1 finds a policy  π ∈ Πhsuch that it approximately minimizes the divergence—measured under IPM with  Fh+1, between the observation distribution  νh+1resulting from  {π1, . . . , πh−1, π}, and the corresponding distribution  µ⋆h+1 from expert.

With Algorithm 1 as the building block, we now introduce our model-free algorithm—Forward Adversarial Imitation Learning (FAIL) in Algorithm 2. Algorithm 2 integrates Algorithm 1 into the Forward Training framework (Ross & Bagnell, 2010), by decomposing the sequential learning problem into H many independent distribution matching problems where each one is solved using Algorithm 1 independently. Every time step h, FAIL assumes that  π1, . . . , πh−1have been correctly learned in the sense that the resulting observation distribution  νh from{π1, . . . , πh−1}is close to  µ⋆h from expert. Therefore, FAIL is only required to focus on learning  πhcorrectly conditioned on  {π1, . . . , πh−1}being fixed, such that  vh+1is

image

close to  µ⋆h+1, in terms of the IPM with  Fh+1. Intuitively, when one has a strong class of discriminators, and the two-player game in each time step is solved near optimally, then by induction from h = 1 to H, FAIL should be able to learn a sequence of policies such that  νhis close to  µ⋆hfor all h ∈ [H](for the base case, we simply have  ν1 = µ⋆1 = ρ).

3.3. Analysis of Algorithm 2

The performance of FAIL crucially depends on the capacity of the discriminators. Intuitively, discriminators that are too strong cause overfitting (unless one has extremely large number of samples). Conversely, discriminators that are too weak will not be able to distinguish  νhfrom  µ⋆h. This dilemma was studied in the Generative Adversarial Network (GAN) literature already by Arora et al. (2017). Below we study this tradeoff explicitly in IL.

To quantify the power of discriminator class  Fhfor all h, we use inherent Bellman Error (IBE) with respect to  π⋆:

image

The Inherent Bellman Error is commonly used in approximate value iteration literature (Munos, 2005; Munos & Szepesv´ari, 2008; Lazaric et al., 2016) and policy evaluation literature (Sutton, 1988). It measures the worst possible projection error when projecting  Γhgto function space  Fh.Intuitively increasing the capacity of  Fh reduces ϵbe.

Using a restricted function class F potentially introduces ϵbe, hence one may tend to set  Fhto be infinitely powerful discriminator class such as function class consisting of all bounded functions  {f : ∥f∥∞ ≤ c}(recall IPM becomes total variation in this case). However, using Fh ≜ {f : ∥f∥∞ ≤ c}makes efficient learning impossible. The following proposition excludes the possibility of sample efficiency with discriminator class being  {f : ∥f∥∞ ≤ c}.

Theorem 3.2 (Infinite Capacity F does not generalize). There exists a MDP with H = 2, a policy set  Π = {π, π′},an expert policy  π⋆with  π = π⋆(i.e.,  Πis realizable), such that for datasets  D⋆ = {˜xi2}Mi=1with  ˜xi2 ∼ µ⋆2, D = {xi2}Mi=1with  xi2 ∼ µπ2, and  D′ = {x′(i)2 }Mi=1with x′(i)2 ∼ µπ′2 , as long as M = O(log(|X|)), we must have:

image

Namely, denote ˆDas the empirical distribution of a dataset D by assigning probability 1/|D| to any sample, we have:

image

The above theorem shows by just looking at the samples generated from  πand  π′, and comparing them to the samples generated from the expert policy  π⋆ using {f : ∥f∥∞ ≤ c}(IPM becomes Total variation here), we cannot distinguish πfrom  π′, as they both look similar to  π⋆, i.e., none of the three datasets overlap with each other, resulting the TV distances between the empirical distributions become constants, unless the sample size scales  Ω(poly(|X|)).

Theorem 3.2 suggests that one should explicitly regularize discriminator class so that it has finite capacity (e.g., bounded VC or Rademacher Complexity). The restricted discriminator class F has been widely used in practice as well such as learning generative models (i.e., Wasserstain GANs (Arjovsky et al., 2017)). Denote  |Π| = maxh |Πh|and  |F| = maxh |Fh|. The following theorem shows that the learned time-dependent policies  π’s performance is close to the expert’s performance:

Theorem 3.3 (Sample Complexity of FAIL). Under As- sumption 2.1, for any  ϵ, δ ∈ (0, 1], set  T = Θ( Kϵ2 ), n = n′ = Θ( K log(|Π||F|H/δ)ϵ2 ), with probability at least 1 − δ, FAIL (Algorithm 2) outputs  π, such that,

image

by using ˜O�HKϵ2 log�|Π||F|δ ��

average inherent Bellman Error  ϵ′be:

image

Note that the average inherent Bellman error  ϵ′bedefined above is averaged over the state distribution of the learned policy  πand the state distribution of the expert, which is guaranteed to be smaller than the classic inherent Bellman error used in RL literature (i.e., (4)) which uses infinity norm over X. The proof of Theorem 3.3 is included in Ap- pendix C. Regarding computational complexity of Algo- rithm 2, we can see that it requires polynomial number of calls (with respect to parameters  H, K, 1/ϵ) to the efficient oracles (Regularized CS oracle and LP oracle). Since our analysis only uses uniform convergence analysis and standard concentration inequalities, extension to continuous  Πand F with complexity measure such as VC-dimension, Rademacher complexity, and covering number is standard. We give an example in Section 5.1.

To quantify the gap between RL and ILFO, below we present an exponential separation between ILFO and RL in terms of sample complexity to learn a near-optimal policy. We assume that the expert policy is optimal.

Proposition 4.1 (Exponential Separation Between RL and ILFO). Fix  H ∈ N+and  ϵ ∈ (0,�1/8). There exists a family of MDP with deterministic dynamics, with horizon H, 2H − 1many states, two different actions, such that for any RL algorithm, the probability of outputting a policy  ˆπ withJ(ˆπ) ≤ J(π⋆) + ϵafter collecting T trajectories is at most 2/3 for all T ≤ O(2H/ϵ2). On the other hand, for the same MDP, given one trajectory of observations  {˜xh}Hh=1from the expert policy  π⋆, there exists an algorithm that deterministically outputs  π⋆after collecting O(H) trajectories.

Proposition 4.1 shows having access to expert’s trajectories of observations allows us to efficiently solve some MDPs that are otherwise provably intractable for any RL algorithm (i.e., requiring exponentially many trajectories to find a near optimal policy). This kind of exponential gap previously was studied in the interactive imitation learning setting where the expert also provides action signals (Sun et al., 2017) and one can query the expert’s action at any time step during training. To the best of our knowledge, this is the first exponential gap in terms of sample efficiency between ILFO and RL. Note that our theorem applies to a specific family of purposefully designed MDPs, which is standard for proving information-theoretical lower bounds.

In this section, we study three settings where inherent Bellman Error will disappear even under restricted discriminator class : (1) Lipschitz Continuous MDPs (e.g., (Kakade et al., 2003)), (2) Interactive Imitation Learning from Observation where expert is available to query during training, and (3) state abstraction.

5.1. Lipschitz Continuous MDPs

We consider a setting where cost functions, dynamics and π⋆h are Lipschitz continuous in metric space (X, d):

image

for the known metric d and Lipschitz constants  LP , Lπ. Under this condition, the Bellman operator with respect to  π⋆is Lipschitz continuous in the metric space (X, d): |Γhf(x1) − Γhf(x2)| ≤ (∥f∥∞(LP + Lπ))d(x1, x2), where we applied Holder’s inequality. Hence, we can design the function class  Fh for all h ∈ [H]as follows:

image

which will give us  ϵbe = 0and  V ⋆h ∈ Fhdue to the as- sumption on the cost function. Namely  Fhis the class of functions with bounded Lipschitz constant and bounded value. This class of functions is widely used in practice for learning generative models (e.g., Wasserstain GAN). Note that this setting was also studied in (Munos & Szepesv´ari, 2008) for the Fitted Value Iteration algorithm.

Denote  L ≜ LP + Lπ. For  F = {f : ∥f∥L ≤L, ∥f∥∞ ≤ 1}we show that we can evaluate the empirical IPM  supf∈F��Ni=1 f(xi)/N − �N ′i=1 f(x′i)/N ′�by reducing it to Linear Programming, of which the details are deferred to Appendix E. Regarding the generalization ability, note that our function class F is a subset of all functions with bounded Lipschitz constant, i.e.,  F ⊂ {f : ∥f∥L ≤ L}. The Rademacher complexity for bounded Lipschitz function class grows in the order of  O(N −1/cov(X))(e.g., see (Luxburg & Bousquet, 2004; Sriperumbudur et al., 2012)), with cov(X) being the covering dimension of the metric space  (X, d).4 ExtendingTheorem 3.3 to Lipschitz continuous MDPs, we have the following corollary.

Corollary 5.1 (Sample Complexity of FAIL for Lipschitz Continuous MDPs). With the above set up on Lipschitz continuous MDP and  Fh for h ∈ [H] (5), given ϵ, δ ∈ (0, 1],set  T = Θ( Kϵ2 ), n = n′ = ˜Θ( K(LK)cov(X) log(|Π|/δ)ϵ2+cov(X) ), then with probability at least  1 − δ, FAIL (Algorithm 2) outputs a policy with  J(π) − J(π⋆) ≤ O(H2ϵ)using at most ˜O�HK(KL)cov(X)ϵ(2+cov(X)) log�|Π|δϵ��many trajectories.

The proof of the above corollary is deferred to Appendix F which uses a standard covering number argument over  Fhwith norm  ∥ · ∥∞. Note that we get rid of IBE here and hence as the number of sample grows, FAIL approaches to the global optimality. Though the bound has an exponential dependency on the covering dimension, note that the covering dimension cov(X) is completely dependent on the underlying metric space (X, d) and could be much smaller than the real dimension of X. Note that the above theorem also serves an example regarding how we can extend Theo- rem 3.3 to settings where F contains infinitely many functions but with bounded statistical complexity (similar techniques can be used for  Π as well).

5.2. Interactive Imitation Learning from Observations

We can avoid IBE in an interactive learning setting, where we assume that we can query expert during training. But different from previous interactive imitation learning setting such as AggreVaTe, LOLS (Ross & Bagnell, 2014; Chang et al., 2015b), and DAgger (Ross et al., 2011), here we do not assume that expert provides actions nor cost signals. Given any observation x, we simply ask expert to take over for just one step, and observe the observation at the next step, i.e.,  x′ ∼ P(·|x, a)with  a ∼ π⋆(·|x). Note that compared to the non-interactive setting, interactive setting assumes a much stronger access to expert. In this setting, we can use arbitrary class of discriminators with bounded complexity. Due to space limit, we defer the detailed description of the interactive version of FAIL (Algorithm 5) to Appendix G. The following theorem states that we can avoid IBE:

Theorem 5.2. Under Assumption 2.1 and the existence of an interactive expert, for any  ϵ ∈ (0, 1]and  δ ∈ (0, 1], set

T = Θ( Kϵ2 ), n = Θ( K log(|Π||F|H)/δϵ2 ), with probability at least  1 − δ,Algorithm 5) outputs a policy  πsuch that:

image

by using at most ˜O�HKϵ2 log�|Π||F|δ ��many trajectories.

Compare to the non-interactive setting, we get rid of IBE, at the cost of a much stronger expert.

5.3. State Abstraction

Denote  φ : X → Sas the abstraction that maps X to a discrete set S. We assume that the abstraction satisfies the Bisimulation property (Givan et al., 2003): for any  x, x′ ∈X, if φ(x) = φ(x′), we have that  ∀s ∈ S, a ∈ A, h ∈ [H]:

image

In this case, one can show that the  V ⋆is piece-wise constant under the partition induced from  φ, i.e.,  V ⋆(x) =V ⋆(x′)if  φ(x) = φ(x′). Leveraging the abstraction  φ, we can then design  Fh = {f : ∥f∥∞ ≤ 1, f(x) =f(x′), ∀x, x′, s.t., φ(x) = φ(x′)}. Namely,  Fhcontains piece-wise constant functions with bounded values. Under this setting, we can show that the inherent Bellman Error is zero as well (see Proposition 9 from Chen & Jiang (2019)). Also  supf∈Fh u(π, f)can be again computed via LP and Fhhas Rademacher complexity scales  O(�|S|/N)with N being the number of samples. Details are in Appendix I.

Some previous works use the idea of learning an inverse model to predict actions (or latent causes) (Nair et al., 2017; Torabi et al., 2018) from two successive observations and then use the learned inverse model to generate actions using expert observation demonstrations. With the inferred actions, it reduces the problem to normal imitation learning. We note here that learning an inverse model is ill-defined. Specifically, simply by the Bayes rule, the inverse model P(a|xh, xh+1)—the probability of action a was executed at  xhsuch that the system generated  xh+1, is equivalent to P(a|xh, xh+1) ∝ P(xh+1|xh, a)P(a|xh), i.e., an inverse model  P(a|xh, xh+1)is explicitly dependent on the action generation policy  P(a|xh). Unlike P(xh+1|xh, a), withoutthe policy  P(a|xh), the inverse model is ill-defined by itself alone. This means that if one wants to learn an inverse model that predicts expert actions along the trajectory of observations generated by the expert, one would need to learn an inverse model, denoted as  P ⋆(a|xh, xh+1), such that  P ⋆(a|xh, xh+1) ∝ P(xh+1|xh, a)π⋆h(a|xh), which in- dicates that one needs to collect actions from  π⋆h. An inverse model makes sense when the dynamics is deterministic and bijective. Hence replying on learning such an inverse model P ⋆(a|x, x′)will not provide any performance guarantees in general, unless we have actions collected from  π⋆.

We test FAIL on three simulations from openAI Gym (Brockman et al., 2016): Swimmer, Reacher, and the Fetch Robot Reach task (FetchReach). For Swimmer we set H to be 100 while for Reacher and FetchReach, H is 50 in default. The Swimmer task has dense reward (i.e.,., reward at every time step). For reacher, we try both dense reward and sparse reward (i.e., success if it reaches to the goal within a threshold). FetchReach is a sparse reward task. As our algorithm is presented for discrete action space, for all three tasks, we discrete the action space via discretizing each dimension into 5 numbers and applying categorical distribution independently for each dimension.5 6

image

Figure 1. Performance of expert, FAIL, and GAIL (without actions) on dense reward tasks (Reacher and Hopper). For (a) and (b), we fix the number of training samples while varying the number of expert demonstrations (6, 12, 25). For (c) and (d), we fix the number of expert trajectories, while varying the training samples.

For baseline, we modify GAIL (Ho & Ermon, 2016), a model-free IL algorithm, based on the implementation from OpenAI Baselines, to make it work for ILFO. We delete the input of actions to discriminators in GAIL to make it work for ILFO. Hence the modified version can be understood as using RL (the implementation from OpenAI uses TRPO (Schulman et al., 2015)) methods to minimize the divergence between the learner’s average state distribution and the expert’s average state distribution.

For FAIL implementation, we use MMD as a special IPM, where we use RBF kernel and set the width using the common median trick (e.g.,(Fukumizu et al., 2009)) without any future tuning. All policies are parameterized by one-layer neural networks. Instead of using FTRL, we use ADAM as an incremental no-regret learner, with all default parameters (e.g., learning rate) (Kingma & Ba, 2014). The total number of iteration T in Algorithm 1 is set to 1000 without any future tuning. During experiments, we stick to default hyperparameters for the purpose of best reflecting the algorithmic contribution of FAIL.7 All the results below are averaged over ten random trials with seeds randomly generated between [1, 1e6]. The experts are trained via a reinforcement learning algorithm (TRPO (Schulman et al., 2015)) with multiple millions of samples till convergence.

Figure 1 shows the comparison of expert, FAIL, and GAIL

image

Figure 2. Performance of expert, FAIL, and GAIL (without actions) on two sparse control tasks (Reacher Sparse and FetchReach). We fix the number of training samples while varying the number of expert demonstrations (6, 12, 25).

on two dense reward tasks with different number of expert demonstrations, under fixed total number of training samples (one million). We report mean and standard error in Figure 1. We observe GAIL outperforms FAIL in Swimmer on some configurations, while FAIL outperforms GAIL on Reacher (Dense reward) for all configuration.

Figure 2 shows the comparison of expert, FAIL, and GAIL on two sparse reward settings. We observe that FAIL sig-nificantly outperforms GAIL on these two sparse reward tasks. For sparse reward, note that what really matters is to reach the target at the end, FAIL achieves that by matching expert’s state distribution and learner’s state distribution one by one at every time step till the end, while GAIL (without actions) loses the sense of ordering by focusing on the average state distributions.

The above experiments also indicates that FAIL in general can work well for shorter horizon (e.g., H = 50 for Reacher and Fetch), while shows much less improvement over GAIL on longer horizon task. We think this is because of the nature of FAIL which has to learn a sequence of time-dependent policies along the entire horizon H. Long horizon requires larger number of samples. While method like GAIL learns a single stationary policy with all training data, and hence is less sensitive to horizon increase. We leave extending FAIL to learning a single stationary policy as a future work.

Due to the possible non-zero inherent Bellman error, FAIL in general cannot guarantee to find the globally optimal policy. So the remaining question is that:

In ILFO, under a realizability assumption, does there exist an algorithm that can learn an  ϵ near-optimal policy with sample complexity scales polynomially with respect to horizon, number of actions,  1/ϵ, and statistical complexities of function classes, with high probability?

While we are not able to answer this question in the model- free setting considered in this work, we study a model-based algorithm which we show can achieve the above sample complexity, though it is computationally not efficient.

Rather than starting with a realizable policy class  Π, our model-based algorithm starts with a class of models P with P ∈ P, i.e., the model class contains the true model P (realizability). Similarly, we assume we have a set of discriminators  {Fh}Hh=1 such that V ∗h ∈ Fh. The above is the realizability assumption in the model-based setting.

Note that intuitively we use discriminators to approximate expert’s value functions. For any ˆP ∈ P, f ∈ Fh+1, defineQˆP ,fh (x, a) ≜ Ex′∼ ˆPx,af(x′) for any (x, a). Denote Qh as:

image

Note that due to the assumption that  V ∗h+1 ∈ Fh+1and P ∈ P, we must have  Q⋆h = QP,V ⋆h+1h ∈ Qh, i.e., Qhis realizable. Each  Qhinduces a policy  πQh (x) =arg mina Qh(x, a). Denote a policy class �Πhas follows:

image

Note that we have����Πh��� = |P| |Fh+1|. Note that since Q⋆h ∈ Qh, we must have  π⋆h ∈ �Πh, i.e., the policy class is realizable. Now we expand the discriminator classes as follows. We set �FH = FH. We know that �FHis realizable, i.e.,  V ⋆H ∈ �FH. Given a realizable  �Fh+1, we design  �Fhas follows:

image

�Fh ≜Fh∪

image

Namely we explicitly expand  Fhby applying a potential Bellman operator (defined using a pair  ( ˆP, π)) to a discriminator at �Fh+1. Via the induction assumption, since V ⋆h+1 ∈ �Fh+1and  P ∈ P, π⋆h ∈ �Πh, we must have ΓhV ⋆h+1 ∈ �Fhby construction. Hence  �Fhis realizable. We can also recursively compute the size of �Fh. Define F ≜ maxh |Fh|. Starting with��� �FH��� = F, we can show that  log���� �Fh����= O (poly(H)(|P| F)) for h ∈ [H].

Model-based FAIL takes P and  {Fh}has inputs, and performs the two procedures shown in Figure 3.

In Figure 3, note that model-based FAIL does not require any real-world samples in the first step. Real-world samples are only required in the second step where we call FAIL. As we show that  {�Πh}hand  { �Fh}hare realizable, and

1. Construct  {�Πh}h and { �Fh}has shown in (8) and (9);

2. Run FAIL (Algorithm 2) with  {�Πh} and { �Fh}.

Figure 3. Model-based FAIL

IBE is zero due to the construction of �Fh, we can simply invoke Theorem 3.3 here and reach the following corollary:

Corollary 8.1 (Model-based FAIL sample Complexity). Given a pair  (ϵ, δ), Pand  {Fh}hwith  P ∈ P, and V ⋆h ∈ Fhfor all  h ∈ [H], the algorithm shown in Fig- ure 3 can learn an  ϵnear-optimal policy with number of samples and number of expert’s samples both scale in the order of poly  (H, K, (1/ϵ), log(|P|), log(F), log(1/δ)), with probability at least  1 − δ.

The above corollary shows statistically, from a model-based perspective, realizability alone is sufficient to achieve the polynomial sample complexity. However, the above corollary does not show if realizability is sufficient for model-free methods. Also, though it is statistically efficient, the computational complexity of model-based FAIL is still questionable (the naive implementation in Figure 3 has computational complexity  Θ(F H)). Investigating sufficient and necessary conditions for achieving polynomial sample and polynomial computation complexity under ILFO setting is an interesting future work.

We study Imitation Learning from Observation (ILFO) setting and propose an algorithm, Forward Adversarial Imitation Learning (FAIL), that achieves sample efficiency and computational efficiency. FAIL decomposes the sequential learning tasks into independent two-player min-max games of which is solved via general no-regret online learning. In addition to the algorithmic contribution, we present the first exponential gap in terms of sample complexity between ILFO and RL, demonstrating the potential benefit from expert’s observations. A key observation is that one should explicitly regularize the class of discriminators to achieve sample efficiency and design discriminators to decrease the inherent Bellman Error. Experimentally, while GAIL can be used to solve the ILFO problem by removing action inputs to the discriminators, FAIL works just as well in problems with dense reward. Our analysis of FAIL provides the first strong theoretical guarantee for solving ILFO, and FAIL significantly outperforms GAIL on sparse reward MDPs, which are common in practice.

WS is supported in part by Office of Naval Research contract N000141512365. WS thanks Nan Jiang and Akshay Krishnamurthy for valuable discussions. We thank the first anonymous reviewer for carefully reviewing the proofs.

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638–1646, 2014.

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.

Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. Gener- alization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.

Bagnell, J. A., Kakade, S. M., Schneider, J. G., and Ng, A. Y. Policy search by dynamic programming. In Advances in neural information processing systems, pp. 831–838, 2004.

Beygelzimer, A., Dani, V., Hayes, T., Langford, J., and Zadrozny, B. Error limiting reductions between classi-fication tasks. In Proceedings of the 22nd international conference on Machine learning, pp. 49–56. ACM, 2005.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.

Chang, K.-W., He, H., Daum´e III, H., and Langford, J. Learning to search for dependencies. arXiv preprint arXiv:1503.05615, 2015a.

Chang, K.-w., Krishnamurthy, A., Agarwal, A., Daume, H., and Langford, J. Learning to search better than your teacher. In ICML, 2015b.

Chen, J. and Jiang, N. Information-theoretic considera- tions in batch reinforcement learning. arXiv preprint arXiv:1905.00360, 2019.

Dann, C. and Brunskill, E. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2818–2826, 2015.

Daum´e III, H., Langford, J., and Marcu, D. Search-based structured prediction. Machine learning, 2009.

Edwards, A. D., Sahni, H., Schroeker, Y., and Isbell, C. L. Imitating latent policies from observation. arXiv preprint arXiv:1805.07914, 2018.

Fukumizu, K., Gretton, A., Lanckriet, G. R., Sch¨olkopf, B., and Sriperumbudur, B. K. Kernel choice and classifiabil-ity for rkhs embeddings of probability distributions. In Advances in neural information processing systems, pp. 1750–1758, 2009.

Givan, R., Dean, T., and Greig, M. Equivalence notions and model minimization in markov decision processes. Artificial Intelligence, 147(1-2):163–223, 2003.

Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., DulacArnold, G., et al. Deep q-learning from demonstrations. arXiv preprint arXiv:1704.03732, 2017.

Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In NIPS, 2016.

Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low bellman rank are pac-learnable. arXiv preprint arXiv:1610.09512, 2016.

Kakade, S. and Langford, J. Approximately optimal approx- imate reinforcement learning. In ICML, 2002.

Kakade, S., Kearns, M. J., and Langford, J. Exploration in metric state spaces. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 306–312, 2003.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Krishnamurthy, A., Agarwal, A., and Langford, J. Pac rein- forcement learning with rich observations. In Advances in Neural Information Processing Systems, pp. 1840–1848, 2016.

Lazaric, A., Ghavamzadeh, M., and Munos, R. Analysis of classification-based policy iteration algorithms. The Journal of Machine Learning Research, 17(1):583–612, 2016.

Liu, Y., Gupta, A., Abbeel, P., and Levine, S. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1118–1125. IEEE, 2018.

Luxburg, U. v. and Bousquet, O. Distance-based classi- fication with lipschitz functions. Journal of Machine Learning Research, 5(Jun):669–695, 2004.

M¨uller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29 (2):429–443, 1997.

M¨uller, K., Smola, A., and R¨atsch, G. Predicting time series with support vector machines. Artificial Neural Networks ICANN’9, 1327:999–1004, 1997. doi: 10. 1007/BFb0020283. URL http://link.springer. com/chapter/10.1007/BFb0020283.

Munos, R. Error bounds for approximate value iteration. In Proceedings of the National Conference on Artificial Intelligence, volume 20, pp. 1006. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005.

Munos, R. and Szepesv´ari, C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 (May):815–857, 2008.

Nair, A., Chen, D., Agrawal, P., Isola, P., Abbeel, P., Malik, J., and Levine, S. Combining self-supervised learning and imitation for vision-based rope manipulation. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 2146–2153. IEEE, 2017.

Pan, Y., Cheng, C.-A., Saigol, K., Lee, K., Yan, X., Theodorou, E., and Boots, B. Agile autonomous driving using end-to-end deep imitation learning. Proceedings of Robotics: Science and Systems. Pittsburgh, Pennsylvania, 2018.

Peng, X. B., Abbeel, P., Levine, S., and van de Panne, M. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. arXiv preprint arXiv:1804.02717, 2018.

Ross, S. and Bagnell, J. A. Efficient reductions for imitation learning. In AISTATS, pp. 661–668, 2010.

Ross, S. and Bagnell, J. A. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014.

Ross, S., Gordon, G. J., and Bagnell, J. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011.

Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust region policy optimization. In ICML, pp. 1889–1897, 2015.

Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and Trends R⃝in Machine Learning, 2012.

Silver, D. et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016.

Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Sch¨olkopf, B., Lanckriet, G. R., et al. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.

Sun, W., Venkatraman, A., Boots, B., and Bagnell, J. A. Learning to filter with predictive state inference machines. In ICML, 2016.

Sun, W., Venkatraman, A., Gordon, G. J., Boots, B., and Bagnell, J. A. Deeply aggrevated: Differentiable imitation learning for sequential prediction. arXiv preprint arXiv:1703.01030, 2017.

Sun, W., Bagnell, J. A., and Boots, B. Truncated horizon policy search: Combining reinforcement learning & imitation learning. arXiv preprint arXiv:1805.11240, 2018a.

Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. Model-based reinforcement learning in contextual decision processes. arXiv preprint arXiv:1811.08540, 2018b.

Sutton, R. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.

Syed, U. and Schapire, R. E. A reduction from appren- ticeship learning to classification. In Advances in neural information processing systems, pp. 2253–2261, 2010.

Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954, 2018.

Venkatraman, A., Hebert, M., and Bagnell, J. A. Improving multi-step prediction of learned time series models. AAAI, 2015.

Zinkevich, M. Online Convex Programming and General- ized Infinitesimal Gradient Ascent. In ICML, 2003.

A. Proof of Theorem 3.1

Before proving the theorem, we introduce some notations and useful lemmas.

Given a pair of  πand f, denote random variable  vi = Kπ(aih|xih)f(xih+1) − f(˜xih+1), recall the definition of the utility function  u(π, f):

image

Denote  v = Ex∼νh,a∼π(·|x),x′∼Px,a[f(x′)]−Ex∼µ⋆h+1[f(x)]. It is easy to verify that  Eivi = v. We also have  |vi −v| ≤ 4K.We can further bound the variance of  vi − v as:

image

where we used the fact that  |f(x)| ≤ 1, ∀x, π(a|x) ≤ 1, ∀x, a, and the last inequality uses the fact that  aih is sampled from a uniform distribution over A. With that, we can apply Bernstein’s inequality to  {vi}together with a union bound over  Π andF, we will have the following lemma:

Lemma A.1. Given dataset  D = {xih, aih, xih+1}with  xih ∼ νh, aih ∼ U(A), xih+1 ∼ Pxih,aih, and  De = {˜xih+1}with ˜xih+1 ∼ µ⋆h+1, for any pair  π ∈ Π, f ∈ F, with probability at least  1 − δ,

image

Let us define two loss functions for  π and f:

For any  f, g : X × A × X → R, define  ⟨f, g⟩ = E(x,a)∼Dx,af(x, a)g(x, a), where we overload the notation and denote D as the empirical distribution over the dataset D (i.e., put probability 1/|D| over each data point in D), and  Dx,aas the marginal distribution over x, a. With this notation, we can see that  ℓt(π)can be written as a linear functional with respect to π:

image

where  Kftis defined such that  Kf t(x, a) = K �Ni=1 1[x = xih, a = aih]f t(xih+1). Under this definition of inner product, we have:

image

It is easy to verify that Algorithm 1 is running Best Response on loss  {ct(f)}tand running FTRL on loss  {ℓt(π)}t. Usingthe no-regret guarantee from FTRL, for  {πt}, we have:

image

Denote  ˆπ⋆ and ˆf ⋆ as the minimizer and maximizer of Eqn 2, i.e.,

image

where  ˆπ⋆, ˆf ⋆ is defined in (13).

Proof. Using the definition of  ℓtand the no-regret property on  {πt}, we have:

image

Since  f t = arg maxf∈F ct(f), we have:

image

We also have:

where the first inequality uses the definition of  minπ∈Π, the second inequality uses the fact that the maximum is larger than the average, and the last inequality uses the fact that ˆf ⋆ is the maximizer with respect to  ˆπ⋆.

Combining the above results, we have:

image

Denote  π⋆ and f ⋆ as

image

Now we are ready to prove Theorem 3.1

Define ˆf ′ = maxf∈F( 1N�Ni=1 Kπ⋆(aih|xih)f(xih+1) − 1N�Ni=1 f(˜xih+1)). Combine the above inequalities together, we have:

image

where the first equality uses the definition of ¯f ⋆, the second inequality uses Lemma A.2, the third inequality uses the fact that  ˆπ⋆ and ˆf ⋆ are the min-max solution of (13), and the fifth inequality uses the fact that  f ⋆ is the maximizer of (14) given π⋆. Hence, we prove the theorem.

image

B. Proof of Theorem 3.2

Lemma B.1. There exists a distribution  D ∈ ∆(X), such that for any two datasets  S1 = {x1, . . . , xM}and  S2 ={x′1, . . . , x′M} where xi and x′i are drawn i.i.d from D, as long as M = O(log(|X|)), then:

image

Proof. We simply set D to be a uniform distribution of X. Denote |X| = N, and M = O(log(N)). The probability of  S1and  S2does not have any overlap samples can be easily computed as:

image

Note that the probability that  S1does not have repeated samples can be computed as:

image

When  N → ∞ and M = O(log N), we have:

Now, conditioned on the event that  S1does not contain repeated samples, we have:

image

Take  N → ∞, we know that  limx→∞(1 − 1/x)x = 1/e and limN→∞ M 2/N = 0, hence we have:

image

Hence we prove the lemma by coming two results above.

We construct the MDP using the above lemma. The MDP has H = 2, two actions  {a, a⋆}, and the initial distribution  ρassigns probability one to a unique state  ˆx ∈ X. The expert policy  π⋆is designed to be  π⋆(a⋆|ˆx) = 1, i.e., the expert’s action at time step  h = 1 is a⋆. We split the state space X into half and half, denoted as  X1 and X2, such that  X1 ∩ X2 = ∅and  |X1| = |X2| = N/2. We design the MDP’s dynamics such that  P(·|ˆx, a)assigns probability 2/N to each state in  X1and assigns probability 0 to any other state in  X2. We design  P(·|ˆx, a⋆)such that it assigns probability 2/N to each state in X2and zero to each state in  X1.

Denote  D⋆ = {˜x(i)2 }Mi=1as the states generated from the expert by executing  a⋆at  ˆx. For any two policies  πand  π′, such that  π(a⋆|ˆx) = 1 and π′(a|ˆx) = 1, denote D = {xi2}Mi=1 as the dataset sampled from executing  π at ˆx Mmany times, and D′ = {x′(i)2 }Mi=1 as the dataset sampled from executing  π′ at ˆx Mmany times. From Lemma B.1, we know that

image

Hence asymptotically either  D nor D′ will overlap with  D⋆, unless M = Ω(poly(N)) = Ω(poly(|X|)).

C. Proof of Theorem 3.3

We first present some extra notations and useful lemmas below.

Lemma C.1 (Performance Difference Lemma (Kakade & Langford, 2002)). Consider a policy  π = {π1, . . . , πH}and π⋆ = {π⋆1, . . . , π⋆H}. We have:

image

Note that under our setting, i.e., the cost function does not depend on actions, the above equation can be simplified to:

image

where we use Bellman equations, i.e.,  Q⋆h(x, a) = c(x)+Ex′∼Px,aV ⋆h+1(x′) and V ⋆h (x) = c(x)+Ea∼π⋆h,x′∼Px,aV ⋆h+1(x′).8

Note that for any h, we have:

where the first inequality comes from the triangle inequality, the second inequality comes from the fact that  V ⋆h+1 ∈ Fh+1,and in the third inequality, we denote  ∆h = maxf∈F��Ex∼µ⋆h[f(x)] − Ex∼µπh [f(x)]��, and ϵbeis introduced because  ΓhV ⋆h+1might not in  Fh.

Now we are ready to prove the main theorem.

Proof of Theorem 3.3. We consider the h’th iteration. Let us denote  π = {π1, . . . , πh−1}and  µπhas the observation distribution at time step h of following policies  πstarting from the initial distribution  ρ. Denote  µ⋆h+1as the observation distribution of the expert policy  π⋆ at time step h + 1 starting from the initial distribution  ρ. Note that the dataset  {˜x(i)h+1}ni=1is generated from distribution  µ⋆h+1. The data at D is generated i.i.d by first drawing sample  x(i)hfrom  µπh(i.e., executing π1, . . . πh−1), and then sample action  a(i)h ∼ U(A), and then sample  x(i)h+1 from the real system  Px(i)h ,a(i)h .

Mapping the above setup to the setup in Theorem 3.1, i.e., set

Algorithm 1 will output a policy  πhsuch that with probability at least  1 − δ:

image

Recall the definition of refined inherent Bellman Error  ϵ′be with respect to  Fh and π⋆:

image

Denote ˆf as:

image

and  ˆg as:

Now we upper bound  minπ∈Πh dFh+1(π|µπh , µ⋆h+1)as follows.

where the first inequality comes from the realizable assumption that  π⋆h ∈ Π, the second inequality comes from an application of triangle inequality, and the third inequality comes from the definition of  ϵbeand the fact that  ˆg ∈ Fh.

After learn  πh, πis updated to  π = {π1, . . . , πh}. For ∆h+1, we have:

image

Define  ∆0 = maxf |Ex∼ρ[f(x)] − Ex∼ρ[f(x)]| = 0, we have for any h,

image

Using Performance Difference Lemma (Lemma C.1), we know that:

image

D. Proof of Proposition 4.1

We first show the construction of the MDP below. The MDP has horizon  H, 2H − 1many states, and two actions {l, r} standing for go left and go right respectively. All states are organized in a perfect balanced binary tree, with  2H−1many leafs at level h = H, and the first level h = 1 contains only a root. The transition is deterministic such that at any internal state, taking action l leads to the state’s left child, and taking action r leads to the state’s right child. Each internal node has cost zero, and all leafs will have nonzero cost which we will specify later. Note that in such MDP, any sequence of actions  {a1, . . . , aH−1} with ai ∈ {l, r}deterministically leads to one and only one leaf, and the total cost of the sequence of actions is only revealed once the leaf is reached.

The first part of the proposition is proved by reducing the problem to Best-arm identification in multi-armed bandit (MAB) setting. We use the following lower bound of best-arm identification in MAB from (Krishnamurthy et al., 2016):

Lemma D.1 (Lower bound for best arm identification in stochastic bandits from (Krishnamurthy et al., 2016)). For any K ≥ 2 and ϵ ∈ (0,�1/8], and any best-arm identification algorithm, there exists a multi-armed bandit problem for which the best arm  i⋆ is ϵbetter than all others, but for which the estimate ˆiof the best arm must have  P(ˆi ̸= i⋆) ≥ 1/3unless the number of samples collected T is at least  K/(72ϵ2).

Given any MAB problem with K arms, without loss of generality, let us assume  K = 2H − 1 for some H ∈ N+. Any suchMAB problem can be reduced to the above constructed binary tree MDP with horizon  H, and 2H − 1leafs. Each arm in the original MAB will be encoded by a unique sequence of actions  {ah}H−1h=1 with ah ∈ {l, r}, and its corresponding leaf. We assign each leaf the cost distribution of the corresponding arm. The optimal policy in the MDP, i.e., the sequence of actions leading to the leaf that has the smallest expected cost, is one-to-one corresponding to the best arm, i.e., the arm that has the smallest expected cost in the MAB. Hence, without any further information about the MDP, any RL algorithm that aims to find the near-optimal policy must suffer the lower bound presented in Lemma D.1, as otherwise one can solve the original MAB by first converting the MAB to the MDP and then running an RL algorithm. Hence, we prove the first part of the proposition.

For the second part, let us denote the sequence of the observations from the expert policy as  {˜xh}Hh=1, i.e., the sequence of states corresponding to the optimal sequence of actions where the last state  ˜xHhas the smallest expected cost. We design an IL algorithm as follows.

We initialize a sequence of actions  a = ∅. At every level h, staring at h = 1, we try any sequence of actions with prefix a ◦ l (a ◦ ameans we append action a to end of the sequence a), record the observed observation  xlh+1; we then reset and try any sequence of actions with prefix  a ◦ r, and record the observed observation  xlh+1. If xlh+1 = ˜xh+1, then we append l to a, i.e.,  a = a ◦ l, otherwise, we append r, i.e.,  a = a ◦ r. We continue the above procedure until  h = H − 1, and we output the final action sequence a.

Due to the deterministic transition, by induction, it is easy to verify that the outputted sequence of actions a is exactly equal to the optimal sequence of actions executed by the expert policy. Note that in each level h, we only generate two trajectories from the MDP. Hence the total number trajectories before finding the optimal sequence of actions is at most  2(H − 1). Hence we prove the proposition.

Let us denote a set  {yi}2Ni=1such that  {y1, . . . , yN} = {x1, . . . , xN}, and  {yN+1 . . . , y2N} = {x′1, . . . , x′N}. Denote di,j = D(yi, yj) for i ̸= j, and ci = 1/N for i ∈ [N] and ci = −1/N for i ∈ [N + 1, 2N]. We formulate the following LP with 2N variables and  O(N 2)many constraints:

image

Denote the solution of the above LP as  α⋆i . We will have the following claim:

Claim E.1 (LP Oracle). Given F in (5),  {xi}Ni=1, and  {x′i}Ni=1, denote  {α⋆i }2Ni=1as the solution of the LP from (20), we have:  supf∈F��Ni=1 f(xi)/N − �Ni=1 f(x′i)/N�= �2Ni=1 ciα⋆i .

Proof of Claim E.1. Given the solutions  {α⋆i }2Ni=1, we first are going to construct a function  ˆf : X → R, such that for any  yi,we have ˆf(yi) = α⋆i, and  ˆf ∈ F. Denote  L⋆ = maxi̸=j |α⋆i − α⋆j|/di,j. Note that  L⋆ ≤ L. The function  ˆfis constructed

as:

First of all, we show that for any  yi, we have ˆf(yi) = α⋆i . For any j ̸= i, we have:

image

where the first inequality uses the definition of  L⋆. Also we know that  −1 ≤ α⋆i ≤ 1. Hence we have that for  yi, max(−1, min(1, minj∈[2N] L⋆D(yj, yi) + α⋆j)) = α⋆i .

Now we need to prove that ˆfis L-Lipschitz continuous. Note that we just need to prove that  mini L⋆D(yi, x) + α⋆iis L-Lipschitz continuous, since for any L-Lipschitz continuous function  f(x), we have max(1, f(x)) and min(−1, f(x)) tobe L-Lipschitz continuous as well.

Consider any two points  x and x′ such that x ̸= x′. Denote ˆi as arg mini L⋆D(yi, x) + α⋆i and ˆi′ = arg mini L⋆D(yi, x′) +α⋆i . We have:

image

where for the first inequality we used the definition of ˆi, and the second inequality uses the triangle inequality. Similarly, one can show that

image

Combine the above two inequalities and the fact that  L⋆ ≤ L, we conclude that ˆf is L-Lipschitiz continuous.

Now we have constructed ˆfsuch that ˆf(yi) = α⋆ifor all  i ∈ [2N]and  ˆf ∈ F. Now suppose that there exists a function f ′ ∈ F, such that  | �Ni=1 f ′(xi)/N − �Ni=1 f ′(x′i)/N| > | �Ni=1 ˆf(xi)/N − �Ni=1 ˆf ′(x′i)/N|, then we must have for some  i ∈ [2N], f ′(yi) ̸= α⋆i . However, since  f ′ ∈ F, we must have that  {f ′(yi)}2Ni=1 satisfies all constrains in the LP in (20). Hence the assumption that �2Ni=1 cif ′(yi) > �2Ni=1 ciα⋆i contradicts to the fact that  {αi}2Ni=1 is the maximum solution of the LP formulation in (20). Hence we prove the claim.

image

F. Proof of Corollary 5.1

Since in this setting,  Fhfor all  h ∈ [H]contains infinitely many functions, we need to discretize  Fhbefore we can apply the proof techniques from the proof of Theorem 3.3. We use covering number.

Denote  N(X, ϵ, D)as the  ϵ-cover of the metric space (X, D). Namely, for any  x ∈ X, there exists a  x′ ∈ N(X, ϵ, D)such that  D(x′, x) ≤ ϵ. Consider any function class  F = {f : X → R, ∥f∥L ≤ L, ∥f∥∞ ≤ 1}with  L ∈ R+. Below we construct the  ϵ-cover over F.

For any  f ∈ F, denote ¯f ∈ R|N (X,ϵ,D)| with the i-th element ¯fibeing the function value  f(xi)measured at the i-th element xi from N(X, ϵ, D). Hence ¯F ≜ { ¯f : f ∈ F} ∈ R|N (X,ϵ,D)|, and ∥ ¯f∥∞ ≤ C for any ¯f ∈ ¯F. Denote ¯N( ¯F, α, ∥ · ∥∞) asthe  α-cover of ¯F. Let us denote the set  N ≜ {f ∈ F : ¯f ∈ ¯N( ˜F, α, ∥ · ∥∞)}.

Claim F.1. With the above set up, for  F’s (α + 2Lϵ)-cover, i.e.,  N(F, α + 2Lϵ, ∥f∥∞), we have

image

Proof. By definition, we know that for any ¯f ∈ ¯F, we have that there exists a ¯f ⋆ ∈ ¯Fsuch that  ∥ ¯f − ¯f ⋆∥∞ ≤ α. Now

consider  ∥f − f ⋆∥∞. Denote x⋆ = arg maxx |f(x) − f ⋆(x)| and x⋆′ is its closest point in  N(X, ϵ, D). We have:

image

where the last inequality comes from the fact that  f, f ⋆are L-Lipschitz continuous,  D(x⋆, x⋆′) ≤ ϵ, ∥ ¯f − ¯f ⋆∥∞ ≤ αand x⋆′ ∈ N(X, ϵ, D). Hence, we just identify a subset of F such that it forms a  α + 2Lϵ cover for Funder norm  ∥ · ∥∞.

Note that ¯N is a α-cover with  ∥ · ∥∞ for ¯Fwhich is a subset of  |N(X, ϵ, D)|-dimension space. By standard discretization along each dimension, we prove the claim.

From the above claim, setting up  α and ϵproperly, we have:

Extending the analysis of Theorem 3.3 simply results extending the concentration result in Lemma A.1. Specifically via Bernstein’s inequality and a union bound over  Π×N(F, ϵ/K, ∥·∥∞), we have that for any  π ∈ Π, ˜f ∈ N(F, ϵ/K, ∥·∥∞),with probability at least  1 − δ,

image

Now using the fact that  N(F, ϵ/K, ∥ · ∥∞)is an  ϵ-cover under norm  ∥ · ∥∞, we have that for any  π ∈ Π, f ∈ F, with probability at least  1 − δ,������1N

image

The rest of the proof is the same as the proof of Theorem 3.3.

Recall that with  {π1, . . . , πh−1}being fixed, we denote  νhas resulting observation distribution resulting at time step h. The interactiveness comes from the ability we can query expert to generate next observation conditioned on states sampled from νh—the states that would be visited by learner at time step h. Let us define  d(π|νh, π⋆h) as:

image

Note that different from  dFh+1(π|νh, µ⋆h+1), in  dFh+1(π|νh, π⋆h), the marginal distributions on x are the same for both  πand  π⋆hand we directly access  π⋆hto generate epxert observations at h + 1 rather than thorough the expert observation distribution  µ⋆h+1. In other words, we use IPM to compare the observation distribution at time step h + 1 after applying  πand the observation distribution at time step h + 1 after applying  π⋆, conditioned on the distribution  νhgenerated by the previously learned policies  {π1, . . . , πh−1}. In Algorithm 5, at every time step h, to find a policy  πhthat approximately minimizes  dFh+1(π|νh, π⋆h), we replace expectations in  dFh+1(π|νh, π⋆h)by proper samples (line ?? and line ??), and then call Algorithm 1 (line ??).

image

Proof. Recall the definition of  d(π|νh, π⋆h),

with  νhbeing the distribution over  Xhresulting from executing policies  {π1, . . . , πh−1}.

We will use Lemma A.1, Lemma A.2, and Lemma C.1 below.

The Performance Difference Lemma (Lemma C.1) tells us that:

where the last inequality comes from the realizable assumption that  V ⋆h ∈ Fh.

At every time step h, mapping to Theorem 3.1 with  νh = µπh,  T = 4K2/ϵ2, n = K log(|Π||F|/δ)/ϵ2, we have that with probability at least  1 − δ:

image

Note that  minπ∈Πh dFh+1(π|µπh , π⋆h) ≤ dFh+1(π⋆h|µπh , π⋆h) = 0, since  π⋆h ∈ Πhby the realizable assumption. Hence, we have that:

image

Hence, using (21), and a union bound over all time steps  h ∈ [H], we have that with probability at least  1 − δ,

image

with  T = 4K2/ϵ2, and N = K log(H|Π||F|/δ)/ϵ2. Since in every round h, we need to draw N many trajectories, hence, the total number of trajectories we need is at most  HK log(H|Π||F|/δ)/ϵ2.

image

Our theoretical results presented so far rely on the realizable assumption (Assumption 2.1). While equipped with recent advances in powerful non-linear function approximators (e.g., deep neural networks), readability can be ensured, in this section, we relax the realizable assumption and show that our algorithms’ performance only degenerates mildly. We relax Assumption 2.1 as follows:

Assumption H.1 (Approximate Realizability). We assume  Πand F is approximate realizable in a sense that for any h ∈ [H], we have minπ∈Πh maxx,a ∥π(a|x) − π⋆h(a|x)∥ ≤ ϵΠ and minf∈Fh ∥f − V ⋆h ∥∞ ≤ ϵF.

The above assumption does not require F and  Πto contain the exact  V ⋆hand  π⋆h, but assumes F and  Πare rich enough to contain functions that can approximate  V ⋆hand  π⋆huniformly well (i.e.,  ϵFand  ϵΠare small). Without any further modification of Algorithm 2 and Algorithm 5 for non-interactive and interactive setting, we have the following corollary.

Corollary H.2. Under Assumption H.1, for  ϵ ∈ (0, 1) and δ ∈ (0, 1), with T = Θ(K/ϵ2), n = Θ(K log(|Π||F|H/δ)/ϵ2),with probability at least  1 − δ, (1) for non-interactive setting, FAIL (Algorithm 2) outputs a policy  πwith  J(π) −J(π⋆) ≤ O�H2(ϵbe + ϵ) + H(ϵF + ϵΠ)�, and (2) for interactive setting, IFAIL (Algorithm 5) outputs a policy  πwith J(π) − J(π⋆) ≤ O (Hϵ + HϵF + HϵΠ), by using at most ˜O((HK/ϵ2) log(|Π||F|/δ))many trajectories under both settings.

The proof is deferred to Appendix H.1.

H.1. Proof of Corollary H.2

Proof of Corollary H.2. For any  h, denote gh as

Below we prove the first bullet in Corollary H.2, i.e., the results for non-interactive setting.

Non-Interactive Setting Using PDL (Lemma C.1), we have

Now repeat the same recursive analysis for  dFh+1(πh|µπh , µ⋆h)as we did in proof of Theorem 3.3 in Appendix C, we can prove the first bullet in the corollary.

Now we prove the second bullet in Corollary H.2, i.e., the results for interactive setting.

Interactive Setting Again, using Performance Difference Lemma (Lemma C.1), we have

image

Now repeat the same steps from the proof of Theorem 5.2 after (21) in proof of Theorem 5.2, we can prove the second bullet in the corollary.

image

We consider the bisimulation model from (6). The following proposition summarizes the conclusion in this section.

Proposition I.1. Assume Bisimulation holds (Eq. 6) and set  Fh = {f : ∥f∥∞ ≤ 1, f(x) = f(x′), ∀x, x′ s.t. φ(x) = φ(x′)} , ∀h ∈[H] to be piece-wise constant functions over the partitions induced from  φ. We have:

1.  V ⋆h is a piece-wise constant function for all  h ∈ [H],

2.  ϵbe = 0,

3.  supf∈Fh(�Ni=1 f(xi)/N − �Ni=1 f(x′i)/N)can be solved by LP, for all  h ∈ [H],

4. given any {xi}Ni=1, the Rademacher complexity of Fhis in the order of O(�|S|/N), i.e., (1/N)Eσ[supf∈Fh�Ni=1 σif(xi)] = O(�|S|/N), with σibeing a Rademacher number.

The above proposition states that by leveraging the abstraction, we can design discriminators to be piece-wise constant functions over the partitions induced by  φ, such that inherent Bellman error is zero, and the discriminator class has bounded statistical complexity. Below we prove the above proposition. The first two points in the above proposition were studied in (Chen & Jiang, 2019). For completeness, we simply prove all four points below.

Piece-wise constant  V ⋆First, we show that  V ⋆h (x)is piece-wise constant over the partitions induced from  φ. Starting from H, via (6), we know that  V ⋆H(x) = c(x), which is piece-wise constant over the partitions induced from  φ. Then let us assume that for  h + 1, we have V ⋆h+1(x) = V ⋆h+1(x′) for any x, x′ s.t. φ(x) = φ(x′). At time step h, via Bellman equation, we know:

image

Hence for any two  x, x′ with φ(x) = φ(x′), we have:

where the second and the last equality use (6). In the third equality above, we abuse the notation  V ⋆h+1(s)for  s ∈ Sto denote the value of  V ⋆h+1(x) for any x such that φ(x) = s.

Inherent Bellman Error With  Fh = {f : ∥f∥∞ ≤ 1, f(x) = f(x′), ∀x, x′ s.t. φ(x) = φ(x′)} , ∀h ∈ [H], we can show ϵbe = 0as follows. For any  x, x′ ∈ X with φ(x) = φ(x′), f ∈ Fh+1, we have:

image

where again we abuse the notation f(s) to denote that value f(x) for any x such that  φ(x) = s. Namely,  Γhfis also a piece-wise constant over the partitions induced from  φ. Since ∥f∥∞ ≤ 1, it is also easy to see that  ∥Γhf∥∞ ≤ 1. Hence wehave  Γhf ∈ Fh.

Reduction to LP Regarding evaluating  supf∈Fh��Ni=1 f(xi)/N − �Ni=1 f(x′i)/N�, we can again reduce it an LP.

Denote  α ∈ [−1, 1]|S|, where the i-th entry in  αcorresponds to the i-th element in S. We denote  αsas the entry in  αthat corresponds to the state  s in S. Take {xi}Ni=1, and compute  cs = �Ni=1 1(φ(xi) = s) for every s ∈ S (i.e., csis the number of points mapped to  s). Take {x′i}Ni=1 and compute  c′s = �Ni=1 1(φ(x′i) = s). We solve the following LP:

image

Denote the solution of the above LP as  α⋆. Then f ⋆(x) = α⋆φ(x).

Complexity of Discriminators  FhRegarding the complexity of  Fh, note that  Fhessentially corresponds to a |S|-dim box: [−1, 1]|S|. Again, consider a dataset  {xi}Ni=1 and the counts  {cs}s∈S. For any f, and Rademacher numbers  σ ∈ {−1, 1}N,we have

image

Now, we can show that the Rademacher complexity of  Fhis bounded as follows:

image

image

When we design the utility in (3), we sample actions from U(A) and then perform importance weighting. This ensures that in analysis the variance will be bounded by K. In practice, we can use any reference policy to generate actions, and then perform importance weighting accordingly. Assume that we have a dataset  D = {xih, aih, pih, xih+1}Ni=1and the expert’s dataset  D⋆ = {˜xih+1}N ′i=1, where pih is the probability of action  aih being chosen at  xih. We can form the utility as follows:

image

As long as the probability of choosing any action at any state is lower bounded, then the variance of the above estimator is upper bounded. This formulation also immediately extends FAIL to continuous action space setting. For a parameterized policy  πθ, given any f, we can compute  ∇θu(πθ, f)easily. If  ah ∼ πθ(·|x)(i.e., on-policy samples), then for any fixed f, the policy gradient  ∇θu(πθ, f)can be estimated using the REINFORCE trick:

image

With the form of  ∇θu(πθ, f), we can perform the min-max optimization in Alg. 1 by iteratively finding the maximizer f n = arg maxf u(πθn, f)using LP oracle, and then perform gradient descent update  θn+1 = θn − ηn∇θu(πθn, f n). See Algorithm 4 below. Note that in Algorithm 4 the dataset  D = {xih, aih, pih, xih+1} contains pih which is the probability of  aih being chosen at  xih. We can integrate Algorithm 4 into the forward training framework.

image

In Algorithm 2, at every time step h, we execute the current sequence of policies  π = {π1, . . . , πh−1}to collect samples at time step h, i.e.,  xh. We then throw away all generated samples  {x1, . . . , xh−1}except  xh. While this simplifies the analysis, in practice, we could leverage these samples  {x1, . . . , xh−1}as well, especially now we can form the utility with on-policy samples and compute the corresponding policy gradient ((23)). This leads us to Alg. 5. Namely, in Algorithm 5, when training  πh, we also incrementally update  π1, . . . , πh−1using their on-policy samples (Line 11 Algorithm 5).

image

Figure 4. Performance of expert, FAIL∗ (Algorithm 5), FAIL(Algorithm 2), and GAIL (without actions) on three control tasks. For the top line, we fix the number of training samples while varying the number of expert demonstrations (6, 12, 25). For the bottom line, we fix the number of expert demonstrations, while varying the number of training samples. All results are averaged over 10 random seeds.

We test Algorithm 5 on the same set of environments (Figure 4) under 10 random rand seeds, with all default parameters. We observe that FAIL∗can be more sample efficient especially in small data setting (e.g., 0.25 million training samples). Implementation of Algorithm 5 and scripts for reproducing results can be found in https://github.com/wensun/ Imitation-Learning-from-Observation/tree/fail_star.


Designed for Accessibility and to further Open Science