Imitation Learning (IL) is a sample efficient approach to policy optimization (Ross et al., 2011; Ross & Bagnell, 2014; Sun et al., 2017) that has been extensively used in real applications, including Natural Language Processing (Daum´e III et al., 2009; Chang et al., 2015b;a), game playing (Silver et al., 2016; Hester et al., 2017), system identification (Venkatraman et al., 2015; Sun et al., 2016), and robotics control tasks (Pan et al., 2018). Most previous IL work considers settings where an expert can directly provide action signals. In these settings, a general strategy is to directly learn a policy that maps from state to action, via supervised learning approaches (e.g., DAgger (Ross et al., 2011), AggreVaTe (Ross & Bagnell, 2014), THOR (Sun et al., 2018a), Behaviour Cloning (Syed & Schapire, 2010)). Another popular strategy is to learn a policy by minimizing some divergence between the policy’s state-action distribution and the expert’s state-action distribution. Popular divergences include Forward KL (i.e., Behaviour Cloning), Jensen Shannon Divergence (e.g., GAIL (Ho & Ermon, 2016)).
Here, we consider a more challenging IL setting, where experts’ demonstrations consist only of observations, no action or reward signals are available to the learner, and no reset is allowed (e.g., a robot learns a task by just watching an expert performing the task). We call this setting Imitation Learning from Observations alone (ILFO). Under this setting, without access to expert actions, approaches like DAgger, AggreVaTe, GAIL, and Behaviour Cloning by defi-nition cannot work. Although recently several model-based approaches, which learn an inverse model that predicts the actions taken by an expert (Torabi et al., 2018; Edwards et al., 2018) based on successive observations, have been proposed, these approaches can suffer from covariate shift (Ross et al., 2011). While we wish to train a predictor that can infer an expert’s actions accurately under the expert’s observation distribution, we do not have access to actions generated by the expert conditioned on the expert’s observation (See Section 6 for a more detailed discussion). An alternative strategy is to handcraft cost functions that use some distance metric to penalize deviation from the experts’ trajectories (e.g., Liu et al. (2018); Peng et al. (2018)), which is then optimized by Reinforcement Learning (RL). These methods typically involve hand-designed cost functions that sometimes require prior task-specific knowledge (Peng et al., 2018). The quality of the learned policy is therefore completely dependent on the hand-designed costs which could be widely different from the true cost. Ideally, we would like to learn a policy that minimizes the unknown true cost function of the underlying MDP.
In this work, we explicitly consider learning near-optimal policies in a sample and computationally efficient manner. Specifically, we focus on large-scale MDPs where the number of unique observations is extremely large (e.g., highdimensional observations such as raw-pixel images). Such large-scale MDP settings immediately exclude most existing sample efficient RL algorithms, which are often designed for small tabular MDPs, whose sample complexities have a polynomial dependency on the number of observations and hence cannot scale well. To solve large-scale MDPs, we need to design algorithms that leverage function approximation for generalization. Specifically, we are interested in algorithms with the following three properties: (1) near-optimal performance guarantees, i.e., we want to search for a policy whose performance is close to the expert’s in terms of the expected total cost of the underlying MDP (and not a hand-designed cost function); (2) sample efficiency, we require sample complexity that scales polynomially with respect to all relevant parameters (e.g., horizon, number of actions, statistical complexity of function approximators) except the cardinality of the observation space—hence excluding PAC RL algorithms designed for small tabular MDPs; (3) computational efficiency: we rely on the notion of oracle-efficiency (Agarwal et al., 2014) and require the number of efficient oracle calls to scale polynomially— thereby excluding recently proposed algorithms for Contextual Decision Processes which are not computationally efficient (Jiang et al., 2016; Sun et al., 2018b). To the best of our knowledge, the desiderata above requires designing new algorithms.
With access to experts’ trajectories of observations we introduce a model-free algorithm, called Forward Adversarial Imitation Learning (FAIL), that decomposes ILFO into H independent two-player min-max games, where H is the horizon length. We aim to learn a sequence of time-dependent policies from h = 1 to H, where at any time step h, the policy is learned such that the generated observation distribution at time step h + 1, conditioned on
being fixed, matches the expert’s observation distribution at time step h + 1, in terms of an Integral Probability Metric (IPM) (M¨uller et al., 1997). IPM is a family of divergences that can be understood as using a set of discriminators to distinguish two distributions (e.g., Wasserstein distance is one such special instance). We analyze the sample complexity of FAIL and show that FAIL can learn a near-optimal policy in sample complexity that does not explicitly depend on the cardinality of observation space, but rather only depends on the complexity measure of the policy class and the discriminator class. Hence FAIL satisfies the above mentioned three properties. The resulting theory extends the domain of provably sample efficient learning algorithms beyond existing results, which typically only consider tabular reinforcement learning settings (e.g., Dann & Brunskill (2015)) or settings that require access to a near-optimal reset distribution (e.g., Kakade & Langford (2002); Bagnell et al. (2004); Munos & Szepesv´ari (2008)). We also demonstrate that learning under ILFO can be exponentially more sample efficient than pure RL.We also study FAIL un- der three specific settings: (1) Lipschitz continuous MDPs, (2) Interactive ILFO where one can query the expert at any time during training, and (3) state abstraction. Finally, we demonstrate the efficacy of FAIL on multiple continuous control tasks.
We consider an episodic finite horizon Decision Process that consists of , where
for
[H] is a time-dependent observation space,1 A is a discrete action space such that
is the horizon. We assume the cost function
is only defined at the last time step H (e.g., sparse cost), and
the initial observation distribution, and P is the transition model i.e.,
for
. Note that here we assume the cost function only depends on observations. We assume that
for all
is discrete, but
is extremely large and hence any sample complexity that has polynomial dependency on
be considered as sample inefficient, i.e., one cannot afford to visit every unique observation. We assume that the cost is bounded, i.e., for any sequence of observations,
(e.g., zero-one loss at the end of each episode). For a time-dependent policy
with
, the value function
is defined as:
and state-action function is defined as
with
. We denote
as the distribution over
at time step
policy classes
, the goal is to learn a
with
, which minimizes the expected cost:
Denote for
. We define a Bellman Operator
associated with the expert
at time step h as
where for any
Notation For a function , we denote
as the Lipschitz constant:
, with
being the metric in space X.2 We consider a Reproducing Kernel Hilbert Space (RKHS) H defined with a positive definite kernel
such that H is the span of
, and we have f(x) =
, with
. For any
, we define
. We denote U(A) as a uniform distribution over action set A. For
, we denote
Integral Probability Metrics (IPM) (M¨uller, 1997) is a family of distance measures on distributions: given two distributions and
over X, and a function class F containing real-value functions
and symmetric (e.g.,
), IPM is defined as:
By choosing different class of functions F, various popular distances can be obtained. For instance, IPM with F = {f : recovers Total Variation distance, IPM with
recovers Wasserstein distance, and IPM with
with RKHS H reveals maximum mean discrepancy (MMD).
2.1. Assumptions
We first assume access to a Cost-Sensitive oracle and to a Linear Programming oracle.
Cost-Sensitive Oracle The Cost-Sensitive (CS) oracle takes a class of classifiers , a dataset consisting of pairs of feature x and cost vector
, i.e.,
, as inputs, and outputs a classifier that minimizes the average expected classification cost:
Efficient cost sensitive classifiers exist (e.g., Beygelzimer et al. (2005)) and are widely used in sequential decision making tasks (e.g., Agarwal et al. (2014); Chang et al. (2015b)).
Linear Programming Oracle A Linear Programming (LP) oracle takes a class of functions G as inputs, optimizes a linear functional with respect to :
When G is in RKHS with bounded norm, the linear functional becomes , from which one can obtain the closed-form solution. Another example is when G consists of all functions with bounded Lipschitiz constant, i.e.,
for
, Sriperum- budur et al. (2012) showed that
be solved by Linear Programming with n many constraints, one for each pair
Appendix E, we provide a new reduction to Linear Programming for
The second assumption is related to the richness of the function class. We simply consider a time-dependent policy class , and we assume realizability:
Assumption 2.1 (Realizability and Capacity of Function Class). We assume and
contains
and
, i.e.,
. Further assume that for all
is symmetric, and
is finite in size.
Note that we assume and
to be discrete (but could be extremely large) for analysis simplicity. As we will show later, our bound scales only logarithmically with respect to the size of function class.
Our algorithm, Forward Adversarial Imitation Learning (FAIL), aims to learn a sequence of policies such that its value
is close to
. Note that
does not necessarily mean that the state distribution of
is close to
. FAIL learns a sequence of policies with this property in a forward training manner. The algorithm learns
starting from i = 1. When learning
, the algorithm fixes
, and solves a min-max game to compute
; it then proceeds to time step i + 1. At a high level, FAIL decomposes a sequential learning problem into H-many independent two-player min-max games, where each game can be solved efficiently via no-regret online learning. Below, we first consider how to learn
conditioned on
being fixed. We then present FAIL by chaining H min-max games together.
3.1. Learning One Step Policy via a Min-Max Game
Throughout this section, we assume are learned already and fixed. The sequence of policies
for
determines a distribution
over observation space
at time step h. Also expert policy
naturally induces a sequence of observation distributions
for
. The problem we consider in this section is to learn a policy
, such that the resulting observation distribution from
at time step h + 1 is close to the expert’s observation distribution
at time step h + 1.
We consider the following IPM minimization problem. Given the distribution , and a policy
, via the Markov property, we have the observation distribution at time step h + 1 conditioned on
and
as
for any
. Recall that the expert observation distribution at time step h + 1 is denoted as
. The IPM with
between
is defined as:
Note that is parameterized by
, and our goal is to minimize
with respect to
over
. However,
is not measur- able directly as we do not have access to
but only samples from
. To estimate
, we draw a dataset
such that
,
, together with observation set resulting from expert
, we form the follow- ing empirical estimation of
, via importance weighting (recall
where recall that K = |A| and the importance weight is used to account for the fact that we draw ac- tions uniformly from A but want to evaluate
. Though due to the max operator,
is not an unbiased estimate of
, in Appendix A, we show that
indeed is a good approximation of
via an application of the standard Bernstein’s inequality and a union bound over
. Hence we can approximately minimize
with respect to
resulting in a two-player min-max game. Intuitively, we can think of
as a generator, such that, conditioned on
, it generates next-step samples
that are similar to the expert samples from
, via fooling discriminators
Note that the above formulation is a two-player game, with the utility function for defined as:
Algorithm 1 solves the minmax game using no-regret online update on both f and
. At iteration n, player f plays the best-response via
(Line 3) and player
plays the Follow-the-Regularized Leader (FTRL) (Shalev-Shwartz et al., 2012) as
being con- vex regularization (Line 5). Note that other no-regret online learning algorithms (e.g., replacing FTRL by incremental online learning algorithms like OGD (Zinkevich, 2003) can also be used to approximately optimize the above min-max formulation. After the end of Algorithm 1, we output a policy
among all computed policies
such that
is minimized (Line 7).
Regarding the computation efficiency of Algorithm 1, the best response computation on f in Line 3 can be computed by a call to the LP Oracle, while FTRL on can be implemented by a call to the regularized CS Oracle. Regarding the statistical performance, we have the following theorem:
The proof of the above theorem is included in Appendix A, which combines standard min-max theorem and uniform convergence analysis. The above theorem essentially shows that Algorithm 1 successfully finds a policy whose resulting IPM is close to the smallest possible IPM one could achieve if one had access to
directly, up to
error. Intuitively, from Theorem 3.1, we can see that if
—the observation distribution resulting from fixed policies
, is similar to
, then we guarantee to learn a policy
, such that the new sequence of policies
will generate a new distribution
that is close to
, in terms of IPM with
. The algorithm introduced below is based on this intuition.
3.2. Forward Adversarial Imitation Learning
Theorem 3.1 indicates that conditioned on being fixed, Algorithm 1 finds a policy
such that it approximately minimizes the divergence—measured under IPM with
, between the observation distribution
resulting from
, and the corresponding distribution
from expert.
With Algorithm 1 as the building block, we now introduce our model-free algorithm—Forward Adversarial Imitation Learning (FAIL) in Algorithm 2. Algorithm 2 integrates Algorithm 1 into the Forward Training framework (Ross & Bagnell, 2010), by decomposing the sequential learning problem into H many independent distribution matching problems where each one is solved using Algorithm 1 independently. Every time step h, FAIL assumes that have been correctly learned in the sense that the resulting observation distribution
is close to
from expert. Therefore, FAIL is only required to focus on learning
correctly conditioned on
being fixed, such that
is
close to , in terms of the IPM with
. Intuitively, when one has a strong class of discriminators, and the two-player game in each time step is solved near optimally, then by induction from h = 1 to H, FAIL should be able to learn a sequence of policies such that
is close to
for all
(for the base case, we simply have
3.3. Analysis of Algorithm 2
The performance of FAIL crucially depends on the capacity of the discriminators. Intuitively, discriminators that are too strong cause overfitting (unless one has extremely large number of samples). Conversely, discriminators that are too weak will not be able to distinguish from
. This dilemma was studied in the Generative Adversarial Network (GAN) literature already by Arora et al. (2017). Below we study this tradeoff explicitly in IL.
To quantify the power of discriminator class for all h, we use inherent Bellman Error (IBE) with respect to
The Inherent Bellman Error is commonly used in approximate value iteration literature (Munos, 2005; Munos & Szepesv´ari, 2008; Lazaric et al., 2016) and policy evaluation literature (Sutton, 1988). It measures the worst possible projection error when projecting to function space
Intuitively increasing the capacity of
Using a restricted function class F potentially introduces , hence one may tend to set
to be infinitely powerful discriminator class such as function class consisting of all bounded functions
(recall IPM becomes total variation in this case). However, using
makes efficient learning impossible. The following proposition excludes the possibility of sample efficiency with discriminator class being
Theorem 3.2 (Infinite Capacity F does not generalize). There exists a MDP with H = 2, a policy set an expert policy
with
(i.e.,
is realizable), such that for datasets
with
,
with
, and
with
, as long as M = O(log(|X|)), we must have:
Namely, denote as the empirical distribution of a dataset D by assigning probability 1/|D| to any sample, we have:
The above theorem shows by just looking at the samples generated from and
, and comparing them to the samples generated from the expert policy
(IPM becomes Total variation here), we cannot distinguish
from
, as they both look similar to
, i.e., none of the three datasets overlap with each other, resulting the TV distances between the empirical distributions become constants, unless the sample size scales
Theorem 3.2 suggests that one should explicitly regularize discriminator class so that it has finite capacity (e.g., bounded VC or Rademacher Complexity). The restricted discriminator class F has been widely used in practice as well such as learning generative models (i.e., Wasserstain GANs (Arjovsky et al., 2017)). Denote and
. The following theorem shows that the learned time-dependent policies
’s performance is close to the expert’s performance:
Theorem 3.3 (Sample Complexity of FAIL). Under As- sumption 2.1, for any , set
,
, with probability at least
Algorithm 2) outputs
, such that,
by using
average inherent Bellman Error
Note that the average inherent Bellman error defined above is averaged over the state distribution of the learned policy
and the state distribution of the expert, which is guaranteed to be smaller than the classic inherent Bellman error used in RL literature (i.e., (4)) which uses infinity norm over X. The proof of Theorem 3.3 is included in Ap- pendix C. Regarding computational complexity of Algo- rithm 2, we can see that it requires polynomial number of calls (with respect to parameters
) to the efficient oracles (Regularized CS oracle and LP oracle). Since our analysis only uses uniform convergence analysis and standard concentration inequalities, extension to continuous
and F with complexity measure such as VC-dimension, Rademacher complexity, and covering number is standard. We give an example in Section 5.1.
To quantify the gap between RL and ILFO, below we present an exponential separation between ILFO and RL in terms of sample complexity to learn a near-optimal policy. We assume that the expert policy is optimal.
Proposition 4.1 (Exponential Separation Between RL and ILFO). Fix and
. There exists a family of MDP with deterministic dynamics, with horizon H,
many states, two different actions, such that for any RL algorithm, the probability of outputting a policy
after collecting T trajectories is at most
. On the other hand, for the same MDP, given one trajectory of observations
from the expert policy
, there exists an algorithm that deterministically outputs
after collecting O(H) trajectories.
Proposition 4.1 shows having access to expert’s trajectories of observations allows us to efficiently solve some MDPs that are otherwise provably intractable for any RL algorithm (i.e., requiring exponentially many trajectories to find a near optimal policy). This kind of exponential gap previously was studied in the interactive imitation learning setting where the expert also provides action signals (Sun et al., 2017) and one can query the expert’s action at any time step during training. To the best of our knowledge, this is the first exponential gap in terms of sample efficiency between ILFO and RL. Note that our theorem applies to a specific family of purposefully designed MDPs, which is standard for proving information-theoretical lower bounds.
In this section, we study three settings where inherent Bellman Error will disappear even under restricted discriminator class : (1) Lipschitz Continuous MDPs (e.g., (Kakade et al., 2003)), (2) Interactive Imitation Learning from Observation where expert is available to query during training, and (3) state abstraction.
5.1. Lipschitz Continuous MDPs
We consider a setting where cost functions, dynamics and are Lipschitz continuous in metric space (X, d):
for the known metric d and Lipschitz constants . Under this condition, the Bellman operator with respect to
is Lipschitz continuous in the metric space (X, d):
, where we applied Holder’s inequality. Hence, we can design the function class
as follows:
which will give us and
due to the as- sumption on the cost function. Namely
is the class of functions with bounded Lipschitz constant and bounded value. This class of functions is widely used in practice for learning generative models (e.g., Wasserstain GAN). Note that this setting was also studied in (Munos & Szepesv´ari, 2008) for the Fitted Value Iteration algorithm.
Denote . For
we show that we can evaluate the empirical IPM
by reducing it to Linear Programming, of which the details are deferred to Appendix E. Regarding the generalization ability, note that our function class F is a subset of all functions with bounded Lipschitz constant, i.e.,
. The Rademacher complexity for bounded Lipschitz function class grows in the order of
(e.g., see (Luxburg & Bousquet, 2004; Sriperumbudur et al., 2012)), with cov(X) being the covering dimension of the metric space
Theorem 3.3 to Lipschitz continuous MDPs, we have the following corollary.
Corollary 5.1 (Sample Complexity of FAIL for Lipschitz Continuous MDPs). With the above set up on Lipschitz continuous MDP and set
, then with probability at least
Algorithm 2) outputs a policy with
using at most
many trajectories.
The proof of the above corollary is deferred to Appendix F which uses a standard covering number argument over with norm
. Note that we get rid of IBE here and hence as the number of sample grows, FAIL approaches to the global optimality. Though the bound has an exponential dependency on the covering dimension, note that the covering dimension cov(X) is completely dependent on the underlying metric space (X, d) and could be much smaller than the real dimension of X. Note that the above theorem also serves an example regarding how we can extend Theo- rem 3.3 to settings where F contains infinitely many functions but with bounded statistical complexity (similar techniques can be used for
5.2. Interactive Imitation Learning from Observations
We can avoid IBE in an interactive learning setting, where we assume that we can query expert during training. But different from previous interactive imitation learning setting such as AggreVaTe, LOLS (Ross & Bagnell, 2014; Chang et al., 2015b), and DAgger (Ross et al., 2011), here we do not assume that expert provides actions nor cost signals. Given any observation x, we simply ask expert to take over for just one step, and observe the observation at the next step, i.e., with
. Note that compared to the non-interactive setting, interactive setting assumes a much stronger access to expert. In this setting, we can use arbitrary class of discriminators with bounded complexity. Due to space limit, we defer the detailed description of the interactive version of FAIL (Algorithm 5) to Appendix G. The following theorem states that we can avoid IBE:
Theorem 5.2. Under Assumption 2.1 and the existence of an interactive expert, for any and
, set
, with probability at least
Algorithm 5) outputs a policy
such that:
by using at most many trajectories.
Compare to the non-interactive setting, we get rid of IBE, at the cost of a much stronger expert.
5.3. State Abstraction
Denote as the abstraction that maps X to a discrete set S. We assume that the abstraction satisfies the Bisimulation property (Givan et al., 2003): for any
, we have that
In this case, one can show that the is piece-wise constant under the partition induced from
, i.e.,
if
. Leveraging the abstraction
, we can then design
. Namely,
contains piece-wise constant functions with bounded values. Under this setting, we can show that the inherent Bellman Error is zero as well (see Proposition 9 from Chen & Jiang (2019)). Also
can be again computed via LP and
has Rademacher complexity scales
with N being the number of samples. Details are in Appendix I.
Some previous works use the idea of learning an inverse model to predict actions (or latent causes) (Nair et al., 2017; Torabi et al., 2018) from two successive observations and then use the learned inverse model to generate actions using expert observation demonstrations. With the inferred actions, it reduces the problem to normal imitation learning. We note here that learning an inverse model is ill-defined. Specifically, simply by the Bayes rule, the inverse model —the probability of action a was executed at
such that the system generated
, is equivalent to
, i.e., an inverse model
is explicitly dependent on the action generation policy
the policy
, the inverse model is ill-defined by itself alone. This means that if one wants to learn an inverse model that predicts expert actions along the trajectory of observations generated by the expert, one would need to learn an inverse model, denoted as
, such that
, which in- dicates that one needs to collect actions from
. An inverse model makes sense when the dynamics is deterministic and bijective. Hence replying on learning such an inverse model
will not provide any performance guarantees in general, unless we have actions collected from
We test FAIL on three simulations from openAI Gym (Brockman et al., 2016): Swimmer, Reacher, and the Fetch Robot Reach task (FetchReach). For Swimmer we set H to be 100 while for Reacher and FetchReach, H is 50 in default. The Swimmer task has dense reward (i.e.,., reward at every time step). For reacher, we try both dense reward and sparse reward (i.e., success if it reaches to the goal within a threshold). FetchReach is a sparse reward task. As our algorithm is presented for discrete action space, for all three tasks, we discrete the action space via discretizing each dimension into 5 numbers and applying categorical distribution independently for each dimension.5 6
Figure 1. Performance of expert, FAIL, and GAIL (without actions) on dense reward tasks (Reacher and Hopper). For (a) and (b), we fix the number of training samples while varying the number of expert demonstrations (6, 12, 25). For (c) and (d), we fix the number of expert trajectories, while varying the training samples.
For baseline, we modify GAIL (Ho & Ermon, 2016), a model-free IL algorithm, based on the implementation from OpenAI Baselines, to make it work for ILFO. We delete the input of actions to discriminators in GAIL to make it work for ILFO. Hence the modified version can be understood as using RL (the implementation from OpenAI uses TRPO (Schulman et al., 2015)) methods to minimize the divergence between the learner’s average state distribution and the expert’s average state distribution.
For FAIL implementation, we use MMD as a special IPM, where we use RBF kernel and set the width using the common median trick (e.g.,(Fukumizu et al., 2009)) without any future tuning. All policies are parameterized by one-layer neural networks. Instead of using FTRL, we use ADAM as an incremental no-regret learner, with all default parameters (e.g., learning rate) (Kingma & Ba, 2014). The total number of iteration T in Algorithm 1 is set to 1000 without any future tuning. During experiments, we stick to default hyperparameters for the purpose of best reflecting the algorithmic contribution of FAIL.7 All the results below are averaged over ten random trials with seeds randomly generated between [1, 1e6]. The experts are trained via a reinforcement learning algorithm (TRPO (Schulman et al., 2015)) with multiple millions of samples till convergence.
Figure 1 shows the comparison of expert, FAIL, and GAIL
Figure 2. Performance of expert, FAIL, and GAIL (without actions) on two sparse control tasks (Reacher Sparse and FetchReach). We fix the number of training samples while varying the number of expert demonstrations (6, 12, 25).
on two dense reward tasks with different number of expert demonstrations, under fixed total number of training samples (one million). We report mean and standard error in Figure 1. We observe GAIL outperforms FAIL in Swimmer on some configurations, while FAIL outperforms GAIL on Reacher (Dense reward) for all configuration.
Figure 2 shows the comparison of expert, FAIL, and GAIL on two sparse reward settings. We observe that FAIL sig-nificantly outperforms GAIL on these two sparse reward tasks. For sparse reward, note that what really matters is to reach the target at the end, FAIL achieves that by matching expert’s state distribution and learner’s state distribution one by one at every time step till the end, while GAIL (without actions) loses the sense of ordering by focusing on the average state distributions.
The above experiments also indicates that FAIL in general can work well for shorter horizon (e.g., H = 50 for Reacher and Fetch), while shows much less improvement over GAIL on longer horizon task. We think this is because of the nature of FAIL which has to learn a sequence of time-dependent policies along the entire horizon H. Long horizon requires larger number of samples. While method like GAIL learns a single stationary policy with all training data, and hence is less sensitive to horizon increase. We leave extending FAIL to learning a single stationary policy as a future work.
Due to the possible non-zero inherent Bellman error, FAIL in general cannot guarantee to find the globally optimal policy. So the remaining question is that:
In ILFO, under a realizability assumption, does there exist an algorithm that can learn an optimal policy with sample complexity scales polynomially with respect to horizon, number of actions,
, and statistical complexities of function classes, with high probability?
While we are not able to answer this question in the model- free setting considered in this work, we study a model-based algorithm which we show can achieve the above sample complexity, though it is computationally not efficient.
Rather than starting with a realizable policy class , our model-based algorithm starts with a class of models P with
, i.e., the model class contains the true model P (realizability). Similarly, we assume we have a set of discriminators
. The above is the realizability assumption in the model-based setting.
Note that intuitively we use discriminators to approximate expert’s value functions. For any
Note that due to the assumption that and
, we must have
, i.e.,
is realizable. Each
induces a policy
. Denote a policy class
as follows:
Note that we have. Note that since
, we must have
, i.e., the policy class is realizable. Now we expand the discriminator classes as follows. We set
. We know that
is realizable, i.e.,
. Given a realizable
, we design
as follows:
Namely we explicitly expand by applying a potential Bellman operator (defined using a pair
) to a discriminator at
. Via the induction assumption, since
and
, we must have
by construction. Hence
is realizable. We can also recursively compute the size of
. Define
. Starting with
, we can show that
Model-based FAIL takes P and as inputs, and performs the two procedures shown in Figure 3.
In Figure 3, note that model-based FAIL does not require any real-world samples in the first step. Real-world samples are only required in the second step where we call FAIL. As we show that and
are realizable, and
1. Construct as shown in (8) and (9);
2. Run FAIL (Algorithm 2) with
Figure 3. Model-based FAIL
IBE is zero due to the construction of , we can simply invoke Theorem 3.3 here and reach the following corollary:
Corollary 8.1 (Model-based FAIL sample Complexity). Given a pair and
with
, and
for all
, the algorithm shown in Fig- ure 3 can learn an
near-optimal policy with number of samples and number of expert’s samples both scale in the order of poly
, with probability at least
The above corollary shows statistically, from a model-based perspective, realizability alone is sufficient to achieve the polynomial sample complexity. However, the above corollary does not show if realizability is sufficient for model-free methods. Also, though it is statistically efficient, the computational complexity of model-based FAIL is still questionable (the naive implementation in Figure 3 has computational complexity ). Investigating sufficient and necessary conditions for achieving polynomial sample and polynomial computation complexity under ILFO setting is an interesting future work.
We study Imitation Learning from Observation (ILFO) setting and propose an algorithm, Forward Adversarial Imitation Learning (FAIL), that achieves sample efficiency and computational efficiency. FAIL decomposes the sequential learning tasks into independent two-player min-max games of which is solved via general no-regret online learning. In addition to the algorithmic contribution, we present the first exponential gap in terms of sample complexity between ILFO and RL, demonstrating the potential benefit from expert’s observations. A key observation is that one should explicitly regularize the class of discriminators to achieve sample efficiency and design discriminators to decrease the inherent Bellman Error. Experimentally, while GAIL can be used to solve the ILFO problem by removing action inputs to the discriminators, FAIL works just as well in problems with dense reward. Our analysis of FAIL provides the first strong theoretical guarantee for solving ILFO, and FAIL significantly outperforms GAIL on sparse reward MDPs, which are common in practice.
WS is supported in part by Office of Naval Research contract N000141512365. WS thanks Nan Jiang and Akshay Krishnamurthy for valuable discussions. We thank the first anonymous reviewer for carefully reviewing the proofs.
Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638–1646, 2014.
Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
Arora, S., Ge, R., Liang, Y., Ma, T., and Zhang, Y. Gener- alization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.
Bagnell, J. A., Kakade, S. M., Schneider, J. G., and Ng, A. Y. Policy search by dynamic programming. In Advances in neural information processing systems, pp. 831–838, 2004.
Beygelzimer, A., Dani, V., Hayes, T., Langford, J., and Zadrozny, B. Error limiting reductions between classi-fication tasks. In Proceedings of the 22nd international conference on Machine learning, pp. 49–56. ACM, 2005.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
Chang, K.-W., He, H., Daum´e III, H., and Langford, J. Learning to search for dependencies. arXiv preprint arXiv:1503.05615, 2015a.
Chang, K.-w., Krishnamurthy, A., Agarwal, A., Daume, H., and Langford, J. Learning to search better than your teacher. In ICML, 2015b.
Chen, J. and Jiang, N. Information-theoretic considera- tions in batch reinforcement learning. arXiv preprint arXiv:1905.00360, 2019.
Dann, C. and Brunskill, E. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2818–2826, 2015.
Daum´e III, H., Langford, J., and Marcu, D. Search-based structured prediction. Machine learning, 2009.
Edwards, A. D., Sahni, H., Schroeker, Y., and Isbell, C. L. Imitating latent policies from observation. arXiv preprint arXiv:1805.07914, 2018.
Fukumizu, K., Gretton, A., Lanckriet, G. R., Sch¨olkopf, B., and Sriperumbudur, B. K. Kernel choice and classifiabil-ity for rkhs embeddings of probability distributions. In Advances in neural information processing systems, pp. 1750–1758, 2009.
Givan, R., Dean, T., and Greig, M. Equivalence notions and model minimization in markov decision processes. Artificial Intelligence, 147(1-2):163–223, 2003.
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., DulacArnold, G., et al. Deep q-learning from demonstrations. arXiv preprint arXiv:1704.03732, 2017.
Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. In NIPS, 2016.
Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low bellman rank are pac-learnable. arXiv preprint arXiv:1610.09512, 2016.
Kakade, S. and Langford, J. Approximately optimal approx- imate reinforcement learning. In ICML, 2002.
Kakade, S., Kearns, M. J., and Langford, J. Exploration in metric state spaces. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 306–312, 2003.
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Krishnamurthy, A., Agarwal, A., and Langford, J. Pac rein- forcement learning with rich observations. In Advances in Neural Information Processing Systems, pp. 1840–1848, 2016.
Lazaric, A., Ghavamzadeh, M., and Munos, R. Analysis of classification-based policy iteration algorithms. The Journal of Machine Learning Research, 17(1):583–612, 2016.
Liu, Y., Gupta, A., Abbeel, P., and Levine, S. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1118–1125. IEEE, 2018.
Luxburg, U. v. and Bousquet, O. Distance-based classi- fication with lipschitz functions. Journal of Machine Learning Research, 5(Jun):669–695, 2004.
M¨uller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29 (2):429–443, 1997.
M¨uller, K., Smola, A., and R¨atsch, G. Predicting time series with support vector machines. Artificial Neural Networks ICANN’9, 1327:999–1004, 1997. doi: 10. 1007/BFb0020283. URL http://link.springer. com/chapter/10.1007/BFb0020283.
Munos, R. Error bounds for approximate value iteration. In Proceedings of the National Conference on Artificial Intelligence, volume 20, pp. 1006. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005.
Munos, R. and Szepesv´ari, C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 (May):815–857, 2008.
Nair, A., Chen, D., Agrawal, P., Isola, P., Abbeel, P., Malik, J., and Levine, S. Combining self-supervised learning and imitation for vision-based rope manipulation. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 2146–2153. IEEE, 2017.
Pan, Y., Cheng, C.-A., Saigol, K., Lee, K., Yan, X., Theodorou, E., and Boots, B. Agile autonomous driving using end-to-end deep imitation learning. Proceedings of Robotics: Science and Systems. Pittsburgh, Pennsylvania, 2018.
Peng, X. B., Abbeel, P., Levine, S., and van de Panne, M. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. arXiv preprint arXiv:1804.02717, 2018.
Ross, S. and Bagnell, J. A. Efficient reductions for imitation learning. In AISTATS, pp. 661–668, 2010.
Ross, S. and Bagnell, J. A. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979, 2014.
Ross, S., Gordon, G. J., and Bagnell, J. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011.
Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust region policy optimization. In ICML, pp. 1889–1897, 2015.
Shalev-Shwartz, S. et al. Online learning and online convex optimization. Foundations and Trends Rin Machine Learning, 2012.
Silver, D. et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016.
Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Sch¨olkopf, B., Lanckriet, G. R., et al. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.
Sun, W., Venkatraman, A., Boots, B., and Bagnell, J. A. Learning to filter with predictive state inference machines. In ICML, 2016.
Sun, W., Venkatraman, A., Gordon, G. J., Boots, B., and Bagnell, J. A. Deeply aggrevated: Differentiable imitation learning for sequential prediction. arXiv preprint arXiv:1703.01030, 2017.
Sun, W., Bagnell, J. A., and Boots, B. Truncated horizon policy search: Combining reinforcement learning & imitation learning. arXiv preprint arXiv:1805.11240, 2018a.
Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. Model-based reinforcement learning in contextual decision processes. arXiv preprint arXiv:1811.08540, 2018b.
Sutton, R. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
Syed, U. and Schapire, R. E. A reduction from appren- ticeship learning to classification. In Advances in neural information processing systems, pp. 2253–2261, 2010.
Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954, 2018.
Venkatraman, A., Hebert, M., and Bagnell, J. A. Improving multi-step prediction of learned time series models. AAAI, 2015.
Zinkevich, M. Online Convex Programming and General- ized Infinitesimal Gradient Ascent. In ICML, 2003.
A. Proof of Theorem 3.1
Before proving the theorem, we introduce some notations and useful lemmas.
Given a pair of and f, denote random variable
, recall the definition of the utility function
Denote . It is easy to verify that
. We also have
We can further bound the variance of
where we used the fact that , and the last inequality uses the fact that
is sampled from a uniform distribution over A. With that, we can apply Bernstein’s inequality to
together with a union bound over
F, we will have the following lemma:
Lemma A.1. Given dataset with
, and
with
, for any pair
, with probability at least
Let us define two loss functions for
For any , define
, where we overload the notation and denote D as the empirical distribution over the dataset D (i.e., put probability 1/|D| over each data point in D), and
as the marginal distribution over x, a. With this notation, we can see that
can be written as a linear functional with respect to
where is defined such that
. Under this definition of inner product, we have:
It is easy to verify that Algorithm 1 is running Best Response on loss and running FTRL on loss
the no-regret guarantee from FTRL, for
Denote as the minimizer and maximizer of Eqn 2, i.e.,
where is defined in (13).
Proof. Using the definition of and the no-regret property on
Since
We also have:
where the first inequality uses the definition of , the second inequality uses the fact that the maximum is larger than the average, and the last inequality uses the fact that
is the maximizer with respect to
Combining the above results, we have:
Denote
Now we are ready to prove Theorem 3.1
Define . Combine the above inequalities together, we have:
where the first equality uses the definition of , the second inequality uses Lemma A.2, the third inequality uses the fact that
are the min-max solution of (13), and the fifth inequality uses the fact that
is the maximizer of (14) given
. Hence, we prove the theorem.
B. Proof of Theorem 3.2
Lemma B.1. There exists a distribution , such that for any two datasets
and
are drawn i.i.d from D, as long as M = O(log(|X|)), then:
Proof. We simply set D to be a uniform distribution of X. Denote |X| = N, and M = O(log(N)). The probability of and
does not have any overlap samples can be easily computed as:
Note that the probability that does not have repeated samples can be computed as:
When
Now, conditioned on the event that does not contain repeated samples, we have:
Take , we know that
, hence we have:
Hence we prove the lemma by coming two results above.
We construct the MDP using the above lemma. The MDP has H = 2, two actions , and the initial distribution
assigns probability one to a unique state
. The expert policy
is designed to be
, i.e., the expert’s action at time step
. We split the state space X into half and half, denoted as
, such that
and
. We design the MDP’s dynamics such that
assigns probability 2/N to each state in
and assigns probability 0 to any other state in
. We design
such that it assigns probability 2/N to each state in
and zero to each state in
Denote as the states generated from the expert by executing
at
. For any two policies
and
, such that
as the dataset sampled from executing
many times, and
as the dataset sampled from executing
many times. From Lemma B.1, we know that
Hence asymptotically either will overlap with
C. Proof of Theorem 3.3
We first present some extra notations and useful lemmas below.
Lemma C.1 (Performance Difference Lemma (Kakade & Langford, 2002)). Consider a policy and
Note that under our setting, i.e., the cost function does not depend on actions, the above equation can be simplified to:
where we use Bellman equations, i.e.,
Note that for any h, we have:
where the first inequality comes from the triangle inequality, the second inequality comes from the fact that and in the third inequality, we denote
is introduced because
might not in
Now we are ready to prove the main theorem.
Proof of Theorem 3.3. We consider the h’th iteration. Let us denote and
as the observation distribution at time step h of following policies
starting from the initial distribution
. Denote
as the observation distribution of the expert policy
at time step h + 1 starting from the initial distribution
. Note that the dataset
is generated from distribution
. The data at D is generated i.i.d by first drawing sample
from
(i.e., executing
), and then sample action
, and then sample
from the real system
Mapping the above setup to the setup in Theorem 3.1, i.e., set
Algorithm 1 will output a policy such that with probability at least
Recall the definition of refined inherent Bellman Error with respect to
Denote
and
Now we upper bound as follows.
where the first inequality comes from the realizable assumption that , the second inequality comes from an application of triangle inequality, and the third inequality comes from the definition of
and the fact that
After learn is updated to
Define , we have for any h,
Using Performance Difference Lemma (Lemma C.1), we know that:
D. Proof of Proposition 4.1
We first show the construction of the MDP below. The MDP has horizon many states, and two actions {l, r} standing for go left and go right respectively. All states are organized in a perfect balanced binary tree, with
many leafs at level h = H, and the first level h = 1 contains only a root. The transition is deterministic such that at any internal state, taking action l leads to the state’s left child, and taking action r leads to the state’s right child. Each internal node has cost zero, and all leafs will have nonzero cost which we will specify later. Note that in such MDP, any sequence of actions
deterministically leads to one and only one leaf, and the total cost of the sequence of actions is only revealed once the leaf is reached.
The first part of the proposition is proved by reducing the problem to Best-arm identification in multi-armed bandit (MAB) setting. We use the following lower bound of best-arm identification in MAB from (Krishnamurthy et al., 2016):
Lemma D.1 (Lower bound for best arm identification in stochastic bandits from (Krishnamurthy et al., 2016)). For any , and any best-arm identification algorithm, there exists a multi-armed bandit problem for which the best arm
better than all others, but for which the estimate
of the best arm must have
unless the number of samples collected T is at least
Given any MAB problem with K arms, without loss of generality, let us assume MAB problem can be reduced to the above constructed binary tree MDP with horizon
leafs. Each arm in the original MAB will be encoded by a unique sequence of actions
, and its corresponding leaf. We assign each leaf the cost distribution of the corresponding arm. The optimal policy in the MDP, i.e., the sequence of actions leading to the leaf that has the smallest expected cost, is one-to-one corresponding to the best arm, i.e., the arm that has the smallest expected cost in the MAB. Hence, without any further information about the MDP, any RL algorithm that aims to find the near-optimal policy must suffer the lower bound presented in Lemma D.1, as otherwise one can solve the original MAB by first converting the MAB to the MDP and then running an RL algorithm. Hence, we prove the first part of the proposition.
For the second part, let us denote the sequence of the observations from the expert policy as , i.e., the sequence of states corresponding to the optimal sequence of actions where the last state
has the smallest expected cost. We design an IL algorithm as follows.
We initialize a sequence of actions . At every level h, staring at h = 1, we try any sequence of actions with prefix
means we append action a to end of the sequence a), record the observed observation
; we then reset and try any sequence of actions with prefix
, and record the observed observation
, then we append l to a, i.e.,
, otherwise, we append r, i.e.,
. We continue the above procedure until
, and we output the final action sequence a.
Due to the deterministic transition, by induction, it is easy to verify that the outputted sequence of actions a is exactly equal to the optimal sequence of actions executed by the expert policy. Note that in each level h, we only generate two trajectories from the MDP. Hence the total number trajectories before finding the optimal sequence of actions is at most . Hence we prove the proposition.
Let us denote a set such that
, and
. Denote
. We formulate the following LP with 2N variables and
many constraints:
Denote the solution of the above LP as . We will have the following claim:
Claim E.1 (LP Oracle). Given F in (5), , and
, denote
as the solution of the LP from (20), we have:
Proof of Claim E.1. Given the solutions , we first are going to construct a function
, such that for any
we have
, and
. Denote
. Note that
. The function
is constructed
as:
First of all, we show that for any
where the first inequality uses the definition of . Also we know that
. Hence we have that for
,
Now we need to prove that is L-Lipschitz continuous. Note that we just need to prove that
is L-Lipschitz continuous, since for any L-Lipschitz continuous function
be L-Lipschitz continuous as well.
Consider any two points
where for the first inequality we used the definition of , and the second inequality uses the triangle inequality. Similarly, one can show that
Combine the above two inequalities and the fact that , we conclude that
-Lipschitiz continuous.
Now we have constructed such that
for all
and
. Now suppose that there exists a function
, such that
, then we must have for some
. However, since
, we must have that
satisfies all constrains in the LP in (20). Hence the assumption that
contradicts to the fact that
is the maximum solution of the LP formulation in (20). Hence we prove the claim.
F. Proof of Corollary 5.1
Since in this setting, for all
contains infinitely many functions, we need to discretize
before we can apply the proof techniques from the proof of Theorem 3.3. We use covering number.
Denote as the
-cover of the metric space (X, D). Namely, for any
, there exists a
such that
. Consider any function class
with
. Below we construct the
-cover over F.
For any with the i-th element
being the function value
measured at the i-th element
the
. Let us denote the set
Claim F.1. With the above set up, for -cover, i.e.,
Proof. By definition, we know that for any , we have that there exists a
such that
. Now
consider is its closest point in
where the last inequality comes from the fact that are L-Lipschitz continuous,
and
. Hence, we just identify a subset of F such that it forms a
under norm
Note that -cover with
which is a subset of
-dimension space. By standard discretization along each dimension, we prove the claim.
From the above claim, setting up properly, we have:
Extending the analysis of Theorem 3.3 simply results extending the concentration result in Lemma A.1. Specifically via Bernstein’s inequality and a union bound over , we have that for any
with probability at least
Now using the fact that is an
-cover under norm
, we have that for any
, with probability at least
The rest of the proof is the same as the proof of Theorem 3.3.
Recall that with being fixed, we denote
as resulting observation distribution resulting at time step h. The interactiveness comes from the ability we can query expert to generate next observation conditioned on states sampled from
—the states that would be visited by learner at time step h. Let us define
Note that different from , in
, the marginal distributions on x are the same for both
and
and we directly access
to generate epxert observations at h + 1 rather than thorough the expert observation distribution
. In other words, we use IPM to compare the observation distribution at time step h + 1 after applying
and the observation distribution at time step h + 1 after applying
, conditioned on the distribution
generated by the previously learned policies
. In Algorithm 5, at every time step h, to find a policy
that approximately minimizes
, we replace expectations in
by proper samples (line ?? and line ??), and then call Algorithm 1 (line ??).
Proof. Recall the definition of
with being the distribution over
resulting from executing policies
We will use Lemma A.1, Lemma A.2, and Lemma C.1 below.
The Performance Difference Lemma (Lemma C.1) tells us that:
where the last inequality comes from the realizable assumption that
At every time step h, mapping to Theorem 3.1 with ,
, we have that with probability at least
Note that , since
by the realizable assumption. Hence, we have that:
Hence, using (21), and a union bound over all time steps , we have that with probability at least
with . Since in every round h, we need to draw N many trajectories, hence, the total number of trajectories we need is at most
Our theoretical results presented so far rely on the realizable assumption (Assumption 2.1). While equipped with recent advances in powerful non-linear function approximators (e.g., deep neural networks), readability can be ensured, in this section, we relax the realizable assumption and show that our algorithms’ performance only degenerates mildly. We relax Assumption 2.1 as follows:
Assumption H.1 (Approximate Realizability). We assume and F is approximate realizable in a sense that for any
The above assumption does not require F and to contain the exact
and
, but assumes F and
are rich enough to contain functions that can approximate
and
uniformly well (i.e.,
and
are small). Without any further modification of Algorithm 2 and Algorithm 5 for non-interactive and interactive setting, we have the following corollary.
Corollary H.2. Under Assumption H.1, for with probability at least
, (1) for non-interactive setting, FAIL (Algorithm 2) outputs a policy
with
, and (2) for interactive setting, IFAIL (Algorithm 5) outputs a policy
with
, by using at most
many trajectories under both settings.
The proof is deferred to Appendix H.1.
H.1. Proof of Corollary H.2
Proof of Corollary H.2. For any
Below we prove the first bullet in Corollary H.2, i.e., the results for non-interactive setting.
Non-Interactive Setting Using PDL (Lemma C.1), we have
Now repeat the same recursive analysis for as we did in proof of Theorem 3.3 in Appendix C, we can prove the first bullet in the corollary.
Now we prove the second bullet in Corollary H.2, i.e., the results for interactive setting.
Interactive Setting Again, using Performance Difference Lemma (Lemma C.1), we have
Now repeat the same steps from the proof of Theorem 5.2 after (21) in proof of Theorem 5.2, we can prove the second bullet in the corollary.
We consider the bisimulation model from (6). The following proposition summarizes the conclusion in this section.
Proposition I.1. Assume Bisimulation holds (Eq. 6) and set [H] to be piece-wise constant functions over the partitions induced from
1. is a piece-wise constant function for all
2.
3. can be solved by LP, for all
4. given any , the Rademacher complexity of
is in the order of
, i.e.,
being a Rademacher number.
The above proposition states that by leveraging the abstraction, we can design discriminators to be piece-wise constant functions over the partitions induced by , such that inherent Bellman error is zero, and the discriminator class has bounded statistical complexity. Below we prove the above proposition. The first two points in the above proposition were studied in (Chen & Jiang, 2019). For completeness, we simply prove all four points below.
Piece-wise constant First, we show that
is piece-wise constant over the partitions induced from
. Starting from H, via (6), we know that
, which is piece-wise constant over the partitions induced from
. Then let us assume that for
. At time step h, via Bellman equation, we know:
Hence for any two
where the second and the last equality use (6). In the third equality above, we abuse the notation for
to denote the value of
Inherent Bellman Error With , we can show
as follows. For any
where again we abuse the notation f(s) to denote that value f(x) for any x such that . Namely,
is also a piece-wise constant over the partitions induced from
, it is also easy to see that
have
Reduction to LP Regarding evaluating , we can again reduce it an LP.
Denote , where the i-th entry in
corresponds to the i-th element in S. We denote
as the entry in
that corresponds to the state
, and compute
is the number of points mapped to
and compute
. We solve the following LP:
Denote the solution of the above LP as
Complexity of Discriminators Regarding the complexity of
, note that
essentially corresponds to a |S|-dim box:
. Again, consider a dataset
and the counts
, and Rademacher numbers
we have
Now, we can show that the Rademacher complexity of is bounded as follows:
When we design the utility in (3), we sample actions from U(A) and then perform importance weighting. This ensures that in analysis the variance will be bounded by K. In practice, we can use any reference policy to generate actions, and then perform importance weighting accordingly. Assume that we have a dataset and the expert’s dataset
is the probability of action
being chosen at
. We can form the utility as follows:
As long as the probability of choosing any action at any state is lower bounded, then the variance of the above estimator is upper bounded. This formulation also immediately extends FAIL to continuous action space setting. For a parameterized policy , given any f, we can compute
easily. If
(i.e., on-policy samples), then for any fixed f, the policy gradient
can be estimated using the REINFORCE trick:
With the form of , we can perform the min-max optimization in Alg. 1 by iteratively finding the maximizer
using LP oracle, and then perform gradient descent update
. See Algorithm 4 below. Note that in Algorithm 4 the dataset
which is the probability of
being chosen at
. We can integrate Algorithm 4 into the forward training framework.
In Algorithm 2, at every time step h, we execute the current sequence of policies to collect samples at time step h, i.e.,
. We then throw away all generated samples
except
. While this simplifies the analysis, in practice, we could leverage these samples
as well, especially now we can form the utility with on-policy samples and compute the corresponding policy gradient ((23)). This leads us to Alg. 5. Namely, in Algorithm 5, when training
, we also incrementally update
using their on-policy samples (Line 11 Algorithm 5).
Figure 4. Performance of expert, FAILAlgorithm 5), FAIL(Algorithm 2), and GAIL (without actions) on three control tasks. For the top line, we fix the number of training samples while varying the number of expert demonstrations (6, 12, 25). For the bottom line, we fix the number of expert demonstrations, while varying the number of training samples. All results are averaged over 10 random seeds.
We test Algorithm 5 on the same set of environments (Figure 4) under 10 random rand seeds, with all default parameters. We observe that FAILcan be more sample efficient especially in small data setting (e.g., 0.25 million training samples). Implementation of Algorithm 5 and scripts for reproducing results can be found in https://github.com/wensun/ Imitation-Learning-from-Observation/tree/fail_star.