oIRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions

Explicit engineering of reward functions for given environments has been a major hindrance to reinforcement learning methods. While Inverse Reinforcement Learning (IRL) is a solution to recover reward functions from demonstrations only, these learned rewards are generally heavily entangled with the dynamics of the environment and therefore not portable or robust to changing environments. Modern adversarial methods have yielded some success in reducing reward entanglement in the IRL setting. In this work, we leverage one such method, Adversarial Inverse Reinforcement Learning (AIRL), to propose an algorithm that learns hierarchical disentangled rewards with a policy over options. We show that this method has the ability to learn generalizable policies and reward functions in complex transfer learning tasks, while yielding results in continuous control benchmarks that are comparable to those of the state-of-the-art methods.

Reinforcement learning (RL) has been able to learn policies in complex environments but it usually requires designing suitable reward functions for successful learning. This can be difficult and may lead to learning sub-optimal policies with unsafe behavior (Amodei et al., 2016) in the case of poor engineering. Inverse Reinforcement Learning (IRL) (Ng & Russell, 2000; Abbeel & Ng, 2004) can facilitate such reward engineering through learning an expert’s reward function from expert demonstrations.

IRL, however, comes with many difficulties and the problem is not well-defined because, for a given set of demonstrations, the number of optimal policies and corresponding rewards can be very large, especially for high dimensional complex tasks. Also, many IRL algorithms learn reward functions that are heavily shaped by environmental dynamics. Policies learned on such reward functions may not remain optimal with slight changes in the environment. Adversarial Inverse Reinforcement Learning (AIRL) (Fu et al., 2018) generates more generalizable policies with disentangled reward functions that are invariant to the environmental dynamics. The reward and value function are learned simultaneously to compute the reward function in a state-only manner. This is an instance of a transfer learning problem with changing dynamics, where the agent learns an optimal policy in one environment and then transfers it to an environment with different environmental dynamics. A practical example of this transfer learning problem would be teaching a robot to walk with some mechanical structure, and then generalize this knowledge to perform the task with different sized structural components.

Other methods have been developed to learn demonstrations in environments with differing dynamics and then exploit this knowledge while performing tasks in a novel environment. As complex tasks in different environments often come from several reward functions, methods such as Maximum Entropy IRL and GAN-Guided Cost Learning (GAN-GCL) tend to overgeneralize (Finn et al., 2016b). One way to help solve the problem of over-fitting is to break down a policy into small option (temporally extended action)-policies that solve various aspects of an overall task. This method has been shown to create a policy that is more generalizable (Sutton et al., 1999; Taylor & Stone, 2009). Methods such as Option-Critic have implemented modern RL architectures with a policy over options and have shown improvements for generalization of policies (Bacon et al., 2017). OptionGAN (Henderson et al., 2018) also proposed an IRL framework for a policy over options and is shown to have some improvement in one-shot transfer learning tasks, but does not return disentangled rewards.

In this paper, we introduce Option-Inverse Reinforcement Learning (oIRL), to investigate transfer learning with options. Following AIRL, we propose an algorithm that com- putes disentangled rewards to learn joint reward-policy options, with each option-policy having rewards that are disentangled from environmental dynamics. These policies are shown to be heavily portable in transfer learning tasks with differences in environments. We evaluate this method in a variety of continuous control tasks in the Open AI Gym environment using the MuJoCo simulator (Brockman et al., 2016; Todorov et al.) and GridWorlds with MiniGrid (Chevalier-Boisvert et al., 2018). Our method shows improvements in terms of reward on a variety of transfer learning tasks while still performing better than benchmarks for standard continuous control tasks.

Markov Decision Processes (MDP) are defined by a tuple ⟨S, A, P, R, γ⟩where S is a set of states, A is the set of actions available to the agent, P is the transition kernel giving a probability over next states given the current state and action,  R : S × A → [0, Rmax]is a reward function and γ ∈ [0, 1]is a discount factor.  stand  atare respectively the state and action of the expert at time instant t. We de-fine a policy  πas the probability distribution over actions conditioned on the current state;  π : S × A → [0, 1]. A pol-icy is modeled by a Gaussian distribution  πθ ∼ N(µ, σ2)where  θis the policy parameters. The value of a policy is defined as  Vπ(s) = Eπ[�∞t=0 γtrt+1|s], where E denotes the expectation. An agent follows a policy  πand receives reward from the environment. A state-action value function is  Qπ(s, a) = Eπ[�∞t=0 γtrt+1|s, a]. The advantage is Aπ(s, a) = Qπ(s, a)−Vπ(s). r(s, a)represents a one-step reward.

Options (ω ∈ Ω) are defined as a triplet (Iω, πω, βω), whereπωis a policy over options,  Iω ∈ Sis the initiation set of states and  βω : S → [0, 1]is the termination function. The policy over options is defined by  πΩ. An option has a reward rωand an option policy  πω.

The policy over options is parameterized by  ζ, the intraoption policies by  αfor each option, the reward approximator by  θ, and the option termination probabilities by δ.

In the one-step case, selecting an option using the policy-over-options can be viewed as a mixture of completely specialized experts. This overall policy can be defined as πΘ(a|s) = �ω∈Ω πΩ(ω|s)πω(a|s).

Disentangled Rewards are formally defined as a reward function  r∗θ(s, a, s′)that is disentangled with respect to (w.r.t.) a ground-truth reward and a set of environmental dynamics T such that under all possible dynamics  T ∈ T, the optimal policy computed w.r.t. the reward function is the same.

Generative Adversarial Networks (GANs) learn the generator distribution  pgand discriminator  DθD(x). They use a prior distribution over input noise variables p(z). Given these input noise variables the mapping  Gθg(z)is learned, which maps these input noise variables to the data set space. G is a neural network. Another neural network,  DθD(x), learns to estimate the probability that x came from the data set and not the generator  pg.

In our two-player adversarial training procedure, D is trained to maximize the probability of assigning the correct labels to the dataset and the generated samples. G is trained to minimize the objective log(1 − DθD(GθG(z))), which causes it to generate samples that are more likely to fool the discriminator.

Policy Gradient methods optimize a parameterized policy πθusing a gradient ascent. Given a discounting term, the objective to be optimized is  p(θ, s0) = E[�∞t=0 γtrθ(st)|s0]. Proximal policy optimization (PPO) (Schulman et al., 2017) is a policy gradient method that uses policy gradient theorem, which states ∂p(θ,s0)∂θ = �s�∞t=0 γtP(st =s|s0) �a∂π(a|s)∂θ Qπθ(s, a). PPO has been adapted for the option-critic architecture (PPOC) (Klissarov et al., 2017).

Inverse Reinforcement Learning (IRL) is a form of imitation learning 1, where the expert’s reward is estimated from demonstrations and then forward RL is applied to that estimated reward to find the optimal policy. Generative Adversarial Imitation Learning (GAIL) directly extracts optimal policies from expert’s demonstrations (Ho & Ermon, 2016). IRL infers a reward function from expert demonstrations, which is then used to optimize a generator policy.

In IRL, an agent observes a set of state-action trajectories from an expert demonstrator D. We let  TD ={τ E1 , τ E2 , . . . , τ En }be the state-action trajectories of the ex- pert,  τ Ei ∼ τD where τ Ei = {s0, a0, s1, a1 . . . , sk, ak}. Wewish to find the reward function r(s, a) given the set of demonstrations  TD. It is assumed that the demonstrations are drawn from the optimal policy  π∗(a|s). The Maximum Likelihood Estimation (MLE) objective in the IRL problems is therefore:


with pθ(τ) ∝ p(s0) �Tt=0 p(st+1|st, at) exp (γtrθ(st, at)).

Adversarial Inverse Reinforcement Learning (AIRL) is based on GAN-Guided Cost Learning (Finn et al., 2016a), which casts the MLE objective as a Generative Adversarial Network (GAN) (Goodfellow et al., 2014) optimization problem over trajectories. In AIRL (Fu et al., 2018), the discriminator probability  Dθis evaluated using the state-action pairs from the generator (agent), as given by


The agent tries to maximize  R(s, a) = log(1 − Dθ(s, a)) −log(Dθ(s, a)) where fθ(s, a)is a learned function and  π ispre-computed. This formula is similar to GAIL but with a recoverable reward function since GAIL outputs 0.5 for the reward of all states and actions at optimality. The discriminator function is then formulated as  fθ,Φ(s, a, s′) =gθ(s, a) + γhΦ(s′) − hΦ(s)given shaping function  hΦ andreward approximator  gθ. Under deterministic dynamics, it is shown in AIRL that there is a state-only reward approximator (  f ∗(s, a, s′) = r∗(s)+γV ∗(s′)−V ∗(s) = A∗(s, a)where the reward is invariant to transition dynamics and is disentangled.

Hierarchical Inverse Reinforcement Learning learns policies with high level temporally extended actions using IRL. OptionGAN (Henderson et al., 2018) provides an adversarial IRL objective function for the discriminator with a policy over options. It is formulated such that  Lregdefines the regularization terms on the mixture of experts so that they converge to options. The discriminator objective in OptionGAN takes state-only input and is formulated as:




In Directed-Info GAIL (Sharma et al., 2019) implements GAIL in a policy over options framework.

Work such as (Krishnan et al., 2016) solves this hierarchical problem of segmenting expert demonstration transitions by analyzing the changes in local linearity w.r.t a kernel function. It has been suggested that decomposing the re- ward function is not enough (Henderson et al., 2018). Other works have learned the latent dimension along with the policy for this task (Hausman et al., 2017; Wang et al., 2017). In this formulation, the latent structure is encoded in an unsupervised manner so that the desired latent variable does not need to be provided. This work parallels many hierarchical IRL methods but with recoverable robust rewards.

Let  (s0, a0, . . . sT , aT ) ∈ τ Eibe an expert trajectory of state-action pairs. Denote by  (s0, a0, ω0 . . . sT , aT , ωT ) ∈τπΘ,ta novice trajectory generated by policy over options πΘ,tof the generator at iteration t.

Given a trajectory of state-action pairs, we first define an option transition probability given a state and an option. Similar transition probabilities given state, action or option information are defined in (Appendix A).

P(st+1, ωt+1 | st, ωt)


We can similarly define a discounted return recursively. Consider the policy over options based on the probabilities of terminating or continuing the option policies given a reward approximator  ˆrθ(s, a)for the state-action reward.


ω0is selected according to  πζ,Ω(ω|s). The expressions for all relevant discounted returns appearing in the analysis are given in Appendix B. A suitable parameterization of the discounted return R can be found by maximizing the causal entropy  Eτ∼D[log(pθ(τ))]w.r.t parameter  θ. We then have for a trajectory  τ with Ttime-steps:


4.1. MLE Derivative

Similar to (Fu et al., 2018) and (Finn et al., 2016a), we define the MLE objective for the generator  pθ as


Note that we may or may not know the option trajectories in our expert demonstrations, instead they are estimated according to the policy over options. The gradient of (7) w.r.t  θ(See Appendix B for detailed derivations) is given by:


We define  pθ,t(st, at) =�st′ ̸=t,at′ ̸=t pθ(τ)dst′dat′as the state action marginal at time t. This allows us to examine the trajectory from step t as defined similarly in (Fu et al., 2018). Consequently, we have


Since  pθis difficult to draw samples from, we estimate it using importance sampling distribution over the generator density. Then, we compute an importance sampling estimate of a mixture policy  µt,w(τ)for each option w as follows.

We sample a mixture policy  µω(a|s) defined as 12πω(a|s) +

2 ˆpω(a|s)and  ˆpω(a|s)is a density estimate trained on the demonstrations. We wish to minimize  DKL(πw(τ)|pω(τ))to reduce the importance sampling distribution variance, where  DKL is theKullback–Leibler divergence metric (Kull- back & Leibler, 1951) between two probability distributions. Applying the aforementioned density estimates in (8), we can express the gradient of the MLE objective J follows:




In this section we formulate the discriminator, parameterized by  θ, as the odds ratio between the policy and the exponentiated reward distribution for option  ω. We have a discriminator  Dθ,ωfor each option  ωand a sample generator option policy  πw, defined as follows:


5.1. Recursive Loss Formulation

The discriminator  Dθ,ωis trained by minimizing the cross-entropy loss between expert demonstrations and generated examples assuming we have the same number of options in the generated and expert trajectories. We define the loss function  Lθas follows:


The parameterized total loss for the entire trajectory, Lθ,α,δ(s, a, ω), can be expressed recursively as follows by taking expectations over the next options and states:


The agent wishes to minimize  Lθ,α,δto find its optimal policy.

5.2. Optimization Criteria

For a given option  ω, define the reward function ˆRθ,δ(s, ω, a), which is to be maximised. We then write a negative discriminator loss (−LD) to turn our loss minimization problem into a maximization problem, as follows:


We set a mixture of experts and novice as  ¯µobservations in our gradient. We then wish to take the derivative of the inverse discriminator loss as,


We can multiply the top and bottom of the fraction in the mixture expectation by the state marginal  πω(st) =�a∈A πω(st, at). This allows us to write  ˆpθ,t,ω(st, at) =exp(Lθ,δ(st, ω, at))πω,t(st). Using this, we can derive an importance sampling distribution in our loss,


The gradient of this parametrized reward function corresponds to the inverse of the discriminator’s objective:


See Appendix C for the detailed derivations of the terms appearing in (20). Substituting (20) into (19) one can see that (9) (derivative of MLE objective) and (10) are of the same form as of (19) (derivative of the discriminator objective and (20)).

In this section, we provide our main algorithm for learning robust rewards with options. Similar to AIRL, we implement our algorithm with a discriminator update that considers the rollouts of a policy over options. We perform this update with  (s, a, s′)triplets and a discriminator function in the form of  fθ,ω(s, a, s′)as given in (21). This allows us to formulate the discriminator with state-only rewards in terms of option-value function estimates. We can then compute an option-advantage estimate. Since the reward function only requires state, we learn a reward function and corresponding policy that is disentangled from the environmental transition dynamics.

fω,θ(s, a, s′) = ˆrω,θ(s) + γ ˆVΩ(s′) − ˆVΩ(s) = ˆA(s, a, ω)

(21) Where Q(s, ω) = �a∈A πω,α(a|s)[rω,θ(s, a) +γ �s′∈S P(s′|s, a)((1 − βδ,ω(s′))Q(s′, ω) +βδ,ω(s′)VΩ(s′))] and VΩ(s) = �ω∈Ω πΩ,ζ(ω|s)Q(s, ω).

Our discriminator model must learn a parameterization of the reward function and the value function for each option, given the total loss function in (37). These parameterized models are learned with a multi-layer perceptron. For each option, the termination functions  βω,δand option-policies πω,αare learned using PPOC.

6.1. The main algorithm: Option-Adversarial Inverse Reinforcement Learning (oIRL)

Our main algorithm, oIRL, is given by Algorithm 1. Here, we iteratively train a discriminator from expert and novice sampled trajectories using the derived discriminator objective. This allows us to obtain reward function estimates for each option. We then use any policy optimization method for a policy over options given these estimated rewards.

We can also have discriminator input of state-only format as described in (21). It is important to note that in our recursive loss, we recursively simulate a trajectory to compute the loss a finite number of times (and return if the state is terminal). We show the adversarial architecture of this algorithm in Appendix D.

In this section we explain the gist of the analysis of con- vergence of oIRL. The detailed proofs can be found in Appendix E and F.

We first show that the actual reward function is recovered (up to a constant) by the reward estimators. We show that for each option’s reward estimator  gθ,ω(s), we have  g∗ω(s) =r∗(s) + cω, where  cωis a finite constant. Using the fact that  gθ,ω(s) → g∗ω(s) = r∗(s) + cω, and by using Cauchy- Schwarz inequality of sup-norm, we prove that the update of the TD-error is a contraction, i.e.,


In order to prove asymptotic convergence to the optimal option-value  Q∗, we show using the contraction argument that  gθ,ω(s) + γQ(s′, ω)converges to  Q∗by establishing the following inequality:


oIRL learns disentangled reward functions for each option policy, which facilitates policy generalizability and is instrumental in transfer learning.

Transfer learning can be described as using information learned by solving one problem and then applying it to a different but related problem. In the RL sense, it means taking a policy trained on one environment and then using the policy to solve a similar task in a different previously unseen environment.

We run experiments in different environments to address the following questions:

Does learning a policy over options with the AIRL framework improve policy generalization and reward robustness in transfer learning tasks where the environmental dynamics are manipulated?

Can the policy over options framework match or exceed benchmarks for imitation learning on complex continuous control tasks?


To answer these questions, we compare our model against AIRL (the current state of the art for transfer learning) in a transfer task by learning in an ant environment and modifying the physical structure of the ant and compare our method on various benchmark IRL continuous control tasks. We wish to see if learning disentangled rewards for sub-tasks through the options framework is more portable.

We train a policy using each of the baseline methods and our method on these expert demonstrations for 500 time steps on the gait environments and 500 time steps on the hierarchical ones. Then we take the trained policy (the parameterized distribution) and use this policy on the transfer environments and observe the reward obtained. Such a method of transferring the policy is called a direct policy transfer.

8.1. Gait Transfer Learning Environments

For the transfer learning tasks we use Transfer Environments for MuJoCo (Chu & Arnold, 2018), a set of gym environments for studying potential improvements in transfer learning tasks. The task involves an Ant as an agent which optimizes a gait to crawl sideways across the landscape. The expert demonstrations are obtained from the optimal policy in the basic Ant environment. We disable the agent ant in two ways for two transfer learning tasks. In BigAnt tasks, the length of all legs is doubled, no extra joints are added though. The Amputated Ant task modifies the agent by shortening a single leg to disable it. These transfer tasks require the learning of a true disentangled reward of walking sideways instead of directly imitating and learning the reward specific to the gait movements. These


Table 1. The mean reward obtained (higher is better) over 100 runs for the Gait transfer learning tasks. We also show the results of PPO optimizing the ground truth reward.


manipulations are shown in Figure 1.


Figure 1. MuJoCo Ant Gait transfer learning task environments. When the ant is disabled, it must position itself correctly to crawl forward. This requires a different initial policy than the original environment where the ant must only crawl sideways.

Table 1 shows the results in terms of reward achieved for the ant gait transfer tasks. As we can see, in both experiments our algorithm performs better than AIRL. Remark that the ground truth is obtained with PPO after 2 million iterations (therefore much less sample efficient than IRL).

8.2. Maze Transfer Learning Tasks

We also create transfer learning environments in a 2D Maze environment with lava blockades. The goal of the agent is to go through the opening in a row of lava cells and reach a goal on the other end. For the transfer learning task, we train the agent on an environment where the "crossing" path requires the agent to go through the middle for (LavaCrossing-M) and then the policy is directly transferred and used on a GridWorld of the same size where the crossing is on the right end of the room (LavaCrossing-R). An additional task would be changing a blockade in a Maze (FlowerMaze-(R,T)). The two environments are shown in Figure 2. We can think of two sub-tasks in this environment, going to the lava crossing and then going to the goal.

In all of these environments, the rewards are sparse. The agent receives a non-zero reward only after completing the mission, and the magnitude of the reward is  1−0.9·n/nmax,where n is the length of the successful episode and  nmaxis the maximum number of steps that we allowed for completing the episode, different for each mission.


Figure 2. The MiniGrid transfer learning task set 1. Here the policy is trained on (a or c) using our method and the baseline methods and then transferred to be used on environment (b or d). The green cell is the goal.

We show the mean reward after 10 runs using the direct policy transfers on the environments in Table 2. The 4 option oIRL achieved the highest reward on the LavaCrossing tasks. The FlowerMaze task was quite difficult with most algorithms obtaining very low reward. Options still result in a large improvement.

Table 2. The mean reward obtained (higher is better) over 10 runs for the Maze transfer learning tasks. We also show the results of PPO optimizing the ground truth reward.


8.3. Hierarchical Transfer Learning Tasks

In addition, we adopt more complex hierarchical environments that require both locomotion and object interaction. In the first environment, the ant must interact with a large movable block. This is called the Ant-Push environment (Duan et al., 2016). To reach the goal, the ant must complete two successive processes: first, it must move to the left of the block and then push the block right, which clears the path towards the target location. There is a maximum of 500 timesteps. These can be thought of as hierarchical tasks with pushing to the left, pushing to the right and going to the goal as sub-goals.

We also utilize an Ant-Maze environment (Florensa et al., 2017) where we have a simple maze with a goal at the end. The agent receives a reward of +1 if it reaches the goal and 0 elsewhere. The ant must learn to make two turns in the maze, the first is down the hallway for one step and then a turn towards the goal. Again, we see hierarchical behavior in this task: we can think of sub-goals consisting of learning to exit the first hall of the maze, then making the turn and finally going down the final hall towards the goal. The two complex environments are shown in Figure 3.


Figure 3. MuJoCo Ant Complex Gait transfer learning task environments. We perform these transfer learning tasks with the Big Ant and the Amputated Ant.

Table 3 shows that oIRL performs better than AIRL in all of the complex hierarchical transfer tasks. In some tasks such as the Maze environment, AIRL fails to have any or very few successful runs while our method achieves reasonably high reward. In the BigAnt push task, AIRL achieves only very minimal reward where oIRL succeeds to perform the task in some cases.

Table 3. The mean reward obtained (higher is better) over 100 runs for the MuJoCo Ant Complex Gait transfer learning tasks. We also show the results of PPO optimizing the ground truth reward.


Figure 4. MuJoCo Continuous control locomotion tasks showing the mean reward (higher is better) achieved over 500 iterations of the benchmark algorithms for 10 random seeds. The shaded area represents the standard deviation.

8.4. MuJoCo Continuous Control Benchmarks

We also test our algorithm on a number of robotic continuous control benchmark tasks. These tasks do not involve transfer.

We show the plots of the average reward for each iteration during training in Figure 4. Achieving a higher reward in fewer iterations is better for these experiments. We examine the Ant, the Half Cheetah, the and Walker MuJoCo gait/locomotion tasks. We run these experiments with 10 random seeds. The results are quite similar between the benchmarks. Using a policy over options shows reasonable improvements in each task.

This work presents Option-Inverse Reinforcement Learning (oIRL), the first hierarchical IRL algorithm with disentangled rewards. We validate oIRL on a wide variety of tasks, including transfer learning tasks, locomotion tasks, complex hierarchical transfer RL environments and GridWorld transfer navigation tasks and compare our results with the state-of-the-art algorithm. Combining options with a disentangled IRL framework results in highly portable policies. Our empirical studies show clear and significant improvements for transfer learning. The algorithm is also shown to perform well in continuous control benchmark tasks.

For future work, we wish to test other sampling methods (e.g., Markov-chain Monte Carlo) to estimate the implicit discriminator-generator pair’s distribution in our GAN, such as Metropolis-Hastings GAN (Turner et al., 2019). We also wish to investigate methods to reduce the computational complexity for the step of computing the recursive loss function, which requires simulating some short trajectories, lowering the variance. Analyzing our algorithm using physical robotic tests for tasks that require multiple sub-tasks would be an interesting future course of research.

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. in ICML, 2004.

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul- man, J., and Mané, D. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.

Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. in AAAI, 2017.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.

Chevalier-Boisvert, M., Willems, L., and Pal, S. Minimalis- tic gridworld environment for openai gym. https:// github.com/maximecb/gym-minigrid, 2018.

Chu, E. and Arnold, S. Transfer environments for mujoco. GitHub, 2018. URL https://github.com/ seba-1511/shapechanger.

Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control, 2016.

Finn, C., Christiano, P., Abbeel, P., and Levine, S. A con- nection between generative adversarial networks, inverse reinforcement learning, and energy-based models. in NeurIPS, 2016a.

Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. in ICML, 2016b.

Florensa, C., Held, D., Wulfmeier, M., Zhang, M., and Abbeel, P. Reverse curriculum generation for reinforcement learning. in CoRL, 2017.

Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. in ICLR, 2018.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NeurIPS, 2014.

Hausman, K., Chebotar, Y., Schaal, S., Sukhatme, G., and Lim, J. J. Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. in NeurIPS, 2017.

Henderson, P., Chang, W.-D., Bacon, P.-L., Meger, D., Pineau, J., and Precup, D. Optiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning. in AAAI, 2018.

Ho, J. and Ermon, S. Generative adversarial imitation learn- ing. in NeurIPS, 2016.

Klissarov, M., Bacon, P., Harb, J., and Precup, D. Learnings options end-to-end for continuous action tasks. CoRR, abs/1712.00004, 2017.

Krishnan, S., Garg, A., Liaw, R., Miller, L., Pokorny, F. T., and Goldberg, K. Y. Hirl: Hierarchical inverse reinforcement learning for long-horizon tasks with delayed rewards. ArXiv, abs/1604.06508, 2016.

Kullback, S. and Leibler, R. A. On information and suf- ficiency. The Annals of Mathematical Statistics, 22(1): 79–86, 1951. ISSN 00034851.

Ng, A. Y. and Russell, S. Algorithms for inverse reinforce- ment learning. in ICML, 2000.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

Sharma, M., Sharma, A., Rhinehart, N., and Kitani, K. M. Directed-info GAIL: Learning hierarchical policies from unsegmented demonstrations using directed information. in ICLR, 2019.

Sutton, R., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 1999.

Taylor, M. E. and Stone, P. Transfer learning for reinforce- ment learning domains: A survey. J. Mach. Learn. Res., 2009.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

Turner, R., Hung, J., Frank, E., Saatchi, Y., and Yosinski, J. Metropolis-hastings generative adversarial networks. ICML, 2019.

Wang, Z., Merel, J., Reed, S., Wayne, G., de Freitas, N., and Heess, N. Robust imitation of diverse behaviors. in NeurIPS, 2017.

It is useful to redefine transition probabilities in terms of options. Since at each step we have a additional consideration, we can continue following the policy of the current option we are in or terminate the option with some probability, sample a new option and follow that option’s policy from a stochastic policy dependent on states. We have


We can define a discounted return recursively for a policy over options in a similar manner to the transition probabilities. Consider the policy over options based on the probabilities of terminating or continuing the option policies given a reward approximator  ˆrθ(s, a)for the state-action reward.


These formulations of the reward function account for option transition probabilities, including the probability of terminating the current option and therefore selecting a new one according to the policy over options.

With  ω0selected according to  πζ,Ω(ω|s), we can define a parameterization of the discounted return R in the style of a maximum causal entropy RL problem with objective  maxθ Eτ∼D[log(pθ(τ))], where


MLE Derivative

We can write out our MLE objective for our generator. We may or may not know the option trajectories in our expert demonstrations, but they are estimated below according to the policy over options. This is defined similarly in (Fu et al., 2018) and (Finn et al., 2016a) as  J(θ) = Eτ∼τ E[�Tt=0 RΩζ,θ,δ(st, at)] − Epθ[�Tt=0�ω∈Ω πζ,Ω(ω|st)Rθ,δ(st, ω, at)]. The

full derivation is shown as (with generator  pθ):

We go from Line 4 to 5 seeing  RΩζ,θ,δ(st, at) = �ω∈Ω πζ,Ω(ω|st)Rθ,δ(st, ω, at).

Now, we take the gradient of the MLE objective w.r.t  θ yields,

We perform importance sampling over the hard to estimate generator density. We make an importance sampling distribution µt,w(τ)for option w.

We sample a mixture policy  µω(a|s)defined as 12πω(a|s) + 12 ˆpω(a|s)and  ˆpω(a|s)is a rough density estimate trained on the demonstrations. We wish to minimize the  DKL(πw(τ)|pω(τ)). KL refers to the Kullback–Leibler divergence metric between two probability distributions. Our new gradient is:


Taking the derivative of the discounted option return results in



We formulate the discriminator as the odds ratio between the policy and the exponentiated reward distribution for option  ωas in AIRL parameterized by  θ. We have a discriminator for each option  ωand generator option policy  πw,


C.1. Recursive Loss Formulation

We minimize the cross-entropy loss between expert demonstrations and generated examples assuming we have the same number of options in the generated and expert trajectories. We define the loss function  Lθas follows:


The total loss for the entire trajectory can be expressed recursively as follows by taking expectations over the next options or states:


The agent wishes to minimize  Lθ,δto find its optimal policy.

We can let cost function  fθ,w(s, a) = Lθ,δ(s, ω, a)as shown in AIRL and we have:


C.2. Optimization Criteria

For a given option  ω, we can write the reward function ˆRθ,δ(s, ω, a)to be maximised, as follows. Note that  θparameterizes the state-action reward function estimate for option  ω. −LDis the negative discriminator loss. We therefore turn our minimization problem into a maximization problem. We define our objective similar to the GAN objective from AIRL:


Now we can write out our reward function in terms of the optimal discriminator


The derivative of this reward function can now be computed as follows:


Writing out our discriminator objective yields:


We set a mixture of experts and novice as  ¯µobservations.

We can take the derivative w.r.t  θ(state-action reward function estimate parameter):


We can multiply the top and bottom of the fraction in the mixture expectation by the state marginal  πω(st) =�a∈A πω(st, at).This allows us to write  ˆpθ,t,ω(st, at) = exp(Lθ,δ(st, ω, at))πω,t(st). Now we have an importance sampling.


It is now easy to see we have the same form as our MLE objective loss function, our loss (the function we approximate with the GAN) is the discounted reward for a state action pair with the expectation over options. We change the loss functions to reward functions to show this, as they are defined equivalently.


In addition, we can decompose the reward into a state-action reward and a future discounted sum of rewards considering the policy over options as follows:


e are given a mixture of experts and generated policies as  ¯µtand perform importance sampling with respect to this distribution.

The architecture for our GAN-IRL framework is described in Figure 5.

A substantial amount of this proof is derived from (Fu et al., 2018).

Lemma 1:  fθ,ω(s, a)recovers the advantage.

Proof: It is known that when  πω = πEω , we have achieved the global min of the discriminator objective. The discriminator must then output 0.5 for all state action pairs. This results in exp(fθ,ω(s, a)) = πEω (a|s). Equivalently we have  f ∗ω(s, a) =log  πEω (a|s) = A∗(s, a, ω).

Definition 1: Decomposability condition. We first define 2 states  s1, s2as 1-step linked under dynamics  T(s′|s, a) if there


Figure 5. Architecture of GAN-IRL framework.

exists a state s that can reach  s1and  s2with non-zero probability in one timestep. The transitivity property holds for the linked relationship. We can say that if  s1 and s2and linked,  s2 and s3are linked then  s1 and s3must also be linked.

The Decomposability condition for transition dynamics T holds if all states in the MDP are linked with all other states.

Lemma 2: For an MDP, where the decomposability condition holds for all dynamics. For arbitrary functions a(s), b(s), c(s), d(s), if for all  s and s′


and for all s


where constsis a constant dependent with respect to state s.

Proof: If we rearrange Equation 48, we can obtain the quality  a(s) − c(s) = b(s′) − d(s′).

Now we define  f(s) = a(s) − c(s). Given our equality, we have  f(s) = a(s) − c(s) = b(s′) − d(s′). This holds for some function dependent on s.

To represent this,  b(s′) − d(s′)must be equal to a constant (with the constant’s value dependent on the state s) for all one-step successor states  s′ from s.

Now, under decomposability, all one step successor states (s′) from s must be equal through the transitivity property so b(s′) − d(s′)must be a constant with respect to state s. Therefore, we can write  a(s) = c(s) + const+sfor an arbitrary state s and functions b and d.

Substituting this into the Equation 48, we can obtain  b(s) = d(s) + consts. This completes our proof.


where  S(k)is the k-th successor state reached in k time-steps from the current state. Let us denote by  T π,(k)(s, S(k))the probability of transitioning from state  s to S(k) in ksteps using policy  π. Then, we can express  T π,(k)(s, S(k))recursively

as follows:

where  T π(s′, S(k))is the one-step transition probability from state  s′ to state S(k) (by definition of the Bellman operator).


as follows:


where  µis the state-distribution.

The unbiased estimator  ˆs(k) of an unknown successor state  S(k) is given by:


where  P(S(k))is given in (53).

Now, replacing  S(k) in (51) with its unbiased estimator  ˆs(k) as given by (54), we have


for some function f, where (a) holds since  ˆs(k) depends only on k. Thus, we get a(s) = c(s)+const. and b(s) = d(s)+const. where the constant is with respect to the state s.

Theorem 1: Suppose we have, for a MDP where the decomposability condition holds,


where  hΦis a shaping term. If we obtain the optimal  f ∗θ,ω(s, a, s′), with a reward approximator  g∗ω(s, a). Under deterministic dynamics the following holds



Proof: We know  f ∗ω(s, a, s′) = A∗(s, a, ω) = Q∗(s, a, ω) − V ∗Ω(s) = r∗ω(s) + γV ∗Ω(s′) − V ∗Ω(s). We can substitute the definition of  f ∗ω(s, a, s′)to obtain our Theorem.

Where Q(s, ω) = �a∈A πω,α(a|s)[rω,θ(s, a) + γ �s′∈S P(s′|s, a) ((1 − βδ,ω(s′))Q(s′, ω) + βδ,ω(s′)VΩ(s′))]and VΩ(s) = �ω∈Ω πΩ,ζ(ω|s)Q(s, ω)

Q(s, a, ω) = πω,α(a|s)[rω,θ(s, a) + γ �s′∈S P(s′|s, a) ((1 − βδ,ω(s′))Q(s′, ω) + βδ,ω(s′)VΩ(s′))]

which holds for all  s and s′. Now we apply Lemma 2. We say that  a(s) = g∗ω(s) − h∗Φ(s), b(s′) = γh∗Φ(s′), c(s) = r(s) −V ∗Ω(s) and d(s′) = γV ∗Ω(s′)and rearrange according to Lemma 2. We therefore have our results that  g∗ω(s) = rω(s) + cω.Where  cωis a constant.

Definition 2: Reward Approximator Error. From Theorem 1, we can see that our reward approximator  g∗ω(s) =rω(s) + cω. We define a reward approximator error over all options as  δr = �ω∈Ω πΩ(ω)|g∗ω(s) − r∗(s)|. This error is bounded by


By definition of  g∗ω(s).

Lemma 3: The Bellman operator for options in the IRL problem is a contraction.

Proof: We prove this by Cauchy-Schwarz and the definition of the sup-norm. We must define this inequality in terms of the IRL problem where we have a reward estimator  ˆgθω(s)under our learned parameter  θand an optimal reward estimator r∗(s).


This is given by Lemma 3 and (Sutton et al., 1999) [Theorem 3].

Giving our results  maxs′′,ω′′ |QπΩ,t(s, ω) − Q∗(s, ω)| ≤ ϵ + maxω∈Ω cω. For ϵ ∈ R>0

Theorem 2:  gθ(s) + γQ(s′, ω)converges to  Q∗.

where (a) follows from Lemma 3 and (b) holds since �s′∈S P(s′|s, a) ≤ 1.

G.1. MuJoCo Tasks

For these experiments, we use PPO to obtain an optimal policy given our ground truth rewards for 2 million iterations and 20 million on the complex tasks. This is used to obtain the expert demonstrations. We sample 50 expert trajectories. PPOC is used for the policy optimization step for the policy over options. We tune the deliberation cost hyper-parameter via cross-validation. The optimal deliberation cost found was 0.1 for PPOC. We also use state-only rewards for the policy transfer tasks. The hyperparameters for our policy optimization are given in Table 4.

Our discriminator is a neural network with the optimal architecture of 2 linear layers of 50 hidden states, each with ReLU activation followed by a single node linear layer for output. We also tried a variety of hidden states including 100 and 25 and tanh activation during our hyperparameter optimization step using cross-validation.

The policy network has 2 layers of 64 hidden states. A batch size of 64 or 32 is used for 1 and any number of options greater than 1 respectively. No mini-batches are used in the discriminator since the recursive loss must be computed. There are 2048 timesteps per batch. Generalized Advantage Estimation is used to compute advantage estimates. We list additional network parameters in the next section. The output of the policy network gives the Gaussian mean and the standard deviation. This is the same procedure as in (Schulman et al., 2017).

Table 4. Policy Optimization parameters for MuJoCo Parameter Value


G.2. MuJoCo Continuous Control Tasks

In this section, we describe the structure of the objects that gait in the continuous control benchmarks and the reward functions. For the transfer learning tasks, we use the same reward function described here for the Ant.

Walker: The walker is a planar biped. There are 7 rigid links comprised of legs, a torso. This includes 6 actuated joints. This task is particularly prone to falling. The state space is of 21 dimensions. The observations in the states include joint angles, joint velocities, the center of mass’s coordinates. The reward function is  r(s, a) = vx − 0.005||a||22. The termination condition occurs when  zbody < 0.8, zbody > 2.0 or ||θy|| > 1.0.

Half-Cheetah: The half-cheetah is a planar biped also like the Walker. There are 9 rigid links comprised of 9 actuated joints, a leg and a torso. The state space is of 20 dimensions. The observations include joint angles, the center of mass’s coordinates, and joint velocities. The reward function is  r(s, a) = vx − 0.005||a||22. There is no termination condition.

Ant: The ant has four legs with 13 rigid links in its structure. The legs have 8 actuated joints. The state space is of 125 dimensions. This includes joint angles, joint velocities, coordinates of the center of mass, the rotation matrix for the body, and a vector of contact forces. The function is  r(s, a) = vx − 0.005||a||22 − Ccontact + 0.05, where  Ccontactis a penalty for contacts to the ground. This is  5 × 10−4||Fcontact||22. Fcontact is the contact force. It’s values are clipped to be between 0 and 1. The termination condition occurs  zbody < 0.2 or zbody > 1.0.


Figure 6. Architecture of the actor-critic policies on MiniGrid. Conv is Convolutional Layer and filter sized is described below. FC is a fully connected layer.

G.3. MiniGrid Tasks

For experiments, we used the PPOC algorithm with parallelized data collection and GAE. 0.1 is the optimal deliberation cost. Each environment is run with 10 random network initialization. As before, in Table 5, we show some of the policy optimization parameters for MiniGrid Tasks. We rely on an actor-critic network architecture for these tasks. Since the state space is relatively large and spatial features are relevant, we use 3 convolutional layers in the network. The network architecture is detailed in Figure 6. n and m are defined by the grid dimensions.

The discriminator network is again an neural network with the optimal architecture of 3 linear layers of 150 hidden states, each with ReLU activation followed by a single node linear layer for output.

Table 5. Policy optimization parameters for benchmark tasks in MiniGrid


Designed for Accessibility and to further Open Science