Learning Reusable Options for Multi-Task Reinforcement Learning

2020·Arxiv

Abstract

Abstract

Reinforcement learning (RL) has become an increasingly active area of research in recent years. Although there are many algorithms that allow an agent to solve tasks efficiently, they often ignore the possibility that prior experience related to the task at hand might be available. For many practical applications, it might be unfeasible for an agent to learn how to solve a task from scratch, given that it is generally a computationally expensive process; however, prior experience could be leveraged to make these problems tractable in practice. In this paper, we propose a framework for exploiting existing experience by learning reusable options. We show that after an agent learns policies for solving a small number of problems, we are able to use the trajectories generated from those policies to learn reusable options that allow an agent to quickly learn how to solve novel and related problems.

1 Introduction

Reinforcement learning (RL) techniques have experienced much of their success in simulated environments, such as video games [Mnih et al., 2015] or board games [Silver et al., 2016; Tesauro, 1995]. One of the main reasons why RL has worked so well in these applications is that we are able simulate millions of interactions with the environment in a relatively short period of time, allowing the agent to experience a large number of different situations in the environment and learn the consequences of its actions.

In many real world applications, however, where the agent interacts with the physical world, it might not be easy to generate such a large number of interactions. The time and cost associated with training such systems could render RL an unfeasible approach for training in large scale.

As a concrete example, consider training a large number of humanoid robots (agents) to move quickly, as in the Robocup competition [Farchy et al., 2013]. Although the agents have similar dynamics, subtle variations mean that a single policy shared across all agents would not be an effective solution. Furthermore, learning a policy from scratch for each agent is too data-inefficient to be practical. As shown by Farchy et al. (2013), this type of problem can be addressed by leveraging the experience obtained from solving a related task (e.g., walking) to quickly learn a policy for each individual agent that is tailored to a new task (e.g., running).

The situation where agents might need to solve many related, but unique, tasks also occurs in industry; an example would be robots (agents) tasked with sorting items in fulfill-ment centers. A simple approach, like using PD controllers, would fail to adapt to the forces generated from picking up objects with different weight distributions, causing the agent to drop the objects. RL is able to mitigate this problem by learning a policy for each agent that is able to make corrections quickly, which is tailored to the robot’s dynamics. However, training a new policy for each agent would be far too costly to be a practical solution.

In these scenarios, it is possible to use a small number of policies learned by a subset of the agents, and then leverage the experience obtained from learning those policies to allow the remaining agents to quickly learn their corresponding policies. This approach can turn problems that are prohibitively expensive to solve into relatively simple problems.

To make use of prior experience and improve learning on new related problems in RL, several lines of work, which are complementary to each other, have been proposed and are actively being studied. Transfer learning, [Taylor and Stone, 2009] refers to the problem of adapting information acquired while solving one task to another. One might consider learning a mapping function that allows for a policy learned in one task to be used in a different task, [Ammar et al., 2015], or simply learn a mapping of the value function learned in one task to another, [Taylor et al., 2007]. These techniques can be quite effective, but are also limited in that they consider mapping information from a source task to a target task; that is, they do not learn a general transfer strategy for many related tasks.

Another approach to reusing prior knowledge is through meta learning or learning to learn [Schmidhuber, 1995; Schmidhuber et al., 1998]. In the context of RL, the goal under this framework is usually for an agent to be exposed to a number of tasks where it can learn some general behavior that generalizes to new tasks. For example, Finn et al. (2017), showed that an agent who learns how to walk forward is able to find a general policy that can quickly be adapted to learn to walk backwards [Finn et al., 2017].

One last technique to leverage prior experience, and the one this paper focuses on, is through temporally extended actions or temporal abstractions [McGovern and Sutton, 1998; Sutton et al., 1999]. While in the standard RL framework the agent has access to a set of primitive actions (i.e., actions that last for one time step), temporally extended actions allow an agent to execute actions that last for several time-steps. They introduce a bias in the behavior of the agent which, if appropriate for the problem at hand, results in dramatic improvements in how quickly the agent learns to solve a new task compared to only using primitive actions [McGovern and Sutton, 1998].

A popular representation for temporally extended actions is the options framework (formally introduced in the next section), which is the focus of this work. It has been shown that options learned in a specific task or set of tasks, can be reused to improve learning on new tasks [Machado et al., 2017; Bacon et al., 2017]; however, this often requires knowledge from the user about which options or how many options are appropriate for the type of problems the agent will face.

In this paper, we propose learning reusable options for a set of related tasks with minimal information provided by the user. We consider the scenario where the agent must solve a large numbers of tasks and show that after learning a well-performing policy for a small number of problems, we can learn an appropriate number of options that facilitates learning in a remaining set of tasks. Ideally the trajectories used to learn options would be obtained from optimal policies, but for many learning algorithms it cannot be guaranteed that the learned policies are actually optimal. We propose learning a set of options that minimize the expected number of decisions needed to represent trajectories generated from the policies learned by the agent for a small number of problems, while also maximizing the probability of generating those trajectories. Our experiments show that after learning to solve a small number of tasks, the learned options allow the agent to much more quickly solve the remaining tasks.

2 Background and Notation

A Markov decision process (MDP) is a tuple, M = , where S is the set of possible states of the environment, A is the set of possible actions that the agent can take, is the probability that the environment will transition to state if the agent executes action in state is the expected reward received after taking action a in state s and transitioning to state is the initial state distribution, and is a discount factor for rewards received in the future. We use t to index the time-step and write , and to denote the state, action, and reward at time t. A policy, , provides a conditional distribution over actions given each possible state: . We denote a trajectory of length t as , that is, is defined as a sequence of states, actions and rewards observed after following some policy for t time-steps.

This work focuses on learning temporally extended actions—actions lasting for multiple time-steps—that can be used for a set of related tasks. We consider the setting where

an agent must solve a set of related tasks, where each task is an MDP, ; that is, each task is an MDP with its own transition function, reward function and initial state distribution, with shared state and action sets. Specifically, our work focuses on learning reusable options [Sutton and Precup, 1998; Sutton et al., 1999] for a set of related tasks. An option, , is a tuple in which is the set of states in which option o can be executed (the initiation set), is a policy that governs the behavior of the agent while executing o, and is a termination function that determines the probability that o terminates in a given state. We assume that for all options o; that is, the options are available at every state. The options framework does not dictate how an agent should choose between available options or how options should be discovered. A common approach to selecting between options is to a learn a policy over options, which is defined by the probability of choosing an option in a particular state. Two recent popular approaches to option discovery are eigenoptions [Machado et al., 2017] and the option-critic architecture [Bacon et al., 2017].

The eigenoptions [Machado et al., 2017] of an MDP are the optimal policies for a set of implicitly defined reward functions called eigenpurposes. Eigenpurposes are defined in terms of proto-value functions [Mahadevan, 2005], which are in turn derived from the eigenvectors of a modified adjacency matrix over states for the MDP. The intuition is that no matter the true reward function, the eigenoptions allow an agent to quickly traverse the transition graph, resulting in better exploration of the state space and faster learning. However, there are two major downsides: 1) the adjacency matrix is often not known a priori, and may be difficult or impossible to construct for large MDPs, and 2) for each eigenpurpose, constructing the corresponding eigenoption requires solving a new MDP.

The option-critic architecture [Bacon et al., 2017] is a more direct approach that learns options and a policy over options simultaneously. The option policies and their termination functions are trained using policy gradient methods, while the policy over options may be trained using any technique. One issue that often arises within this framework is that the termination functions of the learned options tend to collapse to “always terminate”. In a later publication, the authors built on this work to consider the case where there is a cost associated with switching options [Harb et al., 2018]. This method resulted in the agent learning to use a single option while it was appropriate and terminate when an option switch was needed, allowing it to discover improved policies for a particular task. The authors argue that minimizing the use of the policy over options may be desirable, as the cost of choosing an option may be greater than the cost of choosing a primitive action when using an option—e.g., when a planner is used to select an option. Work recently presented by Harutyunyan et al. (2019) approaches the aforementioned termination problem by explicitly optimizing the termination function of options to focus on small regions of the state space. However, while all of these methods can be effective in learning a policy for the task at hand, they do not explicitly take into consideration that the agent might face related, but different, tasks in the future. In contrast, our method discovers options that are useful for a variety tasks.

We build on the idea that minimizing the number of decisions an agent must make will lead to the discovery of generally useful temporal abstractions, and propose an offline technique where options are learned after solving a small number of tasks. The options can then be leveraged to quickly solve new related problems the agent will face in the future. We use the trajectories generated by the agent when learning policies for a small number of problems, and learn an appropriate set of options by directly minimizing the expected number of decisions the agent makes while simultaneously maximizing the probability of generating the observed trajectories.

3 Learning Reusable Options from Experience

In this section, we formally introduce the objective we use to learn a set of options that are reusable for a set of related tasks. Our algorithm introduces one option at a time until introducing a new option does not improve the objective further. This procedure results in a natural way of learning an adequate number of options without having to pre-define it; a new option is included only if it is able to improve the probability of generating optimal behavior while minimizing the number of decisions made by the agent.

3.1 Problem Formulation

In the options framework, at each time-step, t, the agent chooses an action, , based on the current option, . Let be a Bernoulli random variable, where if the previous option, , terminated at time t, and otherwise. If is chosen using the policy over options, . If , then the previous option continues, that is, . To ensure that we can represent any trajectory, we consider primitive actions to be options which always select one specific action and then terminate; that is, for an option, o, corresponding to a primitive, a, for all , the termination function would be given by , and the policy by if and 0 otherwise.

Let denote a set of options, , where refers to the set of options corresponding to primitive actions and to the set of options corresponding to temporal abstractions. Furthermore, let H be a random variable denoting a trajectory of length |H| generated by a well-performing policy, and let be a random variable denoting the sub-trajectory of H up to the state encountered at time-step t. We seek to find a set, , that maxi- mizes the following objective:

where is a regularizer that encourages a diverse set of options, and is a scalar hyper-parameter. If we are also free to learn the parameters of , then

Intuitively, we seek to find options that terminate as infrequently as possible while still generating well-performing trajectories with high probability. Notice that minimizing the number of terminations is the same as minimizing the number of decisions made by the policy over options, as each termination requires the policy to subsequently choose a new option. Given a set of options, a policy over options, and a sample trajectory, we can calculate the joint probability for a trajectory exactly. Therefore, we can obtain an accurate estimate of Equation 1 by averaging over a set of sample trajectories. In the next section, we present a slight modification to our objective that results in a practical optimization problem.

3.2 Optimization Objective for Learning Options

Given that the agent must learn the corresponding policy for a set of tasks, we can use the experienced gathered from solving a subset of tasks to obtain trajectories demonstrating the optimal behavior learned for these problems. Given a set, H, of trajectories generated from an initial subset of tasks, we can now estimate the expectation in (1) to learn options that can be leveraged in the remaining problems.

Because the probability of generating any trajectory approaches 0 as the length of the trajectory increases, we make a slight modification to the original objective that leads to better numerical stability. We explain these modifications after introducing the objective , which we optimize in practice:

The objective in (2) is derived from J with the following modifications: 1) the sum of the two first terms replaces a product of two terms obtained from computing the joint probability in J, 2) the summation over terminations for a trajectory (second term) is normalized by the length of the trajectory, and 3) we introduce a scalar weight to balance the contribution of each term to . Although this is not an unbiased estimator of J, we derived from J with the introduction of some modifications for numerical stability. A more detailed discussion on how we arrived to this objective is provided in Appendix A.

We can express (2) entirely in terms of the policy over options , options and the transition function, P. When the transition function is not known, we can estimate it from data by assuming a family of distributions and fitting the parameters. The following theorems show how to calculate the first two terms in (2) from known quantities, allowing us to efficiently maximize the proposed objective.

Theorem 1. Given a set of options, O, and a policy, , over options, the expected number of terminations for a trajectory h is given by:

where we use as a shorthand notation for ,

and , π, , o).

Proof. See Appendix B.

Theorem 2. Given a set of options O and a policy over options, the probability of generating a trajectory h of length |h| is given by:

where f is a recursive function defined as:

Proof. See Appendix C.

Given a parametric representation of the option policies and termination functions for each and for the policy over options, we use Theorems 1 and 2 to differentiate the objective in (2) with respect to their parameters and optimize with any standard optimization technique.

3.3 Learning Options Incrementally

One common issue in option discovery is identifying how many options are needed for a given problem. Oftentimes this number is predefined by the user based on intuition. In such a scenario, one could learn options by simply randomly initializing the parameters of a number of options and optimizing the proposed objective in (2). Instead, we propose not only learning options, but also the number of options needed, by the procedure shown in Algorithm 1. This algorithm introduces one option at a time and optimizes the objective with respect to the policy over options , with parameters , and the newly introduced option, , with parameters and , for N epochs. Optimizing both and allows us to estimate how much we can improve given that we keep any previously introduced option fixed.

After the new option is trained, we measure how much has improved; if it fails to improve above some threshold, , the procedure terminates. This results in a natural way of obtaining an appropriate number of options, as options stop being added once a new option no longer improves the ability to represent the demonstrated behavior.

4 Experimental Results

This section describes experiments used to evaluate the proposed offline option learning approach. We show results in the “four rooms” domain to allow us to visualize and understand the options produced by the approach, and to show empirically that these options produce a clear improvement in learning. We compare against options generated by the

(a) Visualization of loss for sampled trajectories over 200 training epochs. Every 50 training epochs a new option is introduced. For a given set of sampled trajectories, the decreasing average number of decisions made by is shown in blue and the increasing probability of generating the observed trajectories is shown in red.

Figure 1: Results on four-room environment showing the improvement of the training loss (left), and learning curves on test problems (right).

option-critic architecture [Bacon et al., 2017] and eigenoptions [Machado et al., 2017]. We then extend our experiments to assess the performance of the technique in a few selected problems from the Atari 2600 emulator provided by OpenAI Gym [Brockman et al., 2016]. These experiments demonstrate that when an agent faces a large number of related tasks, by using the trajectories obtained from solving a small subset of tasks, our approach is able to discover options that significantly improve the learning ability of the agent in the tasks it has yet to solve.

4.1 Experiments on Four Rooms Environment

We tested our approach in the four rooms domain: a gridworld of size , in which the agent is placed in a randomly selected start state and needs to reach a randomly selected goal state. At each time-step, the agent executes one of four available actions: moving left, right, up or down, and receives a reward of . After taking a particular action, the agent moves in the intended direction with probability 0.9 and in any other direction with probability 0.1. Upon reaching the goal state, the agent receives a reward of +10. We generated 30 different task variations (by changing the goal and start location) and collected six sample trajectories from optimal policies, learned using Q-learning, from six different start and goal configurations. We evaluated our method on the remaining 24 tasks.

Each option was represented as a two-layer neural network, with 32 neurons in each layer, and two output layers: a softmax output layer over the four possible actions representing , and a separate sigmoid layer representing . We used the tabular form of Q-learning as the learning algorithm with -greedy exploration.

Figure 1a shows the change in the average expected number of terminations and average probability of generating the observed trajectories while learning options, as new options are introduced and adapted to the sampled trajectories. Options were learned over the six sampled optimal trajectories and every 50 epochs a new option was introduced to the op-

tion set, for a total of 4 options. For every new option, the change in probability of generating the observed trajectories as well as the change in expected number of decisions reaches a plateau after 30 or 40 training epochs. When a new option is introduced, there is a large jump in the loss because a new policy, , is initialized arbitrarily to account for the new option set being evaluated. However, after training the new candidate option, the overall loss improves beyond what it was possible before introducing the new option.

In Figure 1b, we compare the performance of Q-learning on 24 novel test tasks (randomly selected start and goal states) using options discovered from offline option learning (with and without regularization using KL divergence), eigenoptions, and option critic. We allowed each competing method to learn options from the same six training tasks and, to ensure a fair comparison, we used the original code provided by the authors. As baselines, we also compare against primitive actions and randomly initialized options. It might seem surprising that both eigenoptions and the option-critic failed to reach an optimal policy when they were shown to work well in this type of problem; for that we offer the following explanation. Our implementation of four rooms is defined in a much larger state space than the ones where these methods were originally tested, making each individual room much larger. Since the options identified by these methods tend to lead the agent from room to room, it is possible that, once in the correct room, the agent executes an option leading to a different room before it had the opportunity to find the goal. When testing our approach in the smaller version of the four room problem, we found no clear difference to the performance of the competing methods. In this setting, the options learned by our method found an optimal policy in all testing tasks in the allotted number of episodes. We set the threshold for introducing a new option to 10% of at the previous iteration and the hyper-parameter . When adding KL regularization, we set .

To understand the reason behind the improvement in performance resulting from offline option learning, we turn the

Figure 2: Visualization of our framework for learning options in four rooms domain. A novel task is seen in the top left, where the agent (red) has to navigate to a goal (green). On the top right, we show the solution found by the agent. The three rows below show how the options were learned and exploited in the new task. The highlighted area in the top row show a sample trajectory and the color corresponds to the probability that the option would take the demonstrated action. The middle row shows the probability of each option being executed at each state, while the bottom row shows the corresponding probability of termination.

reader’s attention to Figure 2. The figure is a visualization of the policy learned by the agent on a particular task: navigate from a specific location in the bottom-left room to a location in the top-right room in a small “four-room” domain of size . 1

The new task to solve is shown in the top-left figure, while the solution found is shown in the top-right figure. Each of the remaining rows of images shows how each option was learned and used in the new task. The first row show the options learned after training; the highlighted path depicts one of the sample trajectories used for training, the colors correspond to the probability that the options would take the demonstrated action, and the arrows indicate the most likely action to be taken by the option.

The middle row depicts a heatmap indicating how each option was used to solve this specific task. It shows the probability that would execute each option at any given state. Finally, the last row depicts a heatmap indicating the proba-

bility of termination for each option given the state.

Looking at the learned options from these different perspectives provides some insight into how they are being exploited. For example, option 1 is generally useful to navigate towards the top rooms and, since the goal in this task is in the top-right room, the option is mainly called in the bottom rooms. Also notice that the option is likely to terminate in the top left and bottom right rooms in the states that would lead the agent to get “trapped” against a wall. These options, when used in combination in specific regions, allow the agent to efficiently tackle problems it has not encountered before.

4.2 Experiments using Atari 2600 Games

We evaluated the quality of the options learned by our framework in two different Atari 2600 games: Breakout and Amidar. We trained the policy over options using A3C [Mnih et al., 2016] with grayscale images as inputs. Options were represented by a two layer convolutional neural network, and were given the previous two frames as input. For each task variation we randomly picked an integer between 1 and 20, and let the agent act randomly for that number of time-steps

Figure 3: Average returns on Breakout comparing primitives (blue), options before training (orange) and learned options for different values of . The shaded region indicates the standard error.

before learning (changing the initial state distribution), we sampled a number in the range (0, 10] and use it to scale the reward the agent received (changing the reward function), and allowed for a number of frames to be skipped after taking each action (changing the transition function). For used three different tasks for training for each game, and sampled 12 trajectories for training; we used five new tasks for testing. Each trajectory lasted until a life was lost, not for the entire duration of the episode. The options were represented by a two-layer neural network, where the input was represented by gray scale images of the last two frames. We ran 32 training agents in parallel on CPUs, the learning rate was set to 0.0001 and the discount factor was set to 0.99.

Figures 3 and 4 show the performance of the agent as a function of training time in Breakout and Amidar, respectively. The plots show that given good choices of hyperparameters, the learned options led to a clear improvement in performance during training. For both domains, we found that led to a reasonable trade-off between the first two term in , and report results with three different values for the regularization term: (no regularization), and . Note that our results do not necessarily show that the options result in a better final policy, but they improve exploration early in training and enable the agent to learn more effectively.

Figure 5 depicts the behavior for one of the learned options on Breakout. The option efficiently catches the ball after it bounces off the left wall, and then terminates with high probability before the ball has to be caught again. Bear in mind that the option remains active for many time-steps, signifi-cantly reducing the number of decisions made by the policy over options. However, it does not maintain control for so long that the agent is unable to respond to changing circumstances. Note that the option is only useful in specific case; for example, it was not helpful in returning a ball bounced off the right wall. That is to say, the option specialized in a spe-cific sub-task within the larger problem: a highly desirable property for generally useful options.

Figure 6 shows the selection of two of the options learned for Amidar when starting a new game. At the beginning of

Figure 4: Average returns on Amidar comparing primitives (blue), options before training (orange) and learned options for different values of . The shaded region indicates the standard error.

Figure 5: Visualization of a learned option executed until termination on Breakout. The option learned to catch the ball bouncing off the left wall and terminates with high probability before the ball bounces a wall again (ball size increased for visualization).

the game, option 1 is selected, which takes the agent to a specific intersection before terminating. The agent then selects option 2, which chooses a direction at the intersection, follows the resulting path, and terminates at the next intersection. Note that the agent does not need to repeatedly select primitive actions in order to simply follow a previously chosen path. Having access to these types of options enables an agent to easily replicate known good behaviors, allowing for faster and more meaningful exploration of the state space.

5 Conclusion and Future Work

In this work we presented an optimization objective for learning options from demonstrations obtained from learned policies on a set of tasks. Optimizing the objective results in a set of options that allows an agent to reproduce the behavior while minimizing the number of decisions made by the policy over options, which are able to improve the learning ability of the agent on new tasks.

There are some clear directions for future development. While we have shown that our method is capable of discovering powerful options, properly tuning the hyperparameters, and , is necessary for learning appropriate options. In complex environments, this is not an easy task. Future work could study methods for finding the right balance between hyperparameters automatically or, if possible, eliminate the need for such hyperparameters altogether. Another possible dimension of improvement is to study how to extend the proposed ideas to the online setting; an agent may be able to sample trajectories as it is learning a task and progressively use them to continuously improve its option set.

Figure 6: Visualization of two learned options on Amidar. The agent is shown in yellow and enemies in pink. Option 1 learned to move up, at the beginning of the game, and turn left until getting close to an intersection. Option 2 learned to turn in that intersection and move up until reaching the next one.

We provided results showing how options adapt to the trajectories provided and showed, through several experiments, that the identified options are capable of significantly improving the learning ability of an agent. The resulting options encode meaningful abstractions that help the agent interact with and learn from its environment more efficiently.

References

[Ammar et al., 2015] Haitham Bou Ammar, Eric Eaton, Paul Ruvolo, and Matthew E. Taylor. Unsupervised crossdomain transfer in policy gradient reinforcement learning via manifold alignment. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pages 2504–2510. AAAI Press, 2015.

[Bacon et al., 2017] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, 2017.

[Brockman et al., 2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. CoRR, 2016.

[Farchy et al., 2013] Alon Farchy, Samuel Barrett, Patrick MacAlpine, and Peter Stone. Humanoid robots learning to walk faster: From the real world to simulation and back. In Proc. of 12th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS), May 2013.

[Finn et al., 2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.

[Harb et al., 2018] Jean Harb, Pierre-Luc Bacon, Martin Klissarov, and Doina Precup. When waiting is not an option: Learning options with a deliberation cost. In AAAI, 2018.

[Machado et al., 2017] Marlos C. Machado, Marc G. Bellemare, and Michael Bowling. A Laplacian Framework for Option Discovery in Reinforcement Learning. CoRR, 2017.

[Mahadevan, 2005] Sridhar Mahadevan. Proto-value func- tions: Developmental reinforcement learning. In Proceedings of the 22nd International Conference on Machine Learning (ICML-2005), pages 553–560. ACM, 2005.

[McGovern and Sutton, 1998] A. McGovern and R. Sutton. Macro actions in reinforcement learning: An empirical analysis. Technical report, University of Massachusetts -Amherst, Massachusetts, USA, 1998.

[Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.

[Mnih et al., 2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1928–1937, New York, New York, USA, 20–22 Jun 2016. PMLR.

[Schmidhuber et al., 1998] J¨urgen Schmidhuber, Jieyu Zhao, and Nicol N. Schraudolph. Learning to learn. chapter Reinforcement Learning with Self-modifying Policies, pages 293–309. Kluwer Academic Publishers, Norwell, MA, USA, 1998.

[Schmidhuber, 1995] J¨urgen Schmidhuber. On learning how to learn learning strategies. Technical report, 1995.

[Silver et al., 2016] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, January 2016.

[Sutton and Precup, 1998] Richard S. Sutton and Doina Pre- cup. Intra-option learning about temporally abstract actions. In In Proceedings of the 15th International Conference on Machine Learning (ICML-1998), 1998.

[Sutton et al., 1999] Richard S. Sutton, Doina Precup, and Satinder P. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 1999.

[Taylor and Stone, 2009] Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res., 10:1633–1685, December 2009.

[Taylor et al., 2007] Matthew E. Taylor, Peter Stone, and Yaxin Liu. Transfer learning via inter-task mappings

for temporal difference learning. J. Mach. Learn. Res., 8:2125–2167, December 2007.

[Tesauro, 1995] Gerald Tesauro. Temporal difference learn- ing and td-gammon. Commun. ACM, 38(3):58–68, March 1995.

A Appendix

The following list defines the notation used in all derivations:

1. : random variable denoting action taken at step t.

2. : random variable denoting state at step t.

3. : random variable denoting history up to step .

4. : random variable denoting the event that the option used at step terminates at state .

5. : policy over options.

6. P: transition function. denotes the probability of transitioning to state by taking action a in state s

7. : random variable denoting the option selected for execution at state .

8. o: option defined as , where is the option policy for option and is the termination function.

9. Assume primitives are options that perform only 1 action and last for 1 time-step.

10. O: set of available options.

We can compute the probability of an option terminating at state and generating a trajectory as:

To compute the proposed objective J we need to find an expression for and in terms of known quantities.

A.1 Appendix A - Derivation of

Recall , ignoring the regularization term. Assuming access to a set H of sample

trajectories, we start by estimating J from sample averages and derive the objective as follows:

It can easily be seen that to maximize the above expression should be minimized while Pr(H = should be maximized. Given that for long trajectories the expected number of terminations increases while the probability of generating the trajectories goes to 0, we normalize the number of terminations by the lenght of the trajectory, |h|, and adjust a hyperparameter, , to prevent one term from dominating the other during optimization. Based on this observation we propose optimizing the following objective:

This objective allow us to control a trade-off, through , of how much we care about the options reproducing the demonstrated trajectories vs. how much we want the agent to minimize the number of decisions.

A.2 Appendix B - Proof of Theorem 1

Theorem 1 Given a set of options O and a policy over options, the expected number of terminations for a trajectory h of length |h| is given by:

and , π, , o).

Proof. Notice that , so if we find an expression for , we can calculate the expectation exactly. We define for ease of derivation even though there is no option to terminate at .

We are left with finding an expression in terms of known probabilities for .

Given that by convention, , we are now left with figuring out how to calculate

where Using the recursive function , the expected number of terminations for a given trajectory is given by:

A.3 Appendix C - Proof of Theorem 2

Theorem 2 Given a set of options O and a policy over options, the probability of generating a trajectory h of length |h| is given by: , where f is a recursive

function defined as:

1, if i = t βπ, o, a, o, i + 1)

, a, o, i + 1) , otherwise

Proof. We define to be the history from time i to time t, that is, ), where i < t. If i = t, the history would contain a single state.

We now need to find an expression to calculate . Consider the probability of seeing history given the previous state, s, the previous option, o, and the previous action, a:

Even though the equation above might seem complicated, there are only two cases we need to consider: either the current option terminates and a new one must be selected (the first term), or the current option does not terminate (the second term). Let’s consider each of them separately.

Case 1 - option terminates: If we terminate, we sum over new options:

Note that the expanded probability has the same form as . Case 2 - option does not terminate: This tells us that , so we may drop the dependency on the terms:

Plugging these two cases back into our earlier equation yields:

Note that each term contains an expression of the same form, . We can therefore compute the probability recursively. Our recursion will terminate when we consider i = t, as contains a single state, and we adopt the convention of its probability to be 1. Notice that for every recursive step, both inner terms will produce a term. Consider the result when we factor every recursive term to the front of the equation. We define the following recursive function:

1, if i = t βπ, o, a, o, i + 1)

, a, o, i + 1) , otherwise

. Notice that this is the recursive probability described above, but with the terms factored out. We now see that:

Plugging this all the back into our original equation for gives us the desired result:

A.4 Appendix D - Empirical Validation of Derived Equations

To double check the derivation of the proposed objective and make sure the implementation was correct, we conducted a simple empirical test to compared the calculated expected number of decisions in a trajectory and the probability of generating each trajectory for a set of 10 trajectories on 10 MDPs. The MDPs are simple chains of 7 states with different transition functions. We randomly initialized four options and a policy over options, and estimated the probability of generating each trajectory and the expected number of terminations, for each sampled trajectory, by Montecarlo sampling 10, 000 trials. Table 1 presents results for the 10 trajectories verifying empirically that the equations were correctly derived and implemented. The table compares the empirical and true probability of generating a given trajectory, and , respectively, and the empirical and true sum of expected number of decisions an agent has to make to generate those trajectories, and , respectively.

Table 1: Validation of equations and implementation.

Note that the cases with largest discrepancy between the estimated and calculated number of terminations occur when the probability of generating a trajectory is low. This happens because, since the trajectory is unlikely to be generated, the Monte Carlo sampling is not able to produce enough samples of the trajectory.