Learning State Abstractions for Transfer in Continuous Control

2020·Arxiv

Abstract

Abstract

Can simple algorithms with a good representation solve challenging reinforcement learning problems? In this work, we answer this question in the affirmative, where we take “simple learning algorithm” to be tabular Q-Learning, the “good representations” to be a learned state abstraction, and “challenging problems” to be continuous control tasks. Our main contribution is a learning algorithm that abstracts a continuous state-space into a discrete one. We transfer this learned representation to unseen problems to enable effective learning. We provide theory showing that learned abstractions maintain a bounded value loss, and we report experiments showing that the abstractions empower tabular Q-Learning to learn efficiently in unseen tasks.

1. Introduction

Finding the right representation is critical for effective reinforcement learning (RL). A domain like Backgammon can be tractable given a simple representation (Tesauro, 1995), but replace this typical input stream with a 3D point cloud of a Backgammon board, and the problem becomes intractable (Konidaris, 2019). In this context, representation learning could be thought of as a learning process that enables faster learning in future. In this paper we explore the limits of this reasoning. Specifically, we study whether unseen continuous control problems can be solved by simple tabular RL algorithms using a state-abstraction function learned from previous problems.

Identifying useful abstractions has long been a goal of AI and RL; indeed, understanding abstraction’s role in intelligence was one of the areas of inquiry of the 1956 Dartmouth Summer AI Workshop (McCarthy et al., 1955), and has been an active area of study since (Brooks, 1991; Dayan & Hin- ton, 1993; Singh et al., 1995; Sutton et al., 1999; Parr & Rus- sell, 1998; Dietterich, 2000; Li et al., 2006). We narrow this study by concentrating on state abstractions that translate

Figure 1. Learned state abstractions (right) in the Puddle World domain (left), where goal location is one of the four corners. The abstraction function is trained using three randomly chosen problems, and is then tested to solve the fourth problem. Notice that states that generally have the same optimal policy are clustered together in a common abstract state.

a continuous state space into a small discrete one, suitable for use with “tabular” learning algorithms which are simple and well-understood (Sutton & Barto, 2018). Prior work has studied a similar setting leading to tree-based methods for abstracting continuous state spaces (Moore, 1994; Uther & Veloso, 1998; Feng et al., 2004; Menashe & Stone, 2018), algorithms for finding state abstractions suitable for transfer (Walsh et al., 2006; Cobo et al., 2011; 2012; Abel et al., 2018), and methods for learning succinct representations of continuous states (Whiteson et al., 2007; Jonschkowski & Brock, 2015).

More concretely, we introduce and analyze an algorithm for learning a state abstraction. We then transfer the learned abstraction to help perform efficient reinforcement learning in unseen problems. Notably, this algorithm is well suited to continuous domains—after training, it outputs a state abstraction that maps continuous states into a small, finite state space, suitable for use with tabular RL algorithms. We present initial analysis and support for this abstractionlearning algorithm, emphasizing its ability to enable downstream tabular RL algorithms to solve new problems not seen during the training of the abstraction.

We provide a theorem on the sample complexity of learning the abstraction function, allowing us to relate the value loss under the learned abstraction function to both the number of samples used for training the abstraction function and to the Rademacher complexity of the set of abstraction functions in the hypothesis space. Moreover, our central experimental finding is that the learned state abstraction enables tabular Qlearning to perform sample efficient reinforcement learning in problems with continuous state spaces.

An example of a learned abstraction function in the Puddle World domain is shown in Figure 1. In section 4 we detail the algorithm used to learn these abstractions.

2. Background

We begin with background on RL and state abstraction.

We take the standard treatment of RL (Sutton & Barto, 2018): An agent learns to make decisions that maximize reward in an environment through interaction alone. We make the usual assumption that the environment can be modeled by a Markov Decision Process (MDP) (Puterman, 2014).

State abstraction describes methods for reducing the size of an MDP’s state space, typically by aggregating states together in the environmental MDP. With a smaller state space that preserves some characteristics, RL algorithms can often learn to make good decisions from fewer samples and with less computation.

More formally, a state abstraction is a function, that maps each ground state, , into an abstract state, . Usually, such state abstractions are chosen so that , thus the state space is more tractable to work with (either by making exploration, planning, or other aspects of decision making easier).

When the environment’s underlying state space is continuous however, a natural family of relevant abstractions are those that translate the continuous state space into a discrete one. Such functions can dramatically simplify problems that are otherwise intractable for certain types of RL algorithms (Moore, 1994; Uther & Veloso, 1998; Lee & Lau, 2004), like traditional Q-Learning (Watkins & Dayan, 1992) and TD-Learning (Sutton, 1988).

In order to study the sample complexity of our abstraction learning algorithm we use a tool from statistical learning theory referred to as Rademacher complexity (Bartlett & Mendelson, 2002; Mohri et al., 2018). Consider a function , and an arbitrarily large set of such functions F. We define Rademacher complexity of this set, Rad(F), as follows:

where , referred to as Rademacher random variables, are drawn uniformly at random from . One could think of these variables as independent and identically distributed noise. With this perspective, the average can be thought of as the covariance between and noise, or in other words, how well matches the noise. If for any realization of noise there exists an for which this average is large, then we can accurately learn noise, and the Rademacher complexity is large.

We can extend the previous Rademacher definition to vectorvalued functions that map to . Imagine a function where . Define the set of such functions G. We similarly define the Rademacher complexity of G as follows:

where are drawn uniformly randomly from

3. Related Work

We now summarize relevant prior literature.

We study state abstractions that map from continuous states to discrete ones—in this sense, we offer a method for learning to discretize the states of a continuous MDP. Prior literature has introduced algorithms for the same purpose. Moore (1994) introduced the Parti-Game algorithm, which uses a decision tree to dynamically partition a continuous state space based on the need for further exploration. That is, as data about the underlying environment is collected, state partitions are refined depending on a minimax score with respect to an adversary that prevents the learning algorithm from reaching the goal (and knows the current partitioning scheme). For Parti-Game to be applied, we must assume 1) the transition function is deterministic, 2) the MDP is goalbased and the goal state is known, and 3) a local greedy controller is available.

Feng et al. (2004) also make use of a tree-based approach— this time, kd-trees (Friedman et al., 1976)—to dynamically partition a continuous MDP’s state space into discrete regions. In contrast to Parti-Game, partitions are chosen based on value equivalence, thereby enabling a form of closure under the Bellman Equation.

Chapman & Kaelbling (1991) study tree-based partitioning as a means of generalizing knowledge in RL, leading to the development of the G algorithm. The G algorithm constructs a data-dependent tree of Q-value partitions based on which Q value can adequately summarize different regions of the state space. Over time, the tree will grow to sufficiently represent the needed distinctions in states. Further work uses decision trees of different forms to partition complex (and often continuous) state spaces into discrete models (McCallum, 1996; Uther & Veloso, 1998).

Menashe & Stone (2018) also introduced an algorithm for abstracting continuous state, with the goal of inducing a small, tractable decision problem. They present RCAST (Recursive Cluster-based Abstraction Synthesis Technique), a technique for constructing a state abstraction that maps continuous states to discrete ones. Like the other tree-based methods discussed, RCAST uses kd-trees to partition the state space. The key insight at the core of RCAST is to partition based on the ability to predict different factors that characterize the state space. This is the main departure between the approach of RCAST and our own algorithm, which is not tailored to factored representations. Krose & Van Dam (1992), who present an adaptive quantization for continuous state spaces, to be used with simple RL algorithms like TD-learning. Lee & Lau (2004) offer a similar technique based on vector quantization that performs partitioning based directly on data gathered by TD learning algorithm. Whiteson et al. (2007) introduce a method for adaptive tile coding, to be used in value function approximation. Liang et al. (2016) presented evidence that “shallow” learning algorithms can achieve competitive performance to many Deep RL algorithms in Atari games. Their approach constructs features that are well suited to the structure of Atari games, including properties like relative object and color locations. The main result of the work shows that with a well crafted set of features, even a simple learning algorithm can achieve competitive scores in Atari games.

We are not the first to explore transferring state abstractions. Indeed, Walsh et al. (2006) studied the process of transferring state abstractions across MDPs drawn from the same distribution. Abel et al. (2018) builds on this work by introducing PAC abstractions, which are guaranteed to retain their usefulness over a distribution of tasks. Notably, similar to our main theoretical result, PAC abstractions also retain value with respect to a distribution of MDPs with high probability based on the value loss of the abstraction family (Abel et al., 2016). The crucial difference is that, again, PAC abstractions are tailored to discrete state spaces, and do not extend to continuous ones. Cobo et al. (2011) and Cobo et al. (2012) study methods for finding state abstractions based on a demonstrator’s behavior. Similarly to our approach, their method is based on finding abstract states that can be used to predict what a demonstrator will do in those clusters. The key differentiating factor is that our algorithm targets continuous state spaces, while theirs focuses on discrete state spaces.

Majeed & Hutter (2018) discussed Q-Learning convergence in non-Markov decision problems. In particular, Theorem 7 of their work reports that Q-Learning converges in some non-MDPs, under relatively mild assumptions. They go on to characterize the necessary and sufficient conditions for Q-Learning convergence, which does in fact extend to some non-MDPs. In this sense, their work builds on Theorem 4 from Li et al. (2006) which states that Q-Learning, using a policy-based abstraction (similar to what we learn), can sometimes converge to a policy that is sub-optimal in the original problem. Indeed, in experiments, we find this to not be the case – it is still an open important question as to how certain abstractions effect both the estimation error and sample complexity of different learning algorithms.

A separate but equally relevant body of literature investigates learning state representations in the context of deep neural networks, typically for use in deep RL. For instance, Jonschkowski & Brock (2015) proposed learning state representations through a set of well chosen ontological priors, catered toward robotics tasks, including a simplicity prior and a causality prior (among others). These priors are then encoded into an optimization problem that seeks to jointly optimize over each of their prescribed properties. They conduct experiments in a continuous grid domain similar to Puddle World, comparing the use of all priors to each one individually, showcasing a consistent performance increase in learning with all priors. (Karl et al., 2017) developed a variational Bayes method for learning a latent state-space representation of a Markov model, given high dimensional observations. Critically, this state space is of a simple Markov model, and does not involve decision making or rewards, which are critical aspects of learning state representations in MDPs (Oh et al., 2017). For a full survey of recent state representation schemes for deep RL, see Lesort et al. (2018).

Naturally, many active areas of recent literature explore transfer for RL. For instance, work on learning to learn (Thrun & Pratt, 1998), or meta RL (Finn et al., 2017), investigate how to explicitly improve sample efficiency across a collection of tasks. Recent work in meta RL has studied how to improve model-based RL (Sæmunds- son et al., 2018) and how to learn a new RL algorithm explicitly (Wang et al., 2016). Other areas of research have explored transferring shaping functions (Konidaris & Barto, 2006; Konidaris et al., 2012), successor features (Barreto et al., 2017), skills (Konidaris & Barto, 2007; Da Silva et al., 2012; Brunskill & Li, 2014), hierarchies (Wilson et al., 2007; Mehta et al., 2008; Tessler et al., 2017), behavior (Taylor & Stone, 2005), and transfer for deep RL (Higgins et al., 2017; Parisotto et al., 2015; Teh et al., 2017). For a survey of transfer in RL see Taylor & Stone (2009).

4. Algorithm

In this section, we formulate an optimization problem and propose an algorithm for learning state abstractions for MDPs with continuous states. Our approach builds on a type of abstraction studied by Li et al. (2006) in the tabular setting. Specifically, the abstraction functions attempts to cluster together the states for which the optimal policy is similar and map these states to a common abstract state. Our goal will be to learn such an abstraction function on a set of training problems, and then transfer and use this representa-

tion for reinforcement learning in unseen problems.

We define a stochastic abstraction function as a function that maps from ground states to a probability distribution over abstract states C (Singh et al., 1995). We focus on problems where S is infinite, but C is finite (problems with continuous state space and an abstraction that maps to a discrete one). Our goal is to find a state abstraction that enables simple learning algorithms to perform data-efficient reinforcement learning on new tasks.

To learn an abstraction function, we introduce an objective that measures the probability of trajectories provided by our learned policy . The goal will be to maximize this probability if we were to use the abstraction function and a policy, , over abstract states. We formulate the optimization problem as follows:

where:

The second sum is not a function of the optimization variables, and could be dropped:

More importantly, we consider the case where our training set consists of K different MDPs. In this case, we solve for a generalization of the above optimization problem, namely:

If the solution to the optimization problem is accurate enough, then states that are clustered into a single abstract state generally have a similar policy in the MDPs used for training.

Although it is possible to jointly solve this optimization problem, for simplicity we assume is fixed and provided to the learner. We parameterize the abstraction function by a vector representing the weights of the network . We use softmax activation to ensure that a probability distribution.

A good setting of can be found by performing stochastic gradient ascent on the objective above, as is standard when optimizing neural networks (LeCun et al., 2015):

θ

In our experiments we used the Adam optimizer (Kingma & Ba, 2015) which can be thought of as an advanced extension of vanilla stochastic gradient ascent.

5. Theory

Given a learned state abstraction, it is natural to ask if the policy over the learned abstract state space can achieve a reasonable value. In this section we answer this question positively by providing a bound on the value loss of the abstract policy relative to the demonstrator policy.

We now outline our proof at a high level. The first step shows the abstraction learning algorithm described in Section (4) has bounded policy difference on expectation under states in the training set. We then use Rademacher complexity to generalize this policy difference to any set of states drawn from the same distribution. Given the generalization bound, we use the following lemma to conclude that the abstraction has bounded value loss.

Lemma 1. (Corollary of Lemma 2 in Abel et al. (2019)) Consider two stochastic policies, and on state space S, and a fixed probability distribution over S, p(s). If, for some

then:

To satisfy the assumption of the lemma, we first show that the objective function can be rewritten using KullbackLeibler (KL) divergence. For the rest of this section, we assume that each cluster assigns probability 1 to the

action a, allowing us to simplify notation:

Now, suppose that our training procedure has a bounded KL loss formalized below:

Using Pinsker’s inequality, we have:

That is, we have a bounded policy difference. Critically, this bound is only valid given set of states seen in the training data, but to leverage Lemma 1, we need to bound the generalization error. We get this result by introducing a theorem. Lemma below is used in the theorem.

Lemma 2. (Corollary 4 in (Maurer, 2016)) Assume a set of vector valued functions . Assume n L-Lipschitz functions . Then, the following inequality holds:

Theorem. With probability at least

Proof. We build on techniques provided by Bartlett & Mendelson (2002). First, note that

We can bound the expected value of

We do so as follows:

(Due to Lemma (2) and being 1-Lipschitz)

Further, it is easy to show that

Applying MacDiarmid’s inequality, for any

Combining (3) with (2), we can write:

Now recall from (1) that:

we can conclude the proof by claiming that

with probability at least

It is clear to see that, given the above bound, Lemma 1 ensures bounded value loss with probability at least

6. Experiments

We now present our experimental findings. At a high level, we conducted two types of experiments:

1. Single Task: We collected an initial data set be used to train the state abstraction based on MDP M. Then, we evaluate the performance of tabular QLearning on M, given this state abstraction. Since each M we test with has continuous state, tabular QLearning cannot be applied. However, we assess the performance of tabular Q-learning given . These experiments provide an initial sanity check as to whether the state abstraction can facilitate learning at all.

2. Multi-Task: We next considered a collection of MDPs . We collected an initial data set of (s, a) tuples from a (strict) subset of MDPs in the collection. We used to construct a state abstraction , which we then gave to Q-Learning to learn on one or many of the remaining MDPs. Critically, we evaluate Q-Learning on MDPs not seen during the training of

For each of the two experiment types, we evaluate in three different domains, Puddle World, Lunar Lander, and Cart Pole. Open-source implementation of Lunar Lander and Cart Pole are available on the web (Brockman et al., 2016). An implementation of Puddle World is included in our code. Full parameter settings and other experimental details are available in our anonymized code, which we make freely available for reproduction and extension.1

6.1. Puddle World

Our first experiment used the Puddle World MDP (Boyan & Moore, 1995), a continuous grid world in which states are represented by two coordinates, and . The agent is initialized at (0.25, 0.6), and is tasked with getting within .0025 of the goal location, which is placed at (1.0,1.0). Two large rectangles are placed in the domain, which produce reward for any time-step in which the agent is in either rectangle. All other transitions receive 0 reward, except for transitions into the goal state, which yield 1 reward. The agent is given four actions, up, down, left, and right. Each action moves the agent .05 in the given direction with a small amount of Gaussian noise.

In the single task Puddle World experiment, we trained the abstraction based on 4000 sampled ples from the puddle instance pictured in Figure 2a with only goal active, and the training policy. The samples are drawn from U(x, y), the joint uniform probability distribution over x and y. Notably, since the domain is continuous, the learning agent will necessarily find itself in states it did not see during training time. We experiment with tabular QLearning paired with the abstraction, denoted Q-Learning-(green), and Q-Learning with a linear function approximator, denoted Linear-Q (blue). We set both algorithms. It is worth noting that since the training of takes place in the same domain that we test Q-Learning-we should anticipate that the resulting pair learn quite well (assuming the is capable of supporting good learning at all).

Results for the single task case are presented in Figure 2d. The figures present the cumulative reward averaged over all 25 runs of the experiment, with 95% confidence intervals. As expected, we find that Q-Learning-consistently learns to navigate to the goal in only a few episodes. Conversely, the linear approach fails to reliably find the goal, but does occasionally learn to avoid running into the puddle, resulting in high variance performance.

In the multi-task Puddle World experiment, we train three out of the four possible goals (each goal defines an independent MDP), located in the four corners. The held-out goal is chosen uniformly randomly from the four possible goals. We leave one goal out of training to be used for testing. All parameters are set as in the single task case, except that the agents now learn for 250 episodes instead of 100. The abstraction learning algorithm was given a budget of 81 abstract states.

Results are presented in Figure 2g. Notably, the only positive reward available comes from reaching the goal, so we can conclude from the low-variance positive slope of the learning curve that Q-Learning with still learns to reach the goal reliably in only a few episodes. Moreover, Q-learning with linear function approximation fails to ever reliably find the goal.

6.2. Lunar

We next experiment with Lunar Lander, pictured in Figure 2b. The agent controls the purple ship by applying

Figure 2. Learning curves for the single task experiments (top) and the transfer experiments( bottom). Each line indicates the average cumulative reward received by the agent, reported with 95% confidence intervals.

thrusters on exactly one of the left, right, bottom, or no sides of the ship. The state is characterized by 8 state variables, 2 dimensional position and velocity information, angle and angular velocity, and two boolean flags that are active only if the corresponding leg touches a surface. The agent receives +100 reward when it lands or -100 when it crashes, and an additional +10 reward for each leg of the ship touching the ground. Critically, the agent receives -.3 reward every timestep it uses a thruster for each of the three thrusters available. The default reward signal in gym implementation includes a shaped reward that gives the agent additional positive reward the closer it gets to the landing zone, positioned at (0,0) on the screen (the flat area with the flags). In each run, the mountainous terrain to the left and right of the landing zone changes.

In the single task case, we train the state abstraction based on 10,000 quadruples, sampled according to the training policy (which reliably either lands or gets high shaped reward). We then give the resulting abstraction (which maps from the continuous 8-dimensional states to a discrete state space), to tabular Q-Learning and evaluate its learning performance.

Results are presented in Figure 2e. Notably, again, QLearning with learns a policy that reliably lands the ship in around 50 episodes. In contrast, the linear Q-Learning agent fails to ever learn a reasonable policy even after collecting 8 times the data. The results suggest that again, is capable of supporting sample efficient learning of a high value policy.

In the transfer case, we train similarly to the single task case. Then, we define a new instance of the Lunar problem with no shaped reward. In this new, non-shaped variant, the agent receives for every use of a thruster, for crashing, +100 for landing safely, and +10 anytime a

Figure 3. The average rate of successful landings over time in Lunar Lander without shaping (left) and the effect of the number of training data on RL performance in Puddle World (right). In the right plot, a point indicates the cumulative reward received by Q-Learning by the end of learning with the trained on the data set of the given size, averaged over 10 runs of the experiment, reported with 95% confidence intervals.

leg touches the ground. This variant of Lunar is extremely challenging, and to our knowledge, no known algorithm has solved this problem from scratch.

Learning curves are presented in Figure 2h, and a plot of the successful landing rate over time is also provided in Figure 3a. Here we find the strongest support for the usefulness of : Q-Learning-learns to land the ship more than half of the time after around 250 episodes. By the end of learning (500 episodes), Q-Learning with has learned to land the ship four out of every five trials.

6.3. Cart Pole

Cart Pole is a classical control problem first studied by Widrow & Smith (1964) in which an agent must learn to balance a pole upright by moving a cart at the base of the pole along a horizontal track. The state of the MDP consists of four variables: , denoting the position, velocity, angle, and angular velocity respectively. The agent is given only two actions: move left and move right. The reward signal is +1 when the pole is balanced (, and otherwise -10. When the pole moves outside of the positive reward region, the agent receives -10 reward and the MDP terminates.

We train based on a data set of 1000 quadruples collected from the training policy, with a learning rate of 0.001. Then, we give Q-Learning to learn on the same MDP for 50 episodes, with a max of 200 steps per episode. We repeat the experiment 20 times and report average learning curves and 95% confidence intervals.

Results are presented in Figure 2f. Here we find that LinearQ is able to efficiently learn a high value policy, along with Q-Learning-. From the set of single task experiments, we conclude that is at least sufficient to support tabular learning in continuous state MDPs in which was trained. In the multi-task case, we train with the same parameters as in the single task experiment. Then, we build a collection of 5 test MDPs where we change gravity from

to each of 5.0, 6.0, 8.0, and 12.0 . We then evaluate Q-Learning-on its average performance across each of these 5 test MDPs. When learning begins, we sample one MDP from the test set and let the agent interact with the sampled MDP for 200 episodes. After the last episode, the agent samples a new MDP uniformly from the same set of 5 and learns, repeating this process for 20 rounds.

Results are presented in Figure 2i. The learning curve is very abrupt; after only a handful of episodes, both learning algorithms again reliably learns to balance the pole, regardless of how much gravity has changed from the original problem. Here we find support for the use of a linear approximator in some settings, but naturally, as the environment becomes complex, the linear approach fails (as in the puddle and lunar experiments)

As a final experiment, we explore the effect of the size of the training data set used to train on the performance of the downstream RL task that will be used for. Our goal is to investigate how many samples are sufficient for the learned state abstraction to reliably produce good behavior in practice. In the experiment, we trained N = 4501 samples (by increments of 500) to convergence. We then gave the resulting to Q-Learning and let it learn in Puddle World with all parameters set as in the single task Puddle experiment, and report the cumulative reward of Q-Learning’s final episode.

Results are presented in Figure 3b. As expected, when the data set used to train only contains a single point, RL performance decays substantially, with Q-Learning receiving 0 reward on average. However, somewhat surprisingly, with only 501 data points, the computed is sufficient for Q-Learning to obtain an average cumulative reward of 0.5. Such an average score indicates that, in about half of the runs, Q-Learning-reliably found the goal, and, in the other half, it avoided the puddle.

7. Conclusion and Future Work

We introduced an algorithm for state abstraction and showed that transferring the learned abstraction to new problems can enable simple RL algorithms to be effective in continuous state MDPs. We further studied our abstraction learning process through the lens of statistical learning theory and Rademacher complexity, leading to the result that our learning algorithm can represent high value abstract policies. We believe that this work takes an important step towards the problem of automatic abstraction learning in RL.

One avenue for future research extends our approach to continuous action spaces; in this case, we can no longer leverage the fact that the number of actions are finite. Moreover, learning with continuous state and action is notoriously difficult and ripe for abstraction to make a big impact. Additionally, there are open questions on learning abstractions in a lifelong setting, in which agents continually refine their abstractions after seeing new problems. Lastly, it is unknown as to how different kinds of abstractions effect the sample complexity of RL (Kakade, 2003). Thus, we foresee connections between continuous state exploration (Pazis & Parr, 2013) and learning abstractions that can lower the sample complexity of RL.

References

Abel, D., Hershkowitz, D. E., and Littman, M. L. Near optimal behavior via approximate state abstraction. In Proceedings of the International Conference on Machine Learning, pp. 2915–2923, 2016.

Abel, D., Arumugam, D., Lehnert, L., and Littman, M. L. State abstractions for lifelong reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2018.

Abel, D., Arumugam, D., Asadi, K., Jinnai, Y., Littman, M. L., and Wong, L. L. State abstraction as compression in apprenticeship learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2019.

Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4055–4065, 2017.

Bartlett, P. L. and Mendelson, S. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.

Boyan, J. A. and Moore, A. W. Generalization in reinforce- ment learning: Safely approximating the value function. In Advances in Neural Information Processing Systems, pp. 369–376, 1995.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI Gym, 2016.

Brooks, R. A. Intelligence without representation. Artificial Intelligence, 47(1-3):139–159, 1991.

Brunskill, E. and Li, L. PAC-inspired option discovery in lifelong reinforcement learning. In Proceedings of the International Conference on Machine Learning, pp. 316–324, 2014.

Chapman, D. and Kaelbling, L. P. Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. In Proceedings of the International

Joint Conference on Artificial Intelligence, volume 91, pp. 726–731, 1991.

Cobo, L. C., Zang, P., Isbell Jr, C. L., and Thomaz, A. L. Au- tomatic state abstraction from demonstration. In Proceedings of the International Joint Conference on Artificial Intelligence, volume 22, pp. 1243, 2011.

Cobo, L. C., Isbell Jr, C. L., and Thomaz, A. L. Automatic task decomposition and state abstraction from demonstration. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, pp. 483– 490. International Foundation for Autonomous Agents and Multiagent Systems, 2012.

Da Silva, B., Konidaris, G., and Barto, A. Learning param- eterized skills. Proceedings of the International Conference on Machine Learning, 2012.

Dayan, P. and Hinton, G. E. Feudal reinforcement learning. In Advances in Neural Information Processing Systems, pp. 271–278, 1993.

Dietterich, T. G. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.

Feng, Z., Dearden, R., Meuleau, N., and Washington, R. Dynamic programming for structured continuous markov decision problems. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pp. 154–161. AUAI Press, 2004.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta- learning for fast adaptation of deep networks. Proceedings of the International Conference on Machine Learning, 2017.

Friedman, J. H., Bentley, J. L., and Finkel, R. A. An algo- rithm for finding best matches in logarithmic time. ACM Trans. Math. Software, 3(SLAC-PUB-1549-REV. 2):209– 226, 1976.

Higgins, I., Pal, A., Rusu, A. A., Matthey, L., Burgess, C. P., Pritzel, A., Botvinick, M., Blundell, C., and Lerchner, A. DARLA: Improving zero-shot transfer in reinforcement learning. Proceedings of the International Conference on Machine Learning, 2017.

Jonschkowski, R. and Brock, O. Learning state represen- tations with robotic priors. Autonomous Robots, 39(3): 407–428, 2015.

Kakade, S. M. On the Sample Complexity of Reinforcement Learning. PhD thesis, University of London London, England, 2003.

Karl, M., Soelch, M., Bayer, J., and van der Smagt, P. Deep variational Bayes filters: Unsupervised learning of state space models from raw data. Proceedings of the International Conference on Learning Representations, 2017.

Kingma, D. P. and Ba, J. Adam: A method for stochastic op- timization. Proceedings of the International Conference on Learning Representations, 2015.

Konidaris, G. On the necessity of abstraction. Current Opinion in Behavioral Sciences, 29:1–7, 2019.

Konidaris, G. and Barto, A. Autonomous shaping: Knowl- edge transfer in reinforcement learning. In Proceedings of the International Conference on Machine Learning, pp. 489–496, 2006.

Konidaris, G. and Barto, A. G. Building portable options: Skill transfer in reinforcement learning. In Proceedings of the International Joint Conference on Artificial Intelligence, volume 7, pp. 895–900, 2007.

Konidaris, G., Scheidwasser, I., and Barto, A. Transfer in reinforcement learning via shared features. Journal of Machine Learning Research, 13(May):1333–1371, 2012.

Krose, B. J. and Van Dam, J. W. Adaptive state space quantisation for reinforcement learning of collision-free navigation. In Intelligent Robots and Systems, 1992., Proceedings of the 1992 lEEE/RSJ International Conference on, volume 2, pp. 1327–1332. IEEE, 1992.

LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436, 2015.

Lee, I. S. and Lau, H. Y. Adaptive state space partitioning for reinforcement learning. Engineering applications of artificial intelligence, 17(6):577–588, 2004.

Lesort, T., D´ıaz-Rodr´ıguez, N., Goudou, J.-F., and Filliat, D. State representation learning for control: An overview. Neural Networks, 2018.

Li, L., Walsh, T. J., and Littman, M. L. Towards a unified theory of state abstraction for MDPs. In ISAIM, 2006.

Liang, Y., Machado, M. C., Talvitie, E., and Bowling, M. State of the art control of Atari games using shallow reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 485–493. International Foundation for Autonomous Agents and Multiagent Systems, 2016.

Majeed, S. J. and Hutter, M. On Q-learning convergence for non-Markov decision processes. In IJCAI, pp. 2546– 2552, 2018.

Maurer, A. A vector-contraction inequality for Rademacher complexities. In International Conference on Algorithmic Learning Theory, pp. 3–17. Springer, 2016.

McCallum, A. K. Learning to use selective attention and short-term memory in sequential tasks. In From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, volume 4, pp. 315. MIT Press, 1996.

McCarthy, J., Minsky, M. L., Rochester, N., and Shannon, C. E. A proposal for the Dartmouth summer research project on artificial intelligence, August 31, 1955. AI Magazine, 27(4):12, 1955.

Mehta, N., Natarajan, S., Tadepalli, P., and Fern, A. Transfer in variable-reward hierarchical reinforcement learning. Machine Learning, 73(3):289, 2008.

Menashe, J. and Stone, P. State abstraction synthesis for discrete models of continuous domains. In 2018 AAAI Spring Symposium Series, 2018.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018.

Moore, A. W. The parti-game algorithm for variable reso- lution reinforcement learning in multidimensional statespaces. In Advances in Neural Information Processing Systems, pp. 711–718, 1994.

Oh, J., Singh, S., and Lee, H. Value prediction network. In Advances in Neural Information Processing Systems, pp. 6118–6128, 2017.

Parisotto, E., Ba, J. L., and Salakhutdinov, R. Actor-mimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342, 2015.

Parr, R. and Russell, S. J. Reinforcement learning with hier- archies of machines. In Advances in Neural Information Processing Systems, pp. 1043–1049, 1998.

Pazis, J. and Parr, R. PAC optimal exploration in continuous space Markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, 2013.

Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.

Sæmundsson, S., Hofmann, K., and Deisenroth, M. P. Meta reinforcement learning with latent variable gaussian processes. arXiv preprint arXiv:1803.07551, 2018.

Singh, S. P., Jaakkola, T., and Jordan, M. I. Reinforce- ment learning with soft state aggregation. In Advances in Neural Information Processing Systems, pp. 361–368, 1995.

Sutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, 2018.

Sutton, R. S., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2): 181–211, 1999.

Taylor, M. E. and Stone, P. Behavior transfer for value- function-based reinforcement learning. In Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems, pp. 53–59. ACM, 2005.

Taylor, M. E. and Stone, P. Transfer learning for reinforce- ment learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.

Teh, Y., Bapst, V., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506, 2017.

Tesauro, G. Temporal difference learning and TD-gammon. Communications of the ACM, 38(3):58–68, 1995.

Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J., and Mannor, S. A deep hierarchical approach to lifelong learning in minecraft. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 3, pp. 6, 2017.

Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 1998.

Uther, W. T. and Veloso, M. M. Tree based discretization for continuous state space reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 769–774, 1998.

Walsh, T. J., Li, L., and Littman, M. L. Transferring state abstractions between MDPs. In ICML Workshop on Structural Knowledge Transfer for Machine Learning, 2006.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.

Watkins, C. J. and Dayan, P. Q-learning. Machine Learning, 8(3-4):279–292, 1992.

Whiteson, S., Taylor, M. E., and Stone, P. Adaptive tile cod- ing for value function approximation, 2007. AI Technical Report AI-TR-07-339, University of Texas at Austin.

Widrow, B. and Smith, F. W. Pattern-recognizing control systems. Computer and Information Sciences (COINS) Proceedings, 1964.

Wilson, A., Fern, A., Ray, S., and Tadepalli, P. Multi-task re- inforcement learning: A hierarchical bayesian approach. In Proceedings of the International Conference on Machine Learning, pp. 1015–1022. ACM, 2007.