Combating the Compounding-Error Problem with a Multi-step Model

2019·Arxiv

Abstract

Abstract

Model-based reinforcement learning is an appealing framework for creating agents that learn, plan, and act in sequential environments. Model-based algorithms typically involve learning a transition model that takes a state and an action and outputs the next state—a one-step model. This model can be composed with itself to enable predicting multiple steps into the future, but one-step prediction errors can get magnified, leading to unacceptable inaccuracy. This compounding-error problem plagues planning and undermines model-based reinforcement learning. In this paper, we address the compounding-error problem by introducing a multi-step model that directly outputs the outcome of executing a sequence of actions. Novel theoretical and empirical results indicate that the multi-step model is more conducive to efficient value-function estimation, and it yields better action selection compared to the one-step model. These results make a strong case for using multi-step models in the context of model-based reinforcement learning.

1 Introduction

The model-based approach to reinforcement learning (RL) offers a unique framework for addressing three important artificial intelligence (AI) problems: understanding the dynamics of the environment through interaction, using the acquired knowledge for planning, and performing sequential decision making. One promise of model-based RL is to enable sample-efficient learning [Sutton, 1990; Deisenroth and Rasmussen, 2011; Levine and Abbeel, 2014]. This advantage is well-understood in theory in settings such as value-function estimation [Azar et al., 2012], exploration in discounted and finite-horizon Markov decision processes (MDPs) [Szita and Szepesv´ari, 2010; Dann and Brunskill, 2015], and exploration in contextual decision processes [Sun et al., 2018].

Planning has long been a fundamental research area in AI [Hart et al., 1972; Russell and Norvig, 2016]. In the context of model-based RL, Sutton and Barto [2018] define planning as any process that goes from a model to a value function or a policy. They describe two notions of planning, namely decision-time planning and background planning. In decision-time planning, the agent utilizes the model during action selection by performing tree search [Kearns and Singh, 2002; Kocsis and Szepesv´ari, 2006; Silver et al., 2016]. An agent is said to be performing background planning if it utilizes the model to update its value or policy network. Examples of background planning include the Dyna Architecture [Sutton, 1990; Sutton et al., 2008], or model-based policy-search [Abbeel et al., 2006; Deisenroth and Rasmussen, 2011; Kurutach et al., 2018]. We call an algorithm model-based if it performs either background or decision-time planning.

A key aspect distinguishing model-based RL from traditional planning is that the model is learned from experience. As such, the model may be imperfect due to ineffective generalization [Abbeel et al., 2006; Nagabandi et al., 2018], inadequate exploration [Brafman and Tennenholtz, 2002; Shyam et al., 2018], overfitting [Asadi et al., 2018b], or irreducible errors in unrealizable settings [Shalev-Shwartz and Ben-David, 2014; Talvitie, 2014]. More generally, a common view across various scientific disciplines is that all models are wrong, though some are still useful [Box, 1976; Wit et al., 2012].

Previous work explored learning models that look a single step ahead. To predict H steps ahead, the starting point of a step is set to the end point of the previous step . Unfortunately, when the model is wrong, this procedure can interfere with successful planning [Talvitie, 2014; Venkatraman et al., 2015; Asadi et al., 2018b]. A significant reason for this failure is that the model may produce a “fake” input, meaning an input that cannot possibly occur in the domain, which is then fed back to the unprepared one-step model. Notice that two sources of error co-exist: The model is imperfect, and the model gets an inaccurate input in all but the first step. The interplay between the two errors leads to what is referred to as the compounding-error problem [Asadi et al., 2018b]. To mitigate the problem, Talvitie [2014] and Venkatraman et al. [2015] provided an approach, called hallucination, that prepares the model for the fake inputs generated by itself. In contrast, our approach to the problem is to avoid feeding such fake inputs altogether by using a multi-step model. Though multi-step algorithms are popular in the model-free setting [Sutton and Barto, 2018; Asis et al., 2017; Singh and Sutton, 1996; Precup, 2000], extension to the model-based setting remains less explored.

Our main contribution is to propose a Multi-step Model for Model-based RL (or simply ) that directly predicts the outcome of executing a sequence of actions. Learning the multi-step model allows us to avoid feeding fake inputs to the model. We further introduce a novel rollout procedure in which the original first state of the rollout will be the starting point across all rollout steps. Our theory shows that, relative to the one-step model, learning the multi-step model is more effective for value estimation. To this end, we study the hardness of learning the multi-step model through the lens of Rademacher complexity [Bartlett and Mendelson, 2002]. Finally, we empirically evaluate the multi-step model and show its advantage relative to the one-step model in the context of background planning and decision-time planning.

2 Background and Notation

To formulate the reinforcement-learning problem, we use finite-horizon Markov decsion processes (MDPs) with continuous states and discrete actions. See Puterman [2014] for a thorough treatment of MDPs, and Sutton and Barto [2018] for an introduction to reinforcement learning.

2.1 Lipschitz Continuity

Following previous work [Berkenkamp et al., 2017; Asadi et al., 2018b; Luo et al., 2018; Gelada et al., 2019] we make assumptions on the smoothness of models, characterized below.

Definition 1. Given metric spaces and is Lipschitz if ,

is finite. Similarly, is uniformly Lipschitz in A if the quantity below is finite:

Table 1: Notation used in the paper.

2.2 Rademacher Complexity

We use Rademacher complexity for sample complexity analysis. We define this measure, but for details see Bartlett and Mendelson [2002] or Mohri et al. [2012]. Also, see Jiang et al. [2015] and Lehnert et al. [2018] for previous applications of Rademacher in reinforcement learning.

Definition 2. Consider , and a set of such functions F. The Rademacher complexity of this set, Rad(F), is defined as:

where , referred to as Rademacher random variables, are drawn uniformly at random from .

The Rademacher variables could be thought of as independent and identically distributed noise. Under this view, the average quantifies the extent to which matches the noise. We have a high Rademacher complexity for a complicated hypothesis space that can accurately match noise. Conversely, a simple hypothesis space has a low Rademacher complexity.

To apply Rademacher to the model-based setting where the output of the model is a vector, we extend Definition 2 to vector-valued functions that map to . Consider a function where . Define the set of such functions G. Then:

where are drawn uniformly from .

2.3 Transition-Model Notation

Our theoretical results focus on MDPs with deterministic transitions. This common assumption [Abbeel et al., 2006; Talvitie, 2017] simplifies notation and analysis. Note that a deterministic environment, like nearly all Atari games [Bellemare et al., 2013], can still be quite complex and challenging for model-based reinforcement learning [Azizzadenesheli et al., 2018]. We introduce an extension of our results to a stochastic setting in the Appendix.

We use an overloaded notation for a transition model (see Table 1). Notice that our definitions support stochastic h-step transitions since these transitions are policy dependent, and that we allow for stochastic policies.

3 M3 – A Multi-step Model for Model-based Reinforcement Learning

is an extension of the one-step model—rather than only predicting a single step ahead, it learns to predict steps ahead using H different functions:

where a. Finally, by , we mean the set of these H functions:

Figure 1: (top) a 3-step rollout using a one-step model. (bottom) a 3-step rollout using a multi-step model Crucially, at each step of the multi-step rollout, the agent uses as the starting point. The output of each intermediate step is only used to compute the next action.

This model is different than the few examples of multi-step models studied in prior work: Sutton [1995] as well as van Seijen and Sutton [2015] considered multi-step models that do not take actions as input, but are implicitly conditioned on the current policy. Similarly, option models are multi-step models that are conditioned on one specific policy and a termination condition [Precup and Sutton, 1998; Sutton et al., 1999; Silver and Ciosek, 2012]. Finally, Silver et al. [2017] introduced a multi-step model that directly predicts next values, but the model is defined for prediction tasks.

We now introduce a new rollout procedure using the multi-step model. Note that by an H-step rollout we mean sampling the next action using the agent’s fixed policy , then computing the next state using the agent’s model, and then iterating this procedure for more times. We now show a novel rollout procedure using the multi-step model that obviates the need for the model to get its own output. To this end, we derive an approximate experession for:

Key to our approach is to rewrite in terms of predictions conditioned on action sequences as shown below:

Observe that given the model introduced above, is actually available—we only need to focus on the quantity Pr. Intuitively, we need to compute the probability of taking a sequence of actions of length H in the next H steps starting from s. This probability is clearly determined by the states observed in the next steps, and could be written as follows:

We can compute Prif we have Pr. Continuing for steps:

which we can compute given the first functions of , namely for .

Finally, to compute a rollout, we sample from by sampling from the policy at each step:

where Notice that, in the above rollout with , we have used the first state s as the starting point of every single rollout step. Crucially, we do not feed the intermediate state predictions to the model as input. We hypothesize that this approach can combat the compounding error problem by removing one source of error, namely feeding the model a noisy input, which is otherwise present in the rollout using the one-step model. We illustrate the rollout procedure in Figure 1 for a better juxtaposition of this new rollout procedure and the standard rollout procedure performed using the one-step model. In the rest of the paper, we test this hypothesis in theory and practice.

4 Value-Function Error Bounds

We argued above that a multi-step model can better deal with the compounding-error problem. We now formalize this claim in the context of policy evaluation. Specifically, we show a bound on value-function estimation error of a fixed policy in terms of the error in the agent’s model, while highlighting similar bounds from prior work [Ross et al., 2011; Talvitie, 2017; Asadi et al., 2018b]. All proofs can be found in the Appendix. Note also that in all expectations below actions and states are distributed according to the agent’s fixed policy and its stationary distribution, respectively.

Theorem 1. Define the H-step value function , then

Moving to the multi-step case with , we have the following result:

Theorem 2.

Note that, by leveraging in Theorem 2, we removed the factor from each summand of the bound in Theorem 1. This result suggests that is more conducive to value-function learning as long as h-step generalization error grows slowly with h. In the next section, we show that this property holds under weak assumptions, concluding that the bound from Theorem 2 is an improvement over the bound from Theorem 1.

5 Analysis of Generalization Error

We now study the sample complexity of learning h-step dynamics. To formulate the problem, consider a dataset . Consider a scalar-valued function , where d is the dimension of the state space, and a set of such functions F. Consider a function and a set of functions

For each h we learn a function such that:

Lemma 1. For any function , training error , and probability at least :

The second and third terms of the bound are not functions of h. Therefore, to understand how generalization error grows as a function of h, we look at the dependence of Rademacher complexity of to h.

Lemma 2. For the hypothesis space shown in Figure 7 (see Appendix):

In our experiments, we represent action sequences with h-hot vector so

. We finally get the desired result:

Theorem 3. Define the constant Lip

the constant Lip. Then:

• with the one-step model,

• with the multi-step model,

Note the reduction of factor H in the coefficient of , which is typically larger than .

6 Discussion

As training the multi-step involves training H different functions, the computational complexity of learning is H-times more than the complexity of learning the one-step model. However, the two rollout procedures shown in Figure 1 require equal computation. So in cases where planning is the bottleneck, the overall computational complexity of the two cases should be similar. Specifically, in experiments we observed that the overall running time under the multi-step model was always less than double the running time under the one-step model.

Previous work has identified various ways of improving one-step models: quantifying model uncertainty during planning [Deisenroth and Rasmussen, 2011; Gal et al., 2018], hallucinating training examples [Talvitie, 2014; Azizzadenesheli et al., 2018; Venkatraman et al., 2015; Oh et al., 2015; Bengio et al., 2015], using ensembles [Kurutach et al., 2018; Shyam et al., 2018], or model regularization [Jiang et al., 2015; Asadi et al., 2018b]. These methods are independent of the multi-step model idea; future work will explore the benefits of combining these advances with .

7 Empirical Results

The goal of our experiments is to investigate if the multi-step model can perform better than the one-step model in several model-based scenarios. We specifically set up experiments in background planning and decision-time planning to test this hypothesis. Note also that we provide code for all experiments in our supplementary material.

7.1 Background Planning

As mentioned before, one use case for models is to enable fast value-function estimation. We compare the utility of the one-step model and the multi-step model for this purpose. For this experiment, we used the all-action variant of actor-critic algorithm, in which the value function (or the critic) is used to estimate the policy gradient [Sutton et al., 2000; Allen et al., 2017]. Note that it is standard to learn the value function model-free:

where . Mnih et al. [2016] generalize this objective to a multi-step target . Crucially, in the model-free case, we compute using the single trajectory observed during environmental interaction. However, because the policy is stochastic, is a random variable with some variance. To reduce variance, we can use a learned model to generate an arbitrary number of rollouts (5 in our experiments), compute for each rollout, and average them. We compared the effectiveness of both one-step and the multi-step models in generating useful rollouts for learning. To ensure meaningful comparison, for each algorithm, we perform the same number of value-function updates. We used three standard RL domains from Open AI Gym [Brockman et al., 2016], namely Cart Pole, Acrobot, and Lunar Lander. Results are summarized in Figures 2 and 3.

To better understand the advantage of the multi-step model, we show per-episode transition error under the two models. Figure 8 (see Appendix) clearly indicates that the multi-step model is more accurate for longer horizons. This is consistent with the theoretical results presented earlier. Note that in this experiment we did not use the model for action selection, and simply queried the policy network and sampled an action from the distribution provided by the network given a state input.

Figure 2: A comparison of actor critic equipped with the learned models (Cart Pole, Acrobot, and Lunar Lander). We set the maximum look-ahead horizon H = 8. Results are averaged over 100 runs, and higher is better. The multi-step model consistently matches or exceeds the one-step model.

Figure 3: Area under the curve, which corresponds to average episode return, as a function of the look-ahead horizon h. Results for all three domains (Cart Pole, Acrobot, and Lunar Lander) are averaged over 100 runs. We add two additional baselines, namely the model-free critic, and a model-based critic trained with hallucination Talvitie [2014]; Venkatraman et al. [2015]

7.2 Decision-time Planning

We now use the model for action selection. A common action-selection strategy is to choose , called the model-free strategy, hereafter. Our goal is to compare the utility of model-free strategy with its model-based counterparts. Our desire is to compare the effectiveness of the one-step model with the multi-step model in this scenario.

A key choice in decision-time planning is the strategy used to construct the tree. One approach is to expand the tree for each action in each observed state [Azizzadenesheli et al., 2018]. The main problem with this strategy is that the number of nodes grow exponentially. Alternatively, using a learned action-value function , at each state s we can only expand the most promising action . Clearly, given the same amount of computation, the second strategy can benefit from performing deeper look aheads. The two strategies are illustrated in Figure 4 (left).

Note that because the model is trained from experience, it is still only accurate up to a certain depth. Therefore, when we reach a specified planning horizon, H, we simply use as an estimate of future sum of rewards from the leaf node . While this estimate can be erroneous, we observed that it is necessary to consider, because otherwise the agent will be myopic in the sense that it only looks at the short-term effects of its actions.

The second question is how to determine the best action given the built tree. One possibility is to add all rewards to the value of the leaf node, and go with the action that maximizes this number. As shown in Figure 4 (right), another idea is to use an ensemble where the final value of the action is computed using the mean of the H different estimates along the rollout. This idea is based on the notion that in machine learning averaging many estimates can often lead to a better estimate than the individual ones [Schapire, 2003; Caruana et al., 2004].

Figure 4: Tree construction (left) and action-value estimation (right) strategies.

Figure 5: A comparison between tree expansion and value-estimation strategies when using the one-step model for action selection (left). Comparison between the one-step model and for action selection (right). x-axis denotes the of agent at that episode, and y-axis denotes performance gain over model-free. Performance is defined as episode return averaged over 20 episodes. Note the inverted-U. Initially, and the model are both bad, so model provides little benefit. Towards the end gets better, so using the model is not beneficial. However, we get a clear benefit in the intermediate episodes because the model is faster to learn than

Figure 6: A comparison between the two models in the context of model-based RL. Action selection with the multi-step model can significantly boost sample efficiency of DQN. All models are trained online from the agent’s experience. Results are averaged over 100 runs, and shaded regions denote standard errors.

The two tree expansion strategies, and the two action-value estimation strategies together constitute four possible combinations. To find the most effective combination, we first performed an experiment in the Lunar Lander setting where, given different pretrained functions, we computed the improvement that the model-based policy offers relative to the model-free policy. We trained these function using the DQN algorithm [Mnih et al., 2013] and stored weights every 100 episodes, giving us 20 snapshots of . The models were also trained using the same amount of data that a particular was trained on. We then tested the four strategies (no learning was performed during testing). For each episode, we took the frozen network of that episode, and compared the performance of different policies given and the trained models. In this case, by performance we mean average episode return over 20 episodes.

Results, averaged over 200 runs, are presented in Figure 5 (left), and show an advantage for the ensemble and optimal-action combination (labeled optimal ensemble). Note that, in all four cases, the model used for tree search was the one-step model, and so this served as an experiment to find the best combination under this model. We then performed the same experiment with the multi-step model as shown in Figure 5 (right) using the best combination (i.e. optimal action expansion with ensemble value computation). We clearly see that is more useful in this scenario as well.

We further investigated whether the superiority in terms of action selection can actually accelerate DQN training as well. In this scenario, we ran DQN under different policies, namely model-free, model-based with the one-step model, and model-based with . In all cases, we chose a random action with probability for exploration. See Figure 6, which again shows the benefit of the multi-step model for decision-time planning in model-based RL.

8 Conclusion

We introduced an approach to multi-step model-based RL and provided results that suggest its promise. We introduced the model along with a new rollout procedure that exploits it. The combination was proven useful from a theoretical and practical point of view. Together, our results made a strong case for multi-step model-based RL.

9 Future Work

An important avenue for future work is to better explore methods appropriate to stochastic environments. (See Appendix for an extension in this direction.) Another thread to explore is an ensemble method enabled by the model. Specifically, we can show that an h-step prediction can be estimated in exponentially many ways, and combining these estimates can lead to a better final estimate (see Appendix).

Finally, large pixel-based domains have proven challenging for model-based methods [Machado et al., 2018; Azizzadenesheli et al., 2018], so a future direction is to investigate the effectiveness of the multi-step model in such domains.

References

Pieter Abbeel, Morgan Quigley, and Andrew Y Ng. Using inaccurate models in reinforcement learning. In ICML, 2006.

Cameron Allen, Kavosh Asadi, Melrose Roderick, Abdel-rahman Mohamed, George Konidaris, and Michael Littman. Mean actor critic. arXiv preprint arXiv:1709.00503, 2017.

Kavosh Asadi, Evan Cater, Dipendra Misra, and Michael L. Littman. Equivalence between Wasserstein and value-aware model-based reinforcement learning. In ICML workshop on Prediction and Generative Modeling in Reinforcement Learning, 2018.

Kavosh Asadi, Dipendra Misra, and Michael L. Littman. Lipschitz continuity in model-based reinforcement learning. In ICML, 2018.

Kristopher De Asis, J. Fernando Hernandez-Garcia, G. Zacharias Holland, and Richard S. Sutton. Multi-step reinforcement learning: A unifying algorithm. CoRR, abs/1703.01327, 2017.

Mohammad Gheshlaghi Azar, R´emi Munos, and Hilbert Kappen. On the sample complexity of reinforcement learning with a generative model. In International Conference on Machine Learning, 2012.

Kamyar Azizzadenesheli, Brandon Yang, Weitang Liu, Emma Brunskill, Zachary C. Lipton, and Animashree Anandkumar. Sample-efficient deep RL with generative adversarial tree search. CoRR, abs/1806.05780, 2018.

Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 2002.

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. JAIR, 2013.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.

Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, pages 908–918, 2017.

George EP Box. Science and statistics. Journal of the American Statistical Association, 1976.

Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.

Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. In ICML, 2004.

Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.

Marc Peter Deisenroth and Carl Edward Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In ICML, 2011.

Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 1977.

Yarin Gal, Rowan McAllister, and Carl Edward Rasmussen. Improving pilco with bayesian neural network dynamics models. In ICML, 2018.

Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. DeepMDP: Learning continuous latent space models for representation learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2170–2179, Long Beach, California, USA, 09–15 Jun 2019. PMLR.

Peter E Hart, Nils J Nilsson, and Bertram Raphael. Correction to a formal basis for the heuristic determination of minimum cost paths. ACM SIGART Bulletin, pages 28–29, 1972.

Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181–1189. International Foundation for Autonomous Agents and Multiagent Systems, 2015.

Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 2002.

Levente Kocsis and Csaba Szepesv´ari. Bandit based Monte-Carlo planning. In ECML, 2006.

Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.

Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.

Lucas Lehnert, Romain Laroche, and Harm van Seijen. On value function representation of long horizon problems. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014.

Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858, 2018.

Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.

Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, pages 3–17. Springer, 2016.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, et al. Asynchronous methods for deep reinforcement learning. In ICML, 2016.

Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. Learning multimodal transition dynamics for model-based reinforcement learning. arXiv preprint arXiv:1705.00470, 2017.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2012.

Andrew W Moore and Christopher G Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13(1), 1993.

Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.

Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pages 2863–2871, 2015.

Doina Precup and Richard S Sutton. Multi-time models for temporally abstract planning. In Advances in neural information processing systems, pages 1050–1056, 1998.

Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.

Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

St´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011.

GA Rummery and M Niranjan. On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, U. of Cambridge, Dept. of Engineering, 1994.

Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,, 2016.

Robert E Schapire. The boosting approach to machine learning: An overview. In Nonlinear estimation and classification, pages 149–171. Springer, 2003.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Pranav Shyam, Wojciech Jaskowski, and Faustino Gomez. Model-based active exploration. CoRR, abs/1810.12162, 2018.

David Silver and Kamil Ciosek. Compositional planning using optimal option models. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1267–1274. Omnipress, 2012.

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016.

David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-end learning and planning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3191–3199. JMLR. org, 2017.

Satinder P Singh and Richard S Sutton. Reinforcement learning with replacing eligibility traces. Machine learning, 22(1-3):123–158, 1996.

Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based reinforcement learning in contextual decision processes. arXiv preprint arXiv:1811.08540, 2018.

Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.

Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.

Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Neurips, 2000.

Richard S. Sutton, Csaba Szepesv´ari, Alborz Geramifard, and Michael H. Bowling. Dyna-style planning with linear function approximation and prioritized sweeping. In UAI, 2008.

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier, 1990.

Richard S Sutton. TD models: Modeling the world at a mixture of time scales. In Machine Learning Proceedings 1995. Elsevier, 1995.

Istv´an Szita and Csaba Szepesv´ari. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1031–1038, 2010.

Erik Talvitie. Model regularization for stable sample rollouts. In UAI, 2014.

Erik Talvitie. Self-correcting models for model-based reinforcement learning. In AAAI, 2017.

Harm van Seijen and Rich Sutton. A deeper look at planning as learning from replay. In ICML, 2015.

Arun Venkatraman, Martial Hebert, and J Andrew Bagnell. Improving multi-step prediction of learned time series models. In AAAI, 2015.

Ernst Wit, Edwin van den Heuvel, and Jan-Willem Romeijn. ‘All models are wrong...’: An introduction to model uncertainty. Statistica Neerlandica, 66(3), 2012.

Figure 7: Architecture used to represent a mapping from a state to a single dimension of the next state h steps into the future. We assume . We show a bound on the Rademacher complexity of this network.

10 Appendix

10.1 Proofs

Lemma 2. For the hypothesis space shown in Figure 7 (see Appendix):

Proof. First note that:

So we rather focus on the set of scalar-valued functions F. Moreover, to represent the model, we set F to be neural network with bounded weights as characterized in Figure 7.

Now note that Lip(ReLU) = 1, so by Theorem 4.12 in Ledoux and Talagrand [2013]:

Lemma 3. Asadi et al. [2018a] Define the H-step value function:

and assume a Lipschitz reward function R with constant . Then:

Theorem 1. Define the H-step value function , then

We reach the desired result by expanding the first term for more times.

Theorem 2.

Proof.

Lemma 1. For any function , training error , and probability at least :

Proof. We heavily make use of techniques provided by Bartlett and Mendelson [2002]. First note that, due to the definition of sup, we clearly have:

We define to be the sup in the right hand side of the above bound:

We can bound

Next note that the function satisfies:

So using MacDiarmid’s inequality:

Combining the two previous results, we can write:

Finally, using (1) and (2) we can conclude the proof.

Theorem 3. Define the constant Lip

the constant Lip. Then:

• with the one-step model,

• with the multi-step model,

Proof. Starting from the one-step case, from Theorem 1 we have:

Moving on to the multi-step case, from Theorem 2 we get:

Figure 8: h-step model error (down is good) on the Lunar Lander domain under the one-step (left) and the multi-step (right) model. For each new episode, before using the episode for training, we compute h-step predictions given all observed states and executed action sequences, and compare the predictions with the states observed h steps later in the episode. The multi-step model outperforms the one-step model.

10.2 Future Work: An Stochastic Extension

One limitation of the model so far is that it is deterministic. Here, we introduce a stochastic extension to remove this limitation. Our model is based on the one-step EM model introduced by Asadi et

Figure 9: A starting state (a), and predicted (blue) and true (red) next state (b) given aof a circle denotes the probability of the circle ending up in the state. The EM model accurately captures the two-step dynamics.

al. [2018b]. In this case, for each value of , rather than learning a single function , we train M functions each denoted , to capture different modes of the h-step transition dynamic. The multi-step transition model, , is then defined as

plus h probability distributions each over M functions . Each function is parameterized by a neural network . For a single h, we train these M functions using an Expectation-Maximization (EM) algorithm Dempster et al. [1977]. For more details on the EM algorithm see Asadi et al. [2018b], but, for completeness, we provide the M-step and the E-step of the algorithm. The M-step at iteration t solves:

The E-step updates the posterior q:

giving rise to a new optimization problem to solve in the M-step of iteration t + 1. Finally, our implementation uses:

We apply this algorithm in a variant of the gridworld (mini-Pacman) domain Moerland et al. [2017]. Details of the domain are in supplementary material, but at a high level, the goal is for the agent to start in the bottom-left and get to the top-right corner of the grid, while avoiding a randomly moving ghost. We first show, in Figure 9, that the agent can learn a 2-step model of the environment almost perfectly. We then use the model for action selection in Q-learning Rummery and Niranjan [1994], where instead of a greedy policy with respect to the Q-function, we build and search a tree using our learned transition model. In each case, the model was learned using a dataset from 100 episodes of a random policy interacting in the domain. Having searched the tree up to H = 2, we then select the action with highest utility. In Figure 10, we see a clear advantage for the EM model relative to the baselines.

10.3 Future Work: An Ensemble Extension

A promising idea for future work is a new ensemble technique articulated below. Suppose our goal is to predict the state after following ain . Two obvious paths are:

Figure 10: An evaluation of effectiveness of different transition models for tree search on the gridworld. The deterministic model fails to provide useful predictions as the average prediction is for the ghost to not move at all.

Figure 11: 8-step model accuracy. Accuracy of ensembles increases with more sampled paths, and outperforms the accuracy of as well as . Note that, due to the nature of this task, as the learner gets better at solving the task, it experiences more diverse states, hence the decrease in accuracy in later episodes.

In fact, the number of paths grows exponentially with H: Let C(H) be the number of paths. We have the following distinct paths: First use , then do the remaining steps in any arbitrary path. There are such paths, So:

We can thus use a subset of these paths and average the results together as is common with other ensemble methods Caruana et al. [2004]. Below, we report a preliminary experiment evaluating this idea.

We ran the actor-critic algorithm in the Acrobot domain. We trained the multi-step model with maximum horizon H = 8. For each episode, we computed the 8-step transition error as a function of the number of paths used to compute the prediction. Note that we sampled a path uniformly at random from possible paths. Results are presented in Figure 11, and show that predictions get more accurate as we average over more paths.

Note that prior works on ensemble methods in model-based RL exist Moore and Atkeson [1993]; Kurutach et al. [2018], but only consider combinations of one-step models. Here, we only scratched the surface of this idea, and we leave further exploration for future work.

designed for accessibility and to further open science