Importance of using appropriate baselines for evaluation of data-efficiency in deep reinforcement learning for Atari

2020·arXiv

ABSTRACT

1 INTRODUCTION

Producing fully independent agents that learn optimal behavior and develop over time purely by trial and error interaction with the surrounding environment is one of the prominent dilemmas in the field of artificial intelligence. A mathematical framework that encapsulates the problem of these autonomous systems is reinforcement learning. Over the past few years, exceptional progress has been made in devising artificial agents that can learn and solve problems in a variety of domains using deep RL methods (Mnih et al., 2015; Schulman et al., 2015; Silver et al., 2016). However, these algorithms are perceived as extremely data inefficient. They are thought to require an immense amount of non-optimal interaction with the real environment before they begin to operate acceptably well (Irpan, 2018).

One of the most popular benchmarks for assessing overall performance and data complexity of deep RL algorithms is Atari Learning Environment (Bellemare et al., 2013; Machado et al., 2018). The state-of-the-art approaches, at least in the way they were presented so far, need millions of frames to learn how to play these games acceptably well (Schulman et al., 2017; Hessel et al., 2018). It corresponds to days of play experience using the standard frame rate. However, human players can achieve the same within minutes (Tsividis et al., 2017).

A lot of work has been produced to circumvent these shortcomings. Most studies focused on the visually-rich Atari domain employ model-based strategies inspired by the classical Dyna approach (Sutton, 1991) and action-conditional prediction methods (Oh et al., 2015; Leibfried et al., 2016). Although some of them manage to drastically reduce the amount of data required by the standard algorithms, they do so by highly increasing both conceptual and computational complexity of the models.

In this paper, we argue, and experimentally prove, that already existing techniques can be much more data-efficient than it is assumed. We introduce a simple change to the DQN-based algorithms. In some environments like Pong or Hero, it can achieve the same results given only 5% - 10% of the data it is often presented to need. Furthermore, it results in the same data-efficiency as the recent advancements in the field while often being much more stable, simpler, and requiring much less computation.

Following the introduction, section 2 gives a brief background behind reinforcement learning with the focus on Q-learning and its deep learning equivalents. Section 3 provides an overview of recent studies aimed at improving data efficiency. Section 4 argues that standard DQN-like algorithms can be much more efficient than it tends to be presented and that recently proposed techniques only give an illusion of efficiency. Then, the description and analysis of experiments follow in sections 5 and 6. Finally, section 7 concludes this study.

2 BACKGROUND

Reinforcement learning is a problem of learning a policy that maximises the reward signal for a given task. To define RL setting we need a set of possible environment states S, a set of available actions A, and relations between those. These relations are described by a transition function T : that defines dynamics of transitions from one state to another, and a reward function that defines the real-valued reward signal. Together, T and R constitute the model of the environment. The goal of reinforcement learning is to find a policy that maximises the total cumulative reward over time. One of the most popular reinforcement learning algorithms is Q-learning (Watkins & Dayan, 1992). Q-learning decides on an optimal policy based on the state-action value function that maps state and action performed in that state to the expected total cumulative reward following the action. The algorithm chooses an action that maximizes Q, i.e. argmaxis learned in the process of interacting with the environment. At every agent’s step, tuple is obtained and immediately used to update the Q function. Because state-action combination is often too big or continuous to represent directly in a tabular manner, Q is commonly approximated using different supervised learning algorithms. However, using deep learning to approximate Q is not trivial because Q-learning breaks important assumptions required by neural networks. Namely, Q update is recursive and experience tuples are highly correlated when used sequentially.

Recently introduced DQN (Mnih et al., 2015) bypassed this issue by introducing two concepts: target network and replay buffer. Target network is simply a fixed snapshot of the network that approximates Q value (online network) taken every steps. Instead of updating the online network towards itself, it is updated towards the target network. This approach maintains the logic of Qlearning while stopping the online network from diverging due to recursive updates. Replay buffer, on the other hand, guarantees a much higher level of independence between experience tuples. They are not used immediately, one after another anymore but stored in the replay buffer instead. Then, every steps, a single training step is performed, i.e. a mini-batch of randomly sampled experience from the replay buffer is used to update the online network. It reduces the correlation between experience samples by breaking their ordering.

Rainbow DQN (Hessel et al., 2018) is a combination of several incremental improvements on top of DQN that increased both sample efficiency and the total performance of the algorithm achieving state-of-the-art results. It is an architecture that we use as an example that current model-free deep RL is not as inefficient as it is often stated. Throughout the paper hyperparameters from Hessel et al. (2018) are employed, unless stated otherwise.

3 ADVANCEMENTS IN DATA EFFICIENCY OF REINFORCEMENT LEARNING

The most promising approach to improving data efficiency of deep RL is based on the premise of model-based techniques (Sutton & Barto, 2018). Having access to transition and reward mechanics of the environment would make it possible to construct an artificial simulation where the agent could be trained without performing often costly interactions with the real environment. However, in most scenarios, the agent is not given any prior information about the model of its environment. This issue is often overcome by learning the model instead. Oh et al. (2015) and Leibfried et al. (2016) have shown that it is possible with a very high level of accuracy.

Ability to learn the model of the environment was subsequently leveraged to successfully improve different aspects of deep RL (Racani`ere et al., 2017; Oh et al., 2017; Buesing et al., 2018; Ha & Schmidhuber, 2018). Azizzadenesheli et al. (2018), Holland et al. (2018), and Kaiser et al. (2019), however, focused directly on employing the learned models to increase data efficiency of deep RL algorithms.

Azizzadenesheli et al. (2018) proposed Generative Adversarial Tree Search (GATS). Unlike in the standard approach to learning the environment dynamics, GATS creates two separate models: Generative Dynamics Model (GDM) based on a modified Pix2Pix (Isola et al., 2017) to learn the transition model ; and Reward Predictor (RP), a simple 3-class classification architecture to learn the reward model . Both models learn from experience stored in DQN’s replay buffer and are then used for bounded Monte Carlo tree search as in (Silver et al., 2016). GATS is evaluated primarily on the game Pong where it learns an optimal policy using around 42% of the data required by using standalone model-free agent what is a tiny improvement compared to the methods described next.

Holland et al. (2018) explored the performance of the model-based approach given either perfect model, model pretrained on expert data (pretrained model), or model learned alongside the agent’s value function (online model). Both non-perfect models followed standard architecture for the task (Oh et al., 2015; Leibfried et al., 2016). These models are then used to generate 100 samples of simulated experience for every interaction with the real environment. All three variations outperformed state-of-the-art Rainbow DQN in terms of data efficiency on 5 out of 6 games. Nevertheless, only the results of the online model are used for further discussion to ensure a fair comparison between the algorithms.

Kaiser et al. (2019) introduced Simulated Policy Learning (SimPLe). Similarly to the previous two architectures, it learns the model of the environment using a modified version of Oh et al. (2015). It differs from previous approaches by employing PPO (Schulman et al., 2017) as its RL agent and by using the learned model much more exhaustively. It uses the model similarly to Holland et al. (2018), however it provides at least 800k samples of artificial data after every 6.4k interactions. The approach is then evaluated on a range of 26 different Atari games. It provides results that highly outperform both Holland et al. (2018) and Azizzadenesheli et al. (2018) in terms of data efficiency achieving at least 2x improvement on over half of the games and more than 10x improvement on Freeway. To the best of our knowledge, SimPLe is the state of the art in terms of model-based data-efficient deep reinforcement learning; thus, it will be used as a primary model-based baseline throughout the rest of the paper.

However, model-based methods are not the only approaches for improving data efficiency of RL. Recently proposed Episodic Backward Update (EBU) (Lee et al., 2019) showed that classical version of the DQN algorithm can be incrementally improved by using full episodes in the algorithm’s replay mechanism and propagating rewards from end of the episode to its start, instead of sampling every step independently and uniformly at random.

4 DATA EFFICIENCY OF STANDARD APPROACHES

We argue that classical DQN-like methods are not as data inefficient as they are often portrayed. They are simply used in a very inefficient way. Let us define ratio r describing the number of training steps to the number of interactions with the environment. In the default setting . It means that the algorithm performs a single update of the network for every 4 interactions with the environment, i.e., r = 1/4.

As explained in section 3, both the online-model-based algorithm from Holland et al. (2018) and SimPLe from Kaiser et al. (2019) first learn the approximated model of the environment. Then, this approximation is used to provide simulated samples of experience alongside the real data. Nevertheless, these samples, in the best case, can only provide as much real signal to the agent as was provided in the original data. However, as a byproduct of the agent’s interactions with the learned model, the ratio r significantly increases. Holland et al. (2018) performs 100 simulated steps for each real step causing . SimPLe executes 800k simulated steps after every 6.4k interactions with the real environment. Thus, if SimPLe was using DQN as its model-free component ratio r would be even higher (r = (800k + 6.4k)/6.4k/4 = 126/4 = 31.5).

The issue of inflated r does not only affect model-based methods. Algorithm 2 from Lee et al. (2019) shows that model-free EBU algorithm performs it training step for every interaction with the environment. By itself it would increase defined above ratio to r = 1, however, it is not the end of the story. Single EBU update actually looks T separate transitions where T is the length of sampled episode so in the end we have r = T >= 1.

It seems unfair to allow novel methods to perform more training steps for each gathered data point without letting baselines to do the same. However, from the studies discussed above, only Holland et al. (2018) performed tests allowing DQN for extra updates1. GATS and EBU were compared solely to the standard version of DQN and SimPLe to the standard version of PPO algorithm together with the Rainbow DQN that, as stated in the paper, was hypertuned for sample efficiency (HRainbow). However, hyperparameters for HRainbow were not disclosed. We hypothesize, that the main reason behind improved data efficiency in the results is essentially increased r.

5 EXPERIMENTAL SETUP

To test the above-mentioned hypothesis, we train both standard DQN as described in Mnih et al. (2015) and Rainbow DQN agent, as described in Hessel et al. (2018). Both with only a few small differences to increase ratio r. These slight modifications of the algorithms are referred to as Overtrained DQN (OTDQN) and Overtrained Rainbow (OTRainbow) throughout the rest of the paper. OTDQN is used for a fair evaluation of EBU, as EBU is just an incremental improvement over the pure DQN. Whereas we use OTRainbow for direct comparison with SimPLe because SimPLe claims state of the art results.

To create both OTDQN and OTRainbow we decrease period between updates as much as possible so (thus r = 1). To ensure fainess with respect to EBU, we stick to this r in case of OTDQN (EBU’s r >= 1). However, for OTRainbow, we need to increase r even further. Because it is impossible to do so using existing hyperparameters, we introduce a new parameter k that specifies how many network updates should be performed every steps (similarly to DQN Extra Updates from Holland et al. (2018)). We find that k = 8 for Rainbow produces the best results (hence Rainbow’s r = 8). We also decrease the epsilon decay period to 250K steps and 50K steps, target network update period 5000 and 500, and minimum replay history to 25K and 20K for OTDQN and OTRainbow respectively to make it compatible with low data settings.

Existing code from the Dopamine framework (Castro et al., 2018) was modified, as explained above, to obtain overtrained algorithms. Dopamine was used for two reasons: (i) it allows for quick and easy prototyping of new RL algorithms; (ii) to ensure the same implementation for each version of the DQN (whether it is OTRainbow, HRainbow, or OTDQN). Both algorithms were then evaluated on different subsets of Atari games from the Atari Learning Environment. OTRainbow used the same 26 games as Kaiser et al. (2019), whereas OTDQN used a random subset of 25 games from the whole pool of 49 games used by EBU. We compare the outcomes to multiple different baselines: an agent that always chooses action uniformly at random (Random) and human score as reported by Mnih et al. (2015) (Human). Additionally, OTRainbow is compared directly with scores of SimPLe and HRainbow, as reported by Kaiser et al. (2019). OTDQN, on the other hand, is analyzed against DQN with the hyperparameters from Mnih et al. (2015) (SDQN) and EBU scores reported in Lee et al. (2019).

To further ensure fairness of comparison, we choose to evaluate SimPLe and EBU in the same data settings they used in the original papers. The efficiency of OTRainbow and SimPLe is tested based on a mean score in the low data regime of 100k interactions with the real environment, while OTDQN and EBU use 2.5M interactions (10M frames). On top of that, we compare the overall performance of all models depending on the amount of available data using human normalized performance. I.e., we normalize agent scores on each game such that 0% is the performance of the random agent, and 100% corresponds to human score.

6 ANALYSIS

6.1 SIMPLE

Figure 1: Comparison of SimPLe with OTRainbow. Bars represent the number of interactions required by OTRainbow to reach the same score as SimPLe achieves using exactly 100k interactions. Notice logarithmic scale on X-axis.

Overall, OTRainbow and SimPLe prove to be the best models evaluated in the 100k-interactions-only setting, without the clear winner between the two. Numerical results for this setting are shown in Table 1. Moreover Figure 1 compares OTRainbow and SimPLe directly, using graphical convention similar to Kaiser et al. (2019). However, in this study, we use a logarithmic scale to denote the number of data samples needed to reach SimPLe’s score. Doing so ensures that whether OTRainbow requires n times more experience or n times less, the absolute visual deviation from the SimPLe baseline is the same. Also, results are clipped to the absolute maximum deviation of 5x (i.e., 20k - 500k) as OTRainbow was evaluated on a maximum of 500k interactions due to computational constraints.

We can see that both OTRainbow and SimPLe outperform Random on all 26 games, surprisingly HRainbow did not manage to do the same. However, HRainbow falls behind Random only when playing Kangaroo. OTRainbow produces better scores than HRainbow on all games proving the weakness of HRainbow as a baseline. Interestingly, SimPLe’s manages to beat HRainbow only on 20 out of 26 games. In terms of direct comparison between OTRainbow and SimPLe, they perform very evenly. OTRainbow outperforms SimPLe on exactly half of the games but is dominated by SimPLe on the remaining half. What is fascinating, however, is that the original paper behind SimPLe reported that efficiency on Freeway benefits most from the model-based approach, with SimPLe being 10x more efficient than HRainbow. This result is improved even further by OTRainbow as it managed to score over 8 points higher. It again shows that the improvement was rather an effect of an increased number of network updates than the model-based approach. We also calculate the human normalized performance for each algorithm. Full numerical results of these calculations can be seen in Table 4 in Appendix A. The mean human performance However, the median human performance of OTRainbow beats SimPLe by over 10pp. These results show that even the state-of-the-art model-based approach, highly tuned for achieving the best scores given a small number of interactions with the real environment, is not significantly more data-efficient than slightly modified existing techniques.

Table 1: Mean scores produced by each approach in the low-data regime. Scores for OTRainbow, SimPLe, HRainbow, and Rainbow are obtained after 100k interactions with the real environment. Values in bold represent the top model for the game (ignores Human).

When comparing SimPLe to the variations of Rainbow DQN with respect to computational complexity, SimPLe is orders of magnitude more expensive. As shown in section 4, using SimPLe increases ratio r 126 times, while the most computationally demanding variation of Rainbow - OTRainbow -increases r 32 times. Thus, when taking into account only the reinforcement learning part, SimPLe already requires almost four times more network updates. On top of that, however, SimPLe has to perform expensive training of the world model. As reported by Kaiser et al. (2019), a full version of SimPLe takes more than three weeks on 100k data points to complete the training. Using the same amount of data, OTRainbow can finish within the first 24 hours2.

6.2 EBU

Unlike SimPLe, EBU was evaluated in the data regime of 10M frames (2.5M steps). Table 2 contains numerical results for all tested algorithms, namely EBU, OTDQN, and SDQN. Because this setting allows for much more interactions than previously described low-data regime, none of the algorithms has a problem with surpassing the random agent. All of them have enough time to finish full exploration period and start converging to optimal policies.

Similarly to SimPLe and OTRainbow, EBU and OTDQN compare evenly, with EBU having only a slight advantage. Lee et al. (2019) reported that EBU outperforms SDQN on 20 out of 25 games that are used in our experiments clearly being superior. However, when compared to OTDQN, it is better only on 14 games. What is more, it was stated in the original paper that although EBU does not surpass SDQN in all of the games, large improvements in Atlantis, Breakout, or VideoPinball offset

Figure 2: Graph represent relative score of OTDQN compared to EBU in percents. Both are trained for 10M frames.

these shortcomings. OTDQN highly outperforms EBU in both Atlantis and VideoPinball, showing that these scores were merely caused by an increased ratio r. It can be easily noticed in Figure 2 that visualizes relative performance of both algorithms. What is important to note, OTDQN still performs much less actual updates than EBU. Number of updates for both of the algorithms would be the same if and only if average episode length was 1. In Atari setting, it’s at least order of magnitude larger than that.

Table 2: Mean scores produced by each approach. Scores for OTDQN, SDQN, and EBU, HRain- bow, and Rainbow are obtained after 100k interactions with the real environment. Values in bold represent the top model for the game (ignores Human).

7 CONCLUSION

We presented an intuition why the previous research did not use fair baselines when comparing new advancements with currently existing methods. We suggested the way of using popular version of the DQN algorithm, namely OTDQN and OTRainbow, that leverage DQN’s actual capabilities in terms of data efficiency. We experimentally proved that most of the recent advancement in model-free and model-based approaches show improved performance only due to an increase in ratio of the number of training updates to the number of interactions with the environment and not due to superiority of complicated and often extremely computationally expensive techniques that were proposed. In particular, it shows that the recent work in sample efficient deep reinforcement learning does not produce significant improvements over the existing methods upholding the position of model-free algorithms as the state of the art, both in terms of data efficiency and total performance, at least in visually-rich Atari domain. Through these results, we aim to underline the importance of using appropriate model-free baselines, such as OTRainbow, in the future research that tries to improve data efficiency of deep RL approaches.

REFERENCES

Kamyar Azizzadenesheli, Brandon Yang, Weitang Liu, Emma Brunskill, Zachary C Lipton, and Animashree Anandkumar. Sample-efficient deep rl with generative adversarial tree search. arXiv preprint arXiv:1806.05780, 2018.

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environ- ment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47: 253–279, 2013.

Lars Buesing, Theophane Weber, Yori Zwols, Sebastien Racaniere, Arthur Guez, Jean-Baptiste Lespiau, and Nicolas Heess. Woulda, coulda, shoulda: Counterfactually-guided policy search. arXiv preprint arXiv:1811.06272, 2018.

Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A Research Framework for Deep Reinforcement Learning. 2018. URL http: //arxiv.org/abs/1812.06110.

David Ha and J¨urgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

G Zacharias Holland, Erin J Talvitie, and Michael Bowling. The effect of planning shape on dyna- style planning in high-dimensional state spaces. arXiv preprint arXiv:1806.01825, 2018.

Alex Irpan. Deep reinforcement learning doesn’t work yet. https://www.alexirpan.com/ 2018/02/14/rl-hard.html, 2018.

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134, 2017.

Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374, 2019.

Su Young Lee, Choi Sungik, and Sae-Young Chung. Sample-efficient deep reinforcement learning via episodic backward update. In Advances in Neural Information Processing Systems 32, pp. 2112–2121. 2019.

Felix Leibfried, Nate Kushman, and Katja Hofmann. A deep learning approach for joint video frame and reward prediction in atari games. arXiv preprint arXiv:1611.07078, 2016.

Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle- mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pp. 2863–2871, 2015.

Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 6118–6128. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7192-value-prediction-network.pdf.

S´ebastien Racani`ere, Th´eophane Weber, David Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdom`enech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. In Advances in neural information processing systems, pp. 5690–5701, 2017.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897, 2015.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.

Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Pedro A Tsividis, Thomas Pouncy, Jaqueline L Xu, Joshua B Tenenbaum, and Samuel J Gershman. Human learning in atari. In 2017 AAAI Spring Symposium Series, 2017.

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.

A COMPLETE NUMERICAL RESULTS

Table 3: Mean raw scores for each approach. Value in brackets after the name of the method indicates the number of training interactions performed before the evaluation.

Table 4: Mean human normalized score for each approach. Value in brackets after the name of the method indicates the number of training interactions performed before the evaluation.

Designed for Accessibility and to further Open Science