Generative Adversarial Imagination for Sample Efficient Deep Reinforcement Learning

2019·arXiv

Abstract

Acknowledgments

I would like to express my great gratitude to Dr Per Kristian Lehre for his invaluable guidance over the past year. His out-of-the-box suggestion during the planning phase of the project allowed me to put together different ideas that finally led to the work presented in this thesis.

In addition, I would like to thank Dr Lukasz Kaiser from Google Brain for finding the time to answer my questions regarding his recent findings.

1 Introduction

One of the most prominent dilemmas in the field of artificial intelligence is to produce fully independent agents that learn optimal behaviour and develop over time purely by trial and error interaction with the surrounding environment. A mathematical framework that encapsulates the problem of these autonomous systems is reinforcement learning (RL) (Sutton & Barto 1998). Although over the past few years exceptional progress has been made in devising artificial agents that can learn and solve problems in a variety of domains using RL approaches (Arulku- maran et al. 2017), these techniques are still not ideal. They require an immense amount of non-optimal interaction with the real environment before they begin to operate acceptably well and they do not efficiently adapt to new tasks, even within the identical environmental setting (Irpan 2018).

So far RL researchers were concentrating on mastering games like backgammon, chess, go, or various video games. In these settings dynamics of the environment are either entirely known and thus can be simulated (rules of the board games), or they can be queried and reset infinitely many times without any additional costs (video games). It allowed producing infinite amounts of data for the agent to learn. However, these kinds of conditions are rare in the real world. Dynamics of the environment are usually unknown and are too involved to approximate using rule-based methods. Often, we also cannot let the agent do millions of arbitrary trial-and-error live experiments freely.

It is simple to imagine the use of reinforcement learning agent to optimise user experience on the website. Every bad decision in the real environment may result in a loss of an unsatisfied customer. Millions of such choices, before the agent converges to the optimal policy is too big of a risk for any company. Another example can be an autonomous car accustomed to driving in a specific country that suddenly finds itself in another country with a completely different driving culture. A human could quickly adapt to the new reality, but a current state-of-the-art reinforcement learning agent would require enormous amounts of experience first, highly increasing a probability of a severe accident.

Model-based reinforcement learning algorithms promise to solve this obstacle by using known dynamics of the environment to analyse probable scenarios. The agent can imagine various circumstances, and learn or reason based on them, without actually executing expensive trial-and-error exploration in the real world. Use of internal forecasts of the world for decision making and reasoning was deeply examined within the neuroscience community (Tolman 1948, Hassabis et al. 2007, Schacter et al. 2012). It has been demonstrated to exist within the learning process of humans and several animals (Pfeiffer & Foster 2013, Leinweber et al. 2017). Again, it is manageable from the RL perspective when we fully understand the model of the environment as exhibited by the AlphaGo (Silver et al. 2017). However, as discussed earlier, we do not always know the exact specifics of the environment and, frequently, we have no prior knowledge regarding its dynamics at all.

One plausible solution to that difficulty could be learning the model of the environment instead. Some work has been done already on the subject. Most prominently, Oh et al. (2015) and Leibfried et al. (2016) showed that the dynamics of the environment can be modelled with very high accuracy. Nonetheless, although learned ’imaginative’ model helped to improve outcomes in environments that require long-term planning, it did not significantly reduce the size of the system exposure required for training a well-performing agent (Racani`ere et al. 2017). This brings about the following questions:

• Can learning the imaginative model of the environment be more data efficient than learning an optimal policy?

• If so, can the learned imagination fulfil the promise of sample efficient model-based RL in settings where dynamics of the real environment are unknown?

This study subsequently answers both of the questions. It combines a few recent and a few less recent ideas from the field to do so. It hypothesises that recent advancement in the generative adversarial networks architecture (Goodfellow et al. 2014), and inherent to the RL setting Markov property can provide a positive answer to the first question. Furthermore, it considers that the potential use of imagination within the structure similar to the Dyna-Q algorithm (Sutton 1990) may indeed profoundly improve sample efficiency of RL in unknown environments.

Following from the introduction, this thesis is structured as follows: Section 2 provides scientific background and basic theoretical fundamentals necessary for full understanding of the conducted research. Section 3 gives an overview of already existing studies that are relevant to this topic. Section 4 introduces novel Generative Adversarial Imaginative Reinforcement Learning (GAIRL) algorithm developed to test the above-stated hypothesis. Section 5 then goes into details of methods that were used to efficiently create accurate imagination, i.e. learn the model of the environment. Section 6 describes an experimental setting used to evaluate newly proposed GAIRL algorithm and compare it to the current state-of-the-art. Section 7 presents the qualitative results that are further discussed in

section 8. Finally, section 9 summarises the work carried out.

2 Background

2.1 Reinforcement learning

Reinforcement learning is the problem of learning an optimal policy (i.e. behaviour) for a given environment (Sutton & Barto 1998). RL is formalised by Markov decision processes (MDPs). An MDP can be formulated as a tuple (is a (discrete or continuous) set of possible states, A is a (discrete or continuous) set of allowed actions, T is a transition probabilities function, R is a (deterministic or stochastic) reward function, and (0; 1) is a discount factor for future rewards (controls agent’s time preference). At any given time t, the reinforcement learning agent observes an environment state and selects an action . Then, the reward is returned from the environment, and the environment moves to the state with transition probability fully describe the dynamics of the environment.

Figure 1: Reinforcement learning setting

A critical characteristic of MDPs is that it follows the Markov assumption. Namely, at any point in time t, history of encountered states

simplified to the last state without any loss in information, i.e:.

It is an assumption required by RL algorithms. Unfortunately, this property is not always observable in the real world. We can often circumvent its absence by clever preprocessing of the state space S or by translating this lack of information to additional stochasticity on top of the transition and reward functions. However, even with these countermeasures, lack of Markov assumption may limit the range of optimal strategies that our agent can learn.

The agent interacts with an MDP by selecting actions accordingly to the policy that maps states to the probability distribution over A. The goal of obtaining an optimal policy can be formulated as learning a policy that maximises the state value function (Bellman equation):

From which we can also define state-action value function

There is always at least one optimal policy and it satisfies/entails the following:

The setting could potentially be directly solved employing simple linear algebra: ; however, it works exclusively for finite-state MDPs and its time complexity is ) (Coppersmith & Winograd 1990). It is computationally infeasible for large and complex environments that are encountered in a vast majority of conditions. Therefore, it is rarely used, and most of the focus is given to algorithms that leverage sampling of the environment to approximate the optimal solution.

There are three different types of RL algorithms:

• Value-based - explicitly learn state-action value function (3) of the environment and use it to infer optimal policy Q-learning (Watkins & Dayan 1992).

• Policy-based - do not learn state value function explicitly but instead directly learn policy mapping . E.g. REINFORCE (Williams 1992).

• Actor-critic - combines learning policy (actor) and value function (critic). The actor makes choices about actions but it is updated by the feedback from the critic who then directly interacts with the environment. E.g. natural actor-critic (Peters et al. 2005).

Value-based methods are usually the simplest and the fastest. The downside is that they do not work in continuous action spaces or when we want to learn stochastic policy. Policy-based methods have better convergence qualities and can handle more complex policies and action spaces but are usually inefficient and converge to local minima. Actor-critic methods try to connect the advantages of both.

These traditional algorithms allowed for a big advancement in computer science creating artificial agents that have reached human-level performance in multiple games (e.g. backgammon (Tesauro 1995)) without any prior knowledge. They work flawlessly for smaller state and action spaces where state value function (2) and/or state-action value function (3) can be expressed by the look-up table. A tabular representation of these functions, however, is often not viable due to too big or continuous domains.

2.2 Deep reinforcement learning

To allow RL to scale to more involved situations, one can propose using function approximators (such as linear regression, neural networks, or Bayesian methods) to approximate state-action value function . Unfortunately, recursive updates of the functions (2) (3) and lack of independent and identically distributed variables (future states/rewards highly depends on current state and action) in the RL setting break most of the assumptions expected by standard machine learning methods. Therefore, trials of plugging non-linear function approximators into the existing algorithms in the place of a tabular representation were resulting in complete divergence and failure.

Because it limited scalability of RL to many of the real-world problems, considerable amount of work has been done to solve this obstacle. Recently, Mnih et al. (2015) introduced Deep Q-Network (DQN), an RL algorithm that for the first time was capable of leveraging black-box properties of deep learning. They bypassed the dilemma of recursive updates and lack of i.i.d. by proposing two adjustments to the standard Q-learning algorithm (Watkins & Dayan 1992).

Replay buffer - rather than instantly learning from sampled data that is highly correlated, algorithm stores N most recent tuples () in a replay buffer D. When updating the value function Q, it uses a random mini-batch of samples from D to estimate gradients. It reduces the correlation between samples by breaking their ordering.

Target network - instead of updating Q function directly with itself, the algorithm maintains two distinct networks: the online network Q and the target network is simply a fixed snapshot of Q taken every C updates. The agent determines actions in a conventional manner accordingly to the network Q, but Q is updated using a revised Bellman equation:

It stabilises the whole learning process and avoids exploding gradients by partially eliminating recursiveness in network updates.

DQN achieved super-human level performance on Atari 2600 games from the Arcade Learning Environment (ALE) (Bellemare, Naddaf, Veness & Bowling 2013) with raw pixels as input alone. This contribution opened an enormous amount of opportunities for RL. A lot of new algorithms and DQN improvements followed.

Schaul et al. (2015) enhanced DQN slightly increasing its data efficiency by applying prioritised experience replay to more frequently replay more informative samples. Van Hasselt et al. (2016) devised Double DQN that extends DQN to the double Q-learning method (Van Hasselt 2010) that addresses an overestimation bias of the standard Q-learning. Wang et al. (2015) introduced a new duelling network architecture specifically for the value-based RL that outperformed conventional supervised machine learning architectures. Bellemare et al. (2017) produced DQN-based algorithm that, alternatively to learning scalar state-action value function Q, learns a categorical distribution of the future returns Z whose expectation is Q proving both theoretically and experimentally that this procedure improves the original DQN algorithm. Hessel et al. (2017) then consolidated all improvements enumerated in this paragraph (and a few smaller ones) into an algorithm called Rainbow DQN. It is the current RL state-of-the-art for discrete action spaces.

However, DQN, as a deep variant of Q-learning (value-based method), does not work in case of continuous actions space. Hence, it was soon merged with deterministic policy gradient methods (Silver et al. 2014) to create the Deep Deterministic Policy Gradient (DDPG) actor-critic algorithm (Lillicrap et al. 2015) solving this issue. Recently, following the success of the Rainbow DQN, multiple incremental refinements suitable to the DDPG were consolidated forming Distributed Distributional Deep Deterministic Policy Gradient (D4PG) algorithm (Barth-Maron et al. 2018). D4PG is the current state-of-the-art algorithm for complicated, continuous action space settings.

Although, the introduction of deep neural networks for the first time enabled RL algorithms to solve extremely complex problems and surpass humans at many levels, all of them suffer from tremendous data inefficiency. Rainbow DQN requires around 15 million frames of interaction with the real environment to match median human performance. It corresponds to over 8 days of constant play at the regular rate of 20 frames per second. It requires a total of full 200M frames to reach its peak performance (over three full months). In contrast, median human performance is defined as a score achieved by a person after merely 15 minutes of training beforehand.

As explained in the introduction, this is not an enormous problem when training agent to master board or video games as computational power is relatively cheap nowadays. Three months of 20 frames per second can be shortened to 1 hour of 000 frames per second given powerful enough infrastructure. However, it makes deep RL inapplicable to any other problem where obtaining samples of experience comes with potential additional costs like losing unsatisfied customers or causing an accident.

2.3 Generative adversarial networks

Recently, Goodfellow et al. (2014) introduced generative adversarial networks (GANs). They became a successful and widespread tool for data generation, profoundly over-performing previously used methods. Contrary to standard approaches that primarily focused on minimising L1 (8) or L2 (9) loss between a generated output and a real output on the individual level, GANs make it possible to work on the data distribution level minimising a difference between a generated

data distribution and a real data distribution instead.

GANs work by defining two separate networks. The generator maps a noise vector coming from the noise distribution onto the data space X, and the discriminator 1] that maps an input from the data space X to the probability of the input coming from the real data distribution . Both networks play a two-player minimax game with the objective:

As theoretically proven in the original study, this minimax game is equivalent to minimising the Jensen-Shanon (JS) divergence between

and thus, modelling to as closely resemble as possible.

Unfortunately, GANs are also known for their lack of stability in training, often causing situations where one of the networks starts to completely overwhelm the other. This results in diminishing gradients for both of the networks. They also tend to ignore certain spectrums of the distribution (mode collapse problem). Therefore, numerous researchers tried to stabilise the original GAN algorithm. One of the most successful improvements was replacing JS divergence with the EarthMover distance (1st Wasserstein metric) creating the Wasserstein GAN (Arjovsky et al. 2017):

Now, the discriminator (in the paper called critic) is no longer trying to predict if the data comes from the real or fake distribution. It is rather providing an actual real-valued distance (as measured by the Earth-Mover metric) between the data generated by the generator and the data coming from the real distribution. Hence, the goal is not longer to balance both networks but to make sure that the critic can converge to the real distance before letting the generator to improve. It has been mathematically proven in the study that the Wasserstein GAN always converges given that the critic is infinitely more powerful than the generator.

The issue with the Wasserstein GAN is that it constructs its minmax value function using the Kantorovich-Rubinstein duality (Villani 2008). Therefore, to be theoretically and practically sound, the critic needs to represent values coming from the set of 1-Lipschitz functions:

The original Wasserstein GAN enforced this limitation by clipping the weights of the critic network to space []. Setting the c hyperparameter is, however, a non-trivial task that introduced new instability problems.

Fortunately, Gulrajani et al. (2017) circumvented this concern by appending a gradient penalty to the final minimax objective instead of clipping weights of the critic:

where is a gradient penalty factor hyperparameter. and

Unfortunately, the choice between the original GAN and the Wasserstein GAN with gradient penalty (WGANGP) is not that trivial. Lucic et al. (2018) showed that WGANGP is not necessarily outperforming standard GAN given suitable hyperparameters configuration. WGANGP also takes much longer to train because the critic, for every step of the generator, has to converge to the appropriate value of the Wasserstein distance fully. However, WGANGP’s hyperparameters are much simpler to optimise. The choice between both usually comes down to the trade-off between computer power consumed by training and researcher/developer time spent on tuning the hyperparameters.

Furthermore, much research has been devoted to conditional GANs where the generator, rather than taking only a random noise z as an input, takes another value y based on which the final output is conditioned (Mirza & Osindero 2014). This enables us to not only generate arbitrary samples that follow the data distribution but also to have an influence on what precise spectrum of the distribution to obtain. This is particularly valuable when fitting GANs to the RL setting where the reward and next state tuple () is conditioned on the current state and chosen action pair (). The state-of-the-art in conditional GANs, especially in the area of computer vision, is PIX2PIX GAN that exhibited extraordinary results in multiple image-to-image translation problems (Isola et al. 2017).

Figure 2: Comparison between GAN and L1-loss-based model on the classical MNIST dataset. Upper row presents results of an L1 loss generation, whereas lower row presents results of a GAN. The GAN produces very realistic images of handwritten data, indistinguishable from the original, due to successful data distribution approximation. L1 model, on the other hand, is only able to find images that minimise its mean error on the level of the individual, thus producing blurry results.

3 Related work

3.1 Model-based reinforcement learning

As briefly discussed in section 1, one of the promises of the model-based reinforcement learning is to drastically improve the sample efficiency of reinforcement learning. The agent with access to the transition matrix T and reward function R (section 2.1) could internally reason about potential scenarios and outcomes without performing risky exploration in the real environment. Unfortunately, the specific characteristics of the RL environment are usually unknown, and hence these methods cannot be directly applied.

3.1.1 Learning the model - imagination

One plausible solution to that difficulty could be learning the model of the environment instead. The idea of learning the model of the environment from the sampled experience when its dynamics are not fully known is one of the most important concepts adapted in this thesis. However, this notion is not new. Current investigation in this area concentrates on application of variational autoencoders (VAEs) (Kingma & Welling 2013), recurrent neural networks (RNNs) (Medsker & Jain 1999), and/or Bayesian methods. Lenz et al. (2015) introduced a novel robot control algorithm leveraging RNN that predicted the position of robot’s parts. Bellemare, Veness & Bowling (2013) factors the state space to decompose model learning into smaller, more manageable sub-problems and applies Bayesian inference to predict future states.

A substantial breakthrough in learning the model of the environment in the RL setting and the algorithm that is most widely applicable was proposed by Oh et al. (2015). The paper presented two novel Encoding-Transformation-Decoding architectures to learn the transition probabilities function T. They first encode high-dimensional state using convolutional network (LeCun et al. 1999), then the transformation conditioned on is performed to convert a high-level encoding of the of the current state to a high-level encoding of the next state, and finally decoding using deconvolutional network (Zeiler et al. 2010) is executed to decode high-level next-state features into the full representation of the next state first architecture employs a simple feed-forward method and takes a fixed history of states as an input. The other takes advantage of LSTM cells (Hochreiter & Schmidhuber 1997) to capture the most relevant features from the past.

This work was later extended through a straightforward modification to support modelling of reward function R, and thus, to able to learn the full model of the environment (Leibfried et al. 2016). It was also incrementally improved by Chiappa et al. (2017) by the alteration of the original architecture and exploration of a few novel ideas.

Nevertheless, these model generation techniques have two major areas for improvement:

• They do not utilise Markov property (1) fully, treating the model learning as a highly sequential and non-stationary problem instead of transforming it into much easier, stationary scenario. It makes capturing all intrinsics of the environment much harder to learn.

• They utilise L1 or L2 objective to train the model and predict future states. L1/L2 loss penalises each difference between prediction and ground truth uniformly. Thus, it struggles to prioritise small but significant features over large but meaningless (e.g. slight ball location change over reduced saturation of a black background in Pong Atari game) and often produces blurry results. This is a well-known problem when applying standard classifica-tion/regression deep learning models for generative tasks.

3.1.2 Applications of imagination

Historically, there has been a significant amount of work devoted to model-based reinforcement learning methods that took advantage of known environment dynamics. Firstly, as already mentioned in section 2.1, having that knowledge allows to directly solve the RL setting using simple linear algebra. If state/action is too intricate for that, we can apply traditional planning techniques. Additionally, Monte Carlo tree search (Browne et al. 2012) can be deployed for more reliable evaluation of the future rewards. Several model-based techniques like DQN-cognate methods coupled with prior knowledge about the intrinsic model of the environment have already yielded outstanding effects (Silver et al. 2016, 2017).

Sadly, using these techniques directly on the imagination (approximated real environment) has been shown to perform very poorly (Talvitie 2014, 2015). It is because approximations, by their nature, introduce a certain level of error. This error quickly compounds over time when performing long environment rollouts into the future. Utilising this approximation to estimate policy using planninglike methods is directly translating this compounded error into estimated policy causing it to be, in the best case, sub-optimal.

One of the first algorithms that managed to do that successfully was Dyna-Q (Sutton 1990). It used a look-up table for imagination modelling. The imagination was later used to simulate and replace the real environment. Unfortunately, it did not scale to more complex problems due to the use of a finite look-up table. The high-level structure of the algorithm proposed in this thesis is inspired by the traditional Dyna-Q and can be partially viewed as its expansion to bigger and/or continuous state spaces.

Most recently, Racani`ere et al. (2017) successfully employed model learning techniques described in section 3.1.1 by combining them with the standard model-free methods in the decision making process. Their results showed improvements in the effective exploration and in the environments where long-term planning is crucial, even, when the imagination of the agent was far from perfect. Nevertheless, it did not significantly reduce the size of the system exposure necessary for training a well-performing agent.

Recent work that most closely resembles this thesis was done by Venkatraman et al. (2016) and Gu et al. (2016). They also try to improve sample efficiency of RL by using Dyna-Q inspired algorithms. However, both of the studies adopt simple, often linear, methods to train the imagination. Hence, they can only be applied to specific domains where such methods are successful. Additionally, their data efficiency is not specifically better than the current model-free state-of-the-art Rainbow DQN (Hessel et al. 2017).

3.2 GANs in reinforcement learning

GANs and RL are usually treated as separate fields of machine learning research by the community. There was not much attention to combining them or employing advances from one domain to improve the other. Recent work by Pfau & Vinyals (2016) presented a deep connection between GANs and actor-critic RL trying to encourage both communities to learn from each other.

A few studies to utilise GAN architecture to devise novel RL algorithms followed. Doan et al. (2018) introduced GAN Q-learning, a model-free distributional alternative to the DQN algorithm, by using generative adversarial architecture to express Bellman update (3) implicitly. Ho & Ermon (2016) obtained significant performance gains in imitation learning (Schaal et al. 2003) by applying a novel Generative Adversarial Imitation Learning algorithm. Henderson et al. (2018) proposed an innovative OptionGAN architecture for inverse reinforcement learning (Ng et al. 2000) improving results in one-shot transfer.

This thesis proposes the use of GANs to learn the dynamics of the environment and form the agent’s imagination. Similar approach was presented by Xiao & Kesineni (2016) and Azizzadenesheli et al. (2018). They both examined learning the model of the environment for Atari games to, similarly to the AlphaGo (Silver et al. 2017), apply a combination of DQN with the Monte Carlo tree search to effectively discover the optimal policy. Both studies reported negative results due to the too short MCTS rollouts as proven in the former. Azizzadenesheli et al. (2018), however, was able to remarkably efficiently learn a very accurate representation of model dynamics using generative adversarial approach.

4 Imaginative framework for data eﬃciency

Building upon previous research, this thesis introduces Generative Adversarial Imaginative Reinforcement Learning (GAIRL) algorithm to answer questions posed in section 1 and to improve overall data efficiency of deep RL. As briefly mentioned in section 3, study by Sutton (1990) inspired the main structure of the novel algorithm. This structure is a focus of this section, without going much into details of the generative adversarial imagination. Transforming the imaginative framework specifically into the GAIRL algorithm is a topic of the next section.

The imaginative core underpinning GAIRL is divided into two separate modules: the model-free module (MFM) and the imagination module (IM). It also makes use of the concept of memory M. The memory is simply an array storing previous real-environment experience, similarly to the replay buffer in the DQN-cognate algorithms that was described in section 2.2.

These modules are utilised across three distinct training phases. First is the model-free phase (MFP) that only makes use of the MFM, followed by the imagination training phase (ITP) where solely the IM is operated, finalised by the imagination-based phase (IBP) that combines both of the modules.

4.1 Model-free module

The MFM consists of a standard model-free reinforcement learning algorithm; in this study, it is the state-of-the-art Rainbow DQN. It is the core decision-making module that models policy mapping of the agent. The advantage of the imaginative framework is that it can leverage any other existing model-free algorithm if it is more suited for a given task (e.g. D4PG in environments with continuous action space) or if a model-free algorithm that overperforms Rainbow DQN is introduced in the future.

Figure 3: Model-free module. It is in the essence a standard RL algorithm working exactly as described in sections 2.1 and 2.1. It represents policy incrementally updated based on the consequences of performed actions (

4.2 Imagination module

The IM is a crucial part of the whole framework. It is a trainable system that can simulate the behaviour of the environment dynamics, i.e. accurately approximate the transition probabilities function T and the reward function R, similarly to the approach presented by Leibfried et al. (2016). Similarly to the MFM, any generative model can be employed in place of the IM. Nonetheless, this thesis focuses on a specially designed IM to optimise its data efficiency. In depth design of the IM to create the GAIRL algorithm can be found in section 5.

Figure 4: Imagination module. Its goal is to replace the real environment by modelling its dynamics (functions T and R). Thus, its input/output exactly resembles input/output of the environment shown in Figure 1.

4.3 Model-free phase

The MFP mostly follows the standard RL procedure taking advantage of the model-free RL algorithm employed within the MFM. Similarly to the standard RL, it is based entirely on the real environment. The only difference is that, in addition to the use of experience samples returned by the environment to improve the policy encapsulated in the MFM, it also stores these samples within the agent’s memory M.

MFP is the only one out of three phases that requires the real environment to sample experience. Therefore, to limit the amount of the real experience that is required, it is also recommended to keep it as short as possible.

Figure 5: Model-free phase. Excluding the memory that stores transition tuples (), the whole concept was already visualised in Figure 1. MFM given a state produces an action . The pair () is fed to the real environment (technically is not produced by the agent nor fed into the environment, environment simply is in the state ). Environment moves into the next state and produces the reward . The pair () is then given as a feedback to the agent and together with the pair () it is saved in the memory. Finally, the next state becames the new current state () and the whole process starts from the beginning.

4.4 Imagination training phase

Imagination training phase is focused on using transitions samples (stored in the memory M to train the IM to accurately follow the dynamics of the real environment (approximate functions T and R) in a purely supervised learning manner.

Length of the ITP does not affect the real environment at all as it exploits past, memorised, experience instead. Hence, it can, and it should to produce more optimal results, run until the IM fully converges to the representation of the data that is stored in the memory.

4.5 Imagination based phase

Having fully trained imagination, we can start leveraging it. The IBP is focused on improving the agent’s policy by letting MFM train and rollout experimental scenarios on the IM instead of making potentially expensive trial and error in the real world. In fact, the MFM is not even aware of the fact that its learning process has been moved onto the ’fake’ environment instead of the real one.

Figure 6: Imagination based phase. The concept works exactly like the one visualised in Figure 1 or in Figure 5 (excluding memory). The only difference is that the real environment was replaced by the imagination module.

IBP, similarly to the ITP, should last until the agent’s policy fully converges to the imaginative environment to extract as much signal, from experience gathered so far, as possible. Nevertheless, in practice, experiments focused on performing only three times more steps in the imagination than in the real environment, even, when the policy did not always fully converged during the imagination based phase.

4.6 Summary

The whole premise of the framework is to extract meaningful signal from the obtained experience as efficiently as possible. Imagination module serves as a way to generalise possible distribution of the world to create a safe, artificial, imaginative environment where the agent can learn and experiment without any risks associated with the actions in the real world.

The three phases work in a loop following Algorithm 1. The loop part is crucial. In the first MFP, the agent usually performs completely random and experimental actions and is very likely to not reach more advanced environment states such as the next level in a video game. Going through the ITP and the IBP can allow it to master the first level of the game. However, it needs to go back to the real environment to explore newly reachable states to be able to imagine them in the second iteration accurately. This process should continue until the agent’s policy fully converges to the given environment.

Convenient characteristic of the whole framework is that it can work for any model-free RL algorithm in place of the MFM and any generative module in place of the IM. Nevertheless, only the framework that specifically leverages GANs and Markov property for the data efficient IM is defined as GAIRL as explained in the next section.

5 Imagination module structure for GAIRL

The framework described in section 4 is based on the assumption that the IM can learn the dynamics of the real environment accurately, and efficiently enough. As discussed in section 3.1.1, there already exist some studies on this topic. Most prominently Oh et al. (2015) used variational autoencoders in combination with recurrent neural networks to create a model that can predict the next state in Atari games conditioned on the current state and the chosen action. Its results were extremely accurate; however, it did not focus on sample efficiency.

Contrary to the previous studies, the IM of the GAIRL algorithm takes into account Markov property simplifying the whole setting to the straightforward supervised learning problem with an entirely stationary mapping is, therefore, theoretically, much more data efficient, especially compared to the standard model-free policy learning that tries to learn a behaviour maximising expected cumulative reward, i.e. the extremely non-stationary sum of all future rewards.

Although any supervised learning model that can fit the above-mentioned description could be used in the framework, certain algorithms are superior to others. As described in section 2.3, standard deep learning models based on the L1 or L2 loss do not perform well in generative tasks. In theory, the best performing architecture to approximate different high-dimensional generative distributions are generative adversarial networks (Goodfellow et al. 2014). GANs should add another advantage over popular variational autoencoder approach. As mentioned in section 3.2, use of the PIX2PIX GAN (Isola et al. 2017) for this setting already yielded remarkably good and sample efficient results in the very recent study (Azizzadenesheli et al. 2018).

Another essential characteristic of GANs is that they take random noise as an input (in addition to the conditional input in case of standard generative models). This quality provides an additional advantage over previously used methods in highly stochastic environments. It should allow GAN, if needed, to produce a set of diverse stochastic outputs for the same conditional input, instead of just a deterministic mean of possible outputs.

Figure 7: GAIRL’s imagination training phase. For the state prediction, a tuple () sampled from the memory serves as a conditional input for both generator and critic. The generator, given random noise z and a conditional input, tries to generate a ’fake’ next state to as closely resemble the real next state possible. Then, the critic, knowing conditional inputs for each of the possible fake and real states, calculates a Wasserstein distance between the generator conditional distribution and the real environment conditional distribution generator aims to reduce. For the reward prediction, a tuple () is the standard input to the MLP. The MLP tries then to predict the following reward. The predicted reward is then combined with the real reward to calculate loss that the MLP tries to minimise.

Although single GAN seems like a perfect match for the IM, in practice, it is easier to construct the IM using two separate deep learning models. One predicting the next state (modelling transition probabilities function T), another predicting the expected reward (modelling reward function R). In the proposed framework, the

Figure 8: The state-predicting generator and the reward-predicting MLP plugged into the IBP. Both generator and MLP work as the imagination module.

next state generation is handled by a GAN because state space is often highly dimensional structure perfectly suited for that architecture. Reward, however, is always represented by a single, real-valued, scalar. Therefore, it can be simplified to a simple regression problem. Hence, GAIRL employs a traditional L1/L2 loss based regression model for the reward generation task. The full internal structure of the IM module during the ITP can be seen in Figure 7. Figure 8 shows how trained imagination is then used in the IBP.

The final crucial step towards a successful use of the IM in the GAIRL setting was deciding on starting states for the IM rollouts. Real environment simply starts in a particular state. IM, on the other hand, is just a combination of supervised learning models, it does not have any inherently associated internal state. States are merely inputs to the machine learning system. Therefore, in the IBP, whenever there is a need to start a new episode, a random state is sampled from the memory. Just then, having these ground truth initial state, the IM is used to simulate possible scenarios into the future in the standard manner.

6 Experimental setup

6.1 Environments

To asses the capabilities of the algorithms, OpenAI Gym (Brockman et al. 2016) was employed. It provides traditional and most popular reinforcement learning environments in an easy to use manner. For the past four years, the most important reinforcement learning benchmarks were Atari games from the Atari Learning Environment (Bellemare, Naddaf, Veness & Bowling 2013) (also provided by the OpenAI gym). Unfortunately, RL algorithms require very extensive hardware to converge to the optimal policies for these games (up to 100GB of RAM per single run of the algorithm).

Therefore, due to high time constraints and limited resources, this study employs slightly simpler benchmarks from the classic control family of the problems. It mainly focuses on the MountainCar (Moore 1990), and Acrobot (Sutton 1996) environments. Although classic control environments are simpler and operate on lower dimensions, they were the standard benchmark in the reinforcement learning community for decades before Atari games were adopted. Nevertheless, in addition to showing the proof of concept of the GAIRL framework on the MountainCar and Acrobot, the plan is to continue the work to analyse the framework on currently adopted test environments.

6.1.1 MountainCar

MountainCar is based on a simple, low-dimensional setting. The agent controls a car that must drive up the steep slope. The car is not able to climb the hill directly. The agent has to learn to leverage the fact that it is situated in a valley and can use potential energy of the opposite slopes.

The state space is represented by two continuous variables: horizontal position of the car , and velocity of the car across the car’s axis Action space is defined by a single choice: drive left, or drive right (

Figure 9: MountainCar environment.

The agent gets a reward 1 for every time step until the episode is terminated (car drives up the right hill). An optimal policy should minimise the time it takes to reach the top of the hill, and thus maximise the cumulative reward obtained during the episode.

To solve the MountainCar environment, the algorithm has to be capable of handling continuous state space. Additionally, what makes it harder to learn is a very sparse reward signal. The agent has no information about the goal until it is reached. A meaningful signal occurs once per from 200 (close-to-optimal policy) to over 5000 (random policy) actions.

6.1.2 Acrobot

Acrobot, although still fairly low-dimensional, is harder and much more difficult task. The agent controls a two-link, under-actuated robot arm. The first (upper) joint, attached to the background, is out of control of the agent. The only controllable part is the torque of the lower joint of the robot. The goal is to balance the whole arm, so the tip of the second link swings above the episode termination line.

Six continuous variables represent the state space: sinus and cosine of the angle between the first link and a horizontal line; sinus and cosine of the angle

Figure 10: Acrobot environment.

between the first and the second link; an angular velocity of the first joint; and an angular velocity of the second joint.

Action space is again defined by a single choice, although slightly modified: apply positive torque, no torque, or negative torque to the second joint (Similarly to the MountainCar, the agent gets a reward 1 for every time step until the episode finishes (tip of the second link swings above the termination line) to encourage policies that take the least amount of time to complete the episode.

6.2 Implementation

All algorithms were implemented using the Tensorflow machine learning library (Abadi et al. 2015) with Python as a front-end programming language. Results and performance of the algorithms were captured and visualised using TensorBoard, a visualisation tool that is a part of the Tensorflow framework.

6.2.1 Model-free reinforcement learning

Implementation started with the standard Deep Q-Network code following the original paper (Mnih et al. 2015). The initial structure consisted of 2 hidden layers with 24 nodes each to accustom for simpler environments. Although the most popular activation function for hidden layers of deep neural networks is rectifier linear unit (ReLU) (f(x) = max(0, x)) introduced by Nair & Hinton (2010), from my experience it fairly often causes ’dead neurons’ problem, i.e. input to the ReLU is negative (thus output is a constant 0) causing lack of gradient flowing through the node during the backpropagation. Therefore, from the beginning, DQN was employing leaky rectifier linear units ()) with default parameter 2 (Maas et al. 2013). Starting configuration was using random minibatches of 32 experience samples from the replay buffer and Adam optimisation technique (Kingma & Ba 2014) to derive an optimal set of network weights.

It was then debugged and hypertuned until it was able to solve both, previously mentioned, environments easily. Most interesting changes performed during the hypertuning regarded the batch size and the optimisation technique. Initially, the agent was not able to even reach close-to-optimal policies. A single change from Adam optimiser to the standard stochastic gradient descent without any additional acceleration (SGD) was enough to fix the issue. Deep reinforcement learning is known to suffer from high instability. Momentum-based optimisation may often exacerbate this problem. Additionally, SGD has been recently shown to have better generalisation properties (Wilson et al. 2017), what may be another reason for its superiority in this setting. Following this change, the agent became able to reach optimal-like behaviour; however, it was quickly forgetting what it has learned and collapsing back to sub-optimal policies. The solution was to increase the batch size, from 32 to 256 strongly. A batch made out of only 32 samples was often not diverse enough to represent the actual signal coming from the environment, thus again intensifying instabilities of deep RL.

Afterwards, the state-of-the-art Rainbow DQN was written inheriting the DQN structure and following improvements described by Hessel et al. (2017). It followed the same debugging and hypertuning procedure. The final Rainbow DQN hyperparameters are presented in Table 1. The same Rainbow DQN was then used both as a baseline and within the model-free module of the GAIRL framework.

Table 1: Final parameters of the Rainbow DQN agent.

6.2.2 Generative models

Following the successful implementation, debugging, and hypertuning reinforcement learning agents, implementation of the second group of necessary algorithms – generative models – has started. All of the neural networks for the generative models were also initially configured to use Leaky ReLU activation for the hidden layers and Adam optimisation for training.

In the first step, I implemented standard GAN based on the original work by Goodfellow et al. (2014). Similarly to the DQN, it was then debugged and hypertuned; however, this time using standard MNIST dataset (LeCun et al. 1999). At first, both generator and discriminator seemed like they entirely lack gradient to progress, yet none of the well known GAN issues occurred. Interestingly, the problem did not lay purely in standard hyperparameters of the networks but in the weights initialisation. Originally, GAN’s weights were initialised according to the same distribution as weights of RL algorithms. In DQN, weights have to be initialised to values with a mean divergence and exploding gradients that can be caused by recursive updates. For GANs, however, these values were too low to produce any meaningful signal. Increasing the standard deviation of initial weights distribution solved the difficulty. Unfortunately, networks imbalance problem that was described in detail in section 2.3 followed. Discriminator started to completely overwhelm generator causing the gradient of both networks to vanish. The solution to that was decreasing the number of discriminator training steps for each generator training step from k = 5 to k = 1.

After the original GAN was implemented, both Wasserstein GAN (WGAN) and Wasserstein GAN with gradient penalty (WGANGP) were built on top. Because tuning weight clipping constant c in WGAN can be extremely hard and mundane, after making sure there are no bugs in the code, focus moved to the WGANGP implementation, without fully hypertuning the standard WGAN. Sadly, the WGANGP did not perfectly work on the dataset with the same hyperparameters as the standard GAN. The reason behind it was that, unlike the discriminator in the standard GAN, the critic in the WGANGP has to be much more powerful than the generator so it can fully converge to the real value of the Wasserstein distance between the real and the generated data. Slightly increasing the network size of the critic and the number of critic steps per single generator step from k = 1 to k = 10 caused the algorithm to perform much better. Nevertheless, it was still suffering from high variance and problematic convergence. As hypothesised and proved by the next runs of the model, this was caused by too high learning rate in both of the networks.

Furthermore, after implementation of the GAN, WGAN, and the WGANGP based on the original papers, they were expanded to allow for a conditional generation as proposed by Mirza & Osindero (2014). Conditional versions of these algorithms did not need any additional hypertuning to learn probabilistic distribution underlying the MNIST dataset adequately.

As described in section 5, an L1/L2 loss based model is also necessary for the expected reward generation. Therefore, in addition to GANs, a standard multilayer perceptron (MLP), optimising for the mean absolute error (L1 loss), has been implemented. L1 objective, instead of the most popular mean squared error (L2), has been chosen because it had been shown to work better for generative tasks (Zhao et al. 2017). The MLP was also initially debugged and hypertuned to fit the MNIST dataset. The comparison between implemented WGANGP and MLP on the MNIST dataset can be seen in figure Figure 2. It is a great example of the superiority of GANs over traditional models in high-dimensional generative tasks.

What is also crucial, the property that all generative models followed equally, both for the MNIST dataset and as a part of the imaginative framework, was normalisation. Namely, all values from the data, both inputs and outputs, were scaled to fit the range [0; 1]. It further optimised their performance but also made it easier to interpret results. The fact that the output space is a range [0; 1] is leveraged by employed evaluation metrics as described in section 6.3.

Final hyperparameters of models that were optimised for the MNIST can be found in Appendix C. Nonetheless, they are not important for the final results described in section 7.

6.2.3 GAIRL

Once generative algorithms and the model-free RL agent had been implemented, putting together the GAIRL agent begun. As explained in sections 4 and 5, the model-free module consists of the RL agent, and two generative models take the place of the imagination.

In place of the MFM, the final version of Rainbow DQN described in section 6.2.1 is used. Precisely the same network structure and hyperparameters were employed for two reasons: Firstly, to entirely make sure that experimental results show actual merit of the GAIRL method, and are not merely caused by varying hyperparameters between the algorithms. Secondly, to test the promise of GAIRL to work as a universal sample efficiency enhancement in a very modular, plug and play manner, that does not require any special tuning of the RL algorithms employed within the MFM.

Although one of the central premises of the research presented in this thesis is the advantage of GANs for the next state generation part of the imagination module, it was plausible that on chosen, less dimensional benchmarks they may actually perform worse than the standard L1/L2 loss approaches due to their higher level of complexity. Therefore, two different versions of the IM were implemented: an MLP-based imagination and a WGANGP-based imagination. The MLP-based version uses an MLP model for both next state and reward generation. Whereas WGANGP-based sticks to the standard GAIRL premise of using a GAN for the next state and an MLP for the reward. Also, this decision rendered a good comparison between Generative Adversarial Imaginative Reinforcement Learning and potentially simpler version of the framework. For both versions, the output of the MLP used for the reward generation is rounded to the closest integer (to accustom for discrete rewards in the experimental environments).

Table 2: Final parameters of the state-generating WGANGP for the WGANGPbased imagination.

As described in section 2.3, an original GAN could be more computationally effi-cient than the WGANGP. However, a hard deadline on the thesis moved the focus on minimising the time necessary to tune hyperparameters, instead of minimising the computational efficiency.

What is more, both environments presented in this study have deterministic dynamics, and so generative models do not require any inherent stochasticity. Therefore the random noise input to the WGANGP was omitted. All of the hyperparameters, both for WGANGP and MLPs, had to also be customised to the new setting. Tables 2 and 3 show final hyperparameter choices for all of the generative models used for the final evaluation.

In practice, the imaginative framework also maintains two separate memory modules: one representing a training set, the other a test set. It is vital to analyse the capabilities of the imagination module thoroughly. The memory used for experiments consisted of 200,000 most recent samples from the environment, 80% of

Table 3: Final parameters of multilayer perceptrons used for the imaginative framework.

which was used for training and 20% was separated purely for assessing the IM. A critical characteristic of the memory was that it artificially increased the number of experience samples that result in terminal states by oversampling (Ling & Li 1998). It was necessary due to a massive imbalance of terminal/non-terminal states in almost every RL setting. To do so, GAIRL keeps track of the mean length of episodes . Once the episode finishes (a transition sample with a terminal state occurs), the obtained sample is replicated many times.

As per algorithm 1, three training phases, MFP, ITP, and IBP continue in a loop starting from the model-free phase. MFP was initially set to operate for 000 steps in the real environment, ITP for 000 stochastic gradient descent imagination training steps, and IBP for 000 steps in the imaginative environment. However, initial experiments showed that longer ITP with 000 steps better generalises on both environments and so the hyperparameter has been changed for the final version of the algorithm.

6.3 Evaluation metrics

Different metrics have been used to test different aspects of the algorithms. Precision (15) and recall (16) is employed to assess the performance of the reward generation (normalised reward in the chosen environment can be either 0 or 1). Precision describes a ratio of true positives (correctly generated 1s) within all generated positives (all generated 1s, no matter if correctly or not). Recall defines how many 1s were correctly generated out of all true 1s. They provide much more information about the real performance of the machine learning model than a standard accuracy metric, especially in situations of high-class imbalance that takes place in chosen environments (reward 0 is extremely more common than 1).

State, on the other hand, follows fully continuous dynamics. Therefore, mean absolute error (MAE) (17) is used for state-generating models. MAE is simply an averaged L1 loss that calculates a mean absolute difference between a generated state and a ground truth state. It has been chosen over more popular mean squared error for the same reasons L1 loss has been chosen over L2 loss (see section 6.2.2). Given that all outputs are normalised between 0 and 1, the value of 1 also be referred to as accuracy.

Finally, to evaluate data efficiency of algorithms (the main focus of this thesis), simply a mean reward from the 100 most recent episodes, in regards to the number of steps performed in the real environment, is used.

7 Results

First, imagination capabilities to accurately approximate real environment dynamics are presented. Then, focus moves onto data efficiency of different variations of the imaginative framework and the state-of-the-art Rainbow DQN. Furthermore, computational efficiency analysis of the proposed framework can be found in Appendix B. All experiments were performed using algorithms and parameters described in section 6.2, the same for both environments. Each of the showcased results is based on 15 independent runs of the algorithm. In all of the plots, lines represent a mean value for the runs. Opaque areas represent a standard deviation around the mean.

7.1 Imagination performance

7.1.1 Reward generation

Figure 11 shows the performance of the reward imagination submodule in terms of precision and recall in both environments. Both metrics, besides recall of Acrobot reward that requires twice as much experience, converge after 120,000 steps. For MountainCar, as it is a simple environment, the MLP can easily reach over 99% recall and precision in a few GAIRL iterations. Acrobot is a bit more challenging and even converged imagination’s precision can drop below 99%. Although results of this magnitude are sufficiently accurate, they could potentially be improved if more time was spent on hypertuning the machine learning model.

Figure 11: Performance of the reward-generating MLP imagination submodule on experimental environments. The x-axis represents training steps performed solely in the ITP (in other phases the performance stays constant). Y-axis shows the value of an appropriate metric.

What can be intriguing, are big jumps about every 60,000 steps in the performance of the network as the training progresses. They are caused by the the main GAIRL algorithm loop. For every GAIRL iteration, agent gathers 20,000 samples of experience from the real environment (000) to then train the imagination for 60,000 gradient descent steps (000). Therefore, each multiple of 60,000 marks a point after which more real data enters the process helping the model to better generalise to unseen samples. I.e. in the first 60,000 steps imagination learns only based on 16,000 samples (80% of 20,000 because another 20% belongs to the test memory), in the 60000 period based on 32,000, in the 12000 based on 48,000, and so on.

7.1.2 State generation

Figure 12: Performance of the state-generating imagination submodule on experimental environments. The x-axis represents training steps performed solely in the ITP (in other phases the performance stays constant). The y-axis shows the value of an appropriate metric.

The effectiveness of the state imagination submodule is presented in Figure 12. Both Figure 12b and 12c show MAE for both WGANGP and MLP state generation. Figure 12a, on the other hand, shows Wasserstein distance as approximated by the critic for WGANGP imagination only.

We can see that WGANGP performs worse than the MLP in terms of MAE. This is, however, expected behaviour. MLP optimises for the L1 loss, that is in essence MAE multiplied by a constant, directly. GANs superiority lays in the fact that they do not optimise towards minimising mean error on the individual level, but towards minimising the difference between data distributions. These plots only showcase that indeed both of the modules seem to model dynamics of the real distribution accurately. Using them for direct comparison of the models would be unfair for the WGANGP.

MLP reaches over 99.9% accuracy on MountainCar and 99.5% accuracy on Acrobot, even without much of hypertuning. Unlike with the reward, very high accuracy of state generation is crucial because the correctness of future states highly depends on the correctness of previous states. When making rollouts into the future using pure imagination, errors may compound. Given accuracy a of the state generation model and rollout of length , final state’s accuracy may drop, in the worst case, to (nevertheless, this is also unlikely).

Even WGANGP that does not optimise for MAE reaches good results of over 97% accuracy for both environments proving its convergence properties. This is also shown by the estimated Wasserstein distance between the generator distribution and the environment distribution that reaches values lower than 0.015 for both environments (even below 0.005 for MountainCar).

These results show a high promise of replacing the real environment with the imagination, especially considering the fact that not much time and resources have been spent on optimising the architecture and parameters used in the experiments.

7.2 Data efficiency

Data efficiency is the main problem targeted in this study. Results in section 7.1 show that deep learning models can efficiently learn the dynamics of the real environment. The remaining question is whether these models are accurate enough to replace the real environment in the RL process.

Figure 13 presents results of both MLP-based and WGANGP-based imaginative framework in comparison to the imagination-free state-of-the-art algorithm. Both imaginative algorithms highly outperform imagination-free Rainbow DQN being

more than twice as data efficient on both environments.

GAIRL requires even as much as over three times fewer experience samples than the framework with MLP-based imagination, and over 6 times less than the current state-of-the-art on the more complex Acrobot environment. Surprisingly, despite the simplicity of MountainCar setting, GAIRL also solves it most optimally.

What is also important, one of the concerns was that imagination-aided agent would be much less stable than the standard model-free algorithm. It is indeed the case on the simple MountainCar, especially for WGANGP-based framework. Remarkably, once the complexity of the environment increases (Acrobot), it is no longer the case.

Figure 13: Performance of the MLP-based and WGANGP-based framework compared to the state-of-the-art baseline. The x-axis represents a size of experience from the real environment. The y-axis represents a mean reward from past 100 episodes. Black dashed line represents the top performance achieved by the state-of-the-art.

In addition, two-tailed Wilcoxon signed rank tests were performed to statistically compare obtained results. There were N = 15 non-zero difference samples in every comparison. The critical value to make sure that the results are statistically significant (

Firstly, the comparison between the imagination-free state of the art, and the mlp-based imaginative algorithm was made. For the MountainCar environment, Wilcoxon test produced rank sums . Similarly, in the Acrobot environment rank sums collected. Again, . Therefore, the null hypothesis stating that the imaginative framework does not improve sample efficiency of the state of the art can be safely rejected.

Then, the focus moved to the analysis of GAIRL benefits over the more primitive MLP-based imaginative framework. Rank sums for the MountainCar environment in this scenario are . Nevertheless, for the Acrobot, . Thus, the superiority of GANs in the generative framework is statistically significant only for the more complicated environment. This result was somewhat expected; the central promise of GANs is to work much better on complex and high dimensional domains. The MLP should not be much worse in an elementary environment like MountainCar.

Full table with the data that was used to calculate test statistics is available in Appendix A.

8 Discussion

Two research questions have been posed at the beginning of this thesis:

• Can learning the imaginative model of the environment be more efficient than learning an optimal policy?

• If so, can the learned imagination fulfil the promise of sample efficient model-based RL in settings where dynamics of the real environment are unknown?

Both of the answers have been subsequently answered: Results presented in section 7 clearly show that simply exploiting Markov property allows imagination to converge with almost order of magnitude less data than it is required for learning an optimal policy. Additionally, the imagination was later successfully used to improve on data efficiency of the state-of-the-art Rainbow DQN algorithm.

Initially, the thesis also hypothesised that the GAIRL framework presented throughout the thesis could produce a positive answer to both of the questions. Although in the end it did, one part of the hypothesis has been shown to be redundant in terms of simply answering these questions. Namely, generative adversarial architecture is not a necessity. More traditional models like multilayer perceptron can be successfully deployed within the imaginative framework as well. At least in simple environments used for the evaluation. Nevertheless, although not necessary, GANs tend to produce better results and should scale better to more complex

settings.

Two main novel contributions of this thesis would be:

• Efficient learning of the environment dynamics (creating imagination) by leveraging Markov property and advantages of generative adversarial networks.

• Successful use of imagination to highly improve state-of-the-art data effi-ciency of deep reinforcement learning through Dyna-Q-inspired algorithm.

Unfortunately, a month before the final submission deadline, Kaiser et al. (2019) introduced SimPLe – a novel deep RL algorithm that follows similar principles to the GAIRL framework. It was also inspired by the Dyna-Q algorithm and produced substantially more data efficient results than the Rainbow DQN. Therefore, it strips the novelty out of the second contribution of this thesis. However, because SimPLe leverages traditional L1/L2 loss model proposed by Oh et al. (2015) with only a few minor modifications, it often suffers from inaccurate generation (blurry images, disappearing small but crucial features). GAIRL’s approach of using GANs promises to circumvent this difficulty. However, more benchmarks in more complex environments should be performed to prove this hypothesis fully. This brings us to the point in the next paragraph.

GAIRL seems to overperform most recent state-of-the-art and even introduce crucial contribution on top of the similar advancement that was produced by the top research institution in parallel. Nevertheless, the set of environments used for testing was limited due to strict time constraints. The framework needs to be evaluated on a higher variety of domains. Each of them should reflect one or more of the following properties: a very high dimensional state space, non-deterministic transition dynamics, or a partially observable Markov decision process (POMDP). GANs architecture promises to handle the first two smoothly, theoretically amplifying the gap between GAIRL and other approaches. However, GAIRL may perform worse in environments following the POMDP properties. Additionally, it should be empirically compared with the recently introduced SimPLe algorithm. Further experiments are the critical next step and are planned to continue after the submission of this thesis.

What is more, more detailed optimisation of GAIRL architecture has been left for future work. Not much focus has been given to hyperparameters. It would be interesting to see a thorough study of the best parameter choices for the GAIRL setting. Potentially, sample efficiency could have been even more significantly improved by decreasing the length of the model-free phase and compensating it with the agent’s training in the imagination-based phase. Additionally, reward and state generation should be combined within a single machine learning model. It is especially essential for the reward generation so it can utilise the benefits of GANs in stochastic environments.

Another promising direction is to modify GAIRL’s memory to follow a similar structure to the prioritised replay buffer (Schaul et al. 2015) that is used in the Rainbow DQN. It could lead to intriguing results. Moreover, current use of Wasserstein GAN could be utilised to guide the exploration-exploitation trade-off. As Azizzadenesheli et al. (2018) mentioned, high Wasserstein distance produced by the critic for certain states could indicate that the agent is unsure of the possible outcomes of such a state. Therefore, the agent should potentially move there to explore the search space better, even if it seems less optimal.

Finally, the length of the imagination training phase could become adaptive. Currently, it takes a fixed constant. However, imagination does not need the same amount of training time at every iteration. The first iteration should take the longest, as the imagination starts without any prior knowledge. However, during the second iteration, it already has a general overview of the world. It just needs to update its model slightly to capture newly gathered data-points. Consequently, by the law of large numbers, it could follow that if n indicates the number of iterations then limof computational power.

One of the main issues raised during the project demonstration session was that the presented comparison of GAIRL with Rainbow DQN is not fair towards the latter. The argument was that the GAIRL framework performs a considerable amount of additional computation in the background. The fact that GAIRL requires more computation is true. Although it has the same time complexity in terms of the big O notation, it performs slower and is less computationally efficient as shown in Appendix B. Naturally, using the real environment directly instead of its imperfect imagination allows the agent to find the optimal policy quicker, even without mentioning the computational power required to learn the imagination.

Nonetheless, data efficiency, not computational efficiency, is one of the most pressing problems in the field (Irpan 2018), and the focus of this thesis. Computational power is cheap and widely accessible. Performing millions of random trial and error actions in the real environment, however, is often either very expensive or even impossible as explained throughout the thesis. Furthermore, this type of comparison is not new to the field. It has been used repeatedly in the literature (Mnih et al. 2015, Silver et al. 2016, 2017, Hessel et al. 2017, Schulman et al. 2017,

Another argument was that the imaginative module would not be able to grasp the dynamics of complex environments. Example of the real world in case of robotics was given. It is a reasonable concern. We cannot know for sure before performing appropriate experiments in such an environment as already mentioned earlier. However, the imagination module requires the same assumptions and the same setting as the current state-of-the-art RL algorithms. In theory, if imagination cannot encapsulate the real environment, then the state-of-the-art is not able to learn a close-to-optimal policy either. Notwithstanding, this question is open for future experimental work.

The last point referred to employed benchmark environments. Namely, what is the point of learning the imaginative simulation for the environment that already is a simulation of a car/robot? Undeniably, it does not make sense in practice. However, the goal of this research was not to find the best solution to the Acrobot or MountainCar. It was about improving the general-use state-of-the-art reinforcement learning. Simulations were used merely as a mean to compare the capabilities of presented RL algorithms. Optimal or close-to-optimal solutions to these environments, much better than any RL algorithm, has been devised decades ago. It did not stop, however, RL researchers to use them for benchmark purposes on conferences such as NeurIPS or ICML.

9 Conclusion

The goal of this thesis was to improve data efficiency of the state-of-the-art reinforcement learning. This has been successfully achieved by introducing an imaginative framework that can accurately and efficiently approximate dynamics of the real environment by making use of Markov property and generative adversarial models.

It presented experimental evidence that supports the superiority of GANs over standard generative models for conditional state prediction within the reinforcement learning setting. It also introduced a way to utilise imperfect approximation of the real world to limit the amount of data needed to train an optimally behaving agent.

Similar advancement regarding sample-efficient reinforcement learning has been released in parallel (Kaiser et al. 2019). The study presented here, however, not only confirms the results obtained by Kaiser et al. (2019) but also adds an important contribution to the field that promises to improve the state-of-the-art even further.

Nevertheless, more experiments are needed to ensure the superiority of the introduced framework, and there is still room open for future work.

Bibliography

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Man´e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Vi´egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y. & Zheng, X. (2015), ‘TensorFlow: Large-scale machine learning on heterogeneous systems’. Software available from tensorflow.org. URL: http://tensorflow.org/

Arjovsky, M., Chintala, S. & Bottou, L. (2017), Wasserstein generative adversarial networks, in ‘International Conference on Machine Learning’, pp. 214–223.

Arulkumaran, K., Deisenroth, M. P., Brundage, M. & Bharath, A. A. (2017), ‘A brief survey of deep reinforcement learning’, arXiv preprint arXiv:1708.05866 .

Azizzadenesheli, K., Yang, B., Liu, W., Brunskill, E., Lipton, Z. & Anandkumar, A. (2018), ‘Surprising negative results for generative adversarial tree search’.

Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D., Muldal, A., Heess, N. & Lillicrap, T. (2018), ‘Distributed distributional deterministic policy gradients’, arXiv preprint arXiv:1804.08617 .

Bellemare, M. G., Dabney, W. & Munos, R. (2017), A distributional perspective on reinforcement learning, in ‘International Conference on Machine Learning’, pp. 449–458.

Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. (2013), ‘The arcade learning environment: An evaluation platform for general agents’, Journal of Artificial Intelligence Research 47, 253–279.

Bellemare, M., Veness, J. & Bowling, M. (2013), Bayesian learning of recur- sively factored environments, in ‘International Conference on Machine Learning’, pp. 1211–1219.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. & Zaremba, W. (2016), ‘Openai gym’.

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfsha- gen, P., Tavener, S., Perez, D., Samothrakis, S. & Colton, S. (2012), ‘A survey of monte carlo tree search methods’, IEEE Transactions on Computational Intelligence and AI in games 4(1), 1–43.

Chiappa, S., Racaniere, S., Wierstra, D. & Mohamed, S. (2017), ‘Recurrent envi- ronment simulators’, arXiv preprint arXiv:1704.02254 .

Coppersmith, D. & Winograd, S. (1990), ‘Matrix multiplication via arithmetic progressions’, Journal of symbolic computation 9(3), 251–280.

Doan, T., Mazoure, B. & Lyle, C. (2018), ‘Gan q-learning’, arXiv preprint arXiv:1805.04874 .

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio, Y. (2014), Generative adversarial nets, in ‘Advances in neural information processing systems’, pp. 2672–2680.

Gu, S., Lillicrap, T., Sutskever, I. & Levine, S. (2016), Continuous deep q-learning with model-based acceleration, in ‘International Conference on Machine Learning’, pp. 2829–2838.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. C. (2017), Improved training of wasserstein gans, in I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan & R. Garnett, eds, ‘Advances in Neural Information Processing Systems 30’, Curran Associates, Inc., pp. 5767–5777. URL: http://papers.nips.cc/paper/7159-improved-training-of-wasserstein-gans.pdf

Hassabis, D., Kumaran, D., Vann, S. D. & Maguire, E. A. (2007), ‘Patients with hippocampal amnesia cannot imagine new experiences’, Proceedings of the National Academy of Sciences 104(5), 1726–1731.

Henderson, P., Chang, W.-D., Bacon, P.-L., Meger, D., Pineau, J. & Precup, D. (2018), Optiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning.

Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M. & Silver, D. (2017), ‘Rainbow: Combining improvements in deep reinforcement learning’, arXiv preprint arXiv:1710.02298 .

Ho, J. & Ermon, S. (2016), Generative adversarial imitation learning, in Neural Information Processing Systems’, pp. 4565–4573.

Hochreiter, S. & Schmidhuber, J. (1997), ‘Long short-term memory’, Neural computation 9(8), 1735–1780.

Irpan, A. (2018), ‘Deep reinforcement learning doesn’t work yet’, https://www. alexirpan.com/2018/02/14/rl-hard.html.

Isola, P., Zhu, J.-Y., Zhou, T. & Efros, A. A. (2017), Image-to-image translation with conditional adversarial networks, in ‘2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)’, IEEE, pp. 5967–5976.

Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S. et al. (2019), ‘Model-based reinforcement learning for atari’, arXiv preprint arXiv:1903.00374 .

Kingma, D. P. & Ba, J. (2014), ‘Adam: A method for stochastic optimization’, arXiv preprint arXiv:1412.6980 .

Kingma, D. P. & Welling, M. (2013), ‘Auto-encoding variational bayes’, arXiv preprint arXiv:1312.6114 .

LeCun, Y., Haffner, P., Bottou, L. & Bengio, Y. (1999), Object recognition with gradient-based learning, in ‘Shape, contour and grouping in computer vision’, Springer, pp. 319–345.

Leibfried, F., Kushman, N. & Hofmann, K. (2016), ‘A deep learning approach for joint video frame and reward prediction in atari games’, arXiv preprint arXiv:1611.07078 .

Leinweber, M., Ward, D. R., Sobczak, J. M., Attinger, A. & Keller, G. B. (2017), ‘A sensorimotor circuit in mouse cortex for visual flow predictions’, Neuron 95(6), 1420–1432.

Lenz, I., Knepper, R. A. & Saxena, A. (2015), Deepmpc: Learning deep latent features for model predictive control., in ‘Robotics: Science and Systems’.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D.

& Wierstra, D. (2015), ‘Continuous control with deep reinforcement learning’, arXiv preprint arXiv:1509.02971 .

Ling, C. X. & Li, C. (1998), Data mining for direct marketing: Problems and solutions., in ‘Kdd’, Vol. 98, pp. 73–79.

Lucic, M., Kurach, K., Michalski, M., Gelly, S. & Bousquet, O. (2018), Are gans created equal? a large-scale study, in ‘Advances in neural information processing systems’, pp. 698–707.

Maas, A. L., Hannun, A. Y. & Ng, A. Y. (2013), Rectifier nonlinearities improve neural network acoustic models, in ‘Proc. icml’, Vol. 30, p. 3.

Medsker, L. & Jain, L. C. (1999), Recurrent neural networks: design and applications, CRC press.

Mirza, M. & Osindero, S. (2014), ‘Conditional generative adversarial nets’, arXiv preprint arXiv:1411.1784 .

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G. et al. (2015), ‘Human-level control through deep reinforcement learning’, Nature 518(7540), 529.

Moore, A. W. (1990), ‘Efficient memory-based learning for robot control’.

Nair, V. & Hinton, G. E. (2010), Rectified linear units improve restricted boltz- mann machines, in ‘Proceedings of the 27th international conference on machine learning (ICML-10)’, pp. 807–814.

Ng, A. Y., Russell, S. J. et al. (2000), Algorithms for inverse reinforcement learn- ing., in ‘Icml’, pp. 663–670.

Oh, J., Guo, X., Lee, H., Lewis, R. L. & Singh, S. (2015), Action-conditional video prediction using deep networks in atari games, in ‘Advances in neural information processing systems’, pp. 2863–2871.

Peters, J., Vijayakumar, S. & Schaal, S. (2005), Natural actor-critic, Conference on Machine Learning’, Springer, pp. 280–291.

Pfau, D. & Vinyals, O. (2016), ‘Connecting generative adversarial networks and actor-critic methods’, arXiv preprint arXiv:1610.01945 .

Pfeiffer, B. E. & Foster, D. J. (2013), ‘Hippocampal place-cell sequences depict future paths to remembered goals’, Nature 497(7447), 74.

Racani`ere, S., Weber, T., Reichert, D., Buesing, L., Guez, A., Rezende, D. J., Ba- dia, A. P., Vinyals, O., Heess, N., Li, Y. et al. (2017), Imagination-augmented agents for deep reinforcement learning, in ‘Advances in Neural Information Processing Systems’, pp. 5690–5701.

Schaal, S., Ijspeert, A. & Billard, A. (2003), ‘Computational approaches to motor learning by imitation’, Philosophical Transactions of the Royal Society of London B: Biological Sciences 358(1431), 537–547.

Schacter, D. L., Addis, D. R., Hassabis, D., Martin, V. C., Spreng, R. N. & Szpunar, K. K. (2012), ‘The future of memory: remembering, imagining, and the brain’, Neuron 76(4), 677–694.

Schaul, T., Quan, J., Antonoglou, I. & Silver, D. (2015), ‘Prioritized experience replay’, arXiv preprint arXiv:1511.05952 .

Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017), ‘Proxi- mal policy optimization algorithms’, arXiv preprint arXiv:1707.06347 .

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. et al. (2016), ‘Mastering the game of go with deep neural networks and tree search’, nature 529(7587), 484.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D. & Riedmiller, M. (2014), Deterministic policy gradient algorithms,

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. et al. (2017), ‘Mastering the game of go without human knowledge’, Nature 550(7676), 354.

Sutton, R. S. (1990), Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, in ‘Machine Learning Proceedings 1990’, Elsevier, pp. 216–224.

Sutton, R. S. (1996), Generalization in reinforcement learning: Successful exam- ples using sparse coarse coding, in ‘Advances in neural information processing systems’, pp. 1038–1044.

Sutton, R. S. & Barto, A. G. (1998), Introduction to reinforcement learning, Vol. 135, MIT press Cambridge.

Talvitie, E. (2014), Model regularization for stable sample rollouts., pp. 780–789.

Talvitie, E. (2015), Agnostic system identification for monte carlo planning, in ‘Twenty-Ninth AAAI Conference on Artificial Intelligence’.

Tesauro, G. (1995), Td-gammon: A self-teaching backgammon program, plications of Neural Networks’, Springer, pp. 267–285.

Tolman, E. C. (1948), ‘Cognitive maps in rats and men.’, Psychological review 55(4), 189.

Van Hasselt, H. (2010), Double q-learning, in ‘Advances in Neural Information Processing Systems’, pp. 2613–2621.

Van Hasselt, H., Guez, A. & Silver, D. (2016), Deep reinforcement learning with double q-learning., in ‘AAAI’, Vol. 2, Phoenix, AZ, p. 5.

Venkatraman, A., Capobianco, R., Pinto, L., Hebert, M., Nardi, D. & Bagnell, J. A. (2016), Improved learning of dynamics models for control, in ‘International Symposium on Experimental Robotics’, Springer, pp. 703–713.

Villani, C. (2008), Optimal transport: old and new, Vol. 338, Springer Science & Business Media.

Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M. & De Freitas, N. (2015), ‘Dueling network architectures for deep reinforcement learning’, arXiv preprint arXiv:1511.06581 .

Watkins, C. J. & Dayan, P. (1992), ‘Q-learning’, Machine learning 8(3-4), 279–292.

Williams, R. J. (1992), ‘Simple statistical gradient-following algorithms for con- nectionist reinforcement learning’, Machine learning 8(3-4), 229–256.

Wilson, A. C., Roelofs, R., Stern, M., Srebro, N. & Recht, B. (2017), The marginal value of adaptive gradient methods in machine learning, in ‘Advances in Neural Information Processing Systems’, pp. 4148–4158.

Xiao, T. & Kesineni, G. (2016), ‘Generative adversarial networks for model based reinforcement learning with tree search’.

Zeiler, M. D., Krishnan, D., Taylor, G. W. & Fergus, R. (2010), ‘Deconvolutional networks’.

Zhao, H., Gallo, O., Frosio, I. & Kautz, J. (2017), ‘Loss functions for image restoration with neural networks’, IEEE Transactions on Computational Imaging 3(1), 47–57.

Appendix A Numerical results for data eﬃciency

Table 4: Algorithms comparison. Runs are grouped by the environment and used random seed. Values in the columns represent the number of samples from the real environment needed before convergence.

Appendix B Computational eﬃciency analysis

Table 5 presents a comparison of the algorithms in regards to the time they required until convergence on a 4CPU instance. For both Acrobot and MountainCar, imaginative framework employing MLP for the imagination module requires approximately 7 times more time to converge to the optimal policy than the imagination-free agent. Full GAIRL algorithms run approximately 2.5 times longer than the MLP-based agent and 17 times longer than the imagination-free agent.

It clearly shows that the state-of-the-art is much more computationally efficient than any of the proposed algorithms. Combining it with the results from section 7, we can observe that there is an inverse proportional relation between data efficiency and computational efficiency.

However, GAIRL performance should scale better with higher computational power. The most demanding part of GAIRL is the imagination training phase. The ITP employs a standard supervised learning process that can easily leverage high parallelism provided by GPUs and TPUs. Highly sequential reinforcement learning, on the other hand, is much harder to parallelise. Naturally, GAIRL can never reach the same efficiency as imagination-free options; however, it may get asymptotically closer when more powerful hardware is provided.

Table 5: Algorithms comparison. Runs are grouped by the environment and used random seed. Values in the columns represent the number of minutes that passed before convergence. Experiments run on Google Cloud Platform, each on the instance with 4 Intel Skylake CPUs and 8GB of RAM.

Appendix C Generative hyperparameters for MNIST

Table 6: Final parameters of the multilyer perceptron used for learning the generative distribution of the MNIST dataset.

Table 7: Final parameters of the Wasserstein GAN with Gradient Penalty used for learning the generative distribution of the MNIST dataset.

Table 8: Final parameters of the original GAN used for learning the generative distribution of the MNIST dataset.

Designed for Accessibility and to further Open Science