My stuff
MIME: Mutual Information Minimisation Exploration

We show that reinforcement learning agents that learn by surprise (surprisal) get stuck at abrupt environmental transition boundaries because these transitions are difficult to learn. We propose a counter-intuitive solution that we call Mutual Information Minimising Exploration (MIME) where an agent learns a latent representation of the environment without trying to predict the future states. We show that our agent performs significantly better over sharp transition boundaries while matching the performance of surprisal driven agents elsewhere. In particular, we show state-of-the-art performance on difficult learning games such as Gravitar, Montezuma’s Revenge and Doom.

Agents trained by reinforcement learning (RL) algorithms perform very well and even exceed human performance in many areas like the game Go [23] and Atari 2600 games [11]. However, the reward function to train an agent is difficult to design and also not scalable as different reward functions are needed for each environment. In addition, some environments only provide extremely sparse rewards. Random exploration agents, and variants such as  ϵ-greedy agents, are not efficient as the agent wastes a lot of time wandering aimlessly over the same states learning nothing, until by chance it hits an extrinsic reward. It makes more sense to add intrinsic rewards that will encourage efficient exploration; in the absence of extrinsic rewards, the agent is guided to seek new states.

The idea that agents should explore by intrinsic reward like curiosity or surprisal can be traced back to early 1990’s. In 1991, Schmidhuber [17] proposed that an agent trained by reinforcement learning can translate mismatches between expectations and reality into curiosity/surprise rewards, also called surprisal. The agents are driven to explore surprising aspects of the world, and hence to explore the environment efficiently. This idea has been inherited and carried forward for the next thirty years, especially in recent years. Thanks to increases in computing power, people have verified this idea in large-scale data and scenarios [1, 15, 4].

All of these recent methods follow the same fundamental idea: a reward-maximising neural policy network  πlearns to generate action sequences. A separate neural network called the world model M learns to predict future states (st+1), given past inputs (st) andactions (at). In the absence of external reward, the policy network πmaximises the same value function that the world model M minimises, that is, surprisal rewards. This motivates policy  π to inventand generate experiments that lead to “novel” situations where the world model M cannot yet predict well.

However, since the world model M is trained to learn environment transitions, if some of those transitions are discontinuous or abrupt, it is difficult for the agent to predict  st+1and therefore the agent is continuously surprised by the transition. This results in an agent that gets stuck on the transition boundary. The length of time stuck on the boundary depends on the magnitude of change in the transition. The greater the transition change, the longer the time spent at the boundary.

Some current methods tend to avoid getting stuck at transition boundaries. Houthooft et al. [8] proposed VIME, which computes Bayesian-surprisal inspired by the idea of maximising information gain. But VIME is difficult to scale up to large-scale environments [1]. Prediction improvement measures [19, 14, 10, 1] compute intrinsic reward by model learning progress and can avoid getting stuck in some situations where surprisal does, for example, when presented with white noise. Even so, [1] shows that prediction improvement measures do not explore as well as surprisal in ordinary sparse reward environments.

We propose Mutual Information Minimising Exploration (MIME) in this paper. MIME-agents can explore as well as surprisal-agents in sparse reward environments and much better in environments that include abrupt state transitions.

In RL, at time step  t ≥ 0, the agent is in state  st ∈ S, takes an action at ∈ A, receives extrinsic reward  ret and transitions to the next state st+1 ∼ P(st+1|st, at). The objective of RL is to find a policy  π tomaximize the discounted cumulative reward:


where  πis a mapping (neural network in this paper) from a state to a distribution over actions so that  at ∼ π(·|st), γ ∈ [0, 1) is thediscount factor.

When environments only provide extremely sparse rewards, it is hard to use (1) to train the policy  π, as almost all  ret = 0. This re-quires us to add intrinsic rewards  rit to encourage the agent to explore efficiently to find the sparse extrinsic rewards. The objective function of RL in (1) can be rewritten as:


where  ηis the trade-off between the intrinsic reward and extrinsic reward. Surprise is a common form of intrinsic reward.

2.1 Count-based exploration

The simplest solution for adding intrinsic reward, is to add an intrinsic reward based on a state visit count function:


where  ntis the number of times agent has visited state  st (countingthe first visit as  nt = 1).

This method is obviously only computationally tractable for environments with relatively small number of discrete states. For continuous state and larger environments something more advanced [13, 3] is required.

2.2 Surprise-driven exploration

2.2.1 Surprisal

Surprise-driven exploration is a class of exploration methods dependent on errors in predicting dynamics [18, 17, 16, 24, 1, 15, 27]. The agent builds a separate neural network called the world model M to predict future state  st+1, given current state  stand current action  at, to approximate the environment transition  P(st+1|st, at)by  PM(M(st, at)|st, at, θ), where θare the parameters of the world model M. If the agent finds its prediction  M(st, at)is different from the true future state  st+1, it gets surprised.

One simple proposed surprise definition is the so-called surprisal [25]:


In practice, if we suppose  PMis a Gaussian distribution, we can compute the intrinsic reward  rit in (2) with surprisal simply by:


as shown in Figure 1.


Figure 1. surprisal-driven V.S. MIME-driven

If we train an agent to explore the environment by surprisal, the world model M is trained with the environment transition P(st+1|st, at). The world model then learns the mapping function from  (st, at) to st+1and makes a prediction  M(st, at). When theenvironment has an area where the transition  P(st+1|st, at) from stto  st+1is discontinuous, the agent cannot predict  st+1well and gets a big surprise. The policy network  πis trained by maximising the surprisal reward, so, at the next iteration, it will generate actions that drive the agent to this area again. Since piecewise smooth functions are hard to learn using continuous activation functions [22], the agent will be stuck at such transition boundaries for a long time.

VIME agents, as introduced by [8] can avoid getting stuck at transition boundaries. VIME agents compute intrinsic rewards based on Bayesian surprise:


inspired by maximising mutual information (MI). Since VIME prefers large changes in model parameters from one state to the next, whether it gets stuck at a transition boundary depends heavily on the particular model. In any case, VIME cannot scale up to large-scale problems that we consider here.

2.3 Prediction improvement measures

Another class of intrinsic exploration involves prediction improvement measures [19, 14, 10, 1], which compute the intrinsic reward by:


where the intrinsic reward is the k-step learning progress at  (st, at).This is somewhat similar to Bayesian surprise and can be easily implemented to large-scale problems, but results in [1] show that such methods do not perform as well as surprisal.

In this section, we present our exploration method named Mutual Information Minimisation Exploration (MIME). Similar to Bayesian surprise and prediction improvement, MIME computes intrinsic reward by mutual information. However, instead of computing mutual information between past and future steps in the trajectory, we compute mutual information between the input and output of the model M on the current state and current action. In other words, the world model simply tries to learn a representation of the world without prediction, but nevertheless incorporating information from the chosen action. Surprisingly, this works just as well as surprisal in continous transition environments, and much better in environments with abrupt transitions.

The mutual information (MI)-based approach has a long history in unsupervised feature learning and the infomax principle [9, 2] for neural networks advocates maximizing the MI between input and output. In our problem, the expected mutual information (information gain) between the input and output of the model M is:


where D is a dataset of tuples sampled from the environment and used for training. Since our purpose is to learn the best feature representation to reconstruct  st, we train the model M by maximising (8):


Intuitively, one can view the world model as generating a notion of familiarity. When the agent is in state  stthat it has visited many times before, and chooses action  atthat has been performed in that state before, it “feels” familiar and comfortable to the agent as the current state and action are well represented. However, the policy network  πtries to minimise the same mutual information and therefore encourages the agent to either choose actions it has not chosen before, or explore unfamiliar states. In other words, the agent is encouraged to get out of its comfort zone by minimising the mutual information in (8):


Minimising equation (10) is equivalent to minimising the pair-wise mutual information between the input and output of the model at each step t:


In summary, we train the model M by maximising the mutual information between its input and output to find the best feature representation. In the absence of extrinsic reward, the policy network πminimises exactly the same function that the model M is maximising. To be consistent with the previous reinforcement learning algorithms, the policy is updated by maximizing a reward function. We define the intrinsic reward  rit as the negative of (11).

Computing the total probability  P(M(st, at))is intractable in practice, so we only choose the first term from (11) as an approximate intrinsic reward:


Similarly, the objective function in equation (9) for training the model M can be rewritten as:


If we suppose  PMis a Gaussian distribution and that M is autoencoder-like, then  rit can be written in a simple way:


The entire training procedure is summarised in Algorithm 1. Figure 1 shows that the structure of surprisal-driven exploration is not changed, just the definition of intrinsic reward.


For illustrative purposes, we begin with two simple experiments and visualise the agent’s movements to show its exploration efficiency. Then we implement MIME-driven exploration in three large-scale experiments: Gravitar, Doom, and Montezuma’s Revenge. These three games have extremely sparse rewards and are a good test of

Table 1. Large-scale games environmental preprocessing.


exploration ability. All large-scale experiments are run three times with different seeds. Table 1 shows how we preprocessed the three large-scale environment. We use TRPO [20] in the two simple experiments and PPO [21] in large-scale games.

In the first two simple experiments, we follow the structure shown in Figure 1 and choose rllab [6] as the platform to run the code. The model is a simple fully-connected neural network that has one hidden layer of 32 units and the number of units in output layer is equal to the dimension of state. The hidden layer has rectified linear unit (ReLU) non-linearities. The policy  πis also a neural network that has one hidden layer of 32 units and tanh nonlinearities. For other hyper-parameter settings please check Table 2, where batch size refers to steps collected at each iteration.

Table 2. Hyperparameter setting for two simple experiments.


In the three large-scale games, since the states are pixel-frames, if we also try the structure in Figure 1, we will compute the surprisal or mutual information from raw pixels. However, recent work [15, 5] shows that if we map the raw pixels to a feature space first and compute the intrinsic reward in this space, we can get a much better result. The mapping can be any feature extractor such as a Variational Autoencoder (VAE) or a Convolution network (CNN) with weights frozen to randomly initialised values. In this paper, we choose the latter for time efficiency reasons.

We use two different methods to compute MI in a CNN feature space (see Figure 2(a) and Figure 2(b)). In Figure 2(a), the world model M is trained on the CNN feature space, but in Figure 2(b)) the world model M is trained from pixels. Both of these two structures use a separate CNN network with frozen layers as a feature extractor so that we can compute intrinsic reward  ∥M(ft, at)−ft∥2 or∥M(st, at)−ft∥2 in feature space. Another approach called random network distillation (RND) [5] was proposed in 2018 (See Figure 2(c)). They chose a self-predictor to predict the features of the next state, the difference between this prediction ˆft+1and the mapped features  ft+1is regarded as the intrinsic reward to train the agent. Here the mapped features  ft+1are produced by a CNN with frozen weights. But similar to the surprisal-driven idea, RND also focuses on future states. We also compare our results with RND in our large-scale experiments.

The feature extractor CNN in Figure 2 has three convolutional layers, which have 32, 64, 64 kernels, 8, 4, 3 kernel size and stride as 4, 2, 1, followed by a fully-connected linear layer with 512 output units. All the parameters in the CNN are fixed (no training). The world model in Figure 2(a) uses the output of the linear layer as its input and predicts future observations in feature space. It has two


Figure 2. Different structures used to compute intrinsic reward.

hidden non-linear layers and one linear output layer. All these fully-connected layers have 512 output units. The world model in Figure 2(b) on the other hand, uses the raw pixels as its input. It uses the same convolutional layer structure as the feature extractor CNN we introduced above, but the parameters are trained during learning. The self-predictor in Figure 2(c) has the same structure as the world model in Figure 2(b). The difference is that the self-predictor only considers observations as its input and ignores actions. All the policy networks in Figure 2 have the same structure. To compare RND as a baseline, we choose the same other hyper-parameter settings as RND in this paper (see Table 3). Note that the number of parameters in 2(a) is much smaller than the other two (3.72 million V.S. 5.14 million).

Table 3. Hyperparameter setting for three large-scale games.


5.1 2DPlane environment

This is a simple 2DPlane environment (S ⊂ R2, A ⊂ R2). Wechoose this environment to show that MIME-agent has similar exploration ability with surprisal-agent. The observation space is a square on the 2D plane  ((x, y) ∈ R2), centred on the origin. The action is its velocity  ( ˙x, ˙y)that satisfies  | ˙x| ≤ 0.01, | ˙y| ≤ 0.01. In this environment, the agent starts at origin (0,0) and the only extrinsic reward can be found at location (1,1). The environment wraps around so that there are no boundaries.

In this experiment, we train one agent and record the observation coordinate (x, y) at each step until it finds the non zero extrinsic reward. Figure 3 shows the heat map of motion tracking for the agent trained without intrinsic reward, with surprisal reward and with MIME reward, respectively. Darker red represents a higher density, which means the agent takes more steps in this area. It is clear that random exploration without intrinsic reward takes a long time (2,059,459 steps) to find the extrinsic reward, whereas surprisal (22,919 steps) and MIME (21,150 steps) do not spend time unnecessarily in random states. We can see that the exploration efficiency is similar between surprisal-agent and MIME-agent, and they are both much more efficient than the random exploration strategy.

5.2 Passing through a wormhole

The wormhole experiment uses a three-dimensional environment with a sharp circular boundary (the wormhole) between an upper rectangular planar environment at z = 1000, and a lower second circular planar environment centred at the origin with radius 0.5 (See Figure 4). The observation space is 3D  ((x, y, z) ∈ R3). The ac-tion is still a two dimensional vector: the velocity  ( ˙x, ˙y)that satisfies �˙x2 + ˙y2 ≤ 0.01. The agent starts from the origin (x = 0, y = 0, z = 0). When the agent crosses the boundary, it immediately transitions from one plane to the other. All the agents explore 5,000,000


Figure 3. Exploration efficiency in 2DPlane environment until chancing upon the reward state.


Figure 4. Pass through a wormhole: environment map


Figure 5. Pass through a wormhole, 5 million steps.

steps. Figure 5(a) to Figure 5(c) show the top view of the environment so that we can also visualise the agent’s movements. We can see both MIME and surprisal agents are attracted by the boundary, but the time that surprisal agents stay at the boundary is much longer than the MIME agent. As can be seen from the Figure 5(a), the random exploration agent is not affected by the boundary, but the exploration efficiency is very low.

5.3 Large-scale games


Figure 6. Mean episodic return of MIME, surprisal, and RND on 3 hard exploration large-scale games. Curves are an average over 3 random seeds, with standard deviation shown in shaded areas. Horizontal axes show numbers of frames.

In this subsection, we test MIME and compare it to surprisal and RND in three large-scale games: Gravitar, Doom, and Montezuma’s Revenge. For MIME, we implement two different structures as shown in Figure 2(a) and Figure 2(b) respectively. All experiments run with 32 parallel environments.

Gravitar is an Atari games which has sparse rewards. There are two modes: the overworld (essentially space); and a sideview landscape when the spaceship enters a planet environment. The agent (spaceship) will be pulled slowly to the star in the overworld, and downward in the side-view levels. We chose this game to show a similar exploration ability between surprisal and MIME. It can be seen from Figure 6(a) that the agent trained by MIME, surprisal and RND performs similar in this game. The MIME agent that has frozen CNN layers also performs as good as the one with trainable CNN layers.

We choose the scenario named ”find my way home” in VizDoom game [26] to train the agent to navigate in surroundings and reach his ultimate goal. The map is a series of connected rooms and one corridor with a dead end. Each room has a different colour and texture. There is a green vest in one of the rooms (the same room every time). The agent is born in a randomly chosen room facing a random direction. When the agent explores in this map and finds the vest (the goal), it gets a 1 point reward. We add a TV (always shows changing frames) on the wall in one room and the experiment is run for 5 million frames. We observe that the surprisal agent gets stuck in the TV room, and only the agent born in a room between the TV room and the goal could find the goal, however, both MIME and RND driven agents can escape from the TV room. As with the previous experiment, there is not much difference in the exploration ability between the agents driven from two different structures of MIME (See Figure 6(b)).

Montezuma’s revenge is an Atari game that is considered very hard. Many RL algorithms [12, 7] that are successful at other Atari games fail at Montezuma’s revenge. It has many rooms for the agent to explore. The agent moves from room to room and scores points along the way. This game has similar game mechanics as the wormhole environment we designed in subsection 5.2: when the agent moves into a new room, the state abruptly changes because the background of each room is different. Surprisal agents get stuck at the boundary between adjacent rooms because of this transition. In Figure 6(c) the surprisal agent can only achieve a score of 400 because it is stuck at the boundary between the first and second rooms. A MIME agent with trainable CNN layers performs somewhat better than RND agents. It is interesting to see that MIME agent with frozen CNN layers only scores 2500. The game offers a substantial reward for agents that return to room 2 with the reward from room 3. The frozen CNN MIME agent discovers this reward, and in the process, the policy encourages the agent to go back to room 2 after room 3. We also observe that the RND agent initially gets stuck in the same undesirable loop as the frozen CNN MIME agent. However, the RND agent does manage to escape from this loop in 2 out of 3 of the trails/seeds. We believe that extending training should allow the frozen CNN MIME agent to escape from this loop.

One limitation of our approach is that when we maximise the mutual information, we maximise the KL divergence, which is theoretically without upper bound. For autoencoder-like world models, if the model learns the identity function, the network will be able to reproduce the input regardless of whether it has seen it before. Importantly, this will result in approximately the same intrinsic reward values for every possible observation, regardless of its novelty. In practice, we find that as long as the model contains a layer with sig-nificantly fewer units than the input dimensionality, the network does not converge on the identity and thus the intrinsic reward values produced are still useful for preventing the agent from getting stuck in situations described in this paper. We also note that since the world model is trained via mutual information, it need not be autoencoder-like, and could produce output that has a significantly different size to the input.

The main difference between MIME agents and other common RL agents, is that MIME agents do not try to predict the future. Rather, they form a measure of how comfortable they are in a given environment. Whereas surprisal agents explore areas where prediction is poor, MIME agents explore areas where it has a poor world model. As a consequence, surprisal agents tend to seek out hard to learn transition boundaries, whilst MIME agents are encouraged to leave their comfort zone. This is a simple idea, is easy to implement and most importantly it overcomes the limitations of surprisal getting stuck at transition boundaries.

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X GPU used for this research.

[1] Joshua Achiam and Shankar Sastry, ‘Surprise-based intrinsic motivation for deep reinforcement learning’, arXiv preprint arXiv:1703.01732, (2017).

[2] Anthony J Bell and Terrence J Sejnowski, ‘An informationmaximization approach to blind separation and blind deconvolution’, Neural computation, 7(6), 1129–1159, (1995).

[3] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos, ‘Unifying count-based exploration and intrinsic motivation’, in Advances in Neural Information Processing Systems 29, pp. 1471–1479, (2016).

[4] Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros, ‘Large-scale study of curiosity-driven learning’, arXiv preprint arXiv:1808.04355, (2018).

[5] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov, ‘Exploration by random network distillation’, arXiv preprint arXiv:1810.12894, (2018).

[6] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel, ‘Benchmarking deep reinforcement learning for continuous control’, in ICML, pp. 1329–1338, (2016).

[7] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver, ‘Rainbow: Combining improvements in deep reinforcement learning’, in Thirty-Second AAAI Conference on Artificial Intelligence, (2018).

[8] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel, ‘Vime: Variational information maximizing exploration’, in Advances in Neural Information Processing Systems 29, pp. 1109–1117. Curran Associates, Inc., (2016).

[9] Ralph Linsker, ‘Self-organization in a perceptual network’, Computer, 21(3), 105–117, (1988).

[10] Manuel Lopes, Tobias Lang, Marc Toussaint, and Pierre-Yves Oudeyer, ‘Exploration in model-based reinforcement learning by empirically estimating learning progress’, in Advances in Neural Information Processing Systems 25, pp. 206–214. Curran Associates, Inc., (2012).

[11] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller, ‘Playing atari with deep reinforcement learning’, arXiv preprint arXiv:1312.5602, (2013).

[12] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran,

Daan Wierstra, Shane Legg, and Demis Hassabis, ‘Human-level control through deep reinforcement learning’, Nature, 518(7540), 529– 533, (2015).

[13] Georg Ostrovski, Marc G. Bellemare, A¨aron van den Oord, and R´emi Munos, ‘Count-based exploration with neural density models’, in Proceedings of the 34th International Conference on Machine Learning -Volume 70, pp. 2721–2730. JMLR.org, (2017).

[14] Pierre-Yves Oudeyer, Frdric Kaplan, and Verena V Hafner, ‘Intrinsic motivation systems for autonomous mental development’, IEEE transactions on evolutionary computation, 11(2), 265–286, (2007).

[15] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell, ‘Curiosity-driven exploration by self-supervised prediction’, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17, (2017).

[16] Juergen Schmidhuber, ‘Unsupervised minimax: Adversarial curiosity, generative adversarial networks, and predictability minimization’, arXiv preprint arXiv:1906.04493, (2019).

[17] J¨urgen Schmidhuber, ‘Curious model-building control systems’, in Proc. international joint conference on neural networks, pp. 1458– 1463, (1991).

[18] J¨urgen Schmidhuber, ‘A possibility for implementing curiosity and boredom in model-building neural controllers’, in Proceedings of the First International Conference on Simulation of Adaptive Behavior on From Animals to Animats, pp. 222–227. MIT Press, (1991).

[19] J¨urgen Schmidhuber, ‘Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts’, Connection Science, 18(2), 173–187, (2006).

[20] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz, ‘Trust region policy optimization’, in ICML, pp. 1889– 1897, (2015).

[21] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov, ‘Proximal policy optimization algorithms’, arXiv preprint arXiv:1707.06347, (2017).

[22] Rastko R Selmic and Frank L Lewis, ‘Neural-network approximation of piecewise continuous functions: application to friction compensation’, IEEE transactions on neural networks, 13(3), 745–751, (2002).

[23] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al., ‘Mastering the game of go with deep neural networks and tree search’, Nature, 529(7587), 484, (2016).

[24] Bradly C Stadie, Sergey Levine, and Pieter Abbeel, ‘Incentivizing exploration in reinforcement learning with deep predictive models’, arXiv preprint arXiv:1507.00814, (2015).

[25] Myron Tribus, Thermostatics and thermodynamics: an introduction to energy, information and states of matter, with engineering applications, Van Nostrand, 1961.

[26] Marek Wydmuch, Michał Kempka, and Wojciech Ja´skowski, ‘Vizdoom competitions: Playing doom from pixels’, IEEE Transactions on Games, (2018).

[27] Haitao Xu, Brendan McCane, and Lech Szymanski, ‘Vase: Variational assorted surprise exploration for reinforcement learning’, arXiv preprint arXiv:1910.14351, (2019).

Designed for Accessibility and to further Open Science