Actor critic methods with sparse rewards in model-based deep reinforcement learning typically require a deterministic binary reward function that reflects only two possible outcomes: if, for each step, the goal has been achieved or not. Our hypothesis is that we can influence an agent to learn faster by applying an external environmental pressure during training, which adversely impacts its ability to get higher rewards. As such, we deviate from the classical paradigm of sparse rewards and add a uniformly sampled reward value to the baseline reward to show that (1) sample efficiency of the training process can be correlated to the adversity experienced during training, (2) it is possible to achieve higher performance in less time and with less resources, (3) we can reduce the performance variability experienced seed over seed, (4) there is a maximum point after which more pressure will not generate better results, and (5) that random positive incentives have an adverse effect when using a negative reward strategy, making an agent under those conditions learn poorly and more slowly. These results have been shown to be valid for Deep Deterministic Policy Gradients using Hindsight Experience Replay in a well known Mujoco environment, but we argue that they could be generalized to other methods and environments as well.
Keywords Deep Reinforcement Learning random reward
The reward function in Deep Reinforcement Learning enables the agent to learn from the environment by providing feedback on the actions that it executes. Rewards can be dense, or they can be sparse. Rewards can also be originated by the environment (extrinsic), or be generated by the agent itself (intrinsic). When working with extrinsic sparse rewards in actor critic architectures, the decision of whether to use positive or negative rewards is addressed early on in the engineering process. There are works that seek to answer the question of which one to use. For instance, HER  has used negative sparse rewards in the past leading to interesting results in robotics by forcing the robot to try to reach the goal as quickly as possible, then curiosity driven experiments  have used incremental intrinsic sparse rewards for achieving goals of increasing complexity. However, the magnitude of a single-task reward is not commonly addressed. After all, working with a value of -1 or -10, or any other negative number in this case, becomes an arbitrary reference point for the algorithm to understand that something undesirable is happening. So far, the premise of extrinsic sparse rewards has been to create a reward function that outputs one of two possible values or states in a deterministic fashion. One for a when a desirable outcome is reached, and the other at any other time (any other state that is not desirable automatically becomes undesirable). We propose the creation of a third state that is to be applied only during training that is meant to improve performance and sample efficiency. The third state, unlike the other two, is allowed to have a stochastic nature. When enabled, an additional constant reward, a bonus, will be uniformly sampled throughout an episode and may occur at one or multiple steps either when the goal has not been achieved, when the goal has been achieved, or both (Fig 1). The significance of the third state and its effects on the learning process is the subject of the study of this paper.
Figure 1: Left: normal deterministic reward policy of for every time the goal has not being achieved and 0 when it has. In this case, the cumulative expected reward is assuming it takes half the time to complete the task. Right: the same agent receives additional uniformly sampled +1 rewards 50% of the times. Assuming the agent again takes half the time to complete the task, the expected cumulative reward becomes 0. This strategy in particular creates a learning problem. Throughout this paper many other strategies will be used that have the opposite effect.
1.1 Partially Observable Markov Decision Processes
This work is partially inspired by the subject of Partially Observable Markov Decision Processes (POMDPs)  in which the observation made by the agent cannot fully explain the reason a reward is being awarded at all times. And, to a certain extent, the subject of Corrupt Reward Markov Decision Processes (CRMDPs) , in which a reward function is corrupted by an external influence for which additional algorithms are created to detect and recover the original unaltered signal in order to achieve a better learning experience . Our work is closer to the theory of POMDPs than CRMDPs, however we are neither attempting to recover a Hidden Markov Model or recover the unaltered reward function. Instead, we seek to explore the agent behavior under conditions in which the reward function is not always consistent, or its magnitude is higher or lower during training. For this purpose, the uniform distribution plays a central role in ensuring that the signal cannot be learned and exploited by the agent. It also plays a central role in our ability to carry out the experiment and influence the agent into achieving a higher, mostly because we can increase the reward density more gradually and analyze the results.
1.2 Relationship to regularization in supervised learning
While is not the main objective of this experiment, adding a stochastic component to the reward function may help create a more robust method to train model-free deep reinforcement learning agents because the randomly sampled values create a barrier to overfitting to a specific policy that maximizes the reward function. This line of research is not new. Previous research has shown the benefits of adding noise to the action space to be able to generalize better , or using doing domain randomization, for instance in the field of computer vision, to enable robots to perform the bulk of their training in a simulated environment and yet perform well in real life conditions . We believe that we achieve a similar effect with the nature of our experiment.
For the purpose of this paper, the terms robot and agent will be used interchangeably and mean the same thing. At times we refer to pressure and adversity as a way to indicate that negative rewards are being used in addition to the deterministic reward generated by the environment. These additional reward will, in general, also be called a bonus reward or simply a bonus, and its value can be positive or negative. The value of the bonus reward is not random, but the sampling of it is. Both the value of the bonus reward and the frequency of the random sampling are deterministic hyper-parameters that will be explained in later sections.
In simpler terms, the reward s comprised of the baseline original reward and a bonus. In our experiments the original reward follows the approach given in the HER paper .
We are employing variations of the FetchSlide environment originally created and made available by OpenAI , trained using Deep Deterministic Policy Gradients (DDPG)  with the original version of Hindsight Experience Replay (HER)  as made available within the OpenAI Baselines . The FetchSlide environment is ideal for our research as it is a complex environment with a very specific set of actions that must to be taken in order for the agent to reach the goal and accumulate rewards. Therefore, modifying this environment in the way we describe below can be used to show the effects of random rewards in the learning process. The experiments were run in two different computers, each equipped with an Intel Core i7-8700K 3.70GHz 12 cores CPU, 16 Gb of RAM, and a GeForce GTX 1080 Ti/PCIe/SSE2 GPU, running Ubuntu 16.06.6 LTS. For repeatibility purposes, both servers ran the same version of the software, including Python 3.7.3 with Tensorflow 1.13.1 and Mujoco 2.0. Each training sequence was executed on 2 cores for a total duration of 96,000 episodes. Testing sequences were 1,000 episodes long, and run on pre-trained robots. We verified that experiments running the same seed generate the same results on all versions of the environment.
2.1 Description of the Environment
The FetchSlide environment was specifically modified to generate a different set of rewards. The original reward function could be expressed as:
for which R (s, ag, g) = 0 when the (x, y, z) coordinates of the achieved goal (ag) are within the tolerance distance to the (x, y, z) coordinates of the goal (g), and -1 otherwise. We introduce:
for which R is the set of original rewards, P is the set of probability levels to obtain the nominal reward without any random bonuses. B is the set of additional rewards, or bonuses, that the agent is subject to, in which B . And is the stage within each episode, for which the bonus B is applied. We define NG as the time span when the goal has not being achieved, G as the time span while the goal has been achieved, and B as the union of both, which is also equivalent to the entire duration of the episode. P, B, and N are all deterministic hyperparameters for the experiment. Table 1 shows the list of experiments that were run.
Table 1: List of experiments
The pertinent modifications to the environment are made to the function to compute the rewards. We first define N(0, 1), sampled on each step, and generate the reward as the function as otherwise. Because of the way HER works, training typically occurs with a mix of about 20% of the transitions having the environmental reward and the remaining 80% of the transitions are modified to have a reward of 0, as described by the HER algorithm . The mix ratio is a hyperparameter and can be modified, but for the purpose of our experiments it wasn’t. Over time, the actual expected value of the reward for each step during training becomes
Notice that the modified reward function calculation is only applied to extrinsic rewards, those coming from the environment, and not those modified using the HER portion of the algorithm. These HER transitions are essential for baseline learning and, from the point of view of the experiment, act as intrinsic rewards that serve the very specific purpose of learning from failed experiences, an outcome that we did not want to alter.
Putting equations 1 and 2 together we get . Because, as stated earlier, about 80% of the experience replay buffer sampled transitions are used for HER and the remaining 20% are used in accordance to the DDPG algorithm, we can calculate what the new average rewards for these states over time would be. When the goal has not been achieved (NG) and the standard reward is , the expected average value of the non-goal (NG) reward over time is . When the goal has been achieved (G) and the standard reward is 0, the expected average value over time for the goal reward is . This shows that the randomness of the reward function has much more significance on the NG stage compared to the goal reward as a result of the intrinsic incentives generated by HER sampling. This difference in significance also affects how the critic network in DDPG approximates the Q function on the reward, as this approximation will be affected separately between transitions that led to the goal and transitions that did not, which changes how the transitions on either stage are valued, and can either greatly help or hurt the actor’s policy.
Testing uses the unmodified version of the FetchSlide environment that only generates deterministic rewards. For comparability purposes, all training experiments were performed using a common random seed, while testing was done using 5 random seeds not used during training.
Figure 2: Relationship between rewards and success rate during testing. Robots trained with more punitive rewards tend to have a higher success rate and higher rewards during testing (upper right quadrant). Conversely, robots trained with more lenient rewards tend to perform worse during testing.
The results show that it is possible to modify the robot’s performance in a predictable way by applying our methodology during training. The results show that more negative sparse rewards induce a better performance in both maximizing the accumulated reward (the objective function), and maximizing the success rate. While typically these two, rewards and success rate, are correlated over time, for this experiment we can see how the correlation changes when modifying the configuration of the reward (fig. 4). Interestingly enough, there is a threshold point after which achieving better results becomes harder, and augmenting the absolute value of the negative reward will no longer improve performance. The data also shows that there are several paths to achieve a higher performance. For instance, [100:-1:B] seems to generate one of the best results, but so does [50:-5:NG] or [30:-10:NG]. The reference point, showed in red (fig. 4) corresponds to the nominal HER results. A dotted line runs through it from the origin to show that the success rate was seen to improve in general more so than the rewards. Positive bonuses, on the other hand, perform worse. In a sense, this performance is expected. It’s important to remember that we are using a negative reward strategy [-1,0]. As such, positive bonuses should be understood as rewards that cancel out the effects of the underlying reward strategy to the point in which is not clear whether an action is good or bad.
Figure 3: Upper left: the [xx:-1:B] series shows progressive improvement as the probability of having additional bonus rewards of -1 at each time step increases, while the Upper right: [xx:+1:B] series shows the analogous effects for a positive reward. After the probability jumps above 50%, the robot arguably stops learning. Lower left: the [xx:-5:NG] series shows what happens when the bonus is further reduced: the improvement stops and starts to slowly descend. Lower right: the [xx:-10:NG] series shows a similar pattern in which more negative rewards start to degrade performance, while at the same time showing that a few more negative rewards can have a similar effect to a lot of less negative rewards.
The case for which the robot performs the worst is when the bonus is +1 (fig. 3 upper right), which makes sense as it would fully overlap with the original [-1,0] reward corresponding to the [NG:G] stages. This slowly transforms the rewards into a [0:+1] model without modifying the HER reward, which is fixed at 0. We argue that the learning process wouldn’t be as poor if we changed the HER reward to +1. That said, one of the best performances comes from using the opposite bonus of -1 (fig. 3 upper left), getting incrementally better as the probability of sampling an additional bonus reward on each time step goes higher. When the probability reaches 100% the reward set becomes [-2, -1] for the [NG, G] stages, and 0 for HER. When using an even lower bonus value of -5 (fig. 3 lower left), the success rate starts to increase as it did with -1, but at around a probability level of 50% it reaches a maximum (actually, the best result among all experiments) and then the performance starts to slowly deteriorate. For the purpose of this paper, we didn’t seek to find the exact bonus/probability combination at which we can obtain the best possible performance, but instead we were interested in understanding the general dynamics of the process. As such, we also performed an experiment with even lower bonus rewards, such as -10 (fig. 3 lower right). At this bonus level, the lowest probability we tried (30%) yielded the best result for the series and the second best for the entire project. This result gives credence to the statement we made before about there being several strategies to achieve peak performance, and one of them may very well be having a low probability at all times of getting a relatively lower reward for no good reason (random sampling). This result also suggests something that at this point should be obvious. That having multiple levels of sparse rewards could be beneficial under certain conditions, even when they are not linked with the specific environmental outcomes (bonus rewards are randomly sampled following a given uniform distribution). This detachment from the environment is important, since linking sparse reward levels to environmental outcomes would be akin to engineering rewards, which requires significant time and domain expertise.
Figure 4: Relationship between rewards and success rate during training. Robots trained under an environment that feeds more negative random rewards (upper left quadrant) tend to outperform those that trained under more positive random rewards (lower right quadrant). The red markers represent the baseline result from a comparable training process with no random rewards.
During testing, we run the trained models using the regular HER algorithm, absent any bonus rewards. It is only during training that the bonus rewards are applied (which is why on previous figures the minimum reward is -60 for each 60 time step episode). However, the training results do provide an interesting insight into what is going on. For instance, the top performers in terms of the success rate during testing, are also the top performers during training, even if they collect much lower rewards because of the conditions we artificially subject the agent to (fig. 4). Also, the success rate is much lower during training because it averages the entire learning process from zero. That being said, the data does suggest that there is a maximum performance that can be achieved for this specific agent, environment, hyperparameters; and other conditions, physical or otherwise, associated with our test bed. That peak performance can be achieved when the average reward per episode during training is around -200. This value is irrelevant in absolute terms, but it is indicative of how much performance we can expect to extract.
From these experiments we conclude that it is possible to improve the agent’s performance and sample efficiency by structuring a partially deterministic sparse reward function such as the ones we applied in this paper. By doing so, we would more closely approximate the performance boundaries the agent in that environment is capable of reaching, for any given seed.
4.1 Future work
We didn’t analyze the effects of having multiple bonus levels at multiple probabilities at the same time. Such a result would more closely resemble real life, in which the environment is not always in agreement about what reward should be given at any given time. As valuable as that would be, such a test would require a much larger test bed and computing resources to draw conclusions about the right combination of factors. We would, however, expect to see a change in performance similar to what we have shown in our experiments. We also anticipate that performance deterioration would be caused by receiving random bonus rewards with a magnitude that is too high, or bonus rewards that make the distinction between when the goal has been achieved and when it hasn’t less clear. Replicating these results on other model-free deep reinforcement learning algorithms is also something we can see happening in the future.
 Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. CoRR, abs/1707.01495, 2017.
 John B. Lanier, Stephen McAleer, and Pierre Baldi. Curiosity-driven multi-criteria hindsight experience replay, 2019.
 Geoff Hollinger. Partially observable markov decision processes (pomdps). Aug 2007.
 Tom Everitt, Victoria Krakovna, Laurent Orseau, and Shane Legg. Reinforcement learning with a corrupted reward channel. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Aug 2017.
 Jason Mancuso, Tomasz Kisielewski, David Lindner, and Alok Singh. Detecting spiky corruption in markov decision processes, 2019.
 OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob McGrew, Jakub W. Pachocki, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous in-hand manipulation. CoRR, abs/1808.00177, 2018.
 Xinyi Ren, Jianlan Luo, Eugen SolowjoW, Juan Aparicio Ojea, Abhishek Gupta, Aviv Tamar, and Pieter Abbeel. Domain randomization for active pose estimation. 2019 International Conference on Robotics and Automation (ICRA), May 2019.
 Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, Vikash Kumar, and Wojciech Zaremba. Multi-goal reinforcement learning: Challenging robotics environments and request for research. CoRR, abs/1802.09464, 2018.
 Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
 Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines, 2017.
 R. Liu and J. Zou. The effects of memory replay in reinforcement learning. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 478–485, Oct 2018.
 Long-Ji Lin. Reinforcement learning for robots using neural networks. technical report, dtic document, 1993.
 David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
 Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. arXiv preprint arXiv:1205.4839, 2012.
 Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
 Jacob Rafati and David C. Noelle. Learning sparse representations in reinforcement learning, 2019.
Identifying the best combination of parameters to achieve the best possible results for a robot is important. In the case of our experiments, we show how the combination of stage, probability and bonus affect the average outcome of the experiments (fig SI-1). The maximum performance is achieved when fully offsetting the rewards by -1 (0:-1:NG), applying various combinations of random rewards of value -5, or very sparsely applying random rewards of -10 (70:-10:NG). At the same time, positive rewards should almost entirely be avoided, especially fully offsetting the reward by +1 (0:+1:B or 0:+1:NG), at which point the robot will encounter serious problems learning a working policy. The full table of results is also presented below (Table SI-1)
Figure SI-1: Average reward per experiment conducted clearly shows the performance difference between the different configurations. The naming convention is P:B:N, in which the P represents the percentage of rewards that maintain the original deterministic reward, B represents the numerical amount that is added to the deterministic reward, and N represents whether for each episode the additional reward B applies to when the robot has not achieved the goal (NG), has achieved it (G), or both (B). Bars reflect the standard deviation.
Table SI-1 continued from previous page