Reinforcement Learning with Probabilistically Complete Exploration

2020·Arxiv

ABSTRACT

ABSTRACT

Balancing exploration and exploitation remains a key challenge in reinforcement learning (RL). State-of-the-art RL algorithms suffer from high sample complexity, particularly in the sparse reward case, where they can do no better than to explore in all directions until the first positive rewards are found. To mitigate this, we propose Rapidly Randomly-exploring Reinforcement Learning (R3L). We formulate exploration as a search problem and leverage widely-used planning algorithms such as Rapidly-exploring Random Tree (RRT) to find initial solutions. These solutions are used as demonstrations to initialize a policy, then refined by a generic RL algorithm, leading to faster and more stable convergence. We provide theoretical guarantees of R3L exploration finding successful solutions, as well as bounds for its sampling complexity. We experimentally demonstrate the method outperforms classic and intrinsic exploration techniques, requiring only a fraction of exploration samples and achieving better asymptotic performance.

1 INTRODUCTION

Reinforcement Learning (RL) studies how agents can learn a desired behaviour by simply using interactions with an environment and a reinforcement signal. Central to RL is the long-standing problem of balancing exploration and exploitation. Agents must first sufficiently explore the environment to identify high-reward behaviours, before this knowledge can be exploited and refined to maximize long-term rewards. Many recent RL successes have been obtained by relying on well-formed reward signals, that provide rich gradient information to guide policy learning. However, designing such informative rewards is challenging, and rewards are often highly specific to the particular task being solved. Sparse rewards, which carry little or no information besides binary success or failure, are much easier to design. This simplicity comes at a cost; most rewards are identical, so that there is little gradient information to guide policy learning. In this setting, the sample complexity of simple exploration strategies was shown to grow exponentially with state dimension in some cases (Osband et al., 2016b). Intuition behind this phenomenon can be gained by inspecting Figure 1a: exploration in regions where the return surface is flat leads to a random walk type search. This inefficient search continues until non-zero gradients are found, which can then be followed to a local optimum.

Planning algorithms can achieve much better exploration performance than random walk by taking search history into account (Lavalle, 1998). These techniques are also often guaranteed to find a solution in finite time if one exists (Karaman & Frazzoli, 2011). In order to leverage the advantages of these methods, we formulate RL exploration as a planning problem in the state space. Solutions found by search algorithms are then used as demonstrations for RL algorithms, initializing them in regions of policy parameter space where the return surface is not flat. Figure 1b shows the importance of such good initialization; surface gradients can be followed, which greatly facilitates learning.

This paper brings the following contributions. We first formulate RL exploration as a planning problem. This yields a simple and effective method for automatically generating demonstrations without the need for an external expert, solving the planning problem by adapting the classic

Figure 1: Expected returns achieved by linear policy with 2 parameters on Sparse MountainCar domain (background). Gradient is 0 in the dark blue area. Trajectories show the evolution of policy parameters over 1000 iterations of TRPO, with 5 random seeds. Same colors indicate the same random seeds. (a) Random-walk type behaviour observed when parameters are initialized using Glorot initialization (Glorot & Bengio, 2010). (b) Convergence observed when parameters are initialized in a region with gradients (1, 40).

Rapidly-exploring Random Tree algorithm (RRT) (Kuffner & LaValle, 2000). The demonstrations are then used to initialize an RL policy, which can be refined with a classic RL method such as TRPO (Schulman et al., 2015). We call the proposed method Rapidly Randomly-exploring Reinforcement Learning (R3L)1, provide theoretical guarantees for finding successful solutions and derive bounds for its sampling complexity. Experimentally, we demonstrate R3L improves exploration and outperforms classic and recent exploration techniques, and requires only a fraction of the samples while achieving better asymptotic performance. Lastly, we show that R3L lowers the variance of policy gradient methods such as TRPO, and verify that initializing policies in regions with rich gradient information makes them less sensitive to initial conditions and random seed.

The paper is structured as follows: Section 2 analyzes the limitations of classic RL exploration. Section 3 describes R3L and provides theoretical exploration guarantees. Related work is discussed in Section 4, followed by experimental results and comments in Section 5. Finally, Section 6 concludes and gives directions for future work.

2 SPARSE-REWARD RL AS RANDOM WALK

Many recent RL methods are based on a policy gradient optimization scheme. This approach optimizes policy parameters with gradient descent, using a loss function (e.g. expected return) and gradient . Since computing exactly is intractable, it is common to use unbiased empirical estimators , estimated from samples acquired by executing the policy. Optimization of then follows the common stochastic gradient descent (SGD) update-rule (Bottou, 2010; Robbins & Monro, 1951): is the learning rate.

The SGD update rule defines a discrete-time stochastic process (Mandt et al., 2017). Note that is the mean of i.i.d. samples. Following the central limit theorem, the distribution over is approximately is an unbiased estimator of g with covariance . Consequently, the update rule can be rewritten as (Mandt et al., 2017):

Here we assume that , i.e. approximately constant w.r.t. , and factorizes as

SGD is efficient for high-dimensional problems as it offers almost dimension independent convergence rates (Nesterov, 2018). However, SGD requires non-zero gradients to guide the search towards the optimum . In the case of sparse-reward RL problems, such as in Figure 1, much of the loss surface is flat. This leads to inefficient exploration of parameter space as the drift component in Eq. (1) , turning the SGD to a random walk in Random walk is guaranteed to wander to infinity when dimensionality (Pólya, 1921; Kakutani, 1944). However, the probability of it reaching a desired region in , e.g. where , depends heavily on problem specific parameters. The probability of ever reaching a sphere of radius r centered at is (Dvoretzky & Erd˝os, 1951):

In sparse RL problems r < R, thus the probability of reaching a desired region by random walk is smaller than 1, i.e. there are no guarantees of finding any solution, even in infinite time. This is in stark contrast with the R3L exploration paradigm, as discussed in Section 3.5.

3 R3L: RAPIDLY AND RANDOMLY-EXPLORING REINFORCEMENT LEARNING

R3L adapts RRT to the RL setting by formulating exploration as a planning problem in state space. Unlike random walk, RRT encourages uniform coverage of the search space and is probabilistically complete, i.e. guaranteed to find a solution (Kleinbort et al., 2019).

R3L is decomposed into three main steps: (i) exploration is first achieved using RRT to generate a data-set of successful trajectories, described in Sections 3.2 and 3.3, (ii) successful trajectories are converted to a policy using learning from demonstrations (Section 3.4), and (iii) the policy is refined using classic RL methods.

3.1 DEFINITIONS

This work is based on the Markov Decision Process (MDP) framework, defined as a tuple are spaces of states s and actions a respectively. a transition probability distribution so that , where the subscript t indicates the discrete timestep. is a reward function defining rewards associated with transitions is a discount factor. Solving a MDP is equivalent to finding the optimal policy maximizing the expected return for some time horizon H, where actions are chosen according to . Lastly, let S be a Euclidean space, equipped with the standard Euclidean distance metric with an associated norm denoted by . The space of valid states is denoted by

3.2 EXPLORATION AS A PLANNING PROBLEM WITH RRT

The RRT algorithm (LaValle & Kuffner, 2001) provides a principled approach for planning in problems that cannot be solved directly (e.g. using inverse kinematics), but where it is possible to sample transitions. RRT builds a tree of valid transitions between states in F, grown from a root As such, the tree T maintains information over valid trajectories. The exploration problem is defined by the pair . In RL environments with a known goal set (e.g. MountainCar), the exploration problem is defined by

The RRT algorithm starts by sampling a random state , used to encourage exploration in a specific direction in the current iteration. This necessitates the first of two assumptions.

Assumption 1. Random states can be sampled uniformly from the MDP state space S.

Sampled states are not required to be valid, thus sampling a random state is typically equivalent to trivially sampling a hyper-rectangle.

Then, the vertex is found. RRT attempts to expand the tree T from toward by sampling an action according to a steering function . In many planning scenarios, samples randomly from A. A forward step is then simulated from using action a, where f is defined by the transition dynamics. Being able to expand the tree from arbitrary relies on another assumption.

Assumption 2. The environment state can be set to a previously visited state

Although this assumption largely limits the algorithm to simulators, it has previously been used in Florensa et al. (2017); Nair et al. (2018); Ecoffet et al. (2019) for example; see discussion in Section 6 on overcoming the limitation to simulated environments.

is added as a new vertex of T, alongside an edge with edge information a. This process repeats until a sampling budget k is exhausted or the goal set is reached (i.e.

Figure 2: Example of R3L exploration on sparse MountainCar. Green segments are sampled transitions, executed in simulation. A successful solution found by R3L is displayed in red. State dimensions are normalized to

Definition 1. A valid trajectory is a sequence such that is a valid transition and . Whenever a goal set is defined, a successful valid trajectory end state satisfies

Once planning is finished, a successful valid trajectory can easily be generated from T by retrieving all nodes between the leaf and the root. Because tree T is grown from valid transitions between states in F, these trajectories are valid by construction.

3.3 R3L EXPLORATION: ADAPTING RRT TO RL

RRT is not directly applicable to RL problems. This subsection presents the necessary adaptations to the exploration phase of R3L, summed up in Algorithm 1. Figure 2 shows R3L’s typical exploration behaviour.

Local policy learning: In classic planning problems, selecting actions extending towards is easy, as these have a geometric interpretation or there is a known steering function. RL state spaces do not benefit from the same geometric properties, and properly selecting actions can be challenging. We solve this problem by defining a local policy , which models the distribution of actions to transition from a state to a neighbouring goal state. Actions to extend a tree branch are sampled as . We formulate the problem of learning as supervised learning, where inputs are starting states augmented with the difference , and targets are actions. The model is learned using transition data collected from previous tree expansion iterations. Results in this paper use Bayesian linear regression to represent , but any supervised learning model can applied instead.

Unknown dynamics: RRT was designed for problems with known continuous dynamics f, but RL features unknown discrete transition dynamics. In R3L, f is replaced with an environment interaction from , with selected action a, resulting in a new state . Since is a real transition, it must be valid, and can be added to the tree T.

Biasing search with Better exploration efficiency can be achieved if goal information is available. Indeed, the RRT framework allows for biasing exploration towards , often resulting in faster exploration. This is achieved by sampling instead of F with low probability , while the rest of the iteration remains unchanged.

Since R3L uses RRT to explore, the algorithm is most suitable for RL problems that are fully observable, exhibit sparse rewards and have continuous controls. R3L is applicable to other RL problems, but may not perform as well as methods tailored to specific cases.

3.4 POLICY INITIALIZATION FROM R3L DEMONSTRATIONS

Upon completion, R3L exploration yields a successful trajectory , which may not be robust to various starting conditions and/or stochastic transitions. Converting successful trajectories into a policy is crucial to achieve robustness and enable further refinement with RL.

Policy initialization is applied to a set of successful trajectories generated using N runs of R3L exploration with different starting conditions. An imitation policy is learned by supervised learning on transitions from is then refined using traditional RL algorithms like TRPO. As shown in Figure 1, initializing policy parameters in the vicinity of a local optimum is crucial.

3.5 EXPLORATION GUARANTEES

The RL planning environment defines differential constraints of the form:

Therefore, starting at , the trajectory can be generated by forward integrating Eq. (3) with the applied actions. As with many RL problems, a(t) is time-discretized resulting in a piecewise constant control function. This means is constructed of segments of fixed time duration such that the overall trajectory duration is defined as and . Furthermore, as all transitions between states in are known, the trajectory return can be defined as

R3L explores in state-action/trajectory space instead of policy parameter space. Furthermore, it is an effective exploration framework which provides probabilistic completeness (PC):

Definition 2. A probabilistically complete planner finds a feasible solution (if one exists) with a probability approaching 1 in the limit of infinite samples.

With the aforementioned dynamic characteristics, we prove that R3L exploration under the RL setting is PC. This is in stark contrast to the random walk exploration process, discussed in section 2, which is not PC. We begin with the following theorem, a modification of Theorem 2 from Kleinbort et al. (2019), which is applied to kinodynamic RRT where a goal set is defined.

Theorem 1. Suppose that there exists a valid trajectory as defined in definition 1, with a corresponding piecewise constant control. The probability that R3L exploration fails to reach iterations is bounded by , for some constants a, b > 0.

The proof, which is a modification of Theorem 2 from Kleinbort et al. (2019), can be found in Appendix S2. It should be noted that R3L exploration does not require an explicit definition for in order to explore the space. While in some path planning variants of RRT, is used to bias sampling, the main purpose of is to indicate that a solution has been found. Therefore, can be replaced by another implicit success criterion. In the RL setting, this can be replaced by a return-related criterion.

Theorem 2. Suppose that there exists a trajectory with a return . The probability that R3L exploration fails to find a valid trajectory from iterations is bounded by , for some constants

Proof. The proof is straightforward. We augment each state in with the return to reach it from

where . For consistency we modify the distance metric by simply adding a reward distance metric. With the above change in notation, we modify the goal set to , such that there is an explicit criterion for minimal return as a goal. Consequently, the exploration problem can be written for the augmented representation as . Theorem 1 satisfies that R3L exploration can find a feasible solution to this problem within finite time, i.e. PC, and therefore the probability of not reaching after k iterations is upper-bounded by the exponential term , for some constants

We can now state our main result on the sampling complexity of the exploration process.

Theorem 3. If trajectory exploration is probabilistically complete and satisfies an exponential convergence bound, the expected sampling complexity is finite and bounded such that

where

Proof. Theorem 2 provides an exponential bound for the probability the planner will fail in finding a feasible path. Hence, we can compute a bound for the expected number of iterations needed to find a solution, i.e. sampling complexity:

where we used the relation

It is worth noting that while the sample complexity is bounded, the above result implies that the bound varies according to problem-specific properties, which are encapsulated in the value of . Intuitively, depends on the scale of the problem. It grows as becomes smaller or as the length of the solution trajectory becomes longer. depends on the probability of sampling states that will expand the tree in the right direction. It therefore shrinks as the dimensionality of S increases. We refer the reader to Appendix S2 for more details on the meaning of and the derivation of the tail bound in Theorem 1.

4 RELATED WORK

Exploration in RL has been extensively studied. Classic techniques typically rely on adding noise to actions (Mnih et al., 2015; Schulman et al., 2015) or to policy parameters (Plappert et al., 2018). However, these methods perform very poorly in settings with sparse rewards.

Intrinsic motivation tackles this problem by defining a new reward to direct exploration. Many intrinsic reward definitions were proposed, based on information theory (Oudeyer & Kaplan, 2008), state visit count (Lopes et al., 2012; Bellemare et al., 2016; Szita & L˝orincz, 2008; Fox et al., 2018), value function posterior variance (Osband et al., 2016a; Morere & Ramos, 2018), or model prediction error (Stadie et al., 2015; Pathak et al., 2017). Methods extending intrinsic motivation to continuous state and action spaces were recently proposed (Houthooft et al., 2016; Morere & Ramos, 2018). However, these approaches are less interpretable and offer no guarantees for the exploration of the state space.

Offering exploration guarantees, Bayesian optimization was adapted to RL in Wilson et al. (2014), to search over the space of policy parameters in problems with very few parameters. Recent work extends the method to functional policy representations (Vien et al., 2018), but results are still limited to toy problems and specific policy model classes.

Motion planning in robotics is predominantly addressed with sampling-based methods. This type of approach offers a variety of methodologies for exploration and solution space representation (e.g., Probabilistic roadmaps (PRM) (Kavraki et al., 1996), Expansive space trees (ESP) (Hsu et al., 1997) and Rapidly-exploring random tree (RRT) (Kuffner & LaValle, 2000)), which have shown excellent performance in path planning in high-dimensional spaces under dynamic constraints (LaValle & Kuffner, 2001; Hsu et al., 2002; Kavraki et al., 1996).

RL was previously combined with sampling-based planning to replace core elements of planning algorithms, such as PRM’s point-to-point connection (Faust et al., 2018), local RRT steering function (Chiang et al., 2019) or RRT expansion policy (Chen et al., 2019). In contrast, the proposed method bridges the gap in the opposite direction, employing a sampling-based planner to generate demonstrations that kick-start RL algorithms and enhance their performance.

Table 1: Impact of learning local policy and biasing search towards with probability R3L exploration. Results show the mean and standard deviation of successful trajectory length and number of timesteps required, computed over 20 runs.

Accelerating RL by learning from demonstration was investigated in Niekum et al. (2015); Bojarski et al. (2016); Torabi et al. (2018). However, these techniques rely on user-generated demonstrations or a-priori knowledge of environment parameters. In contrast, R3L automatically generates demonstrations, with no need of an external expert.

5 EXPERIMENTS

In this section, we investigate (i) how learning a local policy and biasing search towards probability affects R3L exploration, (ii) whether separating exploration from policy refinement is a viable and robust methodology in RL, (iii) whether R3L reduces the number of exploration samples needed to find good policies, compared with methods using classic and intrinsic exploration, and (iv) how R3L exploration can reduce the variance associated with policy gradient methods. All experiments make use of the Garage (Duan et al., 2016) and Gym (Brockman et al., 2016) frameworks. The experimental setup features the following tasks with sparse rewards: Cartpole Swingup (MountainCar ((), Reacher () Fetch Reach (), and Hand Reach (). The exact environment and reward definitions are described in Appendix S3.

R3L exploration analysis We first analyze the exploration performance of R3L in a limited set of RL environments, to determine the impact that learning policy has on exploration speed. We also investigate whether R3L exploration is viable in environments where no goal information is available. Table 1 shows the results of this analysis. Learning seems to greatly decrease the number of exploration timesteps needed on most environments. However, it significantly increases the number of timesteps on the acrobot and reacher environments. Results also suggest that learning helps R3L to find shorter trajectories on the same environments, which is a desirable property in many RL problems. Biasing R3L exploration towards the goal set helps finding successful trajectories faster, as well as reducing their length. However, R3L exploration without goal bias is still viable in all cases. Although goal information is not given in the classic MDP framework, it is often available in real-world problems and can be easily utilized by R3L. Lastly, successful trajectory lengths have low variance, which suggests R3L finds consistent solutions.

Comparison to classic and intrinsic exploration on RL benchmarks We examine the rates at which R3L learns to solve several RL benchmarks, and compare them with state-of-the-art RL algorithms. Performance is measured in terms of undiscounted returns and aggregated over 10 random seeds, sampled at random for each environment. We focus on domains with sparse rewards, which are notoriously difficult to explore for traditional RL methods. Our experiments focus on the widely-used methods TRPO (Schulman et al., 2015) and DDPG (Lillicrap et al., 2015). R3L-TRPO and R3L-DDPG are compared to the baseline algorithms with Gaussian action noise. As an additional baseline we include VIME-TRPO (Houthooft et al., 2016). VIME is an exploration strategy based on maximizing information gain about the agent’s belief of the environment dynamics. It is included to show that R3L can improve on state-of-the-art exploration methods as well as naive ones, even though the return surface for VIME-TRPO is no longer flat, unlike Figure 1. The exact experimental setup is described in Appendix S3.2. The R3L exploration phase is first run to generate training trajectories for all environments. The number of environment interactions during this phase is accounted for in the results, displayed as an offset with a vertical dashed black line. The average performance achieved

Figure 3: Results for classic control tasks, comparing our proposed method (R3L-TRPO/DDPG), vanilla TRPO/DDPG, and VIME-TRPO. Trendlines are the medians and shaded areas are the interquartile range, taken over 10 randomly chosen seeds. Also shown is the average undiscounted return of successful trajectories generated with R3L exploration. The dashed offset at the start of R3L-TRPO/DDPG reflects the number of timesteps spent on R3L exploration.

by these trajectories is also reported as a guideline, with the exception of Cartpole Swingup where doing so does not make sense. RL is then used to refine a policy pretrained with these trajectories.

Figure 3 shows the median and interquartile range for all methods. R3L is very competitive with both vanilla and VIME baselines. It converges faster and achieves higher performance at the end of the experiment. In most cases, the upper quartile for our method begins well above the minimum return, indicating that R3L exploration and pre-training are able to produce successful though not optimal policies. R3L-DDPG performance, for the majority of problems, starts significantly above the minimum return, plunges due to the inherent instability of DDPG, but eventually recovers, indicating that R3L pre-training can help mitigate the instability. It is worth noting that R3L’s lower quartile is considerably higher than that of baselines. Indeed, for many of the baselines the lower quartile takes a long time to improve on the minimum return, and in some cases it never manages to do so at all. This is a common problem in sparse reward RL, where there is no alternative but to search the parameter space randomly until the first successful trajectory is found, as explained in Section 2. While a few random seeds will by chance find a successful trajectory quickly (represented by the quickly rising upper quartile), others take a long time (represented by the much slower rise of the median and lower quartile). In other words, R3L-TRPO/DDPG is much more robust to random policy initialization and to the random seed than standard RL methods. This is because R3L is able to use automatically generated demonstrations to initialize RL policy parameters to a region with informative return gradients.

6 CONCLUSION

We proposed Rapidly Randomly-exploring Reinforcement Learning (R3L), an exploration paradigm for leveraging planing algorithms to automatically generate successful demonstrations, which can be converted to policies then refined by classic RL methods. We provided theoretical guarantees of R3L finding solutions, as well as sampling complexity bounds. Empirical results show that R3L outperforms classic and intrinsic exploration techniques, requiring only a fraction of exploration samples and achieving better asymptotic performance.

As future work, R3L could be extended to real-world problems by leveraging recent advances on bridging the gap between simulation and reality (Peng et al., 2018). Respecting Assumption 2, a policy would first be trained on a simulator and then transferred to the real-world. Exploration in high-dimensional tasks is also challenging as stated in Theorem 3 and confirmed experimentally by increased R3L exploration timesteps. Exploiting external prior knowledge and/or the structure of the problem can benefit exploration in high-dimensional tasks, and help make R3L practical for problems such as Atari games. Lastly, recent advances in RRT (Chiang et al., 2019) and learning from demonstration (Torabi et al., 2018) could also improve R3L.

REFERENCES

R. Arratia and L. Gordon. Tutorial on large deviations for the binomial distribution. Bulletin of Mathematical Biology, 1989.

M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, 2016.

M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Ziebaand. End to end learning for self-driving cars. NIPS Deep Learning Symposium, 2016.

L.N. Bottou. Large-scale machine learning with stochastic gradient descent. In International Conference on Computational Statistics, 2010.

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym, 2016.

B. Chen, B. Dai, and L. Song. Learning to plan via neural exploration-exploitation trees. arXiv:1903.00070, 2019.

H. Lewis Chiang, J. Hsu, M. Fiser, L. Tapia, and A. Faust. RL-RRT: Kinodynamic motion planning via learning reachability estimators from RL policies. Robotics and Automation Letters, 4, 2019.

Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, 2016.

A. Dvoretzky and P. Erd˝os. Some problems on random walk in space. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 1951.

A. Ecoffet, J Huizinga, J Lehman, K. O. Stanley, and J. Clune. Go-explore: a new approach for hard-exploration problems. arXiv:1901.10995, 2019.

A. Faust, K. Oslund, O. Ramirez, A. Francis, L. Tapia, M. Fiser, and J. Davidson. PRM-RL: Long- range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In International Conference on Robotics and Automation, 2018.

C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel. Reverse curriculum generation for reinforcement learning. In Conference on Robot Learning, 2017.

L. Fox, L. Choshen, and Y. Loewenstein. DORA the explorer: Directed outreaching reinforcement action-selection. In International Conference on Learning Representations, 2018.

X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, 2010.

R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, 2016.

D. Hsu, J.C. Latombe, and R. Motwani. Path planning in expansive configuration spaces. In International Conference on Robotics and Automation, 1997.

D. Hsu, R. Kindel, J.C. Latombe, and S. Rock. Randomized kinodynamic motion planning with moving obstacles. The International Journal of Robotics Research, 2002.

S. Kakutani. On brownian motions in n-space. Proceedings of the Imperial Academy, 1944.

S. Karaman and E. Frazzoli. Sampling-based algorithms for optimal motion planning. The International Journal of Robotics Research, 2011.

L. E. Kavraki, P. Svestka, J. C. Latombe, and M. H. Overmars. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. Transactions on Robotics and Automation, 1996.

M. Kleinbort, K. Solovey, Z. Littlefield, K. E. Bekris, and D. Halperin. Probabilistic completeness of RRT for geometric and kinodynamic planning with forward propagation. Robotics and Automation Letters, 2019.

J. J. Kuffner and S. M. LaValle. RRT-connect: An efficient approach to single-query path planning. In International Conference on Robotics and Automation, 2000.

S. M. Lavalle. Rapidly-exploring random trees: A new tool for path planning. Technical report, Department of Computer Science. Iowa State University., 1998.

S. M. LaValle and J. J. Kuffner. Randomized kinodynamic planning. The International Journal of Robotics Research, 2001.

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, et al. Continuous control with deep reinforcement learning. arXiv:1509.02971, 2015.

M. Lopes, T. Lang, M. Toussaint, and P.Y. Oudeyer. Exploration in model-based reinforcement learning by empirically estimating learning progress. In Advances in Neural Information Processing Systems, 2012.

S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic gradient descent as approximate bayesian inference. The Journal of Machine Learning Research, 18, 2017.

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Ried- miller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015.

P. Morere and F. Ramos. Bayesian RL for goal-only rewards. In Conference on Robot Learning, 2018.

A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel. Overcoming exploration in reinforcement learning with demonstrations. In International Conference on Robotics and Automation, 2018.

Y. Nesterov. Lectures on Convex Optimization. Springer, 2018.

S. Niekum, S. Osentoski, G. Konidaris, S. Chitta, et al. Learning grounded finite-state representations from unstructured demonstrations. The International Journal of Robotics Research, 34, 2015.

I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems, 2016a.

I. Osband, B. Van Roy, and Z. Wen. Generalization and exploration via randomized value functions. In International Conference on Machine Learning, 2016b.

P. Y. Oudeyer and F. Kaplan. How can we define intrinsic motivation? In International Conference on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems, 2008.

D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, 2017.

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In International Conference on Robotics and Automation, 2018.

M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz. Parameter space noise for exploration. In International Conference on Learning Representations, 2018.

G. Pólya. Über eine aufgabe der wahrscheinlichkeitsrechnung betreffend die irrfahrt im straßennetz. Mathematische Annalen, 1921.

A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, 2008.

H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning, 2015.

B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv:1507.00814, 2015.

I. Szita and A. L˝orincz. The many faces of optimism: a unifying approach. In International Conference on Machine learning, 2008.

F. Torabi, G. Warnell, and P. Stone. Behavioral cloning from observation. In International Joint Conference on Artificial Intelligence, 2018.

C. Urmson and R. Simmons. Approaches for heuristically biasing RRT growth. In International Conference on Intelligent Robots and Systems, 2003.

N. A. Vien, H. Zimmermann, and M. Toussaint. Bayesian functional optimization. In AAAI Conference on Artificial Intelligence, 2018.

A. Wilson, A. Fern, and P. Tadepalli. Using trajectory data to improve bayesian optimization for reinforcement learning. The Journal of Machine Learning Research, 2014.

Reinforcement Learning with Probabilistically Complete Exploration: Supplementary Material

In this section, we provide pseudo-code of the classic RRT algorithm.

This appendix proves Theorem 1 which shows that planning using RRT under differential constraints is probabilistically complete. The following proof is a modification of Theorem 2 from Kleinbort et al. (2019), where completeness of RRT in the RL setting is maintained without the need to explicitly sample a duration for every action.

Equation 3 defines the environment’s differential constraints. In practice, Eq. (3) is approximated by an Euler integration step. With the interval divided into equal time intervals of duration h with . Eq. (3) is then approximated by an Euler integration step, where the transition between consecutive time steps is given by:

Furthermore, we define as a ball with a radius r centered at s for any given state

We assume that the planning environment is Lipschitz continuous in both state and action, constraining the rate of change of Eq. (S1). Formally, there exists two positive constants , such that

Lemma 1. For two trajectories , where and such that with a positive constant. Suppose that for each trajectory a piecewise constant action is applied, so that and is fixed during a time period . Then

The proof for Lemma 1 is given in Lemma 2 of (Kleinbort et al., 2019). Intuitively, this bound is derived from compounding worst-case divergence between and at every Euler step along T which leads to an overall exponential dependence.

Using Lemma 1, we want to provide a lower bound on the probability of choosing an action that will expand the tree successfully. We note that this scenario assumes that actions are drawn uniformly from A, i.e. there is no steering function 2. When better estimations of the steering function are available, as described in 3.3, the performance of RRT significantly improves.

Definition 3. A trajectory is defined as -clear if for for all

Lemma 2. Suppose that is a valid trajectory from to with a duration of and a clearance of . Without loss of generality, we assume that actions are fixed for all such that

Suppose that RRT expands the tree from a state to a state and we can define the following bound:

Here, is the Lebesgue measure for a unit circle in S.

Proof. We denote a trajectory that starts from and is expanded with an applied random action . According to Lemma 1,

where . Now, we want to find such that the distance between the goal points of these trajectories, i.e. in the worst-case scenario, is bounded:

After rearranging this formula, we can obtain a bound for

Assuming that is drawn out of a uniform distribution, the probability of choosing the proper action is

where is used to account for the degeneracy in action selection due to the dimensionality of S. We note that guarantees , thus a valid probability>

Equation S4 provides a lower bound for the probability of choosing the suitable action. The following lemma provides a bound on the probability of randomly drawing a state that will expand the tree toward the goal.

Lemma 3. Let be a state with clearance , i.e. . Suppose that for an RRT tree T there exist a vertex such that . Following the definition in Section 3.2, we denote as the closest vertex to . Then, the probability that is at least

Proof. Let . Therefore the distance between and v is upper-bounded by . If there exists a vertex such that and , then . Hence, by choosing , we are guaranteed . As is drawn uniformly, the probability for is

We can now prove the main theorem.

Theorem 1. Suppose that there exists a valid trajectory as defined in definition 1, with a corresponding piecewise constant control. The probability that RRT fails to reach iterations is bounded by , for some constants a, b > 0.

Proof. Lemma 2 puts bound on the probability to find actions that expand the tree from one state to another in a given time. As we assume that a valid trajectory exists, we can assume that the probability defined in Lemma 2 is non-zero, i.e.

where we set and as was also done in (Kleinbort et al., 2019). We additionally require that , which is typically defined as an RL environment parameter, is chosen accordingly so to ensure that Eq. (S5) holds, i.e.

We cover with balls of radius . The balls are spaced equally in time with the center of the ball is in , where . Therefore, and . We now examine the probability of RRT propagating along . Suppose that there exists a vertex , we need to bound the probability p that by taking a random sample , there will be a vertex such that . Lemma 3 provides a lower bound for the probability that , given that there exists a vertex , of . Lemma 2 provide a lower bound for the probability of choosing an an action from

. Consequently,

For RRT to recover , the transition between consecutive circles must be repeated m times. This stochastic process can be described as a binomial distribution, where we perform k trials (randomly choosing ), with m successes (transition between circles) and a transition success probability p. The probability mass function of a binomial distribution is Pr(X = m) = Pr(m; k, p) =

for failure, i.e. the process was unable to perform m steps, which can be expressed as:

Using Chernoff’s inequality we derive the tail bounds of the CDF when

In the other case, where p < m/k < 1, the upper bound is given by (Arratia & Gordon, 1989):

where D is the relative entropy such that

where (S14) is justified for worst-case scenario where p = m/k, (S15) uses the fact that p < m/k < . The last step, (S16) is derived from the first term of the Taylor expansion of

As p and m are fixed and independent of k, we show that the expression for Pr(X < m) decays to zero exponentially with k, therefore RRT is probabilistically complete.

It worth noting that as expected the failure probability Pr(X < m) depends on problem-specific properties, which give rise to the values of a and b. Intuitively, a depends on the scale of the problem such as volume of the goal set and how complex and long the solution needs to be, as evident in Eq. (S9). More importantly, b depends on the probability p. Therefore, it is a function of the dimensionality of S (through the probability of sampling ) and other environment parameters such as clearance (defined by ) and dynamics (via ), as specified in Eq. (S4).

All experiments were run using a single 2.2GHz core and a GeForce GTX 1080 Ti GPU.

All environments are made available in supplementary code. Environments are based on Gym (Brock- man et al., 2016), with modified sparse reward functions and state spaces. All environments emit a reward per timestep unless noted otherwise. The environments have been further changed from Gym as follows:

cart position, is cart linear velocity, is pole angle (measuring from the angular velocity. Actions are force applied on the cart along the x-axis. The goal space . Note that reaching the goal space does not terminate an episode, but yields a reward of . Time horizon is H = 500. Reaching the bounds of the rail does not cause failure but arrests the linear movement of the cart.

• MountainCar- The state space consists of states is car position and is car velocity. Actions are force applied by the car engine. The goal space . Time horizon is H = 200.

• Acrobot- The state space consists of states angles of the joints (measuring from the y-axis and from the vector parallel to the link, respectively) and are their angular velocities. Actions are torque applied on the joint. The goal space is . In other words, the set of states where the end of the second link is at a height y > 1.9. Time horizon is H = 500.

(measured from the y-axis) and is the joint angular velocity. Actions are torque applied on the joint. The goal space is . Note that reaching the goal space does not terminate an episode, but yields a reward of . Time horizon is H = 100.

are the angles of the joints, (x, y) are the coordinates of the target and are the joint angular velocities. Actions are torques applied at the 2 joints. The goal space is the set of states where the end-effector is within a distance of 0.01 from the target. Time horizon is H = 50.

• Fetch Reach- A high-dimensional robotic task where the state space consists of states s = [gripper_pos, finger_pos, gripper_state, finger_state, goal_pos] where gripper_pos, gripper_vel are the Cartesian coordinates and velocities of the Fetch robot’s gripper, finger_state, finger_vel are the two-dimensional position and velocity of the gripper fingers, and goal_pos are the Cartesian coordinates of the goal. Actions are relative target positions of the gripper and fingers, which the MuJoCo controller will try to achieve. The goal space is the set of states where the end-effector is within a distance of 0.05 from goal_pos. Time horizon is H = 50. Note that this problem is harder than the original version in OpenAI Gym, as we only sample goal_pos that are far from the gripper’s initial position.

• Hand Reach- A high-dimensional robotic task where the state space consists of states s = [joint_pos, joint_vel, fingertip_pos, goal_pos] where joint_pos, joint_vel are the angles and angular velocities of the Shadow hand’s 24 joints, fingertip_pos are the Cartesian coordinates of the 5 fingertips, and goal_pos are the Cartesian coordinates of the goal positions for each fingertip. Actions are absolute target angles of the 20 controllable joints, which the MuJoCo controller will try to achieve. The goal space the set of states where all fingertips are simultaneously within a distance of 0.02 from their respective goals. Time horizon is H = 50.

All experiments feature a policy with 2 fully-connected hidden layers of 32 units each with tanh activation, with the exception of Reacher, for which a policy network of 4 fully-connected hidden layers of 128 units each with relu activation is used. For all environments we use a linear feature baseline for TRPO.

Default values are used for most hyperparameters. A discount factor of is used in all environments. For VIME, hyperparameters values reported in the original paper are used, and the implementation published by the authors was used.

For TRPO, default hyperparameter values and implementation from Garage are used: KL divergence constraint , and Gaussian action noise

In comparisons with VIME-TRPO and vanilla TRPO, the R3L goal sampling probability is set to 0.05, as proposed in Urmson & Simmons (2003). Goal sets are defined in Appendix S3.1 for each environment. In all experiments, the local policy learned during R3L exploration uses Bayesian linear regression with prior precision 0.1 and noise precision 1.0, as well as 300 random Fourier features (Rahimi & Recht, 2008) approximating a square exponential kernel with lengthscale 0.3.