Sub-Goal Trees -- a Framework for Goal-Based Reinforcement Learning

2020·Arxiv

Abstract

Abstract

Many AI problems, in robotics and other domains, are goal-based, essentially seeking trajectories leading to various goal states. Reinforcement learning (RL), building on Bellman’s optimality equation, naturally optimizes for a single goal, yet can be made multi-goal by augmenting the state with the goal. Instead, we propose a new RL framework, derived from a dynamic programming equation for the all pairs shortest path (APSP) problem, which naturally solves multi-goal queries. We show that this approach has computational benefits for both standard and approximate dynamic programming. Interestingly, our formulation prescribes a novel protocol for computing a trajectory: instead of predicting the next state given its predecessor, as in standard RL, a goal-conditioned trajectory is constructed by first predicting an intermediate state between start and goal, partitioning the trajectory into two. Then, recursively, predicting intermediate points on each sub-segment, until a complete trajectory is obtained. We call this trajectory structure a sub-goal tree. Building on it, we additionally extend the policy gradient methodology to recursively predict sub-goals, resulting in novel goal-based algorithms. Finally, we apply our method to neural motion planning, where we demonstrate signifi-cant improvements compared to standard RL on navigating a 7-DoF robot arm between obstacles.

1. Introduction

Many AI problems can be characterized as learning or optimizing goal-based trajectories of a dynamical system, for example, robot skill learning and motion planning (M¨ulling et al., 2013; Gu et al., 2017; LaValle, 2006; Qureshi et al., 2018). The reinforcement learning (RL) formulation, a popular framework for trajectory optimization based on Bellman’s dynamic programming (DP) equation (Bertsekas & Tsitsiklis, 1996), naturally addresses the case of a single goal, as specified by a single reward function. RL formulations for multiple goals have been proposed (Schaul et al., 2015; Andrychowicz et al., 2017), typically by augmenting the state space to include the goal as part of the state, but without changing the underlying DP structure.

On the other hand, in deterministic shortest path problems, multi-goal trajectory optimization is most naturally represented by the all-pairs shortest path (APSP) problem (Rus- sell & Norvig, 2010). In this formulation, augmenting the state and using Bellman’s equation is known to be suboptimal1, and dedicated APSP algorithms such as FloydWarshall (Floyd, 1962) build on different DP principles. Motivated by this, we propose a goal-based RL framework that builds on an efficient APSP solution, and is also applicable to large or continuous state spaces using function approximation.

Our key idea is that a goal-based trajectory can be constructed in a divide-and-conquer fashion. First, predict a sub-goal between start and goal, partitioning the trajectory into two. Then, recursively, predict intermediate sub-goals on each sub-segment, until a complete trajectory is obtained. We call this trajectory structure a sub-goal tree (Figure 1), and we develop a DP equation for APSP that builds on it and is compatible with function approximation. We further bound the error of following the sub-goal tree trajectory in the presence of approximation errors, and show favorable results compared to the conventional RL method, intuitively, due to the sub-goal tree’s lower sensitivity to drift.

The sub-goal tree can also be seen as a general parametric structure for a trajectory, where a parametric model, e.g., a neural network, is used to predict each sub-goal given its predecessors. Based on this view, we develop a policy gradient framework for sub-goal trees, and show that conventional policy-gradient techniques such as control variates and trust regions (Greensmith et al., 2004; Schulman et al., 2015) naturally apply here as well.

Finally, we present an application of our approach to neu-

Figure 1. Trajectory prediction methods. Upper row: a conventional Sequential representation. Lower row: Sub-Goal Tree representation. Solid arrows indicate predicted segments, while dashed arrows indicate segments that still require to be predicted. By concurrently predicting sub-goals, a Sub-Goal Tree only requires 3 sequential computations, while the sequential requires 8.

ral motion planning – learning to navigate a robot between obstacles (Qureshi et al., 2018; Jurgenson & Tamar, 2019). Using our policy gradient approach, we demonstrate navigating a 7-DoF continuous robot arm safely between obstacles, obtaining marked improvement in performance compared to conventional RL approaches.

2. Related work

In RL, the idea of sub-goals has mainly been investigated under the options framework (Sutton et al., 1999). In this setting, the goal is typically fixed (i.e., given by the reward in the MDP), and useful options are discovered using some heuristic such as bottleneck states (McGovern & Barto, 2001; Menache et al., 2002) or changes in the value function (Konidaris et al., 2012). While hierarchy using options seem intuitive, theoretical results for their advantage are notoriously difficult to establish, and current results require non-trivial assumptions on the option structure (Mann & Mannor, 2014; Fruit et al., 2017). Interestingly, by investigating the APSP setting, we obtain general and strong results for the advantage of sub-goals.

Universal value functions (Schaul et al., 2015; Andrychow- icz et al., 2017) learn a goal-conditioned value function using the Bellman equation. In contrast, in this work we propose a principled motivation for sub-goals based on the APSP problem, and develop a new RL formulation based on this principle. A connection between RL and the APSP has been suggested by Kaelbling (1993); Dhiman et al. (2018), based on the Floyd-Warshall algorithm. However, as discussed in Section 4.2, these approaches become unstable once function approximation is introduced.

Sub-goal trees can be seen as a form of trajectory representation (as we discuss in Section 5.2), which was investigated for learning robotic skills (M¨ulling et al., 2013; Sung et al., 2018) and navigation (Qureshi et al., 2018). A popular approach, Dynamical movement primitives (Ijspeert et al., 2013), represents a continuous trajectory as a dynamical system with an attractor at the goal, and was successfully used for RL (Kober et al., 2013; M¨ulling et al., 2013; Peters & Schaal, 2008). The temporal segment approach of Mishra et al. (2017), on the other hand, predicts segments of a trajectory sequentially. In the context of video prediction, Jayaraman et al. (2019) proposed to predict salient frames in a goal-conditioned setting by a supervised learning loss that focuses on the ‘best’ frames. This was used to predict a list of sub-goals for a tracking controller. In contrast, we investigate the RL problem, and predict sub-goals recursively.

3. Problem Formulation and Notation

We are interested in optimizing goal-conditioned tasks in dynamical systems. We restrict ourselves to deterministic, stationary, and finite time systems, and consider both discrete and continuous state formulations.2

In the discrete state setting, we study the all-pairs shortest path (APSP) problem on a graph. Consider a directed and weighted graph with N nodes , and denote by the weight of edge .3 By definition, we set c(s, s) = 0, such that the shortest path from a node to itself is zero. To simplify notation, we can replace unconnected edges by edges with weight . Thus, without loss of generality, throughout this work we assume that the graph is complete. The APSP problem seeks the shortest paths (i.e., a path with minimum sum of costs) from any start node s to any goal node g in the graph:

In the continuous state case, we consider a deterministic controlled dynamical system in discrete time, defined over a state space S and an action space the states at time t and t+1, respectively, is the control at time t, and f is a stationary state transition function. Given an initial state and a goal g, an optimal trajectory reaches g while minimizing the sum of non-negative costs

To unify our treatment of problems (1) and (2), in the remainder of this paper we abstract away the actions as follows. Let , and if no transition from s to is possible. Then, for any start and goal states s, g, Eq. (2) is equivalent to:4

Note the similarity of Problem (3) to the APSP problem (1), where feasible transitions of the dynamical system are represented by edges between states in the graph. In the rest of this paper, we study solutions for (1) and (3), for all start and goal pairs s, g. We are particularly interested in large problems, where exact solutions are intractable, and approximations are required.

Notation: Let denote a state trajectory. Also, denote , the cost of several subse- quent states and as the total trajectory cost. Let denote the segment of starting at and ending at . We denote the trajectory concatenation as , where it is implied that pears only once. For a stochastic trajectory distribution conditioned on start and goal, we denote shorthand for

4. Approximate Dynamic Programming

In this section we study a dynamic programming principle for the APSP problem (1). We will later extend our solution to the continuous case (3) using function approximation.

One way to solve (1) is to solve a single-goal problem for every possible goal state. The RL equivalent of this approach is UVFA, suggested by Schaul et al. (2015), where the state space is augmented with the goal state, and a goal-based value function is introduced. The UVFA is optimized using standard (single-goal) RL, by learning over different goals. Alternatively, our approach builds on an efficient APSP dynamic programming algorithm, as we propose next.

4.1. Sub-Goal Tree Dynamic Programming

By our definition of non-negative costs, the shortest path between any two states is of length at most N. The main idea in our approach is that a trajectory between s and g can be constructed in a divide-and-conquer fashion: first predict a sub-goal between start and goal that optimally partitions the trajectory to two segments of length N/2 or less: and . Then, recursively, predict intermediate sub-goals on each sub-segment, until a complete trajectory is obtained. We term this composition a sub-goal tree (SGT), and we formalize it as follows.

Let denote the shortest path from or less. Note that by our convention about unconnected edges above, if there is no such trajectory in the graph then . We observe that obeys the following dynamic programming relation, which we term sub-goal tree dynamic programming (SGTDP).

Theorem 1. Consider a weighted graph with N nodes and no negative cycles. Let denote the cost of the shortest path from steps or less, and let denote the cost of the shortest path from s to . Then, can be computed according to the following equations:

V

V

V

Furthermore, for we have that

The proof of Theorem 1, along with all other proofs in this paper, are reported in the supplementary material.

Given prescribes the following recipe for computing a shortest path for every s, g, which we term the greedy SGT trajectory:5

The SGT can be seen as a general divide-and-conquer representation of a trajectory, by recursively predicting sub-goals. In contrast, the construction of a greedy trajectory in conventional RL (using the Bellman equation) proceeds sequentially, by predicting state (or action) after state. This is illustrated in Figure 1. We next investigate the advantages of the SGT structure compared to the sequential approach.

In conventional RL, it is well known that in the presence of errors, following a greedy policy may suffer from drift – accumulating errors that may hinder performance (Ross et al., 2011). In the goal-based setting, intuitively, the error magnitude scales with the distance from the goal. Our main observation is that the SGT is less sensitive to drift, as the sub-goals break the trajectory into smaller segments with exponentially decreasing errors. We next formalize this idea, by analyzing an approximate DP formulation.

4.2. SGT Approximate DP

Similar to standard dynamic programming algorithms, the SGTDP algorithm iterates over all states in the graph, which becomes infeasible for large, or continuous state spaces. For such cases, inspired by the approximate dynamic programming (ADP) literature (Bertsekas, 2005), we investigate ADP methods based on SGTDP.

Let T denote the SGTDP operator, . From Theorem 1, we have that . Similarly to standard ADP analysis (Bertsekas, 2005), we assume an -approximate SGTDP operator that generates a sequence of approximate value functions that satisfy:

where . The next result provides an error propagation bound for SGTDP.6

Proposition 1. For the sequence satisfying (6), we have that

A similar error bound is known for ADP based on the approximate Bellman operator (Bertsekas & Tsitsiklis, 1996, Pg. 332). Thus, Proposition 1 shows that given the same value function approximation method, we can expect a similar error in the value function between the SGT and sequential approaches. However, the main importance of the value function is in deriving a policy, in this case, a trajectory from start to goal. As we show next, the shortest path derived from the approximate SGTDP value function can be significantly more accurate than a path derived using the Bellman value function.

We first discuss how to compute a greedy trajectory. Given a sequence of -approximate SGTDP value functions as described above, one can compute the greedy SGT trajectory by plugging the approximate value functions in (5) instead of their respective accurate value functions. We term this the approximate greedy SGT trajectory, and provide an error bound for it.

Proposition 2. For a start and goal pair s, g, and sequence of value functions generated by an -approximate SGTDP operator, let denote the approximate greedy SGT trajectory as described above. We have that

Thus, the error of the greedy SGT trajectory is . In contrast, for the greedy trajectory according to the standard finite-horizon Bellman equation, a tight bound holds (Ross et al., 2011, we also provide an independent proof in the supplementary material). Thus, the SGT approach provides a strong improvement in handling errors compared to the sequential method. In addition, the SGT approach requires us to compute and store only different value functions. In comparison, a standard finite horizon approach would require storing N value functions, which can be limiting for large N. Finally, during trajectory prediction, sub-goal predictions for different branches of the tree are independent and can be computed concurrently, allowing an prediction time, whereas sequentially predicting sub-goals is O(N). We thus conclude that the SGT provides significant benefits in both approximation accuracy, prediction time, and space.

Why not use the Floyd-Warshall Algorithm? At this point, the reader may question why we do not build on the Floyd-Warshall (FW) algorithm for the APSP problem. The FW method maintains a value function of the shortest path from s to , and updates the value using the relaxation . If the updates are performed over all (in that sequence), will converge to the shortest path, requiring computations, compared to for SGTDP (Rus- sell & Norvig, 2010). One can also perform relaxations in an arbitrary order, as was suggested by Kaelbling (1993), and more recently by Dhiman et al. (2018), to result in an RL style algorithm. However, as was already observed by Kaelbling (1993), the FW relaxation requires that the values always over-estimate the optimal costs, and any underestimation error, due to noise or function approximation, gets propagated through the algorithm without any way of recovery, leading to instability. Our SGT approach avoids this problem by updating based on , resembling finite horizon dynamic programming, and, as we proved, maintains stability in presence of errors. Furthermore, both Kaelbling (1993); Dhiman et al. (2018) showed results only for table-lookup value functions. In our experiments, we have found that replacing the SGTDP update with a FW relaxation (described in the supplementary) leads to instability when used with function approximation.

5. Sub-Goal Tree RL Algorithms

Motivated by the theoretical results of the previous section, we present algorithms for learning SGT policies. We start with a value-based batch-RL algorithm with function approximation in Section 5.1. Then, in Section 5.2, we present a policy gradient approach for SGTs.

5.1. Batch RL with Sub-Goal Trees

We now describe a batch RL algorithm with function approximation based on the SGTDP algorithm above. Our approach is inspired by the fitted-Q iteration (FQI) algorithm for finite horizon Markov decision processes (Tsitsik- lis & Van Roy 2001; see also Ernst et al. 2005; Riedmiller 2005 for the discounted case). Similar to FQI, we are given a data set of M random state transitions and their costs , and we want to estimate for ar- bitrary pairs . Assume that we have some estimate of the value function of depth k in SGTDP. Then, for any pair of start and goal states s, g, we can estimate

Thus, if our data consisted of start and goal pairs, we could use (7) to generate regression targets for the next value function, and use any regression algorithm to fit This is the essence of the Fitted SGTDP algorithm (Algorithm 1). Since our data does not contain explicit goal states, we simply define goal states to be randomly selected states from within the data.

The first iteration k = 0 in Fitted SGTDP, however, requires special attention. We need to fit the cost function for connected states in the graph, yet make sure that states which are not reachable in a single transition have a high cost. To this end, we fit the observed costs c to the observed transitions in the data, and a high cost to transitions from the observed states to randomly selected states.

We also need a method to approximately solve the minimization problem in (7). In our experiments, we discretized the state space and performed a simple grid search. Other methods could be used in general. For example, if is represented as a neural network, then one can use gradient descent. Naturally, the quality of Fitted SGTDP will depend on the quality of solving this minimization problem.

5.2. Sub-Goal Trees Policy-Gradient

The minimization over sub-goal states in Fitted SGTDP can be difficult for high-dimensional state spaces. A similar problem arises in standard RL with continuous actions (Bert- sekas, 2005; Kalashnikov et al., 2018), and has motivated the study of policy search methods, which directly optimize the policy (Deisenroth et al., 2013). Following a similar motivation, we propose a policy search approach based on SGTs. Inspired by policy gradient (PG) algorithms in conventional RL (Sutton et al., 2000; Deisenroth et al., 2013), we propose a parametrized stochastic policy that approximates the optimal sub-goal prediction, and we develop a corresponding PG theorem for training the policy.

Stochastic SGT policies: A stochastic SGT policy is a stochastic mapping from two endpoint states to a predicted sub-goal . Given a start state and goal , the likelihood of under policy is defined recursively by:

where the base of the recursion is 1 for all . This formulation assumes that sub-goal predictions within a segment depend only on states within the segment, and not on states before or after it. This is analogous to the Markov property in conventional RL, where the next state prediction depends only on the previous state, but adapted to a goal-conditioned setting.

Note that this recursive decomposition can be interpreted as a tree. Without loss of generality, we assume this tree has a depth of D thus (repeating states to make the tree a full binary tree does not incur extra cost as c(s, s) = 0).

We now define the PG objective. Let denote a distribution over start and goal pairs . our goal is to learn a stochastic policy , characterized by a parameter vector , that minimizes the expected trajectory costs:

where the trajectory distribution is defined by first drawing , and then recursively drawing intermediate states as defined by Eq. (8). Our next result is a PG theorem for SGTs.

Theorem 2. Let be a stochastic SGT policy, trajectory distribution defined above, and

where , and is the sum of costs from of . Furthermore, let the baseline be any fixed function , then in (10) can be replaced with

The main difference between Theorem 2 and the standard PG theorem (Sutton et al., 2000) is in the calculation of , which in our case builds on the tree-based construction of the SGT trajectory. The summations over i and d in Theorem 2, are used to explicitly state individual sub-goal predictions: d iterates over the depth of the tree, and i iterates between sub-goals of the same depth.

Using Theorem 2, we can sample a trajectory using , obtain its cost , and estimate of the gradient of and the gradients of individual decisions of

In the policy gradients literature, variance reduction using control variates, and trust-region methods such as TRPO and PPO play a critical role (Greensmith et al., 2004; Schulman et al., 2015; 2017). The baseline reduction in Theorem 2 allows similar results to be derived for SGT, and we similarly find them to be important in practice. Due to space constraints, we report these in the Supplementary.

5.3. The SGT-PG Algorithm

Following the theoretical results in Section 4.1, where the greedy SGT trajectory used a different value function to predict sub-goals at different depths of the tree, we can

expect a stochastic SGT policy that depends on the depth in the tree (d in Theorem 2) to perform better than a depthindependent policy. Theorem 2 holds true when on d, similarly to a time-dependent policy in standard PG literature. We denote such a depth dependent policy as , for . At depth d = 0, no sub-goal is predicted resulting in (s, g) being directly connected. Next, predicts a sub-goal , segmenting the trajectory to two segments of depth 0: and . The recursive construction continues, predicts a sub-goal and calls on the resulting segments, and so on until depth D.

An observation that we found important for improving training stability, is that the policies can be trained sequentially. Namely, we first train the d-depth policy, and only then start training d + 1-depth policy, while freezing all policies of depth

Our algorithm implementation, SGT-PG, is detailed in Algorithm 2. SGT-PG predicts sub-goals, and interacts with an environment E that evaluates the cost of segments SGT-PG maintains D depth-specific policies: , each parametrized by respectively (i.e., we do not share parameters for policies at different depths, though this is possible in principle). The policies are trained in sequence, and every training cycle is comprised of on-policy data collection followed by policy update. The collect method collects on-policy data given N, the number of episodes to generate; d, the index of the policy being trained; and E,

Figure 2. Batch RL experiment. (a) A robot needs to navigate between the (hatched) obstacles. Blue - SGT prediction, green -trajectory tracking sub-goals using an inverse model. (b) Approximate values for several values of k. Note how the reachable region to the goal (non-yellow) grows with k.

the environment. We found that for reducing noise when training , it is best to sample from only and take the mean predictions of . The compute-PG method, based on Eq. (10), uses the collected data D, and estimates . We found that similar to sequential RL, adding a trust region using the PPO optimization objective (Schul- man et al., 2017) provides stable updates (see Section E.4 for specific loss function). Finally, optimizer.step updates according to any SGD optimizer algorithm, which completes a single training cycle for . We proceed to train the next policy when a pre-specified convergence criterion is met, for instance, until the expected cost of predicted trajectories stops decreasing.

6. Experiments

In this section we compare our SGT approach with the conventional sequential method (i.e. predicting the next state of the trajectory). We consider APSP problems inspired from robotic motion planning, where the goal is to find a collision-free trajectory between a pair of start and goal states. We illustrate the SGT value functions and trajectories on a simple 2D point robot domain, which we solve using Fitted SGTDP (Section 6.1). We then consider a more challenging domain with a simulated 7DoF robotic arm, and demonstrate the effectiveness of SGT-PG (Section 6.2).

6.1. Fitted SGTDP Experiments

We start by evaluating the Fitted SGTDP algorithm. We consider a 2D particle moving in an environment with obstacles, as shown in Figure 2a. The particle can move a distance of 0.025 in one of the eight directions, and suffers a constant cost of 0.025 in free space, and a large cost of 10 on collisions. Its task is reaching from any starting point to within a 0.15 distance of any goal point without collision. This simple domain is a continuous-state optimal control Problem (2), and for distant start and goal points, as shown in Figure 2a, it requires relatively long-horizoned planning, making it suitable for studying batch RL algorithms.

To generate data, we sampled states and actions uniformly and independently, resulting in tuples. As for function approximation, we opted for simplicity, and used K-nearest neighbors (KNN) for all our experiments, with K. To solve the minimization over states in Fitted SGTDP, we discretized the state space and searched over a grid of points.

A natural baseline in this setting is FQI (Ernst et al., 2005; Riedmiller, 2005). We verified that for a fixed goal, FQI obtains near perfect results with our data. Then, to make it goal-conditioned, we used a universal Q-function (Schaul et al., 2015), requiring only a minor change in the algorithm (see supplementary for pseudo-code).

To evaluate the different methods, we randomly chose 200 start and goal points, and measured the distance from the goal the policies reach, and whether they collide with obstacles along the way. For FQI, we used the greedy policy with respect to the learned Q function. The approximate SGTDP method, however, does not automatically provide a policy, but only a state trajectory to follow. Thus, we experimented with two methods for extracting a policy from the learned sub-goal tree. The first is training an inverse model – a mapping from , using our data, and the same KNN function approximation. To reach a sub-goal g from state s, we simply run until we are close enough to g (we set the threshold to 0.15). An alternative method is first using FQI to learn a goal-based policy, as described above, and then running this policy on the sub-goals. The idea here is that the sub-goals learned by approximate SGTDP can help FQI overcome the long-horizon planning required in this task. Note that all methods use exactly the same data, and the same function approximation (KNN), making for a fair comparison.

In Table 1 we report our results. FQI only succeed in reaching the very closest goals, resulting in a high average distance to goal. Fitted SGTDP (approx. SGTDP+IM), on the other hand, computed meaningful sub-goals for almost all test cases, resulting in a low average distance to goal when tracked by the inverse model. Figure 2 shows an example sub-goal tree and a corresponding tracked trajectory. The FQI policy did learn not to hit obstacles, resulting in the lowest collision rate. This is expected, as colliding leads to an immediate high cost, while the inverse model is not trained to take cost into account. Interestingly, combining the FQI policy with the sub-goals improves both long-horizon planning and short horizoned collision avoidance. In Figure 2b we plot the approximate value function for different k and a specific goal at the top right corner. Note how the reachable parts of the state space to the goal expand with k.

Figure 3. Motion planning with SGT-PG. (a) Illustration of start (red) and goal (blue) positions of the end effector in wall domain. (b) SGT Trajectory. Note a non-trivial rotation between (marked with orange circles) allowing linear motion from supplementary material we also show a 7 sub-goals path.

Table 1. Results for controllers of batch-RL experiments.

6.2. Neural Motion Planning

An interesting application domain for our approach is neural motion planning (NMP, Qureshi et al., 2018) – learning to predict collision-free trajectories for a robot among obstacles. Here, we study NMP for the 7DoF Franka Panda robotic arm. Due to lack of space, full technical details of this section appear in Supplementary E.1.

We follow an RL approach to NMP (Jurgenson & Tamar, 2019), using a cost function that incentivizes short, collision-free trajectories that reach the goal. The state space is the robot’s 7 joint angles, and actions correspond to predicting the next state in the plan. Given the next predicted state, the robot moves by running a fixed PID controller in simulation, tracking a linear joint motion from current to next state, and a cost is incurred based on the resulting motion.

We formulate NMP as approximate APSP as follows. A model predicts sub-goals, resulting in T motion segments. Those segments are executed and evaluated independently, i.e. when evaluating segment the robot first resets to s, and the resulting cost is based on its travel to

Our experiments include two scenarios: self-collision and wall. In self-collision, there are no obstacles and the challenge is to generate a minimal distance path while avoiding self-collisions between the robot links. The more challenging wall workspace contains a wall that partitions the space in front of the robot (see Figure 3). In wall, the shortest path is often nonmyopic, requiring to first move away from the goal in order to pass the obstacle.

We compare SGT-PG with a sequential baseline, Sequential sub-goals (SeqSG), which prescribes the sub-goals predictions sequentially. For appropriate comparison, both models use a PPO objective (Schulman et al., 2017), and a fixed

Table 2. Success rates for the NMP scenarios.

architecture neural network to model the policy. All other hyper-parameters were specifically tuned for each model.

Table 2 compares SGT-PG and SeqSG, on predictions of 1, 3, and 7 sub-goals. We evaluate success rate (reaching goal without collision) on 100 random start-goal pairs that were held out during training (see Figure 3a). Each experiment was repeated 3 times and the mean and range are reported. Note that only 1 sub-goal is required to solve self-collision, and both models obtain perfect scores. For wall, on the other hand, more sub-goals are required, and here SGT significantly outperforms SeqSG, which was not able to accurately predict several sub-goals.

7. Conclusion

We presented a framework for multi-goal RL that is derived from a novel first principle – the SGT dynamic programming equation. For deterministic domains, we showed that SGTs are less prone to drift due to approximation errors, reducing error accumulation from to O(N log N). We further developed value-based and policy gradient RL algorithms for SGTs, and demonstrated that, in line with their theoretical advantages, SGT demonstrate improved performance in practice.

Our work opens exciting directions for future research, including: (1) can our approach be extended to stochastic environments? (2) how to explore effectively based on SGTs? (3) can SGTs be extended to image-based tasks? Finally, we believe that our ideas will be important for robotics and autonomous driving applications, and other domains where goal-based predictions are important.

References

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, O. P., and Zaremba, W. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.

Bertsekas, D. Dynamic programming and optimal control: Volume II. 2005.

Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming. Athena Scientific, 1996.

Bishop, C. M. Mixture density networks. Technical report, Citeseer, 1994.

Deisenroth, M. P., Neumann, G., Peters, J., et al. A survey on policy search for robotics. Foundations and Trends Rin Robotics, 2(1–2):1–142, 2013.

Dhiman, V., Banerjee, S., Siskind, J. M., and Corso, J. J. Floyd-warshall reinforcement learning learning from past experiences to reach new goals. arXiv preprint arXiv:1809.09318, 2018.

Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.

Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.

Floyd, R. W. Algorithm 97: shortest path. Communications of the ACM, 5(6):345, 1962.

Fruit, R., Pirotta, M., Lazaric, A., and Brunskill, E. Regret minimization in mdps with options without prior knowledge. In Advances in Neural Information Processing Systems, pp. 3166–3176, 2017.

Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduc- tion techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov): 1471–1530, 2004.

Gu, S., Holly, E., Lillicrap, T., and Levine, S. Deep rein- forcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389– 3396. IEEE, 2017.

Ijspeert, A. J., Nakanishi, J., Hoffmann, H., Pastor, P., and Schaal, S. Dynamical movement primitives: learning attractor models for motor behaviors. Neural computation, 25(2):328–373, 2013.

Jayaraman, D., Ebert, F., Efros, A. A., and Levine, S. Time- agnostic prediction: Predicting predictable video frames. In ICLR, 2019.

Jurgenson, T. and Tamar, A. Harnessing reinforcement learning for neural motion planning. In Robotics: Science and Systems (RSS), 2019.

Kaelbling, L. P. Learning to achieve goals. In IJCAI, pp. 1094–1099. Citeseer, 1993.

Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Kober, J., Bagnell, J. A., and Peters, J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.

Konidaris, G., Kuindersma, S., Grupen, R., and Barto, A. Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research, 31 (3):360–375, 2012.

LaValle, S. M. Planning algorithms. Cambridge university press, 2006.

Mann, T. and Mannor, S. Scaling up approximate value iteration with options: Better policies with fewer iterations. In International conference on machine learning, pp. 127–135, 2014.

McGovern, A. and Barto, A. G. Automatic discovery of subgoals in reinforcement learning using diverse density. 2001.

Menache, I., Mannor, S., and Shimkin, N. Q-cutdynamic discovery of sub-goals in reinforcement learning. In European Conference on Machine Learning, pp. 295–306. Springer, 2002.

Mishra, N., Abbeel, P., and Mordatch, I. Prediction and control with temporal segment models. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2459–2468. JMLR. org, 2017.

M¨ulling, K., Kober, J., Kroemer, O., and Peters, J. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research, 32(3):263–279, 2013.

Munos, R. and Szepesv´ari, C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 (May):815–857, 2008.

Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682– 697, 2008.

Pomerleau, D. A. ALVINN: An autonomous land vehicle in a neural network. In NIPS, pp. 305–313, 1989.

Qureshi, A. H., Bency, M. J., and Yip, M. C. Motion planning networks. arXiv preprint arXiv:1806.05767, 2018.

Riedmiller, M. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317– 328. Springer, 2005.

Ross, S., Gordon, G., and Bagnell, D. A reduction of imita- tion learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.

Russell, S. J. and Norvig, P. Artificial Intelligence - A Modern Approach (3. internat. ed.). Pearson Education, 2010.

Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International conference on machine learning, pp. 1312–1320, 2015.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897, 2015.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

S¸ucan, I. A. and Kavraki, L. E. Kinodynamic motion planning by interior-exterior cell exploration. In Algorithmic Foundation of Robotics VIII, pp. 449–464. Springer, 2009.

S¸ucan, I. A., Moll, M., and Kavraki, L. E. The Open Motion Planning Library. IEEE Robotics & Automation Magazine, 19(4):72–82, December 2012. doi: 10.1109/MRA. 2012.2205651. http://ompl.kavrakilab.org.

Sung, J., Jin, S. H., and Saxena, A. Robobarista: Object part based transfer of manipulation trajectories from crowdsourcing in 3d pointclouds. In Robotics Research, pp. 701–720. Springer, 2018.

Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181–211, 1999.

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.

Tsitsiklis, J. N. and Van Roy, B. Regression methods for pricing complex american-style options. IEEE Transactions on Neural Networks, 12(4):694–703, 2001.

Zhang, H., Heiden, E., Julian, R., He, Z., Lim, J. J., and Sukhatme, G. S. Auto-conditioned recurrent mixture density networks for complex trajectory generation. arXiv preprint arXiv:1810.00146, 2018.

A. Proofs

Proof of Theorem 1

Proof. First, by definition, the shortest path from s to itself is 0. In the following, therefore, we assume that

We show by induction that each in Algorithm (4) is the cost of the shortest path from steps or less.

Let denote a shortest path from s to in steps or less, and let denote its corresponding cost. Our induction hypothesis is that is the cost of the shortest path from steps or less. We will show that

Assume by contradiction that there was some . Then, the concatenated trajectory would have steps or less, contradicting the fact that is a shortest path from s to in steps or less. So we have that . Since the graph is complete, can be split into two trajectories of length steps or less. Let be a midpoint in such a split. Then we have that . So equality can be obtained and thus we must have

To complete the induction argument, we need to show that . This holds since for k = 0, for each , the only possible trajectory between them of length 1 is the edge

Finally, since there are no negative cycles in the graph, for any , the shortest path has at most N steps. Thus, for , we have that is the cost of the shortest path in N steps or less, which is the shortest path between s and

Proof of Proposition 1

Proof. The SGTDP Operator T is non-linear and not a contraction, but it is monotonic:

To show (11), let

Back to the proof. By denoting e = (1, 1, 1, 1, ..., 1) we can write (6) as:

Since T is monotonic, we can apply it on the inequalities:

Focusing on the left-hand side expression (the right-hand side is symmetric) we obtain:

T

leading to:

Using (6):

Proceeding similarly we obtain for

And for every

For we obtain:

Proof of Proposition 2

Proof. For every iteration k, the middle state of any greedy SGT path with length . Thereby fulfils the identity , for every iteration k, where T is the SGT operator. The next two relations then follow:

Combining the two inequalities yields:

Explicitly writing down relation (15) for all the different sub-paths we obtain:

Figure 4. Example graph with a start and goal states s, g where the optimal path has a total cost of -approximated Bellman might form a path with total cost of infinity

Summing all the inequalities, along with the triangle inequality, lead to the following:

The identities complete the proof.

Proposition 2 provides a bound on the approximate shortest path using SGTDP. In contrast, we show that a similar bound for the sequential Bellman approach does not hold. For a start and goal pair s, g, and an approximate Bellman value function , let denote the greedy shortest path according to the Bellman operator, i.e., and for . Note that the greedy Bellman update is not guaranteed to generate a trajectory that reaches the goal, therefore we add g as the last state in the trajectory, following our convention that the graph is fully connected. When evaluating the greedy trajectory error in the sequential approach, we deal with two different implementation cases: Using one approximated value function or N approximated value functions. The next proposition shows that when using one value function the greedy trajectory can be arbitrarily bad, regardless of . Propositions (4) and (5) show that the error of the greedy trajectory, using N value functions, have a tight

Proposition 3. For any , there exists a graph and an approximate value function satisfying , such that

Proof. We show a graph example (Figure 4), where a -approximate Bellman value function might form the path with a total cost of infinity. (Reminder: The graph is fully connected, all the non-drawn edges have infinity cost), while the optimal path has a total cost of

As is independent of the current time-step k, and the optimal values for s and are and respectively, due to approximation error (for every time-step) might suggest that the value of is lower than the value of : . The resulting policy will choose to stay in for the first steps. Since the maximum path length is N and the path must end at the goal, the last step will be directly from , resulting in a cost of infinity.

Proposition 4. For a finite state graph, start and goal pair s, g, and sequence of -approximate Bellman value functions satisfying and , let denote the greedy Bellman path, that is: etc. We have that

Proof. Denote as the Bellman value function at iteration -approximated Bellman value function and the Bellman operator. Using the properties and (Bertsekas & Tsitsiklis, 1996, Pg. 332), the following relation holds:

Figure 5. Example graph with a start and goal states s, g. The edges costs are marked in black labels, the Bellman value function at iteration k is marked in green and the -approximated Bellman value is marked in red. The optimal path has a total cost of while the -approximated Bellman might form the path going through all the states with a total cost of

for

for

Summing all the path costs:

Proposition 5. The error bound stated in Proposition(4), , is indeed a tight bound.

Proof. We show a graph example (Figure 5),where a sequence of -approximate Bellman value functions might form the path from the start to the goal state, going through all the available states, with a total cost of , while the optimal path has a total cost of

The Bellman value of every state , where is the time horizon from state to the goal. Since the -approximate Bellman value function at every state might have a maximum approximation error of , it might suggest that for every . The Bellman value for state is 0 and the -approximate Bellman value might be . The greedy Bellman policy has to determine at every state , whether to go to or to . Since both possible actions have an immediate cost of , the decision will only be based on the evaluation of -approximate Bellman value function: . Therefore, the greedy bellman policy will decide at every state , forming the path with a total cost of

B. Policy Gradient Theorem for Sub-goal Trees

In this section we provide the mathematical framework and tool we used in the Policy Gradient Theorem 2 in Section 5.2 which allows us to formulate the SGT-PG algorithm (Section 5.2). The SGT the prediction process (Eq.(8)) no-longer decomposes sequentially like a MDP, therefore, we provide a different mathematical construct:

Finite-depth Markovian sub-goal tree (FD-MSGT) is a process predicting sub-goals (or intermediate states) of a trajectory in a dynamical system as described in (3). The process evolves trajectories recursively (as we describe in detail below), inducing a tree-like decomposition for the probability of a trajectory as described Eq. (8). In this work we consider trees with fixed levels D (corresponding to finite-horizon MDP in sequential prediction RL with horizon ) 7. Formally, a FD-MSGT is comprised of where S is the state space, is the initial start-goal pairs distribution, c is a non-negative cost function obeying Eq. (3), and D is the depth of the tree 8.

Recursive evolution of FD-MSGT: Formally, an initial pair is sampled, defining the root of the tree. Next, a policy is used to predict the sub-goal , creating two new tree nodes of depth D corresponding to the segments . Recursively each segment is again partitioned using resulting in four tree nodes of level corresponding to segments and . The process continues recursively until the depth of the tree is 1. This results in Eq.(8), namely,

FD-MSGT results in sub-goals, and overall states (including c defines the cost of the prediction , and is evaluated on the leaves of the FD-MSGT tree according to Eq. (3). Figure 3b illustrates a FD-MSGT of D = 2. The objective of FD-MSGT is to find a policy minimizing the expected cost of a trajectory

Non-recursive formulation for trajectory likelihood: We next derive a non-recursive formulation of Eq. (8), using the notations defined in Theorem 2, namely, , and is the sum of costs from . We re-arrange the terms in (8) grouping by depth resulting in a non-recursive formula,

For example consider a FD-MSGT with D = 3 (T = 8). Given and , the sub-goals will be . We enumerate the two products in Eq. (16) resulting in:

Allowing us to assert the equivalence of the explicit and recursive formulations in this case.

B.1. Proof of Theorem 2

Let be a policy with parameters , we next prove a policy gradient theorem (Theorem 3) for computing , using the following proposition:

Proposition 6. Let be a policy with parameters of a FD-MSGT with depth D, then

Proof. First, we express obtaining,

Next, by taking the log we have,

Proposition 6 shows that the gradient of a trajectory w.r.t. does not depend on the initial distribution. This allows us to derive a policy gradient theorem:

Theorem 3. Let be a stochastic SGT policy, be a trajectory distribution defined above, and

Proof. To obtain we write Eq. (9) as an explicit expectation and use

This proves the first equality. For the second equality we substitute according to Eq. (17).

The policy gradient theorem allows estimating from on policy data collected using . We next show how to improve the estimate in Theorem 3 using control variates (baselines). We start with the following baseline-reduction proposition.

Proposition 7. Let be a policy with parameters of a FD-MSGT with depth be any fixed function

Then we expand the left-hand side of Eq. (18), and use Eq. (19) in the last row:

which concludes the proof.

We define segment-dependent baseline if b is a fixed function that operates on a pair of states s and g. The last proposition allows us to reduce segment-dependent baselines, , from the estimations of without bias. Finally, in the next proposition we show that instead of estimating we can instead use as follows:

Proposition 8. Let be a policy with parameters of a FD-MSGT with depth D, then:

Proof. We have that:

The first transition is the expectation of sums, and the second is the smoothing theorem. We next expand the inner expectation in Eq. (20), and partition the sum of cost into three sums: between indices 0 to to , and finally from

Next, by the definition of FD-MSGT the costs between to and to , are independent of the policy predictions between indices – making them constants. As we showed in Proposition 7, we obtain that

this expectation is 0. Thus the expression for simplifies to:

Finally, combining Propositions 7 and 8 in Theorem 3 provides us Theorem 2 in the main text.

C. Batch RL Baselines

In Algorithm 3 we present a goal-based versions of fitted-Q iteration (FQI) (Ernst et al., 2005) using universal function approximation (Schaul et al., 2015), which we used as a baseline in our experiments.

Next, in Algorithm 4 we present an approximate dynamic programming version of Floyd-Warshall RL (Kaelbling, 1993) that corresponds to the batch RL setting we investigate. This algorithm was not stable in our experiments, as the value function converged to zero for all states (when removing the self transition fitting in line 7 of the algorithm, the values converged to a constant value).

Algorithm 4 Approximate Floyd Warshall

Algorithm

1 Input: dataset , Maximum path cost 2 Create transition data set and targets taken from D 3 Create random transition data set and targets with randomly chosen from states in D

4 Create self transition data set and targets taken from D 5 Fit to data in for k : 1...K do

6 Create random goal and mid-point data set and targets

D. Learning SGT with Supervised Learning

In this section we describe how to learn SGT policies for deterministic, stationary, and finite time dynamical systems from labeled data generated by an expert. This setting is commonly referred to as Imitation Learning (IL). Specifically, we are given a dataset of N trajectory demonstrations, provided by an expert policy . Each trajectory demonstration from the dataset , is a sequence of states, i.e. .9 In this work, we assume a goal-based setting, that is, we assume that the expert policy generates trajectories that lead to a goal state which is the last state in each trajectory demonstration. Our goal is to learn a policy that, given a pair of current state and goal state (s, g), predicts the trajectory that would have chosen to reach g from s.

D.1. Imitation learning with SGT

In this section, we describe the learning objectives for sequential and sub-goal tree prediction approaches under the IL settings. We focus on the Behavioral Cloning (BC)(Pomerleau, 1989) approach to IL, where a parametric model for a policy with parameters is learned by maximizing the log-likelihood of observed trajectories in

Denote the horizon T as the maximal number of states in a trajectory. For ease of notation we assume T to be the same for all trajectories10. Also, let and denote the start and goal states for . We ask – how to best represent the distribution

The sequential trajectory representation (Pomerleau, 1989; Ross et al., 2011; Zhang et al., 2018; Qureshi et al., 2018), a popular approach, decomposes by sequentially predicting states in the trajectory conditioned on previous predictions. Concretely, let denote the history of the trajectory at time index t, the decomposition assumed by the sequential representation is Using this decomposition, (22) becomes:

We can learn using a batch stochastic gradient descent (SGD) method. To generate a sample in a training batch, a trajectory is sampled from , and an observed state is further sampled from . Next, the history and goal are extracted from . After learning, sampling a trajectory between s and g is a straight-forward iterative process, where the first prediction is given by , and every subsequent prediction is given by . This iterative process stops once some stopping condition is met (such as a target prediction horizon is reached, or a prediction is ’close-enough’ to g).

The Sub-Goal Tree Representation: In SGT we instead predict the middle-state in a divide-and-conquer approach. Following Eq. (8) we can redefine the maximum-likelihood objective as

To organize our data for optimizing Eq. (24), we first sample a trajectory for each sample in the batch. From we sample two states and obtain their midpoint . Pseudo-code for sub-goal tree prediction and learning are provided in Algorithm 5 and Algorithm 6.

D.2. Imitation Learning Experiments

We compare the sequential and Sub-Goal Tree (SGT) approaches for BC. Consider a point-robot motion-planning problem in a 2D world, with two obstacle scenarios termed simple and hard, as shown in Figure 6. In simple, the distribution of possible motions from left to right is uni-modal, while in hard, at least 4 modalities are possible.

Figure 6. IL Experiment domains and results. The simple domain (a) and hard domain (b). A point robot should move only on the white free-space from a room on one side to the room on the other side while avoiding the blue obstacles. SGT plan (blue) executed successfully, sequential plan (green) collides with the obstacles.

Table 3. Results for IL experiments.

For both scenarios we collected a set of 111K (100K train + 10K validation + 1K test) expert trajectories from random start and goal states using a state-of-the-art motion planner (OMPL’s(S¸ucan et al., 2012) Lazy Bi-directional KPIECE (S¸ucan & Kavraki, 2009) with one level of discretization). To account for different trajectory modalities, we chose a Mixture Density Network (MDN)(Bishop, 1994) as the parametric distribution of the predicted next state, both for the sequential and the SGT representations. We train the MDN by maximizing likelihood using Adam (Kingma & Ba, 2014). To ensure the same model capacity for both representations, we used the same network architecture, and both representations were trained and tested with the same data. Since the dynamical system is Markovian, the current and goal states are sufficient for predicting the next state in the plan, so we truncated the state history in the sequential model’s input to contain only the current state.

For simple, the MDNs had a uni-modal multivariate-Gaussian distribution, while for hard, we experimented with 2 and 4 modal multivariate-Gaussian distributions, denoted as hard-2G, and hard-4G, respectively. As we later show in our results the SGT representation captures the demonstrated distribution well even in the hard-2G scenario, by modelling a bi-modal sub-goal distribution.

We evaluate the models using unseen start-goal pairs taken from the test set. To generate a trajectory, we predict sub-goals, and connect them using linear interpolation. We call a trajectory successful if it does not collide with an obstacle en-route to the goal. For a failed trajectory, we further measure the severity of collision by the percentage of the trajectory being in collision.

Table 3 summarizes the results for both representations. The SGT representation is superior in all three evaluation criteria - motion planning success rate, trajectory prediction times (total time in seconds for the 1K trajectory predictions), and severity. Upon closer look at the two hard scenarios, the Sub Goal Tree with a bi-modal MDN outperforms sequential with 4-modal MDN, suggesting that the SGT trajectory decomposition better accounts for multi-modal trajectories.

Table 4. Success rate on the NMP poles scenario

E. Technical Details for NMP Experiments Section 6.2

In this appendix we provide the full technical details of our policy gradient experiments presented in Section 6.2. We start by providing the explicit formulation of the NMP environment. We further provide a sequential-baseline not presented in the main text and its NMP environment and experimental results related to this baseline. We also present another challenging scenario – poles, and the results of SGT-PG and SeqSG in it. Finally, we conclude with modeling-related technical details 11.

E.1. Neural Motion Planning Formulation

In Section 6.2, we investigated the performance of SGT-PG and SeqSG executed in a NMP problem for the 7DoF Franka Panda robotic arm. That NMP formulation, denoted here as with-reset, allows for independent evaluations of motion segments, a requirement for FD-MSGT. The state describes the robot’s 7 joint angles and 2 gripper positions, normalized to . The models, SGT-PG and SeqSG, are tasked with predicting the states (sub-goals) in the plan.

A segment is evaluated by running a PID controller tracking a linear joint motion from s to . Let denote the state after the PID controller was executed, and let and be hyperparameters. We next define a cost function that encourages shorter paths, and discourages collisions; if a collision occurred during the PID execution, or 5000 simulation steps were not enough to reach , a cost of is returned. Otherwise, the motion was successful, and the segment cost is . In our experiments we set show the training curves for the self-collision and wall scenarios respectively.

E.2. The SeqAction Baseline

We also evaluated a sequential baseline we denoted as SeqAction, which is more similar to classical RL approaches. Instead of predicting the next state, SeqAction predicts the next action to take (specifically the joint angles difference), and a position controller is executed for a single time-step instead of the PID controller of the previous methods.

SeqAction operates on a finite-horizon MDP (with horizon of T = 5000 steps). We follow the sequential NMP formulation of Jurgenson & Tamar (2019); the agent gets a high cost on collisions, high reward when reaching close to the goal, or a small cost otherwise (to encourage shorter paths). We denote this NMP variant no-reset as segments are evaluated sequentially, without resetting the state between evaluations. To make the problem easier, after executing an action in no-reset, the velocity of all the links is manually set to 0.

Similarly to Jurgenson & Tamar (2019), we found that sequential RL agents are hard to train for neural motion planning with varying start and goal positions without any prior knowledge or inductive bias. We trained SeqAction in the self-collision scenario using PPO (Schulman et al., 2017), and although we achieved 100% success rate when fixing both the start and the goal (to all zeros and all ones respectively), the model was only able to obtain less than 0.1 success rate when varying the goal state, and showed no signs of learning when varying both the start and the goal.

E.3. The poles Scenario

We tested both SGT-PG and SeqSG on another scenario we call poles where the goal is to navigate the robot between poles as shown in Figures 7 and 8. Similar to the previous scenarios, Table 4 shows the success rates of both models for 1, 3, and 7 sub-goals. Again, we notice that SGT-PG attains much better results and improves as more sub-goals are added, while the performance of SeqSG deteriorates. Moreover, even for a single sub-goal, SGT-PG obtains better success rates compared to

Figure 7. Robot trajectory poles scenario predicted with SGT-PG with 7 sub-goals. Top left start state, bottom right goal state. Notice that the two final segments are longer than the rest, but are made possible by the robot getting into position in state Note that SGT-PG failed to find a trajectory for this (s, g) pair with less sub-goals, indicating the hardness of the path.

Figure 8. Start (blue) and goal (red) states in the test set of the poles scenario. Note that the agents are not exposed to these particular states during training.

SeqSG. The high range displayed by SGT-PG is due to one random seed that obtained significantly lower scores.

E.4. Neural Network Architecture and Training Procedure

We next provide full technical details regarding the architecture and training procedure of SGT-PG and SeqSG described in Section 6.2.

Architecture and training of SGT-PG: The start and goal states, s and g, are fed to a neural network with 3 hidden-layers of 20 neurons each and tanh activations. The network learns a multivariate Gaussian-distribution over the predicted sub-goal with a diagonal covariance matrix. Since the covariance matrix is diagonal, the size of the output layer is , 9 elements learn the bias, and 9 learn the standard deviation in each direction. To predict the bias, a value of is added to the prediction of the network. To obtain the final standard deviation, we add two additional elements to the standard deviation predicted by the neural network : (1) a small value (0.05), to prevent learning too narrow distributions, and (2) a learnable coefficient in each of the directions that depends on the distance between s and g, which allows large segments to predict a wider spread of sub-goals. These two changes were incorporated in order to mitigate log-likelihood evaluation issues.

SGT-PG trains on 30 (s, g) pairs per training cycle, using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.005 on the PPO objective (Schulman et al., 2017) augmented with a max-entropy loss term with coefficient 1, and a PPO trust region of . As mentioned in Section 5.2, when training level d for a single (s, g) pair, we sample M sub-goals (in our experiments M = 10) from the multivariate Gaussian described above, take the mean predictions of policies of depth and lower, and take the mean of costs as a (s, g)-dependent baseline. The repetition allows us to obtain a stable baseline without incorporating another learnable module – a value function. We found this method to be more stable than training on 300 different start-goal pairs (no repetitions). During test time we always take the mean prediction generated by the policies.

Architecture of SeqSG: This model has two components, a policy and a value function. The policy architecture is identical to that of SGT-PG. Note that only a single policy is maintained as the optimal policy is independent of the remaining

Figure 9. Success rates over training episodes of the self-collision scenario.

horizon12. We also found that learning the covaraince matrix causes stability issues for the algorithm, so the multivariate Gaussian only learns a bias term. The value function of SeqSG has 3 layers with 20 neurons each, and Elu activations. It predicts a single scalar which approximates V (s, g), and is used as a state-dependent baseline to reduce the variance of the gradient estimation during training.

In order to compare apples-to-apples, SeqSG also trains on 30 (s, g) pairs with 10 repetitions each. We trained the policy with the PPO objective (Schulman et al., 2017) with trust region of . We used the Adam optimizer (Kingma & Ba, 2014), with learning rate of 0.005 but we clipped the gradient with L2 norm of 10 to stabilize training. The value function is trained to predict the value with mean-squared error loss. We also trained it with Adam with a learning rate of 0.005 but clipped the gradient L2 norm to 100.

We note that finding stable parameters for SeqSG was more challenging than for SGT-PG, as apparent by the gradient clippings, the lower trust region value, and our failure to learn the mutivariate Gaussian distribution covariance matrix.

Figure 10. Success rates over training episodes of the wall scenario.