b

DiscoverSearch
About
My stuff
Sub-Goal Trees -- a Framework for Goal-Based Reinforcement Learning
2020·arXiv
Abstract
Abstract

Many AI problems, in robotics and other domains, are goal-based, essentially seeking trajectories leading to various goal states. Reinforcement learning (RL), building on Bellman’s optimality equation, naturally optimizes for a single goal, yet can be made multi-goal by augmenting the state with the goal. Instead, we propose a new RL framework, derived from a dynamic programming equation for the all pairs shortest path (APSP) problem, which naturally solves multi-goal queries. We show that this approach has computational benefits for both standard and approximate dynamic programming. Interestingly, our formulation prescribes a novel protocol for computing a trajectory: instead of predicting the next state given its predecessor, as in standard RL, a goal-conditioned trajectory is constructed by first predicting an intermediate state between start and goal, partitioning the trajectory into two. Then, recursively, predicting intermediate points on each sub-segment, until a complete trajectory is obtained. We call this trajectory structure a sub-goal tree. Building on it, we additionally extend the policy gradient methodology to recursively predict sub-goals, resulting in novel goal-based algorithms. Finally, we apply our method to neural motion planning, where we demonstrate signifi-cant improvements compared to standard RL on navigating a 7-DoF robot arm between obstacles.

Many AI problems can be characterized as learning or optimizing goal-based trajectories of a dynamical system, for example, robot skill learning and motion planning (ulling et al., 2013; Gu et al., 2017; LaValle, 2006; Qureshi et al., 2018). The reinforcement learning (RL) formulation, a popular framework for trajectory optimization based on Bellman’s dynamic programming (DP) equation (Bertsekas & Tsitsiklis, 1996), naturally addresses the case of a single goal, as specified by a single reward function. RL formulations for multiple goals have been proposed (Schaul et al., 2015; Andrychowicz et al., 2017), typically by augmenting the state space to include the goal as part of the state, but without changing the underlying DP structure.

On the other hand, in deterministic shortest path problems, multi-goal trajectory optimization is most naturally represented by the all-pairs shortest path (APSP) problem (Rus- sell & Norvig, 2010). In this formulation, augmenting the state and using Bellman’s equation is known to be suboptimal1, and dedicated APSP algorithms such as FloydWarshall (Floyd, 1962) build on different DP principles. Motivated by this, we propose a goal-based RL framework that builds on an efficient APSP solution, and is also applicable to large or continuous state spaces using function approximation.

Our key idea is that a goal-based trajectory can be constructed in a divide-and-conquer fashion. First, predict a sub-goal between start and goal, partitioning the trajectory into two. Then, recursively, predict intermediate sub-goals on each sub-segment, until a complete trajectory is obtained. We call this trajectory structure a sub-goal tree (Figure 1), and we develop a DP equation for APSP that builds on it and is compatible with function approximation. We further bound the error of following the sub-goal tree trajectory in the presence of approximation errors, and show favorable results compared to the conventional RL method, intuitively, due to the sub-goal tree’s lower sensitivity to drift.

The sub-goal tree can also be seen as a general parametric structure for a trajectory, where a parametric model, e.g., a neural network, is used to predict each sub-goal given its predecessors. Based on this view, we develop a policy gradient framework for sub-goal trees, and show that conventional policy-gradient techniques such as control variates and trust regions (Greensmith et al., 2004; Schulman et al., 2015) naturally apply here as well.

Finally, we present an application of our approach to neu-

image

Figure 1. Trajectory prediction methods. Upper row: a conventional Sequential representation. Lower row: Sub-Goal Tree representation. Solid arrows indicate predicted segments, while dashed arrows indicate segments that still require to be predicted. By concurrently predicting sub-goals, a Sub-Goal Tree only requires 3 sequential computations, while the sequential requires 8.

ral motion planning – learning to navigate a robot between obstacles (Qureshi et al., 2018; Jurgenson & Tamar, 2019). Using our policy gradient approach, we demonstrate navigating a 7-DoF continuous robot arm safely between obstacles, obtaining marked improvement in performance compared to conventional RL approaches.

In RL, the idea of sub-goals has mainly been investigated under the options framework (Sutton et al., 1999). In this setting, the goal is typically fixed (i.e., given by the reward in the MDP), and useful options are discovered using some heuristic such as bottleneck states (McGovern & Barto, 2001; Menache et al., 2002) or changes in the value function (Konidaris et al., 2012). While hierarchy using options seem intuitive, theoretical results for their advantage are notoriously difficult to establish, and current results require non-trivial assumptions on the option structure (Mann & Mannor, 2014; Fruit et al., 2017). Interestingly, by investigating the APSP setting, we obtain general and strong results for the advantage of sub-goals.

Universal value functions (Schaul et al., 2015; Andrychow- icz et al., 2017) learn a goal-conditioned value function using the Bellman equation. In contrast, in this work we propose a principled motivation for sub-goals based on the APSP problem, and develop a new RL formulation based on this principle. A connection between RL and the APSP has been suggested by Kaelbling (1993); Dhiman et al. (2018), based on the Floyd-Warshall algorithm. However, as discussed in Section 4.2, these approaches become unstable once function approximation is introduced.

Sub-goal trees can be seen as a form of trajectory representation (as we discuss in Section 5.2), which was investigated for learning robotic skills (ulling et al., 2013; Sung et al., 2018) and navigation (Qureshi et al., 2018). A popular approach, Dynamical movement primitives (Ijspeert et al., 2013), represents a continuous trajectory as a dynamical system with an attractor at the goal, and was successfully used for RL (Kober et al., 2013; ulling et al., 2013; Peters & Schaal, 2008). The temporal segment approach of Mishra et al. (2017), on the other hand, predicts segments of a trajectory sequentially. In the context of video prediction, Jayaraman et al. (2019) proposed to predict salient frames in a goal-conditioned setting by a supervised learning loss that focuses on the ‘best’ frames. This was used to predict a list of sub-goals for a tracking controller. In contrast, we investigate the RL problem, and predict sub-goals recursively.

We are interested in optimizing goal-conditioned tasks in dynamical systems. We restrict ourselves to deterministic, stationary, and finite time systems, and consider both discrete and continuous state formulations.2

In the discrete state setting, we study the all-pairs shortest path (APSP) problem on a graph. Consider a directed and weighted graph with N nodes  s1, . . . , sN, and denote by c(s, s′) ≥ 0the weight of edge  s → s′.3 By definition, we set c(s, s) = 0, such that the shortest path from a node to itself is zero. To simplify notation, we can replace unconnected edges by edges with weight  ∞. Thus, without loss of generality, throughout this work we assume that the graph is complete. The APSP problem seeks the shortest paths (i.e., a path with minimum sum of costs) from any start node s to any goal node g in the graph:

image

In the continuous state case, we consider a deterministic controlled dynamical system in discrete time, defined over a state space S and an action space  U: where st, st+1 ∈ S arethe states at time t and t+1, respectively,  ut ∈ Uis the control at time t, and f is a stationary state transition function. Given an initial state  s0and a goal g, an optimal trajectory reaches g while minimizing the sum of non-negative costs ¯c(s, u) ≥ 0:

image

To unify our treatment of problems (1) and (2), in the remainder of this paper we abstract away the actions as follows. Let c(s, s′) = minu ¯c(s, u) s.t. s′ = f(s, u), and  c(s, s′) = ∞if no transition from s to  s′is possible. Then, for any start and goal states s, g, Eq. (2) is equivalent to:4

image

Note the similarity of Problem (3) to the APSP problem (1), where feasible transitions of the dynamical system are represented by edges between states in the graph. In the rest of this paper, we study solutions for (1) and (3), for all start and goal pairs s, g. We are particularly interested in large problems, where exact solutions are intractable, and approximations are required.

Notation: Let  τ =s0, . . . , sTdenote a state trajectory. Also, denote  ci:j = �j−1t=i c(st, st+1), the cost of several subse- quent states and  cτ = c0:Tas the total trajectory cost. Let τ(s0, sm)=s0, . . . , smdenote the segment of  τstarting at s0and ending at  sm. We denote the trajectory concatenation as  τ=[τ(s0, sm), τ(sm, sT )], where it is implied that  sm ap-pears only once. For a stochastic trajectory distribution conditioned on start and goal, we denote  Prπ[si, . . . , sj|s, g] asshorthand for  Prπ[si, . . . , sj|si =s,sj =g].

In this section we study a dynamic programming principle for the APSP problem (1). We will later extend our solution to the continuous case (3) using function approximation.

One way to solve (1) is to solve a single-goal problem for every possible goal state. The RL equivalent of this approach is UVFA, suggested by Schaul et al. (2015), where the state space is augmented with the goal state, and a goal-based value function is introduced. The UVFA is optimized using standard (single-goal) RL, by learning over different goals. Alternatively, our approach builds on an efficient APSP dynamic programming algorithm, as we propose next.

4.1. Sub-Goal Tree Dynamic Programming

By our definition of non-negative costs, the shortest path between any two states is of length at most N. The main idea in our approach is that a trajectory between s and g can be constructed in a divide-and-conquer fashion: first predict a sub-goal between start and goal that optimally partitions the trajectory to two segments of length N/2 or less:  s, . . . , sN/2and  sN/2, . . . , g. Then, recursively, predict intermediate sub-goals on each sub-segment, until a complete trajectory is obtained. We term this composition a sub-goal tree (SGT), and we formalize it as follows.

Let  Vk(s, s′)denote the shortest path from  s to s′ in 2k stepsor less. Note that by our convention about unconnected edges above, if there is no such trajectory in the graph then Vk(s, s′) = ∞. We observe that  Vkobeys the following dynamic programming relation, which we term sub-goal tree dynamic programming (SGTDP).

Theorem 1. Consider a weighted graph with N nodes and no negative cycles. Let  Vk(s, s′)denote the cost of the shortest path from  s to s′ in 2k steps or less, and let  V ∗(s, s′)denote the cost of the shortest path from s to  s′. Then,  Vkcan be computed according to the following equations:

V0(s, s′) = c(s, s′), ∀s, s′ : s ̸= s′;

Vk(s, s) = 0, ∀s;

Vk(s, s′) =minsm {Vk−1(s, sm)+Vk−1(sm, s′)}, ∀s,s′:s̸=s′.

Furthermore, for  k ≥ log2(N)we have that  Vk(s, s′) =V ∗(s, s′) for all s, s′.

The proof of Theorem 1, along with all other proofs in this paper, are reported in the supplementary material.

Given  V0, . . . , Vlog2(N), Theorem 1prescribes the following recipe for computing a shortest path for every s, g, which we term the greedy SGT trajectory:5

image

image

The SGT can be seen as a general divide-and-conquer representation of a trajectory, by recursively predicting sub-goals. In contrast, the construction of a greedy trajectory in conventional RL (using the Bellman equation) proceeds sequentially, by predicting state (or action) after state. This is illustrated in Figure 1. We next investigate the advantages of the SGT structure compared to the sequential approach.

In conventional RL, it is well known that in the presence of errors, following a greedy policy may suffer from drift – accumulating errors that may hinder performance (Ross et al., 2011). In the goal-based setting, intuitively, the error magnitude scales with the distance from the goal. Our main observation is that the SGT is less sensitive to drift, as the sub-goals break the trajectory into smaller segments with exponentially decreasing errors. We next formalize this idea, by analyzing an approximate DP formulation.

4.2. SGT Approximate DP

Similar to standard dynamic programming algorithms, the SGTDP algorithm iterates over all states in the graph, which becomes infeasible for large, or continuous state spaces. For such cases, inspired by the approximate dynamic programming (ADP) literature (Bertsekas, 2005), we investigate ADP methods based on SGTDP.

Let T denote the SGTDP operator,  (TV )(s, s′) =minsm {V (s, sm) + V (sm, s′)}. From Theorem 1, we have that  V ∗ = T log2(N)V0. Similarly to standard ADP analysis (Bertsekas, 2005), we assume an  ϵ-approximate SGTDP operator that generates a sequence of approximate value functions ˆV0, . . . , ˆVlog2(N)that satisfy:

image

where  ∥x∥∞ = maxs,s′ |x(s, s′)|. The next result provides an error propagation bound for SGTDP.6

Proposition 1. For the sequence ˆV0, . . . , ˆVlog2(N)satisfying (6), we have that  ∥ ˆVlog2(N) − V ∗∥∞ ≤ ϵ(2N − 1).

A similar  O(ϵN)error bound is known for ADP based on the approximate Bellman operator (Bertsekas & Tsitsiklis, 1996, Pg. 332). Thus, Proposition 1 shows that given the same value function approximation method, we can expect a similar error in the value function between the SGT and sequential approaches. However, the main importance of the value function is in deriving a policy, in this case, a trajectory from start to goal. As we show next, the shortest path derived from the approximate SGTDP value function can be significantly more accurate than a path derived using the Bellman value function.

We first discuss how to compute a greedy trajectory. Given a sequence of  ϵ-approximate SGTDP value functions ˆV0, . . . , ˆVlog2(N)as described above, one can compute the greedy SGT trajectory by plugging the approximate value functions in (5) instead of their respective accurate value functions. We term this the approximate greedy SGT trajectory, and provide an error bound for it.

Proposition 2. For a start and goal pair s, g, and sequence of value functions ˆV0, . . . , ˆVlog2(N)generated by an ϵ-approximate SGTDP operator, let  s0, . . . , sNdenote the approximate greedy SGT trajectory as described above. We have that �N−1i=0 c(si, si+1) ≤ V ∗(s, g) + 4N log2 (N)ϵ.

Thus, the error of the greedy SGT trajectory is O(N log2 (N)). In contrast, for the greedy trajectory according to the standard finite-horizon Bellman equation, a tight  O(N 2)bound holds (Ross et al., 2011, we also provide an independent proof in the supplementary material). Thus, the SGT approach provides a strong improvement in handling errors compared to the sequential method. In addition, the SGT approach requires us to compute and store only  log2(N)different value functions. In comparison, a standard finite horizon approach would require storing N value functions, which can be limiting for large N. Finally, during trajectory prediction, sub-goal predictions for different branches of the tree are independent and can be computed concurrently, allowing an  O(log2(N))prediction time, whereas sequentially predicting sub-goals is O(N). We thus conclude that the SGT provides significant benefits in both approximation accuracy, prediction time, and space.

Why not use the Floyd-Warshall Algorithm? At this point, the reader may question why we do not build on the Floyd-Warshall (FW) algorithm for the APSP problem. The FW method maintains a value function  VF W (s, s′)of the shortest path from s to  s′, and updates the value using the relaxation  VF W (s, s′) :=min {VF W (s, s′), VF W (s, sm) + VF W (sm, s′)}. If the updates are performed over all  sm, s, and s′ (in that sequence), VF Wwill converge to the shortest path, requiring  O(N 3)computations, compared to  O(N 3 log N)for SGTDP (Rus- sell & Norvig, 2010). One can also perform relaxations in an arbitrary order, as was suggested by Kaelbling (1993), and more recently by Dhiman et al. (2018), to result in an RL style algorithm. However, as was already observed by Kaelbling (1993), the FW relaxation requires that the values always over-estimate the optimal costs, and any underestimation error, due to noise or function approximation, gets propagated through the algorithm without any way of recovery, leading to instability. Our SGT approach avoids this problem by updating  Vkbased on  Vk−1, resembling finite horizon dynamic programming, and, as we proved, maintains stability in presence of errors. Furthermore, both Kaelbling (1993); Dhiman et al. (2018) showed results only for table-lookup value functions. In our experiments, we have found that replacing the SGTDP update with a FW relaxation (described in the supplementary) leads to instability when used with function approximation.

Motivated by the theoretical results of the previous section, we present algorithms for learning SGT policies. We start with a value-based batch-RL algorithm with function approximation in Section 5.1. Then, in Section 5.2, we present a policy gradient approach for SGTs.

5.1. Batch RL with Sub-Goal Trees

We now describe a batch RL algorithm with function approximation based on the SGTDP algorithm above. Our approach is inspired by the fitted-Q iteration (FQI) algorithm for finite horizon Markov decision processes (Tsitsik- lis & Van Roy 2001; see also Ernst et al. 2005; Riedmiller 2005 for the discounted case). Similar to FQI, we are given a data set of M random state transitions and their costs {(si, s′i, ci)}Mi=1, and we want to estimate  Vk(s, s′)for ar- bitrary pairs  (s, s′). Assume that we have some estimate ˆVk(s, s′)of the value function of depth k in SGTDP. Then, for any pair of start and goal states s, g, we can estimate

image

Thus, if our data consisted of start and goal pairs, we could use (7) to generate regression targets for the next value function, and use any regression algorithm to fit ˆVk+1(s, s′).This is the essence of the Fitted SGTDP algorithm (Algorithm 1). Since our data does not contain explicit goal states, we simply define goal states to be randomly selected states from within the data.

The first iteration k = 0 in Fitted SGTDP, however, requires special attention. We need to fit the cost function for connected states in the graph, yet make sure that states which are not reachable in a single transition have a high cost. To this end, we fit the observed costs c to the observed transitions  s, s′in the data, and a high cost  Cmaxto transitions from the observed states to randomly selected states.

We also need a method to approximately solve the minimization problem in (7). In our experiments, we discretized the state space and performed a simple grid search. Other methods could be used in general. For example, if  Vkis represented as a neural network, then one can use gradient descent. Naturally, the quality of Fitted SGTDP will depend on the quality of solving this minimization problem.

image

5.2. Sub-Goal Trees Policy-Gradient

The minimization over sub-goal states in Fitted SGTDP can be difficult for high-dimensional state spaces. A similar problem arises in standard RL with continuous actions (Bert- sekas, 2005; Kalashnikov et al., 2018), and has motivated the study of policy search methods, which directly optimize the policy (Deisenroth et al., 2013). Following a similar motivation, we propose a policy search approach based on SGTs. Inspired by policy gradient (PG) algorithms in conventional RL (Sutton et al., 2000; Deisenroth et al., 2013), we propose a parametrized stochastic policy that approximates the optimal sub-goal prediction, and we develop a corresponding PG theorem for training the policy.

Stochastic SGT policies: A stochastic SGT policy π(s′|s1, s2)is a stochastic mapping from two endpoint states  s1, s2 ∈ Sto a predicted sub-goal  s′ ∈ S. Given a start state  s0 = sand goal  sT = g, the likelihood of τ =s0, s1, . . . , sTunder policy  πis defined recursively by:

image

where the base of the recursion is  Prπ[st, st+1|st, st+1] =1 for all  t ∈ [0, T − 1]. This formulation assumes that sub-goal predictions within a segment depend only on states within the segment, and not on states before or after it. This is analogous to the Markov property in conventional RL, where the next state prediction depends only on the previous state, but adapted to a goal-conditioned setting.

Note that this recursive decomposition can be interpreted as a tree. Without loss of generality, we assume this tree has a depth of D thus  T = 2D(repeating states to make the tree a full binary tree does not incur extra cost as c(s, s) = 0).

We now define the PG objective. Let  ρ0denote a distribution over start and goal pairs  s, g ∈ S. our goal is to learn a stochastic policy  πθ(s′|s1, s2), characterized by a parameter vector  θ, that minimizes the expected trajectory costs:

image

where the trajectory distribution  ρ(πθ)is defined by first drawing  (s, g) ∼ ρ0, and then recursively drawing intermediate states as defined by Eq. (8). Our next result is a PG theorem for SGTs.

Theorem 2. Let  πθbe a stochastic SGT policy,  ρ(πθ) be atrajectory distribution defined above, and  T = 2D. Then

image

where  si,d = s(i−1)·2d, si,dm = s(2i−1)·2d−1, gi,d = si·2d, and  Ci,dτ = c(i−1)·2d:i·2dis the sum of costs from  si,d to gi,dof  τ. Furthermore, let the baseline  bi,d = b(si,d, gi,d)be any fixed function  bi,d : S2 → R, then  Ci,dτin (10) can be replaced with  Ci,dτ − bi,d.

The main difference between Theorem 2 and the standard PG theorem (Sutton et al., 2000) is in the calculation of ∇θ log Prρ(πθ)[τ], which in our case builds on the tree-based construction of the SGT trajectory. The summations over i and d in Theorem 2, are used to explicitly state individual sub-goal predictions: d iterates over the depth of the tree, and i iterates between sub-goals of the same depth.

Using Theorem 2, we can sample a trajectory  τusing  πθ, obtain its cost  cτ, and estimate of the gradient of  J(θ) usingcτand the gradients of individual decisions of  πθ.

In the policy gradients literature, variance reduction using control variates, and trust-region methods such as TRPO and PPO play a critical role (Greensmith et al., 2004; Schulman et al., 2015; 2017). The baseline reduction in Theorem 2 allows similar results to be derived for SGT, and we similarly find them to be important in practice. Due to space constraints, we report these in the Supplementary.

5.3. The SGT-PG Algorithm

Following the theoretical results in Section 4.1, where the greedy SGT trajectory used a different value function to predict sub-goals at different depths of the tree, we can

image

expect a stochastic SGT policy that depends on the depth in the tree (d in Theorem 2) to perform better than a depthindependent policy. Theorem 2 holds true when  π dependson d, similarly to a time-dependent policy in standard PG literature. We denote such a depth dependent policy as πd, for  d ∈ 1, . . . , D. At depth d = 0, no sub-goal is predicted resulting in (s, g) being directly connected. Next, π1predicts a sub-goal  sm, segmenting the trajectory to two segments of depth 0:  (s, sm)and  (sm, g). The recursive construction continues,  π2predicts a sub-goal and calls  π1on the resulting segments, and so on until depth D.

An observation that we found important for improving training stability, is that the policies can be trained sequentially. Namely, we first train the d-depth policy, and only then start training d + 1-depth policy, while freezing all policies of depth  ≤ d.

Our algorithm implementation, SGT-PG, is detailed in Algorithm 2. SGT-PG predicts sub-goals, and interacts with an environment E that evaluates the cost of segments  (s, s′).SGT-PG maintains D depth-specific policies:  {πi}Di=1, each parametrized by  {θi}Di=1respectively (i.e., we do not share parameters for policies at different depths, though this is possible in principle). The policies are trained in sequence, and every training cycle is comprised of on-policy data collection followed by policy update. The collect method collects on-policy data given N, the number of episodes to generate; d, the index of the policy being trained; and E,

image

Figure 2. Batch RL experiment. (a) A robot needs to navigate between the (hatched) obstacles. Blue - SGT prediction, green -trajectory tracking sub-goals using an inverse model. (b) Approximate values ˆVk(s, g = [0.9, 0.9])for several values of k. Note how the reachable region to the goal (non-yellow) grows with k.

the environment. We found that for reducing noise when training  πd, it is best to sample from  πdonly and take the mean predictions of  {πi}d−1i=1. The compute-PG method, based on Eq. (10), uses the collected data D, and estimates ∇θdJ(θd). We found that similar to sequential RL, adding a trust region using the PPO optimization objective (Schul- man et al., 2017) provides stable updates (see Section E.4 for specific loss function). Finally, optimizer.step updates  θdaccording to any SGD optimizer algorithm, which completes a single training cycle for  πd. We proceed to train the next policy  πd+1when a pre-specified convergence criterion is met, for instance, until the expected cost of predicted trajectories  E [cτ]stops decreasing.

In this section we compare our SGT approach with the conventional sequential method (i.e. predicting the next state of the trajectory). We consider APSP problems inspired from robotic motion planning, where the goal is to find a collision-free trajectory between a pair of start and goal states. We illustrate the SGT value functions and trajectories on a simple 2D point robot domain, which we solve using Fitted SGTDP (Section 6.1). We then consider a more challenging domain with a simulated 7DoF robotic arm, and demonstrate the effectiveness of SGT-PG (Section 6.2).

6.1. Fitted SGTDP Experiments

We start by evaluating the Fitted SGTDP algorithm. We consider a 2D particle moving in an environment with obstacles, as shown in Figure 2a. The particle can move a distance of 0.025 in one of the eight directions, and suffers a constant cost of 0.025 in free space, and a large cost of 10 on collisions. Its task is reaching from any starting point to within a 0.15 distance of any goal point without collision. This simple domain is a continuous-state optimal control Problem (2), and for distant start and goal points, as shown in Figure 2a, it requires relatively long-horizoned planning, making it suitable for studying batch RL algorithms.

To generate data, we sampled states and actions uniformly and independently, resulting in  125K (s, u, c, s′)tuples. As for function approximation, we opted for simplicity, and used K-nearest neighbors (KNN) for all our experiments, with Kneighbors = 5. To solve the minimization over states in Fitted SGTDP, we discretized the state space and searched over a  50 × 50grid of points.

A natural baseline in this setting is FQI (Ernst et al., 2005; Riedmiller, 2005). We verified that for a fixed goal, FQI obtains near perfect results with our data. Then, to make it goal-conditioned, we used a universal Q-function (Schaul et al., 2015), requiring only a minor change in the algorithm (see supplementary for pseudo-code).

To evaluate the different methods, we randomly chose 200 start and goal points, and measured the distance from the goal the policies reach, and whether they collide with obstacles along the way. For FQI, we used the greedy policy with respect to the learned Q function. The approximate SGTDP method, however, does not automatically provide a policy, but only a state trajectory to follow. Thus, we experimented with two methods for extracting a policy from the learned sub-goal tree. The first is training an inverse model fIM(s, s′)– a mapping from  s, s′ to u, using our data, and the same KNN function approximation. To reach a sub-goal g from state s, we simply run  fIM(s, g)until we are close enough to g (we set the threshold to 0.15). An alternative method is first using FQI to learn a goal-based policy, as described above, and then running this policy on the sub-goals. The idea here is that the sub-goals learned by approximate SGTDP can help FQI overcome the long-horizon planning required in this task. Note that all methods use exactly the same data, and the same function approximation (KNN), making for a fair comparison.

In Table 1 we report our results. FQI only succeed in reaching the very closest goals, resulting in a high average distance to goal. Fitted SGTDP (approx. SGTDP+IM), on the other hand, computed meaningful sub-goals for almost all test cases, resulting in a low average distance to goal when tracked by the inverse model. Figure 2 shows an example sub-goal tree and a corresponding tracked trajectory. The FQI policy did learn not to hit obstacles, resulting in the lowest collision rate. This is expected, as colliding leads to an immediate high cost, while the inverse model is not trained to take cost into account. Interestingly, combining the FQI policy with the sub-goals improves both long-horizon planning and short horizoned collision avoidance. In Figure 2b we plot the approximate value function ˆVkfor different k and a specific goal at the top right corner. Note how the reachable parts of the state space to the goal expand with k.

image

Figure 3. Motion planning with SGT-PG. (a) Illustration of start (red) and goal (blue) positions of the end effector in wall domain. (b) SGT Trajectory. Note a non-trivial rotation between  s2 to s3(marked with orange circles) allowing linear motion from  s3 to g. In thesupplementary material we also show a 7 sub-goals path.

image

Table 1. Results for controllers of batch-RL experiments.

6.2. Neural Motion Planning

An interesting application domain for our approach is neural motion planning (NMP, Qureshi et al., 2018) – learning to predict collision-free trajectories for a robot among obstacles. Here, we study NMP for the 7DoF Franka Panda robotic arm. Due to lack of space, full technical details of this section appear in Supplementary E.1.

We follow an RL approach to NMP (Jurgenson & Tamar, 2019), using a cost function that incentivizes short, collision-free trajectories that reach the goal. The state space is the robot’s 7 joint angles, and actions correspond to predicting the next state in the plan. Given the next predicted state, the robot moves by running a fixed PID controller in simulation, tracking a linear joint motion from current to next state, and a cost is incurred based on the resulting motion.

We formulate NMP as approximate APSP as follows. A model predicts  T − 1sub-goals, resulting in T motion segments. Those segments are executed and evaluated independently, i.e. when evaluating segment  (s, s′)the robot first resets to s, and the resulting cost is based on its travel to  s′.

Our experiments include two scenarios: self-collision and wall. In self-collision, there are no obstacles and the challenge is to generate a minimal distance path while avoiding self-collisions between the robot links. The more challenging wall workspace contains a wall that partitions the space in front of the robot (see Figure 3). In wall, the shortest path is often nonmyopic, requiring to first move away from the goal in order to pass the obstacle.

We compare SGT-PG with a sequential baseline, Sequential sub-goals (SeqSG), which prescribes the sub-goals predictions sequentially. For appropriate comparison, both models use a PPO objective (Schulman et al., 2017), and a fixed

image

Table 2. Success rates for the NMP scenarios.

architecture neural network to model the policy. All other hyper-parameters were specifically tuned for each model.

Table 2 compares SGT-PG and SeqSG, on predictions of 1, 3, and 7 sub-goals. We evaluate success rate (reaching goal without collision) on 100 random start-goal pairs that were held out during training (see Figure 3a). Each experiment was repeated 3 times and the mean and range are reported. Note that only 1 sub-goal is required to solve self-collision, and both models obtain perfect scores. For wall, on the other hand, more sub-goals are required, and here SGT significantly outperforms SeqSG, which was not able to accurately predict several sub-goals.

We presented a framework for multi-goal RL that is derived from a novel first principle – the SGT dynamic programming equation. For deterministic domains, we showed that SGTs are less prone to drift due to approximation errors, reducing error accumulation from  O(N 2)to O(N log N). We further developed value-based and policy gradient RL algorithms for SGTs, and demonstrated that, in line with their theoretical advantages, SGT demonstrate improved performance in practice.

Our work opens exciting directions for future research, including: (1) can our approach be extended to stochastic environments? (2) how to explore effectively based on SGTs? (3) can SGTs be extended to image-based tasks? Finally, we believe that our ideas will be important for robotics and autonomous driving applications, and other domains where goal-based predictions are important.

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, O. P., and Zaremba, W. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.

Bertsekas, D. Dynamic programming and optimal control: Volume II. 2005.

Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming. Athena Scientific, 1996.

Bishop, C. M. Mixture density networks. Technical report, Citeseer, 1994.

Deisenroth, M. P., Neumann, G., Peters, J., et al. A survey on policy search for robotics. Foundations and Trends R⃝in Robotics, 2(1–2):1–142, 2013.

Dhiman, V., Banerjee, S., Siskind, J. M., and Corso, J. J. Floyd-warshall reinforcement learning learning from past experiences to reach new goals. arXiv preprint arXiv:1809.09318, 2018.

Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.

Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.

Floyd, R. W. Algorithm 97: shortest path. Communications of the ACM, 5(6):345, 1962.

Fruit, R., Pirotta, M., Lazaric, A., and Brunskill, E. Regret minimization in mdps with options without prior knowledge. In Advances in Neural Information Processing Systems, pp. 3166–3176, 2017.

Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduc- tion techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov): 1471–1530, 2004.

Gu, S., Holly, E., Lillicrap, T., and Levine, S. Deep rein- forcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389– 3396. IEEE, 2017.

Ijspeert, A. J., Nakanishi, J., Hoffmann, H., Pastor, P., and Schaal, S. Dynamical movement primitives: learning attractor models for motor behaviors. Neural computation, 25(2):328–373, 2013.

Jayaraman, D., Ebert, F., Efros, A. A., and Levine, S. Time- agnostic prediction: Predicting predictable video frames. In ICLR, 2019.

Jurgenson, T. and Tamar, A. Harnessing reinforcement learning for neural motion planning. In Robotics: Science and Systems (RSS), 2019.

Kaelbling, L. P. Learning to achieve goals. In IJCAI, pp. 1094–1099. Citeseer, 1993.

Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Kober, J., Bagnell, J. A., and Peters, J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.

Konidaris, G., Kuindersma, S., Grupen, R., and Barto, A. Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research, 31 (3):360–375, 2012.

LaValle, S. M. Planning algorithms. Cambridge university press, 2006.

Mann, T. and Mannor, S. Scaling up approximate value iteration with options: Better policies with fewer iterations. In International conference on machine learning, pp. 127–135, 2014.

McGovern, A. and Barto, A. G. Automatic discovery of subgoals in reinforcement learning using diverse density. 2001.

Menache, I., Mannor, S., and Shimkin, N. Q-cutdynamic discovery of sub-goals in reinforcement learning. In European Conference on Machine Learning, pp. 295–306. Springer, 2002.

Mishra, N., Abbeel, P., and Mordatch, I. Prediction and control with temporal segment models. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2459–2468. JMLR. org, 2017.

M¨ulling, K., Kober, J., Kroemer, O., and Peters, J. Learning to select and generalize striking movements in robot table tennis. The International Journal of Robotics Research, 32(3):263–279, 2013.

Munos, R. and Szepesv´ari, C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 (May):815–857, 2008.

Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682– 697, 2008.

Pomerleau, D. A. ALVINN: An autonomous land vehicle in a neural network. In NIPS, pp. 305–313, 1989.

Qureshi, A. H., Bency, M. J., and Yip, M. C. Motion planning networks. arXiv preprint arXiv:1806.05767, 2018.

Riedmiller, M. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317– 328. Springer, 2005.

Ross, S., Gordon, G., and Bagnell, D. A reduction of imita- tion learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.

Russell, S. J. and Norvig, P. Artificial Intelligence - A Modern Approach (3. internat. ed.). Pearson Education, 2010.

Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International conference on machine learning, pp. 1312–1320, 2015.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897, 2015.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

ucan, I. A. and Kavraki, L. E. Kinodynamic motion planning by interior-exterior cell exploration. In Algorithmic Foundation of Robotics VIII, pp. 449–464. Springer, 2009.

ucan, I. A., Moll, M., and Kavraki, L. E. The Open Motion Planning Library. IEEE Robotics & Automation Magazine, 19(4):72–82, December 2012. doi: 10.1109/MRA. 2012.2205651. http://ompl.kavrakilab.org.

Sung, J., Jin, S. H., and Saxena, A. Robobarista: Object part based transfer of manipulation trajectories from crowdsourcing in 3d pointclouds. In Robotics Research, pp. 701–720. Springer, 2018.

Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2): 181–211, 1999.

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.

Tsitsiklis, J. N. and Van Roy, B. Regression methods for pricing complex american-style options. IEEE Transactions on Neural Networks, 12(4):694–703, 2001.

Zhang, H., Heiden, E., Julian, R., He, Z., Lim, J. J., and Sukhatme, G. S. Auto-conditioned recurrent mixture density networks for complex trajectory generation. arXiv preprint arXiv:1810.00146, 2018.

Proof of Theorem 1

Proof. First, by definition, the shortest path from s to itself is 0. In the following, therefore, we assume that  s ̸= s′.

We show by induction that each  Vk(s, s′)in Algorithm (4) is the cost of the shortest path from  s to s′ in 2k steps or less.

Let  τk(s, s′)denote a shortest path from s to  s′in  2ksteps or less, and let  Vk(s, s′)denote its corresponding cost. Our induction hypothesis is that  Vk−1(s, s′)is the cost of the shortest path from  s to s′ in 2k−1 steps or less. We will show that Vk(s, s′) = minsm {Vk−1(s, sm) + Vk−1(sm, s′)}.

Assume by contradiction that there was some  s∗ such that Vk−1(s, s∗) + Vk−1(s∗, s′) < Vk(s, s′). Then, the concatenated trajectory  [τk−1(s, s∗), τk−1(s∗, s′)]would have  2ksteps or less, contradicting the fact that  τk(s, s′)is a shortest path from s to  s′in  2ksteps or less. So we have that  Vk(s, s′) ≤ {Vk−1(s, sm) + Vk−1(sm, s′)} ∀sm. Since the graph is complete,  τk(s, s′)can be split into two trajectories of length  2k−1steps or less. Let  smbe a midpoint in such a split. Then we have that  Vk(s, s′) = {Vk−1(s, sm) + Vk−1(sm, s′)}. So equality can be obtained and thus we must have Vk(s, s′) = minsm {Vk−1(s, sm) + Vk−1(sm, s′)}.

To complete the induction argument, we need to show that  V0(s, s′) = c(s, s′). This holds since for k = 0, for each  s, s′, the only possible trajectory between them of length 1 is the edge  s, s′.

Finally, since there are no negative cycles in the graph, for any  s, s′, the shortest path has at most N steps. Thus, for k = log2(N), we have that  Vk(s, s′)is the cost of the shortest path in N steps or less, which is the shortest path between s and  s′.

Proof of Proposition 1

Proof. The SGTDP Operator T is non-linear and not a contraction, but it is monotonic:

image

To show (11), let  s′m = arg minsm {Vβ(s, sm) + Vβ(sm, g)}. Then:

image

Back to the proof. By denoting e = (1, 1, 1, 1, ..., 1) we can write (6) as:

image

Since T is monotonic, we can apply it on the inequalities:

Focusing on the left-hand side expression (the right-hand side is symmetric) we obtain:

T(V0 − eϵ)(s, s′) = minsm {(V0 − eϵ)(s, sm) + (V0 − eϵ)(sm, s′)} = minsm {V0(s, sm) + V0(sm, s′) − 2ϵ} = TV0(s, s′) − 2ϵ

leading to:

image

Using (6):

image

Proceeding similarly we obtain for ˆV2:

image

And for every  k ≥ 1:

image

For  k = log2 Nwe obtain:

image

Proof of Proposition 2

Proof. For every iteration k, the middle state of any greedy SGT path with length 2k is ˆsm =arg minsm�ˆVk−1(s, sm) + ˆVk−1(sm, g)�. Thereby  ˆsmfulfils the identity ˆVk−1(s, ˆsm) + ˆVk−1(ˆsm, g) = T ˆVk−1(s, g), for every iteration k, where T is the SGT operator. The next two relations then follow:

image

Combining the two inequalities yields:

Explicitly writing down relation (15) for all the different sub-paths we obtain:

image

image

Figure 4. Example graph with a start and goal states s, g where the optimal path has a total cost of  Nϵ while ϵ-approximated Bellman might form a path with total cost of infinity

image

Summing all the inequalities, along with the triangle inequality, lead to the following:

image

The identities  c(si, si+1) = V0(si, si+1) and V ∗(s0, sN) = Vlog2 N(s0, sN)complete the proof.

Proposition 2 provides a bound on the approximate shortest path using SGTDP. In contrast, we show that a similar bound for the sequential Bellman approach does not hold. For a start and goal pair s, g, and an approximate Bellman value function ˆV B, let  sB0 , . . . , sBNdenote the greedy shortest path according to the Bellman operator, i.e.,  s0 = s, sN = gand for  1 ≤ k < N: sk+1 = arg minsm�c(sk, sm) + ˆV B(sm, g)�. Note that the greedy Bellman update is not guaranteed to generate a trajectory that reaches the goal, therefore we add g as the last state in the trajectory, following our convention that the graph is fully connected. When evaluating the greedy trajectory error in the sequential approach, we deal with two different implementation cases: Using one approximated value function or N approximated value functions. The next proposition shows that when using one value function the greedy trajectory can be arbitrarily bad, regardless of  ϵ. Propositions (4) and (5) show that the error of the greedy trajectory, using N value functions, have a tight  O(N 2) bound.

Proposition 3. For any  ϵ, β, there exists a graph and an approximate value function ˆV Bsatisfying  ∥ ˆV B − V ∗∥∞ ≤ ϵN, such that �N−1i=0 c(si, si+1) ≥ V ∗(s, g) + β.

Proof. We show a graph example (Figure 4), where a  ϵ-approximate Bellman value function ˆV Bmight form the path s0, s0, . . . , s0, gwith a total cost of infinity. (Reminder: The graph is fully connected, all the non-drawn edges have infinity cost), while the optimal path  s0, s1, . . . , sN−1, ghas a total cost of  Nϵ.

As ˆV Bis independent of the current time-step k, and the optimal values for s and  s1are  Nϵand  (N − 1)ϵrespectively, due to approximation error ˆV B(for every time-step) might suggest that the value of  s0is lower than the value of  s1: ˆV B(s0, g) − Nϵ ≤ ˆV B(s1, g) + Nϵ. The resulting policy will choose to stay in  s0for the first  N − 1steps. Since the maximum path length is N and the path must end at the goal, the last step will be directly from  s0 to g, resulting in a cost of infinity.

image

Proposition 4. For a finite state graph, start and goal pair s, g, and sequence of  ϵ-approximate Bellman value functions ˆV B0 , . . . , ˆV BNsatisfying  ∥ ˆV Bk+1 − T B ˆV Bk ∥∞ ≤ ϵand  ∥ ˆV B0 − V0∥∞ ≤ ϵ, let  s0, . . . , sNdenote the greedy Bellman path, that is:  s0 = s, sN = g, sk+1 = arg minsk+1 c(sk, sk+1) + ˆV BN−k−2(sk+1, g),etc. We have that  �N−1i=0 c(si, si+1) ≤V ∗(s, g) + (N 2 − N)ϵ = V ∗(s, g) + O(N 2).

Proof. Denote  Vkas the Bellman value function at iteration  k, ˆVk as the ϵ-approximated Bellman value function and  TB asthe Bellman operator. Using the properties  ∥TB ˆVk−1 − ˆVk∥∞ ≤ ϵand  ∥Vk − ˆVk∥∞ ≤ kϵ(Bertsekas & Tsitsiklis, 1996, Pg. 332), the following relation holds:

image

Figure 5. Example graph with a start and goal states s, g. The edges costs are marked in black labels, the Bellman value function at iteration k is marked in green and the  ϵ-approximated Bellman value is marked in red. The optimal path has a total cost of  V ∗(s, g) = (N − 1)ϵwhile the  ϵ-approximated Bellman might form the path going through all the states with a total cost of  N 2 − Nϵ/2 = V ∗(s, g) + O(N 2).

for  0 ≤ k ≤ N − 2:

image

for  k = N − 1:

image

Summing all the path costs:

Proposition 5. The error bound stated in Proposition(4), �N−1i=0 c(si, si+1) − V ∗(s, g) ≤ O(N 2), is indeed a tight bound.

Proof. We show a graph example (Figure 5),where a sequence of  ϵ-approximate Bellman value functions ˆV0, . . . , ˆVN−1might form the path from the start to the goal state, going through all the available states, with a total cost of  (N 2 − N)/2 =O(N 2), while the optimal path  (s0, sN−1, sN)has a total cost of  V ∗(s, g) = (N − 1)ϵ.

The Bellman value of every state  sk, 0 ≤ k ≤ sN−3: VN−k−1(sk, g) = (N − k − 1)ϵ, where  N − k − 1is the time horizon from state  skto the goal. Since the  ϵ-approximate Bellman value function at every state  skmight have a maximum approximation error of  (N − k − 1)ϵ, it might suggest that ˆVN−k−1(sk, g) = 0for every  1 ≤ k ≤ N − 2. The Bellman value for state  sN−1is 0 and the  ϵ-approximate Bellman value might be  δ for some 0 < δ ≤ ϵ. The greedy Bellman policy has to determine at every state  sk, 0 ≤ k ≤ sN−3, whether to go to  sk+1or to  sN−1. Since both possible actions have an immediate cost of  (N − k − 1)ϵ, the decision will only be based on the evaluation of  sk+1 and sN−1 with the ϵ-approximate Bellman value function: ˆVN−k−1(sk, g) = 0 < δ = ˆV0(sN−1, g). Therefore, the greedy bellman policy will decide at every state  sk to go to sk+1, forming the path with a total cost of �N−1k=1 kϵ = N 2 − Nϵ/2

image

In this section we provide the mathematical framework and tool we used in the Policy Gradient Theorem 2 in Section 5.2 which allows us to formulate the SGT-PG algorithm (Section 5.2). The SGT the prediction process (Eq.(8)) no-longer decomposes sequentially like a MDP, therefore, we provide a different mathematical construct:

Finite-depth Markovian sub-goal tree (FD-MSGT) is a process predicting sub-goals (or intermediate states) of a trajectory in a dynamical system as described in (3). The process evolves trajectories recursively (as we describe in detail below), inducing a tree-like decomposition for the probability of a trajectory as described Eq. (8). In this work we consider trees with fixed levels D (corresponding to finite-horizon MDP in sequential prediction RL with horizon  T = 2D) 7. Formally, a FD-MSGT is comprised of  (S, ρ0, c, D)where S is the state space,  ρ0is the initial start-goal pairs distribution, c is a non-negative cost function obeying Eq. (3), and D is the depth of the tree 8.

Recursive evolution of FD-MSGT: Formally, an initial pair  (s0 = s, sT = g) ∼ ρ0is sampled, defining the root of the tree. Next, a policy  π(s T2 |s0, sT )is used to predict the sub-goal  s T2 , creating two new tree nodes of depth D corresponding to the segments  (s0, s T2 ) and (s T2 , sT ). Recursively each segment is again partitioned using  πresulting in four tree nodes of level  D − 1corresponding to segments  (s0, s T4 ), (s T4 , s T2 ), (s T2 , s 3T4 )and  (s 3T4 , sT ). The process continues recursively until the depth of the tree is 1. This results in Eq.(8), namely,

image

FD-MSGT results in �D−1d=0 2d = 2D − 1sub-goals, and  2D + 1overall states (including  s and g), setting T = 2D. Finally,c defines the cost of the prediction  c(τ) = c0:T, and is evaluated on the leaves of the FD-MSGT tree according to Eq. (3). Figure 3b illustrates a FD-MSGT of D = 2. The objective of FD-MSGT is to find a policy  π : S2 → Sminimizing the expected cost of a trajectory  Jπ = Eτ∼ρ(π) [cτ].

Non-recursive formulation for trajectory likelihood: We next derive a non-recursive formulation of Eq. (8), using the notations defined in Theorem 2, namely,  si,d = s(i−1)·2d, si,dm = s(2i−1)·2d−1, gi,d = si·2d, and  Ci,dτ = c(i−1)·2d:i·2dis the sum of costs from  si,d to gi,d of τ. We re-arrange the terms in (8) grouping by depth resulting in a non-recursive formula,

image

For example consider a FD-MSGT with D = 3 (T = 8). Given  s0 = sand  s8 = g, the sub-goals will be  s1, . . . s7. We enumerate the two products in Eq. (16) resulting in:

image

Allowing us to assert the equivalence of the explicit and recursive formulations in this case.

B.1. Proof of Theorem 2

Let  πθbe a policy with parameters  θ, we next prove a policy gradient theorem (Theorem 3) for computing  ∇θJπθ, using the following proposition:

Proposition 6. Let  πθbe a policy with parameters  θof a FD-MSGT with depth D, then

image

Proof. First, we express  Prρ(πθ)[τ] using Eq. 16 and ρ0obtaining,

image

Next, by taking the log we have,

Proposition 6 shows that the gradient of a trajectory w.r.t.  θdoes not depend on the initial distribution. This allows us to derive a policy gradient theorem:

Theorem 3. Let  πθbe a stochastic SGT policy,  ρ(πθ)be a trajectory distribution defined above, and  T = 2D. Then

�cτ ·

image

Proof. To obtain  ∇θJ(θ)we write Eq. (9) as an explicit expectation and use  ∇xf(x) = f(x) · ∇x log f(x):

image

This proves the first equality. For the second equality we substitute  ∇θ log Prρ(πθ)[τ]according to Eq. (17).

The policy gradient theorem allows estimating  ∇J(θ)from on policy data collected using  πθ. We next show how to improve the estimate in Theorem 3 using control variates (baselines). We start with the following baseline-reduction proposition.

Proposition 7. Let  πθbe a policy with parameters  θof a FD-MSGT with depth  D, and let bi,d = b(si,d, gi,d)be any fixed function  bi,d : S2 → R, then

image

Then we expand the left-hand side of Eq. (18), and use Eq. (19) in the last row:

image

which concludes the proof.

We define  b : S2 → R as asegment-dependent baseline if b is a fixed function that operates on a pair of states s and g. The last proposition allows us to reduce segment-dependent baselines,  bi,d, from the estimations of  ∇J(θ)without bias. Finally, in the next proposition we show that instead of estimating  ∇J(θ) using cτwe can instead use  Ci,d as follows:

Proposition 8. Let  πθbe a policy with parameters  θof a FD-MSGT with depth D, then:

image

Proof. We have that:

The first transition is the expectation of sums, and the second is the smoothing theorem. We next expand the inner expectation in Eq. (20), and partition the sum of cost  cτinto three sums: between indices 0 to  (i − 1) · 2d, (i − 1) · 2dto  i · 2d, and finally from  i · 2d to T.

image

Next, by the definition of FD-MSGT the costs between  s0to  s(i−1)·2dand  si·2dto  sT, are independent of the policy predictions between indices  (i − 1) · 2d to i · 2d – making them constants. As we showed in Proposition 7, we obtain that

this expectation is 0. Thus the expression for  ∇θJ(θ)simplifies to:

image

Finally, combining Propositions 7 and 8 in Theorem 3 provides us Theorem 2 in the main text.

In Algorithm 3 we present a goal-based versions of fitted-Q iteration (FQI) (Ernst et al., 2005) using universal function approximation (Schaul et al., 2015), which we used as a baseline in our experiments.

image

Next, in Algorithm 4 we present an approximate dynamic programming version of Floyd-Warshall RL (Kaelbling, 1993) that corresponds to the batch RL setting we investigate. This algorithm was not stable in our experiments, as the value function converged to zero for all states (when removing the self transition fitting in line 7 of the algorithm, the values converged to a constant value).

Algorithm 4 Approximate Floyd Warshall

Algorithm

1 Input: dataset  D = {s, u, c, s′}, Maximum path cost  Cmax2 Create transition data set  Dtrans = {s, s′}and targets  Ttrans = {c} with s, s′, ctaken from D 3 Create random transition data set  Drandom = {s, srand}and targets  Trandom = {Cmax}with  s, srandrandomly chosen from states in D

4 Create self transition data set  Dself = {s, s}and targets  Tself = {0} with staken from D 5 Fit ˆV (s, s′)to data in  Dtrans, Drandom, Dselffor k : 1...K do

6 Create random goal and mid-point data set  Dgoal = {s, g}and targets

In this section we describe how to learn SGT policies for deterministic, stationary, and finite time dynamical systems from labeled data generated by an expert. This setting is commonly referred to as Imitation Learning (IL). Specifically, we are given a dataset  Dπ∗ = {τi}Ni=1of N trajectory demonstrations, provided by an expert policy  π∗. Each trajectory demonstration  τifrom the dataset  Dπ∗, is a sequence of  Tistates, i.e.  τi = si0, si1 . . . siTi.9 In this work, we assume a goal-based setting, that is, we assume that the expert policy generates trajectories that lead to a goal state which is the last state in each trajectory demonstration. Our goal is to learn a policy that, given a pair of current state and goal state (s, g), predicts the trajectory that  π∗ would have chosen to reach g from s.

D.1. Imitation learning with SGT

In this section, we describe the learning objectives for sequential and sub-goal tree prediction approaches under the IL settings. We focus on the Behavioral Cloning (BC)(Pomerleau, 1989) approach to IL, where a parametric model for a policy ˆπwith parameters  θis learned by maximizing the log-likelihood of observed trajectories in  Dπ∗, i.e.,

image

Denote the horizon T as the maximal number of states in a trajectory. For ease of notation we assume T to be the same for all trajectories10. Also, let  si = si0and  gi = siTidenote the start and goal states for  τi. We ask – how to best represent the distribution  Pˆπ(τ|s, g; θ)?

The sequential trajectory representation (Pomerleau, 1989; Ross et al., 2011; Zhang et al., 2018; Qureshi et al., 2018), a popular approach, decomposes  Pˆπ(s0, s1, . . . sT |s, g; θ)by sequentially predicting states in the trajectory conditioned on previous predictions. Concretely, let  ht = s0, s1, . . . , stdenote the history of the trajectory at time index t, the decomposition assumed by the sequential representation is  Pˆπ(s0, s1, . . . sT |s, g; θ) = Pˆπ(s1|h0, g; θ)Pˆπ(s2|h1, g; θ) . . . Pˆπ(sT |hT −1, g; θ).Using this decomposition, (22) becomes:

image

We can learn  Pˆπusing a batch stochastic gradient descent (SGD) method. To generate a sample  (st, ht−1, g)in a training batch, a trajectory  τiis sampled from  Dπ∗, and an observed state  sitis further sampled from  τi. Next, the history  hit−1and goal  giare extracted from  τi. After learning, sampling a trajectory between s and g is a straight-forward iterative process, where the first prediction is given by  s1 ∼ Pˆπ(s|h0 = s, g; θ), and every subsequent prediction is given by st+1 ∼ Pˆπ(s|ht = (s, s1, . . . st), g; θ). This iterative process stops once some stopping condition is met (such as a target prediction horizon is reached, or a prediction is ’close-enough’ to g).

The Sub-Goal Tree Representation: In SGT we instead predict the middle-state in a divide-and-conquer approach. Following Eq. (8) we can redefine the maximum-likelihood objective as

image

To organize our data for optimizing Eq. (24), we first sample a trajectory  τi from Dπ∗ for each sample in the batch. From  τiwe sample two states  sia and sib and obtain their midpoint  sia+b2 . Pseudo-code for sub-goal tree prediction and learning are provided in Algorithm 5 and Algorithm 6.

D.2. Imitation Learning Experiments

We compare the sequential and Sub-Goal Tree (SGT) approaches for BC. Consider a point-robot motion-planning problem in a 2D world, with two obstacle scenarios termed simple and hard, as shown in Figure 6. In simple, the distribution of possible motions from left to right is uni-modal, while in hard, at least 4 modalities are possible.

image

Figure 6. IL Experiment domains and results. The simple domain (a) and hard domain (b). A point robot should move only on the white free-space from a room on one side to the room on the other side while avoiding the blue obstacles. SGT plan (blue) executed successfully, sequential plan (green) collides with the obstacles.

image

Table 3. Results for IL experiments.

For both scenarios we collected a set of 111K (100K train + 10K validation + 1K test) expert trajectories from random start and goal states using a state-of-the-art motion planner (OMPL’s(ucan et al., 2012) Lazy Bi-directional KPIECE (ucan & Kavraki, 2009) with one level of discretization). To account for different trajectory modalities, we chose a Mixture Density Network (MDN)(Bishop, 1994) as the parametric distribution of the predicted next state, both for the sequential and the SGT representations. We train the MDN by maximizing likelihood using Adam (Kingma & Ba, 2014). To ensure the same model capacity for both representations, we used the same network architecture, and both representations were trained and tested with the same data. Since the dynamical system is Markovian, the current and goal states are sufficient for predicting the next state in the plan, so we truncated the state history in the sequential model’s input to contain only the current state.

For simple, the MDNs had a uni-modal multivariate-Gaussian distribution, while for hard, we experimented with 2 and 4 modal multivariate-Gaussian distributions, denoted as hard-2G, and hard-4G, respectively. As we later show in our results the SGT representation captures the demonstrated distribution well even in the hard-2G scenario, by modelling a bi-modal sub-goal distribution.

We evaluate the models using unseen start-goal pairs taken from the test set. To generate a trajectory, we predict sub-goals, and connect them using linear interpolation. We call a trajectory successful if it does not collide with an obstacle en-route to the goal. For a failed trajectory, we further measure the severity of collision by the percentage of the trajectory being in collision.

Table 3 summarizes the results for both representations. The SGT representation is superior in all three evaluation criteria - motion planning success rate, trajectory prediction times (total time in seconds for the 1K trajectory predictions), and severity. Upon closer look at the two hard scenarios, the Sub Goal Tree with a bi-modal MDN outperforms sequential with 4-modal MDN, suggesting that the SGT trajectory decomposition better accounts for multi-modal trajectories.

image

Table 4. Success rate on the NMP poles scenario

In this appendix we provide the full technical details of our policy gradient experiments presented in Section 6.2. We start by providing the explicit formulation of the NMP environment. We further provide a sequential-baseline not presented in the main text and its NMP environment and experimental results related to this baseline. We also present another challenging scenario – poles, and the results of SGT-PG and SeqSG in it. Finally, we conclude with modeling-related technical details 11.

E.1. Neural Motion Planning Formulation

In Section 6.2, we investigated the performance of SGT-PG and SeqSG executed in a NMP problem for the 7DoF Franka Panda robotic arm. That NMP formulation, denoted here as with-reset, allows for independent evaluations of motion segments, a requirement for FD-MSGT. The state  s ∈ R9describes the robot’s 7 joint angles and 2 gripper positions, normalized to  [−1, 1]. The models, SGT-PG and SeqSG, are tasked with predicting the states (sub-goals) in the plan.

A segment  (s, s′)is evaluated by running a PID controller tracking a linear joint motion from s to  s′. Let  sP IDdenote the state after the PID controller was executed, and let  αfreeand  αcollisionbe hyperparameters. We next define a cost function that encourages shorter paths, and discourages collisions; if a collision occurred during the PID execution, or 5000 simulation steps were not enough to reach  s′, a cost of  αfree · ||s − sP ID||2 + αcollision · ||s′ − sP ID||2is returned. Otherwise, the motion was successful, and the segment cost is  αfree · ||s − s′||2. In our experiments we set  αfree = 1 andαcollision = 100. Figures 9 and 10show the training curves for the self-collision and wall scenarios respectively.

E.2. The SeqAction Baseline

We also evaluated a sequential baseline we denoted as SeqAction, which is more similar to classical RL approaches. Instead of predicting the next state, SeqAction predicts the next action to take (specifically the joint angles difference), and a position controller is executed for a single time-step instead of the PID controller of the previous methods.

SeqAction operates on a finite-horizon MDP (with horizon of T = 5000 steps). We follow the sequential NMP formulation of Jurgenson & Tamar (2019); the agent gets a high cost on collisions, high reward when reaching close to the goal, or a small cost otherwise (to encourage shorter paths). We denote this NMP variant no-reset as segments are evaluated sequentially, without resetting the state between evaluations. To make the problem easier, after executing an action in no-reset, the velocity of all the links is manually set to 0.

Similarly to Jurgenson & Tamar (2019), we found that sequential RL agents are hard to train for neural motion planning with varying start and goal positions without any prior knowledge or inductive bias. We trained SeqAction in the self-collision scenario using PPO (Schulman et al., 2017), and although we achieved 100% success rate when fixing both the start and the goal (to all zeros and all ones respectively), the model was only able to obtain less than 0.1 success rate when varying the goal state, and showed no signs of learning when varying both the start and the goal.

E.3. The poles Scenario

We tested both SGT-PG and SeqSG on another scenario we call poles where the goal is to navigate the robot between poles as shown in Figures 7 and 8. Similar to the previous scenarios, Table 4 shows the success rates of both models for 1, 3, and 7 sub-goals. Again, we notice that SGT-PG attains much better results and improves as more sub-goals are added, while the performance of SeqSG deteriorates. Moreover, even for a single sub-goal, SGT-PG obtains better success rates compared to

image

Figure 7. Robot trajectory poles scenario predicted with SGT-PG with 7 sub-goals. Top left start state, bottom right goal state. Notice that the two final segments  s6 to s7 and s7 to gare longer than the rest, but are made possible by the robot getting into position in state  s6.Note that SGT-PG failed to find a trajectory for this (s, g) pair with less sub-goals, indicating the hardness of the path.

image

Figure 8. Start (blue) and goal (red) states in the test set of the poles scenario. Note that the agents are not exposed to these particular states during training.

SeqSG. The high range displayed by SGT-PG is due to one random seed that obtained significantly lower scores.

E.4. Neural Network Architecture and Training Procedure

We next provide full technical details regarding the architecture and training procedure of SGT-PG and SeqSG described in Section 6.2.

Architecture and training of SGT-PG: The start and goal states, s and g, are fed to a neural network with 3 hidden-layers of 20 neurons each and tanh activations. The network learns a multivariate Gaussian-distribution over the predicted sub-goal with a diagonal covariance matrix. Since the covariance matrix is diagonal, the size of the output layer is  18 = 2 · 9, 9 elements learn the bias, and 9 learn the standard deviation in each direction. To predict the bias, a value of s+g2is added to the prediction of the network. To obtain the final standard deviation, we add two additional elements to the standard deviation predicted by the neural network : (1) a small value (0.05), to prevent learning too narrow distributions, and (2) a learnable coefficient in each of the directions that depends on the distance between s and g, which allows large segments to predict a wider spread of sub-goals. These two changes were incorporated in order to mitigate log-likelihood evaluation issues.

SGT-PG trains on 30 (s, g) pairs per training cycle, using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.005 on the PPO objective (Schulman et al., 2017) augmented with a max-entropy loss term with coefficient 1, and a PPO trust region of  ϵ = 0.2. As mentioned in Section 5.2, when training level d for a single (s, g) pair, we sample M sub-goals (in our experiments M = 10) from the multivariate Gaussian described above, take the mean predictions of policies of depth  d − 1and lower, and take the mean of costs as a (s, g)-dependent baseline. The repetition allows us to obtain a stable baseline without incorporating another learnable module – a value function. We found this method to be more stable than training on 300 different start-goal pairs (no repetitions). During test time we always take the mean prediction generated by the policies.

Architecture of SeqSG: This model has two components, a policy and a value function. The policy architecture is identical to that of SGT-PG. Note that only a single policy is maintained as the optimal policy is independent of the remaining

image

Figure 9. Success rates over training episodes of the self-collision scenario.

horizon12. We also found that learning the covaraince matrix causes stability issues for the algorithm, so the multivariate Gaussian only learns a bias term. The value function of SeqSG has 3 layers with 20 neurons each, and Elu activations. It predicts a single scalar which approximates V (s, g), and is used as a state-dependent baseline to reduce the variance of the gradient estimation during training.

In order to compare apples-to-apples, SeqSG also trains on 30 (s, g) pairs with 10 repetitions each. We trained the policy with the PPO objective (Schulman et al., 2017) with trust region of  ϵ = 0.05. We used the Adam optimizer (Kingma & Ba, 2014), with learning rate of 0.005 but we clipped the gradient with L2 norm of 10 to stabilize training. The value function is trained to predict the value with mean-squared error loss. We also trained it with Adam with a learning rate of 0.005 but clipped the gradient L2 norm to 100.

We note that finding stable parameters for SeqSG was more challenging than for SGT-PG, as apparent by the gradient clippings, the lower trust region value, and our failure to learn the mutivariate Gaussian distribution covariance matrix.

image

Figure 10. Success rates over training episodes of the wall scenario.


Designed for Accessibility and to further Open Science