b

DiscoverSearch
About
My stuff
Robust Model Predictive Shielding for Safe Reinforcement Learning with Stochastic Dynamics
2019·arXiv
Abstract
Abstract

This paper proposes a framework for safe reinforcement learning that can handle stochastic nonlinear dynamical systems. We focus on the setting where the nominal dynamics are known, and are subject to additive stochastic disturbances with known distribution. Our goal is to ensure the safety of a control policy trained using reinforcement learning, e.g., in a simulated environment. We build on the idea of model predictive shielding (MPS), where a backup controller is used to override the learned policy as needed to ensure safety. The key challenge is how to compute a backup policy in the context of stochastic dynamics. We propose to use a tube-based robust NMPC controller as the backup controller. We estimate the tubes using sampled trajectories, leveraging ideas from statistical learning theory to obtain high-probability guarantees. We empirically demonstrate that our approach can ensure safety in stochastic systems, including cart-pole and a non-holonomic particle with random obstacles.

There has been much recent progress in reinforcement learning (RL) [1], [2], [3], [4], [5]. As a consequence, there has been interest in using RL to design control policies for solving complex robotics tasks [6], [7], [8], including grasping [9] and multi-agent planning [10], [11], [12]. In particular, learning-enabled controllers (LECs) have the potential to outperform optimization-based controllers [13]. In addition, optimization-based controllers can often only be used under strong assumptions about the system dynamics, system constraints, and objective functions [14], [15], which limits their applicability to complex robotics tasks.

However, safety concerns prevent LECs from being widely used in real-world tasks, which are often safety-critical in nature. In particular, unlike optimization-based controllers, it is typically infeasible to impose hard safety constraints on LECs. For example, even a simple safety property such as having a robot avoid static obstacles is nontrivial to impose. More complex safety properties, such as ensuring that a cart-pole or walking robot does not fall over, pose additional difficulty. Additionally, the real world typically has stochastic disturbances such as friction or wind, or the model used to simulate the dynamics may have errors; if an LEC is not robust to these perturbations, then it may fail catastrophically [16]. The issue is that the LEC is typically a black box machine learning model such as a deep neural network (DNN), and reasoning about the closed-loop behavior of the DNN poses computational challenges.

image

As a consequence, safe reinforcement learning has become an increasingly important area of research [17], [18], [19], [20], [21], [22], [23]. Many methods in this area leverage optimization tools to prove that a learned neural network policy satisfies a given safety constraint [24], [25], [26], [27], [19], [20], [23]. A related approach is shielding, which verifies a simpler backup controller, and then overrides the LEC using the backup controller when it can no longer ensure that using the LEC is safe [17], [18], [21], [28]. While these methods provide strong mathematical guarantees, they suffer from a number of shortcomings. For example, many of these methods scale exponentially in the state dimension, so they do not scale well to high-dimensional systems. Those that do typically rely on overapproximating reachable set of states, which can become very imprecise.

We build on a recently proposed idea called model predictive shielding (MPS), which has been used to ensure safety of learned control policies [29], [28], including extensions to the multi-agent setting [30]. The basic idea is that rather than checking which states are safe ahead-of-time, we can dynamically check whether we can maintain safety if we use the LEC, and only use the LEC if we can do so. However, existing approach are limited—there has been work considering nonlinear dynamics, but assuming they are deterministic [28], [30], and there has been work considering stochastic or nondeterministic dynamics, but assuming they are linear [29]. Nonlinearity is important because many tasks where LECs have the most promise are highly nonlinear. Stochasticity is important for a number of reasons. For instance, there are often small perturbations in real-world dynamical systems. Similarly, it can be used to model estimation error in the robot’s state (e.g., uncertainty in its position). Finally, LECs are often learned in simulation using a model of the dynamics; there are often errors in the model that need to be robustly accounted for.

We propose an approach, called robust MPS (RMPS), that extends MPS to the setting of stochastic nonlinear dynamics. Our approach uses robust nonlinear model-predictive control (NMPC) as the backup controller. The reason for using NMPC is that the goals of the backup controller are qualitatively different from the goals of the LEC. For example, consider the problem of building a robot that can run. The LEC tries to make the robot run as fast as possible. It may be able to outperform the robust NMPC, since the robust NMPC treats the stochastic perturbations conservatively. However, because RL lacks theoretical guarantees, the LEC cannot guarantee safety. Thus, we want to use the LEC as often as possible, but override it using a backup controller if we are not sure whether it is safe to use the LEC. The NMPC is an effective choice for the backup controller, where the goal is to stop the system and bring it to an equilibrium point, after which a feedback controller can be used to stabilize it. Continuing our example, the NMPC might bring the robot to a halt (e.g., where it is standing).

To achieve our goals, we build on algorithms for robust NMPC [31], [32], which aims to control a dynamical system in the presence of stochastic or nondeterministic perturbations, and on sampling-based estimation of forward reachable sets [33], [34]; in particular, we build closely on tube-based robust NMPC where sampling is used to estimate a tube within which the NMPC is guaranteed to stay (i.e., the tube is the forward reachable set of the NMPC) [35], [36]. However, these approaches do not provide any finite sample probabilistic guarantees on their estimates of the tubes.

We propose to use results from statistical learning theory to obtain provable probabilistic guarantees on our estimates of the sizes of the tubes [37]. We develop a practical algorithm based on these theoretical results. There has been recent work that provides theoretical guarantees on the estimated forward reachable set [38]; however, they study the problem of obtaining underapproximations of the forward reachable set, as opposed to overapproximations that are needed for checking safety. Thus, their guarantees require qualitatively different techniques than the ones we use. To the best of our knowledge, we are the first to use these to construct estimates for forward reachable sets of nonlinear dynamical systems that come with probabilistic guarantees.

Contributions. Our key contributions are: (i) an extension of the MPS algorithm to stochastic nonlinear dynamical systems (Section IV), (ii) a novel algorithm for estimating tubes for RMPC with high-probability guarantees (Section IV), and, as well as heuristic modifications that enable us to scale this algorithm to real-world systems, and (iii) experiments demonstrating how our approach ensures safety for LECs for cart-pole and for a single particle with non-holonomic dynamics and random obstacles (Section V).

Dynamics. We consider stochastic nonlinear dynamics

image

where k is the time step,  x(k) ∈ X ⊆ RnXis the state, u(k) ∈ U ⊆ RnUis the control input, and  w(k) ∈ W ⊆ RnXis a zero-mean stochastic perturbation with distribution  PW.

Policy. A policy is a map  π : X → U. A trajectory generated using  πfrom initial state  x ∈ Xis x = (x(0), x(1), ...), where x(0) = x and  x(k+1) = f (π)(x(k))+w(k)and where f (π)(x) = f(x, π(x)). Since w(k) is random, x is a random sequence; we use  p(x | π, x)to denote the probability of x.

Objective. We consider a cost function  ℓ : X × U → Rand a discount factor  γ ∈ (0, 1). Given a distribution p(x) over initial states, the cost of a policy  πis

image

image

Fig. 1: An illustration of MPS. If using  ˆπ(dashed orange line), the robot (blue) may unsafely crash into the obstacle (black). Using  πbackup(solid red line) ensures safety.

Safety constraint. We consider a set of safe states  Xsafe ⊆X, with the goal of ensuring that the system stays in  Xsafe. A trajectory x is safe if  x(k) ∈ Xsafefor all  k ≥ 0.

Shielding problem. Our goal is to construct a policy  πthat achieves low cost while ensuring safety. In general, since the dynamics are stochastic, it is impossible to guarantee safety. Instead, our goal is to ensure that safety holds with high probability. We establish a theoretical safety guarantee in Theorem 1; we interpret this theorem in Section IV-C.

Our approach is based on shielding [17], [18], [21], [28]. This approach takes as input a policy  ˆπthat optimizes  ℓ(π), but does not ensure safety (though a penalty for violating safety may be encoded in  ℓ). We call  ˆπthe learned policy, since our key motivation is when  ˆπis a neural network policy trained using RL—e.g., we use DDPG [39]. Nevertheless, our approach can be used with any controller  ˆπ.

Then, the shielding problem is to construct a policy πshieldthat overrides  ˆπas needed to ensure safety. The key challenges are (i) determining how to override  ˆπ, and (ii) minimizing how often  ˆπis overridden.

Notation. For  k ∈ N, we let [k] = {1, ..., k}. We let  Snbe the positive semidefinite matrices of dimension n. For  x ∈Rnand  M ∈ Sn, we let  ∥x∥M = x⊤Mx. For  A, B ⊆ Rn,

image

denote their Minkowski sum and difference, respectively.

Model predictive shielding (MPS) is a shielding algorithm that dynamically checks when to override  ˆπ. The key idea is to maintain an invariant that it can always use a recovery policy  πrecto safely transition to an equilibrium point [29], [28], [30]; a state  x ∈ Xthat satisfies this invariant is recoverable (denoted  x ∈ Xrec). Near the equilibrium point, we assume a linear feedback controller  πstablecan be used to ensure safety for an infinite horizon. Thus, as long as the system remains in  Xrec, then MPS can guarantee safety. The combination of  πrecand  πstableis the backup controller  πbackupused to override  ˆπ. This approach is shown in Fig. 1.

For example, consider a driving robot. In this setting, x is recoverable if the robot can safely apply the brakes to come

image

to a stop. If x is recoverable, but  f (ˆπ)(x)is not, then MPS uses  πbackup. Since x is recoverable, using  πrecis guaranteed to keep the system safe. Thus, MPS ensures safety for an infinite horizon from any recoverable state.

The key challenge of implementing MPS is how to check whether a state x is recoverable. When the dynamics are deterministic, this check can be performed by simulating the dynamics [28]. However, for stochastic dynamics, each simulation will result in different realizations of w(k), so this approach no longer applies. When the dynamics are linear and the perturbations w(k) are Gaussian, we can analytically check a high-probability variant of recoverability [29], but this approach does not generalize to nonlinear dynamics.

Our approach is to based on robust control. First, we use tracking NMPC as the recovery policy [35]. This choice ensures that the system reach its goal with high probability even with stochastic perturbations. Second, we check recoverability by estimating the reachable set of  πbackup. In particular, we use the reachable set to ensure that the trajectory generated using  πbackup(i) is safe, and (ii) reaches an invariant set around an equilibrium point. A key innovation in our approach is that we use tools from statistical learning theory to obtain provable guarantees on our check for recoverability.

A. The Backup Controller

We use a standard robust NMPC as the backup controller [35]. At a high level, this controller first computes a reference trajectory that transitions the system to an equilibrium point. Then, it uses NMPC to track this reference trajectory. Finally, once the trajectory has reached the invariant set  Gearound equilibrium point  ze ∈ X × U, it uses a feedback controller  πstableto stabilize the system within  Ge.

Stabilization near equilibrium points. We assume given a mapping  ρ : X → X × U, where  ze = (xe, ue) = ρ(x)is an equilibrium point—i.e.,  xe = f(xe, ue). Intuitively,  ρ(x)returns the equilibrium point  ze“closest” to x. Then,  πrectries to transition the system from x to  ze. Near  ze, we can use an LQR  πstableto stabilize the system and ensure safety.

We assume we can compute a safe invariant set  Ge ⊆ Xsafearound  ze—i.e., for  x ∈ Ge, the trajectory x generated using πstablefrom x(0) = x (i) is safe, and (ii) remains in  Ge. Since the dynamics are stochastic, we cannot guarantee this property with probability 1 (unless w(k) is bounded). Nevertheless, in our experiments, we find that  πstableis effective at ensuring safety and stability inside  Ge. We discuss how we compute  Gein Section IV-E.

Reference trajectory. We denote the nominal dynamics by  ¯x(k + 1) = f(¯x(k), ¯u(k)), where  ¯x(k)is the nominal state and  ¯u(k)is the nominal control input. Given an initial state  x ∈ Xand a time horizon T, we compute a nominal trajectory to transition the system to an equilibrium point ze = ρ(x)by solving the following:

image

where  ℓ(x, u) = ∥x∥2Q + ∥u∥2Rfor some  Q ∈ SnXand R ∈ SnU. Furthermore, ¯Xsafe ⊆ Xsafecan be specified by the user to improve robustness; we describe heuristics for computing these sets in Section IV-E. We denote the solution to (1) by  (¯x0, ¯u0) ∈ X T +1 × UT +1. Since  (xe, ue)is a nominal equilibrium, the infinite horizon trajectory

image

where  ◦is concatenation, is safe for the nominal dynamics.

Tracking NMPC. Once we have a reference trajectory ¯x∗, ¯u∗, we use NMPC to track this reference trajectory and try to reach the equilibrium  ze. In particular, if we are at state x(t) after t steps, this controller solves the following:

image

image

where  Vf(x)is the cost-to-go function of the LQR for the linearization of the nominal dynamics f around  ze[35]. We let  ˜x0, ˜u0 ∈ X T +1 × UT +1be the solution of (3).

Backup controller. Given an state x, an equilibrium point ze ∈ X × U, and a time horizon T, our backup controller πbackupfirst computes the reference trajectory  ¯x0, ¯u0using (1), with corresponding infinite horizon reference trajectory ¯x∗, ¯u∗. Then, for each step  t ∈ {0, ..., T − 1}, πbackupsolves (3) for the current state x(t) to obtain  ˜x0, ˜u0, and chooses control input  u(t) = ˜u0(0). Finally, for  t ≥ T, it chooses control input  u(t) = πstable(x(t)).

This procedure for computing the backup controller is summarized in Algorithm 2. Note that  πbackupactually needs to keep internal state consisting of the target equilibrium point  ze = ρ(x), its corresponding invariant set  Ge, the reference trajectory  ¯x∗, ¯u∗to the equilibrium point, and the number of steps t taken so far using the backup controller. This internal state is initialized in the context of a given state x ∈ Xby the function call InitializeBackup(x).

B. Checking Robust Recoverability via Sampling

Building on ideas from tube NMPC [15], we use sampling to determine whether  πreccan safely reach the invariant set  Gefrom the current state x. In particular, we sample N trajectories according to the stochastic dynamics, and fit boxes B(t) that cover all the states sampled on each given step  t ∈ {0, ..., T}. Intuitively, if we take N to be sufficiently large, then the realized trajectory will lie in B(t) at each step t with high probability. In contrast to prior work, we make this intuition precise using tools from statistical learning theory. Finally, to check if x is recoverable, we check that it is robustly safe according to the uncertainty in these boxes, and furthermore that it robustly enters the invariant set  Ge.

Robust recoverability. Our goal is to ensure that  πreccan always transition the system safely from the current state x to the invariant set  Gearound  ze = ρ(x). Due to the random perturbation w(k) in the dynamics, we cannot make an absolute guarantee that this property holds. Following prior work [17], [18], [19], we instead aim to guarantee that this property holds with high probability.

Algorithm 4 Estimates the reachable sets B(t) after t steps using Monte Carlo sampling.

procedure EstimateReachableSets(x, πbackup)

image

Definition 1: Let  ϵ ∈ R>0and  x ∈ Xbe given. Let  ze =ρ(x), let  Gebe an invariant set for  πstablearound  ze, and let x0 = (x(0), ..., x(T))be a trajectory generated using  πbackupfrom x(0) = x + w, where  w ∼ PW. We say x is  ϵrobustly recoverable if with probability at least  1 − ϵ, (i) x(t) is safe for every  t ∈ {0, ..., T}, and (ii)  x(T) ∈ Ge.

In other words,  x0safely transitions the system from x to Gewith probability at least  1−ϵ. Algorithm 3 checks whether a state x is robustly recoverable using sampling. Prior work on safe reinforcement learning [17], [18], [19] has relied on thresholding the perturbation distribution and then using verification to obtain similar bounds. This approach can guarantee that x is robustly recoverable with probability 1. In contrast, using our approach, there is an additional chance δ(for any given  δ ∈ R>0) that our algorithm incorrectly concludes that x is robustly recoverable when it is not. The difference is that the  ϵerror in robust recoverability is due to the noise in the dynamics, whereas our  δerror is due to noise in the sampled trajectories taken by our algorithm.

We believe this added potential for error is reasonable for two reasons. First, there is already an  ϵchance of error, so the added error  δdoes not affect the nature of our guarantee. Second, the dependence of the running time of our algorithm 1/δis logarithmic, so we can choose  δto be very small.

Estimating reachable sets. Our approach is to compute sets B(t) for  t ∈ {0, ..., T}such that the trajectory x0satisfies  x(t) − ¯x∗(t) ∈ B(t)with probability at least  1 − ϵ, where  ¯x∗is the reference trajectory used by  πbackup—i.e.,

image

where the probability is taken over the randomness in the dynamics. To this end, a box constraint is a set

image

where  [a, b] ⊆ Rdenotes the closed interval from a to b. We use B to denote the set of all possible boxes. Now, we have the following theoretical guarantee.

Lemma 1: Let d be a distribution over  Rnand  ϵ, δ ∈ R>0be given. Consider N i.i.d. samples  x1, ..., xN ∼ d, where

image

image

Fig. 2: An example of a tube (red region) estimated using Algorithm 4 for  πbackup. We guarantee the realized trajectory lies in this tube with high probability.

and let  B ∈ Bbe any box satisfying  xi ∈ Bfor all  i ∈{1, ..., N}. Then, with probability at least  1 − δ, we have

Prx∼d(x ∈ B) ≥ 1 − ϵ.(6) Intuitively, (6) says that at least a  1 − ϵfraction of states (weighted by d) fall inside B, and this guarantee holds with probability at least  1 − δ. The proof, based on tools from statistical learning theory, is given in Appendix A. Algorithm 4 takes N samples of the trajectory  x0by simulating the dynamics, and fits a box B(t) based on the sampled states on each step  t ∈ {0, ..., T}. The following guarantee follows from Lemma 1 via a union bound:

Lemma 2: Let B be the sequence of boxes returned by Algorithm 4. With probability at least  1−(T +1)δ, we have

image

As before, the  1−(T +1)δprobability is according to the samples used by our algorithm, whereas the  1 − (T + 1)ϵprobability is according to the randomness in the dynamics.

The sets B(t) computed using Algorithm 4 can be thought of as an estimate of a tube in which the trajectories are guaranteed to stay [35]. In contrast to prior work, we have used results from statistical learning theory to obtain probabilistic guarantees on the correctness of these tubes [37]. An example of an estimated tube is shown in Fig. 2.

Checking recoverability. Given the boxes B(t) for  t ∈{0, ..., T}, Algorithm 3 checks both properties required for robust recovery: (i) to check if  x(t) ∈ Xsafewith high probability, it checks if

image

which is equivalent to  ¯x∗(t) ∈ Xsafe ⊖B(t), and (ii) to check if  x(T) ∈ Gewith high probability, it checks if

image

which is equivalent to  x(T) ∈ Ge ⊖ B(T). These checks ensure robust recoverability because Corollary 2 ensures that x(t) ∈ B(t)with high probability for every  t ∈ {0, ..., T}. Thus, we have the following guarantee:

Lemma 3: Given a state  x ∈ X, if Algorithm 3 returns true, then x is  (T +1)ϵrobustly recoverable with probability 1−(T +1)δ(according to the randomness in the algorithm).

C. Robust Model Predictive Shielding

Our robust model predictive shielding (RMPS) algorithm is shown in Algorithm 1. At state  x ∈ X, this algorithm computes a control input (denoted  πshield(x)) by checking whether next state  f (ˆπ)(x)is robustly recoverable (with high probability) in simulation. Otherwise, it takes a step according to  πbackup. One subtlety is that if  πbackuphas already been initialized, it actually needs to check if  f (πbackup)(x)is robustly recoverable. The issue is that robust recoverability is defined with respect to a freshly initialized backup policy, not the backup policy after it has taken some number of steps. We have the following guarantee:

Theorem 1: If  x ∈ Xis  1−(T +1)ϵrobustly recoverable, then  f (πshield)(x)+w(where  w ∼ PW) is  1−(T +1)ϵrobustly recoverable with probability at least  1 − 2(T + 1)δ.

See Appendix VI-B for a proof. A key shortcoming of this guarantee is that it does not ensure safety of the infinite horizon trajectory. Given our assumptions, a stronger guarantee is impossible, since on every step there is a chance that the additive perturbation is large, causing the system to leave  Xsafe. However, this guarantee is still useful since it helps guide the design of our algorithm. In practice, we find that the bounds can be tighter than the theory suggests, since the robust NMPC is actually conservatively overapproximating the reachable set. In other words, the robust NMPC ensures safety much more robustly than the probabilities in Theorem 1 would suggest.

D. Running Time

We briefly discuss the running time of Algorithm 1. First, it is straightforward to see that the number of samples N is ˜O(1/ϵ2)(where ˜Oindicates that we have suppressed log factors including  log(1/δ)). For each sample, we need to simulate the closed-loop dynamics  f (πrec)over a horizon of length T. Thus, assuming the cost of simulating  f (πrec)for one step is  τ, the running time of Algorithm 1 is ˜O(Tτ/ϵ2).

E. Practical Modifications

We describe several practical modifications to our algorithm designed to improve either performance or computational tractability. These modifications may weaken our safety guarantees, but as we show in our experiments, they do not affect safety very much empirically.

Computing  Ge. We use a heuristic to compute  Gefrom [35]. In particular, we sample trajectories over a long horizon and estimate the reachable set B the same way as in Algorithm 4. This approach does not provide any guarantees that the estimated set  Geis actually invariant, but it works well in our experiments.

Precomputing B. Computing the sets B(t) (for  t ∈{0, ..., T}) on-the-fly can be prohibitively expensive, since we might need a large number of samples N for Lemma 1 to apply. Instead, we precompute these sets from a fixed

image

Fig. 3: Left: Safety probabilities for no shielding (light brown), non-robust MPS (dark brown), LRMPS (light green), and RMPS (dark green). Middle: Probability of reaching goal for the same policies. Right: Percentage of time using the learned policy  ˆπ(blue) compared to the backup policy  πbackup(orange) by RMPS (left) and LRMPS (right).

initial state  x0. Then, we reuse the same states rather than recomputing them at each step. Intuitively, this approach works well in practice since the dynamics of the tracking NMPC are usually fairly similar for different initial states.

Using tighter constraints for NMPC. In the optimization problem (1) used to compute the reference trajectory, we noted that we can use tighter state and input constraints ¯x(k) ∈ ¯Xsafethan needed. In particular, by doing so, we can improve the robustness of the tracking NMPC—we use the “tightened set” ¯Xsafe = Xsafe ⊖B(t). Unlike the previous two heuristics, this one does not affect our theoretical guarantees, since our guarantees hold for an arbitrary backup policy. While our approach would ensure safety even without the tighter state and input constraints, using tighter constraints improves the chances that the backup policy recovers the system from a given state x—i.e., it improves the size of Xrec. In other words, it increases the probability that the robot reaches its goal without diminishing the probability of safety.

We perform experiments to demonstrate how our system can ensure safety of stochastic systems with nonlinear dynamics and/or nonconvex constraints.

A. Setting

We perform experiments using three environments: (i) cart-pole, which has nonlinear dynamics and polytopic constraints (i.e., the pole should not fall below a certain height), a particle with holonomic dynamics and obstacles (which has linear dynamics but nonconvex constraints), and a particle with non-holonomic dynamics and obstacles (which has both nonlinear dynamics and nonconvex constraints).

For the cart-pole, the states are  z = (x, v, θ, ω) ∈ R4, where x is the cart position, v is the cart velocity,  θis the pole angle from upright position, and  ωis the pole angular velocity, and the control inputs are  u ∈ R, with the goal of reaching a target position  xtarget[40]. The safety constraint is that the pole should not fall down while moving the cart. We define the cost function to be

image

where  γ ∈ R>0is a hyperparameter. Finally, disturbances are uniform noise  wi(k) ∼Uniform([−0.01, 0.01])for the velocity and angular velocity, and zero otherwise.

For the single particle with holonomic dynamics, the states are  z = (x, y, vx, vy) ∈ R4, where (x, y) is position and (vx, vy)is velocity, and the control inputs are  (ux, uy) ∈ R2, where  (ux, uy)is the acceleration. The system dynamics are z(k + 1) = Az(k) + Bu(k), where

image

The cost function is

image

where  g ∈ R2is the goal,  oi (i ∈ [N]) are the obstacles, and d(x, y) = ∥x−y∥2, and  γis a hyperparameter. Disturbances are uniform noise  wi(k) ∼Uniform([−0.01, 0.01])for the velocity and angular velocity, and zero otherwise.

For the single particle with non-holonomic dynamics, the states are  z = (x, y, v, h) ∈ R4, where (x, y) is the position, v is the velocity, and h is the heading, and the control inputs are  (a, ω) ∈ R2, where a is acceleration and  ωis angular acceleration. The system dynamics are

image

where  ℓis the particle radius. Costs and disturbances are the same as for the holonomic particle.

We compare RMPS with: (i) using the learned policy  ˆπwithout any shielding, (ii) using shielding without robust control [28], which we call non-robust MPS, and (iii) using RMPS with a linear robust controller [40], which we call LRMPS. In particular, for LRMPS, we linearize the dynamics around the origin, and then use the corresponding linear robust MPC proposed in [40] to try and recover the system. For each experiment, we run 50 scenarios with 3 different random seeds and compute both the probability of safety and the probability of reaching the goal. Also, we use N = 1500 to estimate the tightened constraints in NMPC.

B. Results

The safety and performance of the four algorithms are shown in Fig. 3. First, as can be seen, RMPS and LRMPS achieve a safety probability of 1.0. In contrast, the learned

image

Fig. 4: Pairs  (θ, ω)for which nonlinear MPC (red dot) and linear MPC (blue cross) can steer the pole to upright position.

policy is frequently unsafe, demonstrating the importance of shielding. Furthermore, RMPS only slightly diminishes performance compared to the learned policy. Similarly, the non-robust MPS performs slightly better than the learned policy alone in terms of safety, but still cannot guarantee that safety holds. In contrast, RMPS guarantees safety in each environment. Its performance is slightly diminished— for the particle with non-holonomic dynamics (the hardest environment), the probability of reaching the goal drops by about 20%. Thus, RMPS is much more suitable for safety-critical systems where safety must be guaranteed.

Next, except for the holonomic particle (which has linear dynamics), RMPS achieves much better performance than the LRMPS. This improvement in performance comes from the fact that our NMPC can stabilize the system in many more states compared to the linear robust MPC. For example, as shown in Fig. 4, for a fixed cart speed, NMPC can recover the pole to the upright position for more  (θ, ω)pairs compared to linear MPC. Thus, RMPS is less conservative than LRMPS and has better performance. Fig. 3 (right) also supports these results—LRMPS uses backup controller more frequently than RMPS, showing that it is more conservative. These results demonstrate the benefit of accounting for the nonlinearity in the dynamics.

We have proposed an algorithm for ensuring the safety of a learned policy for a stochastic nonlinear dynamical system. We use a sampling-based approach to estimate the reachable set of the backup policy, and use results from statistical learning theory to provide theoretical guarantees on our estimates. In our experiments, we show that our approach can ensure safety without sacrificing very much performance despite these modifications. Thus, our approach is a promising way to ensure safety in safety-critical systems.

A key direction for future work is reducing the running time of our algorithm. While our approach is already computationally tractable, it is still too slow for production systems. As we described, our current implementation uses two heuristics to improve performance, but these heuristics break our theoretical guarantees. Thus, improving the performance of our algorithm would enable our algorithm to be used with its theoretical guarantees intact. Additional directions for future work include enabling safe exploration and extending our results to partially observable systems, as well as demonstrating our approach on real robotics systems.

image

A. Proof of Lemma 1

Given  B ∈ B, define  fB : X → Y, where Y = {0, 1}, by fB(x) = I(x ∈ B)—i.e.,  fBindicates whether x is contained in B. Note that  fBis a binary classifier, and we can consider the family of classifiers  F = {fB | B ∈ B}. also, define the distribution ˜don  X × Yby ˜d(x, y) = d(x) · I(y = 1)— i.e., all labels are 1. Thus, sampling  x ∼ dis equivalent to sampling  (x, y) ∼ ˜d, where y = 1. Recall that we choose B so all samples  x1, ..., xN ∼ dsatisfy  xi ∈ B. Equivalently, (x1, y1), ..., (xN, yN) ∼ ˜d, where  yi = 1for all  i ∈ [N], so fB(xi) = yifor all  i ∈ [N]. Thus, we can think of choosing B as choosing  fB ∈ Fsuch that the training error

image

on a set of i.i.d. samples  (xi, yi) ∼ d. Thus, we can apply results from statistical learning theory to bound the test error

image

where the last probability is the one we are seeking to bound. In particular, it is straightforward to check that the VC dimension of  fBfor boxes  B ⊆ Rnis VC(F) = n. By the VC dimension bound [37], for all  B ∈ B, we have

image

with probability at least  1 − δ. The claim follows by setting ϵequal to the right-hand side of (7).  ■

B. Proof of Theorem 1

If  πshielduses  ˆπ, by Lemma 3,  f (ˆπ)(x)is  1 − (T + 1)ϵrobustly recoverable with probability  ≥ 1 − (T + 1)δ. If πshielduses the already initialized version of  πbackup, by Lemma 3,  f (πbackup)(x)is  1 − (T + 1)ϵrobustly recoverable with probability  ≥ 1 − (T + 1)δ. If  πshieldinitializes  πbackupand uses it, since x is  1 − (T + 1)ϵrobustly recoverable,

image

holds with probability at least  1 − (T + 1)ϵ. The robust recoverability condition for  x(1) = f (πbackup)(x)is

image

In particular, note that  φ0 ⇒ φ1, so

image

so  f (πshield)(x)is  1−(T +1)ϵrobustly recoverable. By a union bound,  f (πshield)(x)is robustly recoverable with probability ≥ 1 − 2(T + 1)δ, as claimed.  ■

[1] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” CoRR, vol. abs/1509.06461, 2015. [Online]. Available: http://arxiv.org/abs/1509.06461

[2] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” CoRR, vol. abs/1502.05477, 2015. [Online]. Available: http://arxiv.org/abs/1502.05477

[3] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/ 1707.06347

[4] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” CoRR, vol. abs/1602.01783, 2016. [Online]. Available: http://arxiv.org/abs/1602.01783

[5] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, pp. 484–503, 2016. [Online]. Available: http://www.nature.com/nature/journal/v529/ n7587/full/nature16961.html

[6] R. Mahjourian, N. Jaitly, N. Lazic, S. Levine, and R. Miikkulainen, “Hierarchical policy design for sample-efficient learning of robot table tennis through self-play,” CoRR, vol. abs/1811.12927, 2018. [Online]. Available: http://arxiv.org/abs/1811.12927

[7] N. O. Lambert, D. S. Drew, J. Yaconelli, R. Calandra, S. Levine, and K. S. J. Pister, “Low level control of a quadrotor with deep model-based reinforcement learning,” CoRR, vol. abs/1901.03737, 2019. [Online]. Available: http://arxiv.org/abs/1901.03737

[8] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to end learning for selfdriving cars,” CoRR, vol. abs/1604.07316, 2016. [Online]. Available: http://arxiv.org/abs/1604.07316

[9] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al., “Learning dexterous in-hand manipulation,” arXiv preprint arXiv:1808.00177, 2018.

[10] A. Khan, C. Zhang, S. Li, J. Wu, B. Schlotfeldt, S. Y. Tang, A. Ribeiro, O. Bastani, and V. Kumar, “Learning safe unlabeled multirobot planning with motion constraints,” CoRR, vol. abs/1907.05300, 2019. [Online]. Available: http://arxiv.org/abs/1907.05300

[11] A. Khan, E. Tolstaya, A. Ribeiro, and V. Kumar, “Graph policy gradients for large scale robot control,” in CoRL, 2019.

[12] E. Tolstaya, F. Gama, J. Paulos, G. Pappas, V. Kumar, and A. Ribeiro, “Learning decentralized controllers for robot swarms with graph neural networks,” in CoRL, 2019.

[13] C. Gehring, S. Coros, M. Hutler, D. Bellicoso, H. Heijnen, R. Diethelm, M. Bloesch, P. Fankhauser, J. Hwangbo, M. Hoepflinger, and R. Siegwart, “Practice makes perfect: An optimization-based approach to controlling agile motions for a quadruped robot,” IEEE Robotics & Automation Magazine, pp. 1–1, 02 2016.

[14] C. E. Garcia, D. M. Prett, and M. Morari, “Model predictive control: Theory and practice - a survey,” Automatica, vol. 25, pp. 335–348, 1989.

[15] J. B Rawlings and D. Mayne, Model Predictive Control: Theory and Design, 01 2009.

[16] J. Ho and S. Ermon, “Generative adversarial imitation learning,” CoRR, vol. abs/1606.03476, 2016. [Online]. Available: http://arxiv. org/abs/1606.03476

[17] J. H. Gillula and C. J. Tomlin, “Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor,” in 2012 IEEE International Conference on Robotics and Automation. IEEE, 2012, pp. 2723–2730.

[18] A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger, and C. J. Tomlin, “Reachability-based safe learning with gaussian processes,” in 53rd IEEE Conference on Decision and Control. IEEE, 2014, pp. 1424–1431.

[19] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Advances in neural information processing systems, 2017, pp. 908– 918.

[20] O. Bastani, Y. Pu, and A. Solar-Lezama, “Verifiable reinforcement learning via policy extraction,” in Advances in Neural Information Processing Systems, 2018, pp. 2494–2504.

[21] M. Alshiekh, R. Bloem, R. Ehlers, B. K¨onighofer, S. Niekum, and U. Topcu, “Safe reinforcement learning via shielding,” in ThirtySecond AAAI Conference on Artificial Intelligence, 2018.

[22] J. F. Fisac, N. F. Lugovoy, V. Rubies-Royo, S. Ghosh, and C. J. Tomlin, “Bridging hamilton-jacobi safety analysis and reinforcement learning,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8550–8556.

[23] R. Ivanov, J. Weimer, R. Alur, G. J. Pappas, and I. Lee, “Verisig: verifying safety properties of hybrid systems with neural network controllers,” in Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control. ACM, 2019, pp. 169– 178.

[24] M. Fazlyab, M. Morari, and G. J. Pappas, “Safety verification and robustness analysis of neural networks via quadratic constraints and semidefinite programming,” CoRR, vol. abs/1903.01287, 2019. [Online]. Available: https://arxiv.org/abs/1903.01287

[25] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. J. Pappas, “Efficient and accurate estimation of lipschitz constants for deep neural networks,” CoRR, vol. abs/1906.04893, 2019. [Online]. Available: http://arxiv.org/abs/1906.04893

[26] D. Tran, “Safety verification of cyber-physical systems with reinforcement learning control, emsoft 2019,” 07 2019.

[27] W. Xiang, D. M. Lopez, P. Musau, and T. T. Johnson, “Reachable set estimation and verification for neural network models of nonlinear dynamic systems,” CoRR, vol. abs/1802.03557, 2018. [Online]. Available: http://arxiv.org/abs/1802.03557

[28] O. Bastani, “Safe planning via model predictive shielding,” CoRR, vol. abs/1905.10691, 2019. [Online]. Available: http://arxiv.org/abs/ 1905.10691

[29] K. P. Wabersich and M. N. Zeilinger, “Linear model predictive safety certification for learning-based control,” CoRR, vol. abs/1803.08552, 2018. [Online]. Available: http://arxiv.org/abs/1803.08552

[30] W. Zhang and O. Bastani, “Mamps: Safe multi-agent reinforcement learning via model predictive shielding.” [Online]. Available: https://obastani.github.io/docs/mamps.pdf

[31] Y. Gao, A. Gray, H. E. Tseng, and F. Borrelli, “A tube-based robust nonlinear predictive control approach to semiautonomous ground vehicles,” Vehicle System Dynamics, vol. 52, no. 6, pp. 802–823, 2014. [Online]. Available: https://doi.org/10.1080/00423114.2014.902537

[32] S. Yu, H. Chen, and F. Allgwer, “Tube mpc scheme based on robust control invariant set with application to lipschitz nonlinear systems,” in 2011 50th IEEE Conference on Decision and Control and European Control Conference, Dec 2011, pp. 2650–2655.

[33] R. E. Allen, A. A. Clark, J. A. Starek, and M. Pavone, “A machine learning approach for real-time reachability analysis,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2014, pp. 2202–2208.

[34] P. Reist, P. Preiswerk, and R. Tedrake, “Feedback-motion-planning with simulation-based lqr-trees,” The International Journal of Robotics Research, vol. 35, no. 11, pp. 1393–1416, 2016.

[35] D. Q. Mayne, E. C. Kerrigan, E. J. van Wyk, and P. Falugi, “Tube-based robust nonlinear model predictive control,” International Journal of Robust and Nonlinear Control, vol. 21, no. 11, pp. 1341–1353, 2011. [Online]. Available: https://onlinelibrary.wiley.com/ doi/abs/10.1002/rnc.1758

[36] D. Mayne, “Robust and stochastic model predictive control: Are we going in the right direction?” Annual Reviews in Control, vol. 41, pp. 184–192, 2016.

[37] V. Vapnik, The nature of statistical learning theory. Springer science & business media, 2013.

[38] L. Liebenwein, C. Baykal, I. Gilitschenski, S. Karaman, and D. Rus, “Sampling-based approximation algorithms for reachability analysis with provable guarantees,” in RSS, 2018.

[39] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ser. ICML’14. JMLR.org, 2014, pp. I–387–I–395. [Online]. Available: http: //dl.acm.org/citation.cfm?id=3044805.3044850

[40] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” CoRR, vol. abs/1606.01540, 2016. [Online]. Available: http://arxiv.org/abs/1606.01540


Designed for Accessibility and to further Open Science