b

DiscoverSearch
About
My stuff
Learning-based Model Predictive Control for Safe Exploration
2018·arXiv
Abstract
Abstract

Learning-based methods have been successful in solving complex control tasks without significant prior knowledge about the system. However, these methods typically do not provide any safety guarantees, which prevents their use in safety-critical, real-world applications. In this paper, we present a learning-based model predictive control scheme that can provide provable high-probability safety guarantees. To this end, we exploit regularity assumptions on the dynamics in terms of a Gaussian process prior to construct provably accurate confidence intervals on predicted trajectories. Unlike previous approaches, we do not assume that model uncertainties are independent. Based on these predictions, we guarantee that trajectories satisfy safety constraints. Moreover, we use a terminal set constraint to recursively guarantee the existence of safe control actions at every iteration. In our experiments, we show that the resulting algorithm can be used to safely and efficiently explore and learn about dynamic systems.

In model-based reinforcement learning (RL, [1]), we aim to learn the dynamics of an unknown system from data, and based on the model, derive a policy that optimizes the long-term behavior of the system. Crucial to the success of such methods is the ability to efficiently explore the state space in order to quickly improve our knowledge about the system. While empirically successful, current approaches often use exploratory actions during learning, which lead to unpredictable and possibly unsafe behavior of the system, e.g., in exploration approaches based on the optimism in the face of uncertainty principle [2]. Such approaches are not applicable to real-world safety-critical systems.

In this paper we introduce SAFEMPC, a safe model predictive control (MPC) scheme that guarantees the existence of feasible return trajectories to a safe region of the state space at every time step with high-probability. These return trajectories are identified through a novel uncertainty propagation method that, in combination with constrained MPC, allows for formal safety guarantees in learning control. Related Work: One area that has considered safety guarantees is robust MPC. There, we iteratively optimize the performance along finite-length trajectories at each time step, based on a known model that incorporates uncertainties and disturbances acting on the system [3]. In a constrained robust MPC setting, we optimize these local trajectories under

This work was supported by SNSF grant 200020 159557, a fellowship within the FITweltweit program of the German Academic Exchange Service (DAAD), the Vector Institute, an Open Philantropy Project AI fellowship, and the Max Planck ETH Center for Learning Systems.

Torsten Koller is with the Department of Computer Science, University of Freiburg, Germany. Email: kollert@informatik.uni-freiburg.de

Felix Berkenkamp, Matteo Turchetta and Andreas Krause are with the Learning & Adaptive Systems Group, Department of Computer Science, ETH Zurich, Switzerland. Email: {befelix, matteotu, krausea}@inf.ethz.ch

image

image

Fig. 1. Propagation of uncertainty over multiple time steps based on a wellcalibrated statistical model of the unknown system. We iteratively compute ellipsoidal over-approximations (purple) of the intractable image (green) of the learned model for uncertain ellipsoidal inputs.

additional state and control constraints. Safety is typically defined in terms of recursive feasibility and robust constraint satisfaction. In [4], this definition is used to safely control urban traffic flow, while [5] guarantees safety by switching between a standard and a safety mode. However, these methods are conservative since they do not update the model.

In contrast, learning-based control approaches adapt their models online based on observations of the system. This allows the controller to improve over time, given limited prior knowledge of the system. Theoretical safety guarantees in learning-based MPC (LBMPC) are established in [6]. A safety mechanism for general learning-based controllers using robust MPC is proposed in [7]. Both approaches require a known nominal linear model. The former approach requires deviations from the system dynamics to be bounded in an pre-specified polytope, the latter relies on sampling.

MPC based on Gaussian process (GP, [8]) models is proposed in a number of works, e.g. [9], [10]. The difficulty here is that trajectories have complex dependencies on states and unbounded stochastic uncertainties. Safety through probabilistic chance constraints is considered in [11]–[13] based on approximate uncertainty propagation. While often being empirically successful, these approaches do not theoretically guarantee safety of the underlying system.

Another area that has considered learning for control is model-based RL. There, we aim to learn global policies based on data-driven modeling techniques, e.g., by explicitly trading-off between finding locally optimal policies (exploitation) and learning the behavior of the system globally (exploration) [1]. This results in data-efficient learning of policies in unknown systems [14]. In contrast to MPC, where we optimize finite-length trajectories, in RL we typically aim to find an infinite horizon optimal policy. Hence, enforcing hard constraints in RL is challenging. Controltheoretic safety properties such as Lyapunov stability or robust constraint satisfaction are only considered in a few

c ⃝2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

works [15]. In [16], safety is guaranteed by optimizing parametric policies under stability constraints, while [17] guarantees safety in terms of constraint satisfaction through reachability analysis.

Our Contribution: We combine ideas from robust control and GP-based RL to design a MPC scheme that recursively guarantees the existence of a safety trajectory that satisfies the constraints of the system. In contrast to previous approaches, we use a novel uncertainty propagation technique that can reliably propagate the confidence intervals of a GP-model forward in time. We use results from statistical learning theory to guarantee that these trajectories contain the system with high probability jointly for all time steps. In combination with a constrained MPC approach and a terminal set constraint, we then prove the safety of the system. We apply the algorithm to safely explore the dynamics of an inverted pendulum simulation.

We consider a nonlinear, discrete-time dynamical system

image

where  xt ∈ Rnxis the state and  ut ∈ Rnuis the control input to the system at time step  t ∈ N. We assume that we have access to a twice continuously differentiable prior model  h(xt, ut), which could be based on a first principles physics model. The model error  g(xt, ut)is a priori unknown and we use a statistical model to learn it by collecting observations from the system during operation. In order to provide guarantees, we need reliable estimates of the model-error. In general, this is impossible for arbitrary functions g. We make the following additional regularity assumptions.

We assume that the model-error g is of the form g(z) = �∞l=0 αik(z, zl), αl ∈ R, z = (x, u) ∈ Rnx × Rnu, a weighted sum of distances between inputs z and representer points  zl = (xl, ul) ∈ Rnx × Rnuas defined through a symmetric, positive definite kernel k. This class of functions is well-behaved in the sense that they form a reproducing kernel Hilbert space (RKHS, [18])  Hkequipped with an inner-product  ⟨·, ·⟩k. The induced norm  ||g||2k = ⟨g, g⟩kis a measure of the complexity of a function  g ∈ Hk. Consequently, the following assumption can be interpreted as a requirement on the smoothness of the model-error g w.r.t. the kernel k.

Assumption 1 The unknown function g has bounded norm in the RKHS  Hk, induced by the continuously differentiable kernel k, i.e.  ||g||k ≤ Bg.

In the case of a multi-dimensional output  nx > 1, we follow [19] and redefine g as a single-output function  ˜gsuch that ˜g(·, j) = gj(·)and assume that  ||˜g||k ≤ Bg.

We further assume that the system is subject to polytopic state and control constraints

image

which are bounded. For example, in an autonomous driving scenario, the state region could correspond to a highway lane and the control constraints could represent the physical limits on acceleration and steering angle of the car.

Lastly, we assume access to a backup controller that guarantees that we remain inside a given safe subset of the state space once we enter it. In the autonomous driving example, this could be a simple linear controller that stabilizes the car in a small region in the center of the lane at slow speeds.

Assumption 2 We are given a controller  πsafe(·)and a polytopic safe region

image

which is (robust) control positive invariant (RCPI) under  πsafe(·). Moreover, the controller satisfies the control constraints inside  Xsafe, i.e.  πsafe(x) ∈ U ∀x ∈ Xsafe.

This assumption allows us to gather initial data from the system inside the safe region even in the presence of significant model errors, since the system remains safe under the controller  πsafe. Moreover, we can still guarantee constraint satisfaction asymptotically outside of  Xsafe, if we can show that a finite sequence of control inputs eventually steers the system back to the safe set  Xsafe. This idea and a similar definition of a safe set was introduced concurrently in [7]. A set and corresponding controller which fulfill Assumption 2 for general dynamical systems is difficult to find. However, there has been recent progress in finding stability regions for systems of the form (1), which are RCPI by design, that could, under additional considerations (e.g. through polytopic inner-approximations [20]), satisfy the assumptions.

Given a controller  π, ideally we want to enforce the stateand control constraints at every time step,

image

where  xt+1 = fπ(xt) = f(xt, π(xt))denotes the closed-loop system under  π. Apart from  πsafe, which trivially and conservatively fulfills this, it is in general impossible to design a controller that enforces (5) without additional assumptions. Instead, we slightly relax this requirement to safety with high probability throughout its operation time.

Definition 1 Let  π : Rnx → Rnube a controller for (1) with the corresponding closed-loop system  fπ. Let  x0 ∈ Xand  δ ∈ (0, 1]. A system is  δ−safe under the controller  πiff:

image

Based on Definition 1, the goal is to design a control scheme that guarantees  δ-safety of the system (1). At the same time, we want to improve our model by learning from observations collected outside of the initial safe set  Xsafeduring operation, which increase the performance of the controller over time.

In this section, we introduce the necessary background on GPs and set-theoretic properties of ellipsoids that we use to model our system and perform multi-step ahead predictions.

A. Gaussian Processes (GPs)

We want to learn the unknown model-error g from data using a GP model. A GP(m, k) is a distribution over functions, which is fully specified through a mean function m : Rd → Rand a covariance function  k : Rd × Rd → R, where  d = nx + nu. Given a set of n noisy observations yi = f(zi) + wi, wi ∼ N(0, λ2), i = 1, . . . , n, λ ∈ R, we choose a zero-mean prior on g as  m ≡ 0and regard the differences  ˜yn = [y1−h(z1), . . . , yn−h(zn)]Tbetween prior model h and observed system response at input locations Z = [z1, .., zn]T. The posterior distribution at z is then given as a Gaussian  N(µn(z), σ2n(z))with mean and variance

image

where  [Kn]ij = k(zi, zj), [kn(z)]j = k(z, zj), and  Inis the  n−dimensional identity matrix. In the case of multiple outputs  nx > 1, we model each output dimension with an independent GP,  GP(mj, kj), j = 1, .., nx. We then redefine (7) and (8) as  µn(·) = (µn,1(·), .., µn,nx(·))and  σn(·) =(σn,1(·), .., σn,nx(·))corresponding to the predictive mean and variance functions of the individual models.

Based on Assumption 1, we can use GPs to model the unknown part of the system (1), which provides us with reliable confidence intervals on the model-error g.

Lemma 1 [16, Lemma 2]: Assume  ||g||k ≤ Bgand that measurements are corrupted by  λ-sub-Gaussian noise. Let  βn = Bg + 4λ�γn + 1 + ln(1/δ), where  γnis the information capacity associated with the kernel k. Then with probability at least  1 − δwe have for all  1 ≤ j ≤ nx, z ∈X × Uthat  |µn−1,j(z) − gj(z)| ≤ βn · σn−1,j(z).

In combination with the prior model h(z), this allows us to construct reliable confidence intervals around the true dynamics of the system (1). The scaling  βndepends on the number of data points n that we gather from the system through the information capacity,  γn =maxA⊂ ˜Z,|A|=˜n I(˜gA; g), ˜Z = X × U × I, ˜n = n · nx, i.e. the maximum mutual information  I(˜gA, g)between a finite set of samples A and the function g. Exact evaluation of  γnis NP-hard in general, but it can be greedily approximated and has sublinear dependence on n for many commonly used kernels [21].

The regularity assumption Assumption 1 on our model-error and the smoothness assumption on the covariance function k additionally imply that the function g is Lipschitz.

B. Ellipsoids

We use ellipsoids to give an outer bound on the uncertainty of our system when making multi-step ahead predictions. Due to appealing geometric properties, ellipsoids are widely used in the robust control community to compute reachable sets [22], [23]. These sets intuitively provide an outer approximation on the next state of a system considering all possible realizations of uncertainties when applying a controller to the system at a given set-valued input. We briefly review some of these properties and refer to [24] for an exhaustive introduction to ellipsoids and to the derivations for the following properties.

We use the basic definition of an ellipsoid,

image

with center  p ∈ Rnand a symmetric positive definite (s.p.d) shape matrix  Q ∈ Rn×n. Ellipsoids are invariant under affine subspace transformations such that for  A ∈ Rr×n, r ≤ nwith full row rank and  b ∈ Rr, we have that

image

The Minkowski sum  E(p1, Q1) ⊕ E(p2, Q2), i.e. the pointwise sum between two arbitrary ellipsoids, is in general not an ellipsoid anymore, but we have that

image

for all c > 0. Moreover, the minimizer of the trace of the resulting shape matrix is analytically given as c = �Tr(Q1)/Tr(Q2). A particular problem that we encounter is finding the maximum distance r to the center of an ellipsoid E := E(0, Q) under a special transformation, i.e.

image

where  S ∈ Rm×nwith full column rank. This is a generalized eigenvalue problem of the pair  (Q, ST S)and the optimizer is given as the square-root of the largest generalized eigenvalue.

In this section, we use the assumptions in Sec. II to design a control scheme that fulfills our safety requirements in Definition 1. We construct reliable, multi-step ahead predictions based on our GP model and use MPC to actively optimize over these predicted trajectories under safety constraints. Using Assumption 2, we use a terminal set constraint to theoretically prove the safety of our method.

A. Multi-step Ahead Predictions

From Lemma 1 and our prior model  h(xt, ut), we directly obtain high-probability confidence intervals on  f(xt, ut)uniformly for all  t ∈ N. We extend this to over-approximate the system after a sequence of inputs  (ut, ut+1, ..). The result is a sequence of set-valued confidence regions that contain the true dynamics of the system with high probability. a) One-step ahead predictions: We compute an ellipsoidal confidence region that contains the next state of the system with high probability when applying a control input, given that the current state is contained in an ellipsoid. In order to approximate the system, we linearize our prior model h(xt, ut)and use the affine transformation property (10) to compute the ellipsoidal next state of the linearized model. Next, we approximate the unknown model-error  g(xt, ut)using the confidence intervals of our GP model. We finally apply Lipschitz arguments to outer-bound the approximation

image

Fig. 2. Decomposition of the over-approximated image of the system (1) under an ellipsoidal input  R0. The exact, unknown image of f (right, green area) is approximated by the linearized model ˜fµ(center, top) and the remainder term ˜d, which accounts for the confidence interval and the linearization errors of the approximation (center, bottom). The resulting ellipsoid  R1is given by the Minkowski sum of the two individual approximations.

errors. We sum up these individual approximations, which result in an ellipsoidal approximation of the next state of the system. This is illustrated in Fig. 2. We formally derive the necessary equations in the following paragraphs. The reader may choose to skip the technical details of these approximations, which result in Lemma 2.

We first regard the system f in (1) for a single input vector z = (x, u), f(z) = h(z) + g(z). We linearly approximate f around  ¯z = (¯x, ¯u)via

image

where  Jh(¯z) = [A, B]is the Jacobian of h at  ¯z.

Next, we use the Lagrangian remainder theorem [25] on the linearization of h and apply a continuity argument on our locally constant approximation of g. This results in an upper-bound on the approximation error,

image

where  fj(z)is the ith component of  f, 1 ≤ j ≤ nx, L∇h,jis the Lipschitz constant of the gradient  ∇hj, and  Lgis the Lipschitz constant of g, which exists by ??.

The function ˜fdepends on the unknown model error g. We approximate g with the statistical GP model,  µn(¯z) ≈ g(¯z). From Lemma 1 we have

image

with high probability. We combine (14) and (15) to obtain

image

where  1 ≤ j ≤ nxand ˜fµ(z) = h(¯z)+Jh(¯z)(z−¯z)+µn(¯z).We can interpret (16) as the edges of the confidence hyper-rectangle

image

where  L∇h = [L∇h,1, .., L∇h,nx]and we use the shorthand notation  a ± b := [a1 ± b1] × [anx ± bnx], a, b ∈ Rnx.We are now ready to compute a confidence region based on an ellipsoidal state  R = E(p, Q) ⊂ Rnxand a fixed input  u ∈ Rnu, by over-approximating the output of the system  f(R, u) = {f(x, u)|x ∈ R}for ellipsoidal inputs R. Here, we choose p as the linearization center of the state and choose  ¯u = u, i.e.  ¯z = (p, u). Since the function ˜fµis affine, we can make use of (10) to compute

image

resulting again in an ellipsoid. This is visualized in Fig. 2 by the upper ellipsoid in the center. To upper-bound the confidence hyper-rectangle on the right hand side of (17), we upper-bound the term  ∥z − ¯z∥2by

image

which leads to

image

Due to our choice of  z, ¯z, we have that  ||z(x) − ¯z||2 =||x − p||2and we can use (12) to get  l(R, u) = r(Q, Inx),which corresponds to the largest eigenvalue of  Q−1. Using (19), we can now over-approximate the right side of (17) for inputs R by an ellipsoid

image

where we obtain  Q ˜d(R, u)by over-approximating the hyper-rectangle ˜d(R, u)with the ellipsoid  E(0, Q ˜d(R, u))through a ± b ⊂ E(a, √nx · diag([b1, .., bnx])), ∀a, b ∈ Rnx. This is illustrated in Fig. 2 by the lower ellipsoid in the center. Combining the previous results, we can compute the final over-approximation using (11),

image

Since we carefully incorporated all approximation errors and extended the confidence intervals around our model predictions to set-valued inputs, we get the following generalization of Lemma 1.

Lemma 2 Let  δ ∈ (0, 1]and choose  βnas in Lemma 1. Then, with probability greater than  1 − δ, we have that:

image

uniformly for all  R = E(p, Q) ⊂ X, u ∈ U.

Proof: Define  m(x, u) = h(x, u) + µn(x, u) ±βnσn−1(x, u). From Lemma 1 we have  ∀ R ⊂ X, u ∈ Uthat, with high probability, �x∈R f(x, u) ⊂ �x∈R m(x, u). Due to the over-approximations, we have �x∈R m(x, u) ⊂˜m(R, u).

Lemma 2 allows us to compute confidence ellipsoid around the next state of the system, given that the current state of the system is given through an ellipsoidal belief. b) Multi-step ahead predictions: We now use the previous results to compute a sequence of ellipsoids that contain a trajectory of the system with high-probability, by iteratively applying the one-step ahead predictions (22).

Given an initial ellipsoid  R0 ⊂ Rnxand control input ut ∈ U, we iteratively compute confidence ellipsoids as

image

We can directly apply Lemma 2 to get the following result.

Corollary 1 Let  δ ∈ (0, 1]and choose  βnas in Lemma 1. Choose  x0 ∈ R0 ⊂ X. Then the following holds jointly for all  t ≥ 0with probability at least  1 − δ: xt ∈ Rt, where zt = (xt, ut) ∈ X ×U, R0, R1, ..is computed as in (24) and xtis the state of the system (1) at time step t.

Proof: Since Lemma 2 holds uniformly for all ellipsoids  R ⊂ Xand  u ∈ U,this is a special case that holds uniformly for all control inputs  ut, t ∈ Nand for all ellipsoids  Rt, t ∈ Nobtained through (24).

Corollary 1 guarantees that, with high probability, the system is always contained in the propagated ellipsoids (24). Thus, if we provide safety guarantees for these sequences of ellipsoids, we obtain high-probability safety guarantees for the system (1). c) Predictions under state-feedback control laws: When applying multi-step ahead predictions under a sequence of feed-forward inputs  ut ∈ X, the individual sets of the corresponding reachability sequence can quickly grow unreasonably large. This is because these open loop input sequences do not account for future control inputs that could correct deviations from the model predictions. Hence, we extend (22) to affine state-feedback control laws of the form

image

where  Kt ∈ Rnu×nxis a feedback matrix and  ut ∈ Rnuis the open-loop input. The parameter  ptis determined through the center of the current ellipsoid  Rt = E(pt, Qt). Given an appropriate choice of  Kt, the control law actively contracts the ellipsoids towards their center. Similar to the derivations (13)-(22), we can compute the function  ˜mfor affine feedback controllers (25)  πtand ellipsoids  Rt = E(pt, Qt). The resulting ellipsoid is

image

where  ¯zt = (pt, ut)and  Ht = At + BtKt. The set E(0, Q ˜d(Rt, πt))is obtained similarly to (19) as the ellipsoidal over-approximation of

image

with  St = [Inx, KTt ]and  l(Rt, St) = maxx∈Rt ||St(z(x) −¯zt)||2. The theoretical results of Lemma 2 and Corollary 1 directly apply to the case of the uncertainty propagation technique (26).

B. Safety constraints

The derived multi-step ahead prediction technique provides a sequence of ellipsoidal confidence regions around trajectories of the true system f through Corollary 1. We can guarantee that the system is safe by verifying that the computed confidence ellipsoids are contained inside the polytopic constraints (2) and (3). That is, given a sequence of feedback controllers  πt, t = 0, .., T − 1we need to verify

image

where  (R0, .., RT )is given through (24).

Since our constraints are polytopes, we have that  X =�mxi=1 Xi, Xi = {x ∈ Rnx|[Hx]i,·x − hxi ≤ 0},where [Hx]i,·is the ith row of  Hx. We can now formulate the state constraints through the condition  Rt = E(pt, Qt) ⊂ Xas mxindividual constraints  Rt ⊂ Xi, i = 1, .., mx, for which an analytical formulation exists [26],

image

Moreover, we can use the fact that  πtis affine in x to obtain  πt(Rt) = E(kt, KtQt, KTt ), using (10). The corre- sponding control constraint  πt(Rt) ⊂ Uis then equivalently given by

[Hu]i,·ut+�[Hu]i,·KtQtKTt [Hu]Ti,· ≤ hui,  ∀i ∈ {1, .., mu}.

image

C. The SafeMPC algorithm

Based on the previous results, we formulate a MPC scheme that optimizes the long-term performance of our system, while satisfying the safety condition in Definition 1:

image

where  R0 := {xt}is the current state of the system and the intermediate state and control constraints are defined in (29), (30). The terminal set constraint  RT ⊂ Xsafehas the same form as (29) and can be formulated accordingly. The objective  Jtcan be chosen to suit the given control task.

Due to the terminal constraint  RT ⊂ Xsafe, a solution to (31) provides a sequence of feedback controllers  π0, .., πTthat steer the system back to the safe set  Xsafe. We cannot directly show that a solution to MPC problem (31) exists at every time step (this property is known as recursive feasibility) without imposing additional assumption, e.g. on the safety controller  πsafe. However, employing a control scheme similar to standard robust MPC, we guarantee that such a sequence of feedback controllers exists at every time step as follows: Given a feasible solution  Πt = (π0t , .., πT −1t )to (31) at time t, we apply the first feed-back control  π0t. In case we do not find a feasible solution to (31) at the next time step, we shift the previous solution in a receding horizon fashion and append  πsafeto the sequence to obtain Πt+1 = (π1t , .., πT −1t , πsafe). We repeat this process until a new feasible solution exists that replaces the previous input sequence. This procedure is summarized in Algorithm 1. We now state the main result of the paper that guarantees the safety of our system under the proposed algorithm.

Theorem 2 Let  πbe the controller defined through algorithm Algorithm 1 and  x0 ∈ Xsafe. Then the system (1) is δ−safe under the controller  π.

image

Proof: From Corollary 1, the ellipsoidal outer approximations (and by design of the MPC problem, also the constraints (2)) hold uniformly with high probability for all closed-loop systems  fΠ, where  Πis a feasible solution to (31), over the corresponding time horizon T. Hence we can show uniform high probability safety by induction. Base case: If (31) is infeasible, we are  δ-safe using the backup controller  πsafeof Assumption 2, since  x0 ∈ Xsafe. Otherwise the controller returned from (31) is  δ-safe as a consequence of Corollary 1 and the terminal set constraint that leads to xt+T ∈ Xsafe. Induction step: let the previous controller  πtbe  δ-safe. At time step t+1, if (31) is infeasible then  Πtleads to a state  xt+T ∈ Xsafe, from which the backup-controller is δ-safe by Assumption 2. If (31) is feasible, then the return path is  δ-safe by Corollary 1.

D. Optimizing long-term behavior

While the proposed MPC problem (31) yields a safe return strategy, we are often interested in a controller that optimizes performance over a possibly much longer horizon. In the autonomous driving example, a safety trajectory that stabilizes the car towards the center of the lane can be much shorter than for planning a steering maneuver before entering a turn. We hence propose to simultaneously plan a performance trajectory  s0, .., sHunder a sequence of inputs πperf0 , .., πperfH−1using a performance-model  mperfalong with the return strategy that we obtain when solving (31). We do not make any assumptions on the performance model which could be given by one of the approximate uncertainty propagation methods proposed in the literature (see, e.g. [11] for an overview). In order to maintain the safety of our system, we enforce that the first  r ∈ {1, ., min{T, H}}controls are the same for both trajectories, i.e. we have that πk = πperfk , k = 0, .., r − 1. This extended MPC problem is

image

subject to (31b)  −(31e), t = 0, .., T − 1

image

where we replace (31) with this problem in Algorithm 1. The safety guarantees of Theorem 2 directly translate to this setting, since we can always fall back to the return strategy.

E. Discussion

Algorithm Algorithm 1 theoretically guarantees that the system remains safe, while actively optimizing for performance via the MPC problem (32). This problem can be solved by commonly used, nonlinear programming (NLP) solvers, such as the Interior Point OPTimizer (Ipopt, [27]). Due to the solution of the eigenvalue problem (12) that is required to compute (22), our uncertainty propagation scheme is not analytic. However, we can still obtain exact function values and derivative information by means of algorithmic differentiation, which is at the core of many state-of-the-art optimization software libraries [28].

One way to further reduce the conservatism of the multi-step ahead predictions is to linearize the GP mean prediction µn(xt, ut), which we omitted for clarity.

In this section, we evaluate the proposed SAFEMPC algorithm to safely explore the dynamics of an inverted pendulum system.

The continuous-time dynamics of the pendulum are given by  ml2¨θ = gml sin(θ) − η ˙θ + u, where m = 0.15kg and l = 0.5m are the mass and length of the pendulum, respectively,  η = 0.1Nms/radis a friction parameter, and g = 9.81m/s2is the gravitational constant. The state of the system  x = (θ, ˙θ)consists of the angle  θand angular velocity ˙θof the pendulum. The system is controlled by a torque u that is applied to the pendulum. The origin of the system corresponds to the pendulum standing upright.

The system is underactuated with control constraints U = {u ∈ R| − 1 ≤ u ≤ 1}. Due to these limits, the pendulum becomes unstable and falls down beyond a certain angle. We do not impose state constraints,  X = R2. However the terminal set constraint (31e) of the MPC problem (31) acts as a stability constraint and prevents the pendulum from falling. Apart from being smooth, we do not make any assumptions on our prior model h and we choose it to be a linearized and discretized approximation to the true system with a lower mass and neglected friction as in [16]. The safety controller  πsafeis a discrete-time, infinite horizon linear quadratic regulator (LQR, [29]) of the approximated system h with cost matrices Q = diag([1, 2]), R = 20. The corresponding safety region  XSafeis given by a conservative polytopic inner-approximation of the true region of attraction of  πsafe. We use the same mixture of linear and Mat´ern kernel functions for both output dimensions, albeit with different hyperparameters. We initially train our model with a dataset  (Z0, ˜y0)sampled inside the safe set using the backup controller  πSafe. That is, we gather  n0 = 25initial samples Z0 = {z01, .., z0n0}with  z0i = (xi, πsafe(xi)), xi ∈ Xsafe, i =1, .., n and observed next states  ˜y0 = {y00, .., y0n0} ⊂ Xsafe. The theoretical choice of the scaling parameter  βnfor the confidence intervals in Lemma 1 can be conservative and we choose a fixed value of  βn = 2instead, following [16].

We aim to iteratively collect the most informative samples of the system, while preserving its safety. To evaluate the exploration performance, we use the mutual information

image

Fig. 3. Visualization of the samples acquired in the static exploration setting in Sec. V-A for  T ∈ {1, 4, 5}. The algorithm plans informative paths to the safe set  Xsafe(red polytope in the center). The baseline sample set for T = 1 (left) is dense around origin of the system. For T = 4 (center) we get the optimal trade-off between cautiousness due to a long horizon and limited length of the return trajectory due to a short horizon. The exploration for T = 5 (right) is too cautious, since the propagated uncertainty at the final state is too large.

I(gZn, g)between the collected samples  Zn = {z0, .., zn} ∪Z0and the GP prior on the unknown model-error g, which can be computed in closed-form [21].

A. Static Exploration

For a first experiment, we assume that the system is static, so that we can reset the system to an arbitrary state  xn ∈ R2in every iteration. In the static case and without terminal set constraints, a provably close-to-optimal exploration strategy is to, at each iteration n, select state-action pair  zn+1with the largest predictive standard deviation [21]

image

where  σ2n,j(·)is the predictive variance (8) of the jth GP(0, kj)at the nth iteration. Inspired by this, at each iteration we collect samples by solving the MPC problem (31) with cost function  Jn = − �nxj=1 σn,j, where we additionally optimize over the initial state  xn ∈ X. Hence, we visit high-uncertainty states, but only allow for state-action pairs  znthat are part of a feasible return trajectory to the safe set  XSafe.

Since optimizing the initial state is highly non-convex, we solve the problem iteratively with 25 random initializations to obtain a good approximation of the global minimizer. After every iteration, we update the sample set  Zn+1 = Zn∪{zn}, collect an observation  (zn, yn)and update the GP models. We apply this procedure for varying horizon lengths.

The resulting sample sets are visualized for varying horizon lengths  T ∈ {1, .., 5}with 300 iterations in Fig. 3, while Fig. 4 shows how the mutual information of the sample sets Zi, i = 0, .., nfor the different values of T. For short time horizons (T = 1), the algorithm can only slowly explore, since it can only move one step outside of the safe set. This is also reflected in the mutual information gained, which levels off quickly. For a horizon length of T = 4, the algorithm is able to explore a larger part of the state-space, which means that more information is gained. For larger horizons, the predictive uncertainty of the final state is too large to explore effectively, which slows down exploration initially, when we do not have much information about our system. The results suggest that our approach could further benefit from adaptively choosing the horizon during operation, e.g.

image

Fig. 4. Mutual information  I(gZn, g), n = 1, .., 200for horizon lengths T ∈ {1, .., 5}. Exploration settings with shorter horizon gather more informative samples at the beginning, but less informative samples in the long run. Longer horizon lengths result in less informative samples at the beginning, due to uncertainties being propagated over long horizons. However, after having gathered some knowledge they quickly outperform the smaller horizon settings. The best trade off is found for T = 4.

by employing a variable horizon MPC approach [30], or by increasing the horizon when the mutual information saturates for the current horizon.

B. Dynamic Exploration

As a second experiment, we collect informative samples during operation; without resetting the system at every iteration. Starting at  x0 ∈ Xsafe, we apply the SAFEMPC, Algorithm 1, over 200 iterations. We consider two settings. In the first, we solve the MPC problem (31) with  −Jngiven by (33), similar to the previous experiments. In the second setting, we additionally plan a performance trajectory as proposed in Sec. IV-D. We define the states of the performance trajectory as Gaussians  st = N(mt, St) ∈Rnx × Rnx×nxand the next state is given by the predictive mean and variance of the current state  mtand applied action ut. That is,  st+1 = N(mt+1, St+1)with

image

where  Σn(·) = diag(σ2n(·))and  m0 = xn. This simple approximation technique is known as mean-equivalent uncertainty propagation. We define the cost-function  −Jt =�Ht=0 trace(S1/2t )−�Tt=1(mt −pt)T Qperf(mt −pt), which maximizes the sum of predictive confidence intervals along the trajectory  s1, .., sH, while penalizing deviation from the safety trajectory. We choose r = 1 in the problem (32),

image

Fig. 5. Comparison of the information gathered from the system after 200 iterations for the standard setting (blue) and the setting where we plan an additional performance trajectory (green).

i.e. the first action of the safety trajectory and performance trajectory are the same. As in the static setting, we update our GP models after every iteration.

We evaluate both settings for varying  T ∈ {1, .., 5}and fixed H = 5 in terms of their mutual information in Fig. 5. We observe a similar behavior as in the static exploration experiments and get the best exploration performance for T = 4 with a slight degradation of performance for T = 5. We can see that, except for T = 1, the performance trajectory decomposition setting consistently outperforms the standard setting. Planning a performance trajectory (green) provides the algorithm with an additional degree of freedom, which leads to drastically improved exploration performance.

We introduced SAFEMPC, a learning-based MPC scheme that can safely explore partially unknown systems. The algorithm is based on a novel uncertainty propagation technique that uses a reliable statistical model of the system. As we gather more data from the system and update our statistical mode, the model becomes more accurate and control performance improves, all while maintaining safety guarantees throughout the learning process.

[1] R. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduction,” IEEE Transactions on Neural Networks, vol. 9, no. 5, pp. 1054–1054, 1998.

[2] C. Xie, S. Patil, T. Moldovan, S. Levine, and P. Abbeel, “Modelbased reinforcement learning with parametrized physical models and optimism-driven exploration,” in Proc. of the IEEE International Conference on Robotics and Automation (ICRA), 2016, pp. 504–511.

[3] J. B. Rawlings and D. Q. Mayne, Model Predictive Control: Theory and Design. Nob Hill Pub., 2009.

[4] S. Sadraddini and C. Belta, “A provably correct MPC approach to safety control of urban traffic networks,” in American Control Conference (ACC), 2016, pp. 1679–1684.

[5] J. M. Carson, B. Ac¸ıkmes¸e, R. M. Murray, and D. G. MacMartin, “A robust model predictive control algorithm augmented with a reactive safety mode,” Automatica, vol. 49, no. 5, pp. 1251–1260, 2013.

[6] A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin, “Provably safe and robust learning-based model predictive control,” Automatica, vol. 49, no. 5, pp. 1216–1226, 2013.

[7] K. P. Wabersich and M. N. Zeilinger, “Linear model predictive safety certification for learning-based control,” in Proc. of the Conference on Decision and Control (CDC), 2018.

[8] C. E. Rasmussen and C. K. Williams, Gaussian Processes for Machine Learning. MIT Press, Cambridge MA, 2006.

[9] J. Kocijan, R. Murray-Smith, C. E. Rasmussen, and A. Girard, “Gaussian process model based predictive control,” in Proc. of the American Control Conference (ACC), vol. 3, 2004, pp. 2214–2219.

[10] G. Cao, E. M.-K. Lai, and F. Alam, “Gaussian process model predictive control of an unmanned quadrotor,” Journal of Intelligent & Robotic Systems, vol. 88, no. 1, pp. 147–162, 2017.

[11] L. Hewing, A. Liniger, and M. N. Zeilinger, “Cautious NMPC with gaussian process dynamics for autonomous miniature race cars,” in In Proc. of the European Control Conference (ECC), 2018.

[12] A. Jain, T. X. Nghiem, M. Morari, and R. Mangharam, “Learning and control using Gaussian processes: Towards bridging machine learning and controls for physical systems,” in Proc. of the International Conference on Cyber-Physical Systems, 2018, pp. 140–149.

[13] C. J. Ostafew, A. P. Schoellig, and T. D. Barfoot, “Robust constrained learning-based NMPC enabling reliable mobile robot path tracking,” The International Journal of Robotics Research, vol. 35, no. 13, pp. 1547–1563, 2016.

[14] M. P. Deisenroth and C. E. Rasmussen, “PILCO: A model-based and data-efficient approach to policy search,” in Proc. of the International Conference on Machine Learning, 2011, pp. 465–472.

[15] D. Ernst, M. Glavic, F. Capitanescu, and L. Wehenkel, “Reinforcement learning versus model predictive control: A comparison on a power system problem,” In IEEE Transactions on Systems, Man, and Cybernetics, vol. 39, no. 2, pp. 517–529, 2009.

[16] F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” Proc. of Neural Information Processing Systems (NIPS), vol. 1705, 2017.

[17] A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger, and C. J. Tomlin, “Reachability-based safe learning with Gaussian processes,” in In Proc. of the IEEE Conference on Decision and Control (CDC), 2014, pp. 1424–1431.

[18] G. Wahba, Spline Models for Observational Data. Siam, 1990, vol. 59.

[19] F. Berkenkamp, A. Krause, and A. P. Schoellig, “Bayesian optimization with safety constraints: Safe and automatic parameter tuning in robotics,” arXiv:1602.04450 [cs], 2016.

[20] E. M. Bronstein, “Approximation of convex sets by polytopes,” Journal of Mathematical Sciences, vol. 153, no. 6, pp. 727–762, 2008.

[21] N. Srinivas, A. Krause, S. Kakade, and M. Seeger, “Gaussian process optimization in the bandit setting: No regret and experimental design,” in Proc. of the International Conference on Machine Learning (ICML), 2010, pp. 1015–1022.

[22] T. F. Filippova, “Ellipsoidal estimates of reachable sets for control systems with nonlinear terms,” Proc. of the International Federation of Automatic Control (IFAC), vol. 50, no. 1, pp. 15 355–15 360, 2017.

[23] L. Asselborn, D. Gross, and O. Stursberg, “Control of uncertain nonlinear systems using ellipsoidal reachability calculus,” Proc. of the International Federation of Automatic Control (IFAC), vol. 46, no. 23, pp. 50–55, 2013.

[24] A. B. Kurzhanskii and I. V´alyi, Ellipsoidal Calculus for Estimation and Control. Boston, MA : Birkh¨auser, 1997.

[25] L. Breiman and A. Cutler, “A deterministic algorithm for global optimization,” Mathematical Programming, vol. 58, no. 1-3, pp. 179– 199, 1993.

[26] D. H. van Hessem and O. H. Bosgra, “Closed-loop stochastic dynamic process optimization under input and state constraints,” in Proc. of the American Control Conference (ACC), vol. 3, 2002, pp. 2023–2028.

[27] A. W¨achter and L. T. Biegler, “On the implementation of an interiorpoint filter line-search algorithm for large-scale nonlinear programming,” Mathematical Programming, vol. 106, no. 1, pp. 25–57, 2006.

[28] J. Andersson, “A general-purpose software framework for dynamic optimization,” PhD Thesis, Arenberg Doctoral School, KU Leuven, Leuven, Belgium, 2013.

[29] H. Kwakernaak and R. Sivan, Linear Optimal Control Systems. Wiley-interscience New York, 1972, vol. 1.

[30] Richards Arthur and How Jonathan P., “Robust variable horizon model predictive control for vehicle maneuvering,” International Journal of Robust and Nonlinear Control, vol. 16, no. 7, pp. 333–351, 2006.


Designed for Accessibility and to further Open Science