My stuff
Scalable synthesis of safety certificates from data with application to learning-based control

The control of complex systems faces a tradeoff between high performance and safety guarantees, which in particular restricts the application of learning-based methods to safety-critical systems. A recently proposed framework to address this issue is the use of a safety controller, which guarantees to keep the system within a safe region of the state space. This paper introduces efficient techniques for the synthesis of a safe set and control law, which offer improved scalability properties by relying on approximations based on convex optimization problems. The first proposed method requires only an approximate linear system model and Lipschitz continuity of the unknown nonlinear dynamics. The second method extends the results by showing how a Gaussian process prior on the unknown system dynamics can be used in order to reduce conservatism of the resulting safe set. We demonstrate the results with numerical examples, including an autonomous convoy of vehicles.

Digitalization opens new perspectives for control engineering and automation by making large amounts of data from experiments and numerical models available. Learning-based control exploits this cumulated knowledge and potentially also performs autonomous exploration of unseen system behavior in order to find an optimal control policy. An example is deep reinforcement learning (RL), providing prominent results, one of which is the application to Atari Arcade video-games [1].

Compared with traditional control techniques, learning-based methods offer the potential to reduce modeling and controller design effort. However, many industrial applications are safety-critical systems, i.e. systems with physical constraints that have to be satisfied. This essentially limits the application of most available learning-based control algorithms, which do not provide safety certificates. In order to address this limitation, we present efficient and scalable methods for the synthesis of a safety strategy consisting of a safe set and corresponding safe control law, which are cheap to implement and can be applied together with existing controllers or modern learning techniques to enhance them with safety guarantees.

Contributions: We consider dynamical systems with unknown Lipschitzian nonlinearity and formulate the safe set and safe controller synthesis as convex optimization problems, which directly employ available data. Our analysis considers an approximate linear model of the system and

Kim Wabersich (wabersich@kimpeter.de) and Melanie N. Zeilinger (mzeilinger@ethz.ch) are with the Institute for Dynamic Systems and Control, ETH Zurich, Switzerland. This work was supported by the Swiss National Science Foundation under grant no. PP00P2 157601/1.


uses data to incorporate the unknown nonlinear effects. The computations are based on Lyapunov’s method and result in two optimization problems: The first optimization problem defines a quadratic approximation of the nonlinearity in the Lyapunov conditions and the second one describes the computation of the safe set and controller. Similar to [2], [3] the framework can be used to augment any desired controller, which is lacking safety guarantees, in particular one which is based on learning.

We extend the technique to reduce conservatism of the safe set by putting a prior on the unknown dynamics in the form of a Gaussian process model, which is beneficial, especially in case of high dimensional systems and sparse data. Due to its less conservative nature, this extension favors safe exploration beyond the system behavior seen so far and is well suited for iteratively learning in closed-loop.

We illustrate the approach using examples, including a convoy of partly non-cooperative autonomous cars.

Related work: Given its relevance in industrial applications, there has been a growing interest in safe learning methods in the past years. Extensions of existing RL methods have been developed to enable safe RL with respect to different notions of safety, see [4] for a survey. A detailed literature review regarding RL, focusing on safety with respect to state and input constraints as also considered in this work, can be found in [3]. There are few results for efficient controller tuning from data with respect to best worst-case performance (also worst-case stability under physical constraints) by Bayesian min-max optimization, see e.g. [5], or by safety constrained Bayesian optimization as e.g. in [6], [7]. In [8] a method was developed that allows to analyze a given closed-loop system (under an arbitrary RL algorithm) with respect to safety.

Recent developments include the concept of a supervisory framework, which consists of a safe set in the state space and a safety controller. As long as the system state is in the interior of the safe set, any control law (e.g. unsafe learning-based control) can be applied. The safety controller only interferes if necessary, in case that the system reaches the boundary of the safe set, see e.g. [2], [3]. Such a framework allows for certifying an arbitrary learning-based control algorithm with safety guarantees. Previously proposed techniques are based on a differential game formulation, which results in solving a min-max optimal control problem. An active field of research aims at extending these techniques to larger scale systems, mostly by considering special cases as described, e.g., in [9], [10], [11]. For some relevant cases, the existence of analytic solutions [12] has been shown.


c⃝2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.


Fig. 1: Illustration of Assumption II.1. Red dots display observations  (xi, d(xi)).

The results presented in this paper are based on the concept of a safety framework, but compared to previous work we focus on approximation techniques to improve scalability with respect to the system dimension.

Structure of the paper: In Section II we state the problem and in Section III we present our main result for safe set and controller computation. We then show an extension using a stronger assumption on the unknown system dynamics by considering Gaussian processes in Section IV in order to reduce conservatism of the safe set. The results are demonstrated on numerical examples within the respective sections. We conclude the paper in Section V.

Notation: The set of symmetric matrices of dimension n is  Sn, the set of positive (semi-) definite matrices is (Sn+) Sn++, the set of integers in the interval  [a, b] ⊂ Ris  I[a,b], the set of integers in the interval  [a, ∞) ⊂ Ris  I≥a, and for ǫ > 0let  Bǫ(¯x) = {x ∈ R| ∥x − ¯x∥2 ≤ ǫ}. The boundary of an arbitrary compact set  C ⊂ Rnis  ∂C. Given a set D = {(xi, yi)}Ni=1, let  Dx = {xi}Ni=1and  Dy = {yi}Ni=1. Define ∆Dx : Rn → Dxas  ∆Dx(x) = argmin¯x∈Dx ∥¯x − x∥2, which picks the closest element in  Dxwith respect to  x ∈ Rnunder the 2-norm. Given a set  A ⊂ Rnand a locally Lipschitz continuous function  f : Rn → Rm, the local Lipschitz constant  L ≤ |f(x) − f(y)|/ ∥x − y∥2for all x, y ∈ Ais denoted by  Lf(x)(A). The Minkowski sum of two sets  A1, A2 ⊂ Ris denoted by  A1 ⊕ A2.

We consider deterministic nonlinear systems of the form


where  A ∈ Rn×n, B ∈ Rn×mand  d : Rn → Rnis locally Lipschitz continuous. The system is subject to polytopic state constraints  x(t) ∈ X := {x ∈ Rn|Axx ≤ bx}, Ax ∈ Rnx×n, bx ∈ Rnxand polytopic input constraints  u(t) ∈ U := {u ∈Rm|Auu ≤ bu}, Au ∈ Rnu×m, bu ∈ Rnu. The origin is contained in X, (A, B) is controllable, and the system state is fully observable.

The explicit form of the nonlinearity d is unknown, hence d is assumed to be a memoryless black-box function, which will be identified from system measurements. However, we assume that B, i.e. the influences of the control input, are known. The matrix A incorporates system knowledge in form of a linear model, which will be used in the design procedure in Section III-D. The linear model can, e.g., be selected as an approximate system model. For identification of d(x), we have access to finitely many observations D = {(xi, d(xi))}Ni=1that fulfill the following property (Figure 1).

Assumption II.1. Given a set  D = {(xi, d(xi))}Ni=1with N data tuples, where  xi ∈ X, there exists a non-trivial subset Aδ ⊆ Xsuch that for any  x ∈ Aδthere exists an  xi ∈ Dxsuch that  ||x − xi||2 ≤ δ.

Remark II.2. Assumption II.1 implies that there exists a region, in which the collected data samples are dense. Intuitively, one can think of  δtogether with the Lipschitz constant L, as a ‘measure of knowledge’ that we have about d(x) inside  Aδ. The ‘knowledge’ increases as  δgets smaller.

Remark II.3. For simplicity, we assume noise-free data D. It would, however, be possible to incorporate bounded or stochastic noise with only minor changes.

We consider the problem of providing a safety certificate for an arbitrary control law by means of a safe set and controller, as proposed in [2], [3]. Consider a potentially unsafe control strategy  ¯u(t), obtained for example by application of RL (which often cannot guarantee constraint satisfaction). In order to achieve minimal interference with the desired control  ¯u(t), the goal is to compute a set of states S, for which we know a control strategy  uS(t)such that input and state constraints will be satisfied for all future times, in particular considering that d(x) is unknown. The control ¯u(t)can then safely be applied in the interior of S, until it becomes necessary to take a safety-ensuring action  uSon the boundary of S which guarantees that we stay in S, i.e. that we can still provide a safe control strategy in the future. More formally:

Definition II.4. A set  S ⊆ Xis called a safe set for system (1) if there exists a safe control law  uS : S → Usuch that for an arbitrary (learning-based) policy  ¯u(t), the safe controller


guarantees that the state x(t) is contained in S for all t > 0 if  x(0) ∈ S.

In particular, we aim at finding an algorithm that scales well in computational complexity with respect to the dimen- sionality of system (1) as well as the number of measurements.

We first introduce the class of safe-sets considered. Afterwards, we motivate the proposed method and highlight its basic idea using an example, in order to then introduce the algorithm for safe set and controller computation in the remainder of the section.

A. Ellipsoidal safe set

In order to provide a scalable optimization-based approach, we restrict the form of the safe set to an ellipsoidal

set of the form


with  P ∈ Sn++, γ ∈ Rn, γ > 0, and the safe controller to the class of linear state feedback control laws  uS = Kxwith  K ∈ Rm×n. To construct  SP (γ), we leverage Lyapunov’s direct method, with a quadratic Lyapunov function V (x) = x⊤(γ−1P)x. By standard Lyapunov arguments, the following sufficient conditions ensure, analogously to [2, Lemma 1], that  SP (γ)fulfills Definition II.4:


for all  x ∈ ∂SP (γ).

B. Motivating Example

Consider the system  ˙x = x + d(x) + uwith  d(x) = −x3subject to input constraints  |u| ≤ 2and state constraints |x| ≤ 2. The task is to find a safe interval S = [a, b] and a corresponding safe control law  uS : [−2, 2] →[−2, 2]according to Definition II.4. The nonlinearity d(x) is unknown, but we are given a set of noise-free observations D = {xi, d(xi)}Ni=1such that the convex hull  conv({xi}Ni=1)equals the state space  [−2, 2](this will not be a necessary assumption later, see also Assumption II.1). We consider a linear state feedback  uS = kx, k ∈ R.

Robust approach: Without any knowledge of d(x), a robust approach is to consider the dynamics  ˙x = x + w + u, with |w| ≤ 8, where the bound on w is estimated from D. In this case, there does not exist a feasible controller gain k w.r.t. input constraints for any state and  ∀|w| ≤ 8, such that  ˙x ≤ 0, i.e. there does not exist a safe set.

Proposed approach: Let  V (x) = px2, p > 0be our Lyapunov candidate function. We analyze ˙V (x) = 2p(x2 +kx2+xd(x)) ≤ 0. At the boundary of the state space we have the measurements  2d(2) = −2d(−2) = −16, providing that ˙V (x) ≤ 0. By standard Lyapunov arguments, this implies that for k = 0 we have that for all  x(0) ∈ ∂S = {−2, 2}, x(t) ∈ Sfor all t > 0. We conclude that  uS(t) = 0is a safe controller for which the state constraint set constitutes a safe set.

This example highlights that rather than taking uniform bounds on the unknown dynamics, we can provide less conservative safe sets by quantifying the effect of the unknown dynamics in the form of state-dependent disturbances, which can be inferred from the available data D. In the following, we will exploit this concept for safe set and controller computations and conclude the section with two examples.

C. Computation of the safe set and controller

Given a matrix P (see Section III-D) that determines the shape of the safe set, we write the problem of finding the size of the safe set  SP (γ)and a corresponding safe controller  uSas two consecutive convex optimization problems.

Remark III.1. Given the shape P of the safe set (3), there exists a  ¯γ > 0small enough such that Assumption II.1 is satisfied on the sub-level set  SP (¯γ), i.e.


The proposed procedure first uses data to bound the effects of the nonlinearity on the Lyapunov decrease by a quadratic form in the largest possible safe set, i.e. over  γ ∈ (0, ¯γ]. This quadratic bound is then used as input to the second optimization problem, which computes the controller and set size in order to take into account the nonlinearities in addition to the linear system dynamics. The restriction to a quadratic bound of the nonlinearity is motivated by the fact that it can be treated efficiently by means of a convex problem.

In order to reduce conservatism, we bound the nonlinearity on sub-regions of the safe set, described by intervals


which are defined such that  SP (γ) ⊆ Aδfor any  γ ∈ Γi. Note that the selection of sub-intervals is possible as we will use the quadratic bound for upper bounding (4c), which is only required to hold on  ∂SP (γ). For every interval Γi, we then formulate two convex optimization problems in order to determine the volume of the safe set and the safe controller. In case no solution exists, the interval can be reduced. In general, the smaller the intervals are chosen, the less conservative the bound will be.

Bounding the nonlinear effects: Given an interval  Γi, consider the neighborhood  R(Γi) = {SP (γ2i )\SP (γ1i )}⊕Bδ(0). The indices of data samples inside the set  R(Γi)are given by  IR(Γi) = {k ∈ I≥1|xk ∈ Dx, xk ∈ R(Γi)}. We seek to find a quadratic bound on the nonlinearity arising in the Lyapunov decrease (4c) for all  ¯x ∈ SP (γ2i ) \ SP (γ1i ), i.e. to find a  Q(Γi)such that


The first optimization problem for bounding the nonlinearity over each interval is given by


with  pk = x⊤k Pd(xk) + δLx⊤P d(x)(R(Γi))and  δas defined in Assumption II.1.

Lemma III.2. Let Assumption II.1 hold and let  ¯γsatisfy (5). Consider an interval  Γiaccording to (6). If (8) attains a solution, then for all  ¯x ∈ SP (γ2i ) \ SP (γ1i )it holds that


Proof. We prove that for all  ¯x ∈ SP (γ2i ) \ SP (γ1i )the implication (8b), (8c)  ⇒(9) holds. First note that for any ¯x ∈ SP (γ2i ) \ SP (γ1i )there exists a  k ∈ IR(Γi)such that  ||¯x − xk|| ≤ δby the definition of the intervals, i.e. γ2i ≤ ¯γ, see also Remark III.1. For notational ease let  f(¯x) =d(¯x)⊤P ¯x. Equation (9) reads  f(¯x)− ¯x⊤Q(Γi)¯x ≤ 0. For all k ∈ IR(Γi)and for all  ¯xk ∈ Bδ(xk)we have therefore by Lipschitz continuity  f(¯xk) − f(∆Dx(¯xk)) + f(∆Dx(¯xk)) −¯x⊤k Q(Γi)¯xk ≤ pk − ¯x⊤k Q(Γi)¯xk. Note that by the definition of the intervals and Remark III.1 the relation  {SP (γ2i ) \SP (γ1i )} ⊂ �i∈IR(Γi) Bδ(xk)holds. As a consequence, if for all  k ∈ IR(Γi)and for all  ¯xk ∈ Bδ(xk)we have that pk − ¯x⊤k Q(Γi)¯xk ≤ 0, then the quadratic bound (9) holds for all  ¯x ∈ SP (γ2i ) \ SP (γ1i ). Finally, using the S-Lemma (see [13]) the condition  ¯x ∈ Bδ(xk) ⇒ pi − ¯x⊤Q(γ)¯x ≤ 0is equal to (8b),(8c) which completes the proof.

Remark III.3. The optimization problem in (8) is a convex semidefinite programming problem. In case that there are more observations (N ≫ 0) than the optimization algorithm can handle in (8), one can iteratively calculate  Q(Γi): Solve (8) using a subset of D in order to obtain  Q1(Γi), in the next iteration choose another disjoint subset of D and add the constraint ˜Q ⪰ Q1(Γi)to (8) in order to obtain  Q2(Γi). Repeat until all subsets of D are processed which yields a feasible, possibly suboptimal solution to (8).

Problem (8) provides a bound on the nonlinear effect in the Lyapunov decrease by means of the Lipschitz constant of  x⊤Pd(x). In practice, a local Lipschitz constant can e.g. be obtained from data as


Since  Q(Γ1)does not have to be positive definite, stabilizing effects of the nonlinearity can be considered in (7), i.e.  ¯x⊤Q(Γi)¯xcan be negative and therefore contribute to rendering ˙V (x)negative on the boundary of the safe set. This is demonstrated in Section III-E, see also Figure 2 (right).

Calculation of safe level set and safe controller: By using the quadratic bound from the first optimization problem, we are able to state the second optimization problem for safe set and controller design satisfying the conditions (4a)-(4c) as follows. Let  γ ∈ R, E = P −1, Y ∈ Rm×n, Γibe chosen according to (6). The optimization problem is given by


s.t. γ ∈ Γi(10b)


Theorem III.4. Let Assumption II.1 hold and let  ¯γsatisfy (5). Consider an interval  Γiaccording to (6). If (10) attains a solution  {γ∗, Y ∗}, then  SP (γ∗)is a safe set for system (1) according to Definition II.4 with  uS(x) = Kx, K =γ∗−1Y ∗E−1.

Proof. We prove the result in two steps: 1) Conditions (10d)-(10e) imply (4a) and (4b). By [14, Section 5.2.2] we can rewrite (4b) as  Au,iK(γ−1P)−1K⊤A⊤u,i ≤ b2u,iwhich equals (10e). The matrix inequality for the states can be derived similarly. 2) Conditions (10b)-(10c) ensure that  SP (γ∗)fulfills (4c). For all  x ∈ ∂SP (γ)we have to fulfill (4c) which is implied by (7). A sufficient condition for (4c) is therefore γ−1P(A+ BK)+ γ−1(A+ BK)⊤P + 2γ−1Q(Γi) ⪯ 0, i.e. that ˙V (x) ≤ 0for all  x ∈ Rn. Multiplying from left and right by  γP −1yields (10c). We have shown that  SP (γ∗)is a safe set according to Definition II.4. The objective in (10) yields the largest safe set under these sufficient conditions.

Note that optimizing over P and K in (10) is not possible, because the bound obtained in the first optimization step depends on P.

Problem (8) and (10) provide (semidefinite) convex optimization problems for computing a safe set and controller by directly employing data points. Such problems can be solved efficiently even for higher dimensions, see e.g. [15], [16]. While this offers a general approach with favorable scalability properties, the limitation is that the resulting safe set cannot be larger than the convex hull of the data points plus a  δ-neighborhood. Exploration is therefore limited. Nevertheless, initially collected data can be iteratively extended inside  Aδsuch that  δfrom Assumption II.1 gets smaller over time. Recomputation of the safe set can then reduce conservatism. We present an extension in Section IV which further reduces conservatism and improves exploration.

D. Shape of the safe set

Using the linear model of the system dynamics in (1), we define an approximate initial shape matrix P of the safe set by neglecting the unknown nonlinearity. Assume that  Aδis given by  {x ∈ Rn|x⊤Aδx ≤ 1}with  Aδ ∈ Sn++, which can be e.g. calculated as the minimum volume covering ellipse of the data points D, see [17, p. 222]. We can find a safe set for system (1) by setting d(x) = 0, resulting in the following optimization problem with  E ∈ Sn++and  Y ∈ Rm×n

min E,Y −logdet(E) (11a)

s.t. A−1δ ⪰E (11b) AE + EAT +BY + Y ⊤B⊤ ⪯ 0(11c)


If (11) attains a solution, then analogously to Theorem III.4, we obtain a safe set for system (1) according to Defini-tion II.4 with  P = E−1, uS(x) = Kx, and  K = Y E−1, but for d(x) = 0. Constraint (11b) ensures that the safe set is a subset of  Aδ, i.e. the set in which Assumption II.1 is satisfied and therefore the data is dense. This implies that (0, 1], i.e.  ¯γ = 1, is the maximum set size such


Fig. 2: Left: Sample trajectories under the safe control law, when starting on the boundary of the safe set  SP. Red dots: Data points, black lines: Closed-loop system trajectories, elliptic ring  R(Γ1), which contains  ∂SP. Dashed set: Safe set using  QGPfrom Section IV. Right: Quadratic bound x⊤Q(Γi)xaccording to Lemma III.2 (blue) of non-linear term  x⊤Pd(x)(red).

that Assumption II.1 as well as all constraints are fulfilled. The optimization problem (10) then aims at improving the initial approximation obtained via (11) with respect to the nonlinearity by designing a different control law and safe set size.

E. Illustrative numerical example

Consider a nonlinear system of the form (1), where A = � −1 2−3 4�, B = � 0.5−2�, d(x) = � 0.5x410.35−1.5x32�with input constraints  |u| ≤ 3and state constraints  |x| ≤ 2. The origin is not a stable equilibrium, and is neither an equilibrium point. For simplicity we consider a grid of data points as illustrated in Figure 2, however any data set fulfilling Assumption II.1 could be used. The data was taken inside the set  {x ∈ Rn|x⊤Px ≤ 1}with  P = ( 0.7651 0.21620.2162 0.6481 )obtained by solving (11), which also corresponds to  Aδwith  δ = 0.15. We apply Theorem III.4 by solving (8) and (10) for  Γ1 = [0.9, 1], Lx⊤P d(x)(R(Γ1)) ≈ 6.02and obtain K∗ = (0.5261, 2.2953), γ∗ = 1. The results are illustrated in Figure 2. Note that in general the quadratic bound must only hold on  ∂S(γ)and could therefore be violated around the origin.

F. Simulation: Safety for autonomous convoys

Consider a convoy of cars or trucks as depicted in Figure 3. Given a target velocity  vtarand a possibly small target distance  xtar, the goal is to drive closely behind each other, in order to leverage slipstream effects for efficiency. We assume that it is possible to overwrite the local controllers, i.e. the acceleration of car 1, car 3 and car 4 in a centralized way if necessary to ensure safety. During a supervised observation phase, initial data about the system is collected. We consider the problem of finding a safe, centralized control law, and a safe set such that the cars will not crash, even if we cannot determine the acceleration of car 2 and car 5.

Let  zi+1→i = xtar − xi+1→ibe the difference between the target distance  xtarand the actual distance  xi+1→iof car i + 1 and car i. Let  vibe the difference between the


Fig. 3: Illustration of the autonomous car convoy. The acceleration of red cars cannot be controlled.

target velocity of the convoy  vtarand the velocity  ¯viof car i. The dynamics of all cars i = 1, ..., 5 are given as  ˙zi+1→i = vi+1 − vi, ˙vi = uiwhere  uiis the applied acceleration. The control law of car 1 is given by u1(v1) = −v1, of cars 3, 4 by  ui(zi→i−1, vi) = 0.1zi→i−1−0.3viand of cars 2, 5 (which we cannot overwrite) by  u2(z2→1, v2) = max {min {z2→1 − v2, 0.9}, −0.9}, u5(z5→4, v5) = max {min {z5→4 − v5, 0.9}, −0.9}, i.e. they apply a saturated, stabilizing state feedback law and are therefore nonlinear. The target distance between the cars is 1 meter. In order to avoid a crash the state constraints are given by  xi+1→i ≥ 0. The maximum acceleration of cars 1, 3, and 4 is  3 [m/s2], i.e.  |ui| ≤ 3, i = 1, 3, 4. We are given observations of  ˙z2→1, ˙v2and  ˙z5→4, ˙v5in the interval [−0.8, 0.8]with  δ = 0.013.

In Figure 4, a numerical simulation under the resulting safe control law (2) is shown, starting from the boundary of the safe set with  v2(0) = 0.02 [m/s], x2→1(0) = −0.76 [m]and the remaining states equal to zero, which represents the situation that the second car is too close to the first one and its velocity is slightly higher than the reference velocity. As we can see in Figure 4, the first car has to accelerate quickly several times during the first two seconds for safety reasons, since the second car (which cannot be controlled) would decelerate more as car 3 would be able to compensate. The same situation occurs at 4.2 seconds between car 3 and car 4 and at around 10 seconds between car 1 and car 2 again. After six safety interventions in total, the local controllers of the cars are able to stabilize the overall system.

The previous sections are based on Lipschitz continuity of the unknown nonlinearity d(x), which was incorporated into the quadratic upper bound in (10), see Lemma III.2. A shortcoming of using only Lipschitz continuity as ‘prior’ knowledge is the requirement of relatively dense and structured data, see Assumption II.1. This implies that almost no ‘exploration’ can be made beyond the data, observed so far. By putting a stronger prior on the class of functions for modeling the nonlinearity d(x), we develop a less conservative quadratic bound, which can then be used in (10) and is generally expected to yield a larger safe set. In addition, it improves exploration beyond the data points allowing to iteratively improve the safe set during closed-loop operation, where initially few data points are generally available.

A. Gaussian Processes We use a Gaussian process model (GP) in order to perform Bayesian inference on the unknown nonlinearity (see. e.g.


Fig. 4: Sample trajectory of the car convoy under the safety framework, when starting on the boundary of  SP (γ). Grey lines indicate times when the safe control law  uSis applied.

[18]) for each element  di(x)of d(x). The GP is defined by a mean function  µi(x), together with a covariance (kernel) function  ki(x, x′), denoted with  GP(ciµ, ki)for the GP prior on  di(x)in short. We set the mean prior function to be constant, i.e.  µi = ciµ. Given observations  yiN = [yi1, .., yiN]Tat locations  XN = [x1, .., xN]Twhere  yij = di(xj), the posterior distribution of  di(x)is given by


where  kiN(x) = [ki(x1, x), .., ki(xN, x)]T,  kiN(x, x′)is the posterior covariance,  σiN2(x)is the variance,  KiN =[ki(x, x′)]x,x′∈XNis the positive definite covariance matrix matrix and  ciµis a vector of N elements, each equal to  ciµ. The posterior mean for the vector d(x) is µN(x) = [µ0N(x), .., µnN(x)]⊤and the variance  σ2N(x) =[σ0N2(x), .., σnN2(x)]⊤.

B. GP-based bounding of nonlinearity

The GP model provides a measure for the posterior variance of d(x), which is used to improve the bound on the effect of the nonlinearity on the Lyapunov decrease. Instead of using Lipschitz-based arguments, as in Section III, we calculate a strict quadratic bound on highly probable, worst-case realizations of the nonlinear term  x⊤Pd(x)for all  x ∈ SP (γ2i )\SP (γ1i ), given an interval  Γi. The intervals Γiin this section are not limited to a dense data region  Aδ. Therefore one can drop the constraint (11b) in the computation of P in (11) and choose  ¯γ = 1for the construction of the


intervals (6). By relying on the GP, the largest interval  ¯γcan be chosen independent of Assumption II.1 and Remark III.1.

Algorithm 1 summarizes the calculation of the quadratic bound, implementing the following idea. Let f(x) be a function that has to be quadratically upper bounded by  x⊤Qxfor all  x ∈ {SP (γ2i ) \ SP (γ1i )}. In order to enforce the infinite dimensional constraint  ∀x ∈ {SP (γ2i ) \ SP (γ1i )} :f(x) ≤ x⊤Qx, we proceed iteratively by starting with a finite approximation, which will be improved until  f(x) ≤x⊤Qxholds for all  x ∈ {SP (γ2i ) \ SP (γ1i )}.


fined, which returns the maximum value of the nonlinear term  x⊤Pd(x)with a chosen probability, e.g. with 99.73% by letting c = 3. Starting with a finite number of samples of f(x) for  x ∈ SP (γ2i ) \ SP (γ1i )(lines 3,4) we compute an initial guess for a quadratic bound on  x⊤Pd(x)given by  x⊤Q1xin line 5, where


s.t. for all (xi, yi) ∈ P : yi ≤ x⊤i ˜Qxi, which yields a quadratic upper bound on  {yi}Ni=1.

We search for potential violations of the current bound in line 8 and add it to the set of data points in lines 9 and 10. After that we update the quadratic bound in line 11. The algorithm iterates until there is no violation. This way for all  x ∈ SP (γ2i ) \ SP (γ1i ), the quadratic form  x⊤QGP(Γi)xwill be a bound on  x⊤Pd(x)with high probability.

Remark IV.1. The optimization problem in line 8 is continuous (compare for example [19, Chapter 2G]), but nonconvex. An alternative approach is to build a discretization of  R(Γi), which we denote by  Dx, with grid size  δ, |Dx| = N, and evaluate the posterior mean and covariance  µN(x), σN(x)for each  x ∈ Dx. By selecting  pk =x⊤k PµN(xk) + βN�ni=1 σiN(x) + δLx⊤P d(x)(R(Γi))with βNas defined in [20, Lemma 3], the bound  QGPcan be approximated using (8) as a convex optimization problem.

For the second step of calculating a safe set size and controller, we can simply use the bound  QGP(Γi)instead of  Q(Γi)in (10) in order to obtain a set, which is safe with the selected probability. By construction,  QGP (Γi)will be less conservative, or equal than  Q(Γi). This is due to the fact that we put a prior on the unknown nonlinearity d(x),


Fig. 5: Safe RL using the proposed safety framework together with the PGSD [21] algorithm. The distance of the state to the origin, as well as the volume of the safe set is shown over time. Grey-shaded time points depict application of the safe control law  uS.

which allows for Bayesian inference and therefore improved extra- and interpolation based on the data, as opposed to using estimates based on Lipschitz continuity. In Figure 2 (left), the benefit of using  QGPover the Lipschitz based bound Q is illustrated. Moreover, the main advantage is that we do not require particular assumptions on the data and the safe set is not limited to a subset of  Aδ, as it is the case in Lemma III.2.

C. Numerical example: Exploration

Consider a nonlinear system of the form (1) with A = � −1 2−3 4�, B = � 0.5−2�, d(x) = �0.5x21 sin(6x1)−0.8x32 �, input constraints  |u| ≤ 4, and state constraints  |x| ≤ 4. We use a squared exponential kernel as defined in [18] with  σ1f =σ2f = 0.05and  l1 = l2 = 0.2. Given initial data in [−0.2, 0.2]2with  δ = 0.05, we solve (10) using the high probability (c = 3, 99.73%) bound  QGP(Γi)with an initial P obtained via (11) and  Γi = [1 − i0.1 − 0.1, 1 − i0.1]for i = 1, 2, .., 8, i.e. we start with i = 1 and iterate i = 2, 3, .. until we find a feasible solution. We assume that the desired control input  ¯u(t)is given by the policy gradient with signed derivative (PGSD) algorithm (see [21]), which is a policy search RL method without any safety guarantees. During closed-loop operation under the safe control law (2) we collect data D. Every 0.2 seconds we recompute the safe level set, i.e.  γ(t), where the safe set size converges after 2.5 seconds. In Figure 5 the evolution of the safe set size (volume of the ellipse) is shown as well as the distance of the system state to the origin, which has to be minimized by the PGSD learning based control law. The unsafe RL input is ‘overwritten’ three times indicated by the grey lines to ensure safety, until it begins to converge.

This paper presents a safety framework that allows to en- hance arbitrary learning-based and unsafe control strategies, for nonlinear and potentially larger scale systems with safety certificates. The nonlinearity is assumed to be unknown and we only require a possibly inaccurate linear system model and observations of the system. A key feature is that the proposed method directly exploits the available data, without the need of an additional learning mechanism. By relying on convex optimization problems, the proposed method is scalable with respect to the system dimension and number of data points. In order to reduce conservatism of the safe set calculations, the approach was extended using a Gaussian process model as prior on the nonlinearity. This modification enables safe exploration and thereby iterative computation of the safe set during closed-loop operation. The results were demonstrated using several numerical example problems, showing that the safety framework can be used to certify arbitrary and in particular learning-based controllers.

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[2] A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger, and C. J. Tomlin, “Reachability-based safe learning with gaussian processes,” in 53rd IEEE Conference on Decision and Control (CDC). IEEE, 2014, pp. 1424–1431.

[3] J. F. Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula, and C. J. Tomlin, “A general safety framework for learning-based control in uncertain robotic systems,” arXiv preprint arXiv:1705.01292, 2017.

[4] J. Garcıa and F. Fern´andez, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, pp. 1437–1480, 2015.

[5] K. P. Wabersich and M. Toussaint, “Automatic testing and minimax optimization of system parameters for best worst-case performance,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Sept 2015, pp. 5533–5539.

[6] F. Berkenkamp and A. P. Schoellig, “Safe and robust learning control with gaussian processes,” in European Control Conference (ECC), July 2015, pp. 2496–2501.

[7] F. Berkenkamp, A. P. Schoellig, and A. Krause, “Safe controller optimization for quadrotors with gaussian processes,” in Proc. of the International Conference on Robotics and Automation (ICRA), 2016, pp. 491–496.

[8] F. Berkenkamp, R. Moriconi, A. P. Schoellig, and A. Krause, “Safe learning of regions of attraction for uncertain, nonlinear systems with gaussian processes,” in 55th IEEE Conference on Decision and Control (CDC), Dec 2016, pp. 4661–4666.

[9] M. Chen, J. F. Fisac, S. Sastry, and C. J. Tomlin, “Safe sequential path planning of multi-vehicle systems via double-obstacle hamilton-jacobi-isaacs variational inequality,” in European Control Conference (ECC), July 2015, pp. 3304–3309.

[10] S. Kaynama, I. M. Mitchell, M. Oishi, and G. A. Dumont, “Scalable safety-preserving robust control synthesis for continuous-time linear systems,” IEEE Transactions on Automatic Control, vol. 60, no. 11, pp. 3065–3070, 2015.

[11] J. F. Fisac and S. S. Sastry, “The pursuit-evasion-defense differential game in dynamic constrained environments,” in 54th IEEE Conference on Decision and Control (CDC), Dec 2015, pp. 4549–4556.

[12] J. Darbon and S. Osher, “Algorithms for overcoming the curse of dimensionality for certain Hamilton–Jacobi equations arising in control theory and elsewhere,” Research in the Mathematical Sciences, vol. 3, no. 1, p. 19, 2016.

[13] I. P´olik and T. Terlaky, “A survey of the S-lemma,” SIAM review, vol. 49, no. 3, pp. 371–418, 2007.

[14] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan, Linear matrix inequalities in system and control theory. SIAM, 1994.

[15] K.-C. Toh, M. J. Todd, and R. H. T¨ut¨unc¨u, “SDPT3a matlab software package for semidefinite programming, version 1.3,” Optimization methods and software, vol. 11, no. 1-4, pp. 545–581, 1999.

[16] M. ApS, The MOSEK optimization toolbox for MATLAB manual. Version 7.1 (Revision 28)., 2016. [Online]. Available: http://docs.mosek.com/7.1/toolbox/index.html

[17] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.

[18] C. E. Rasmussen and C. K. Williams, Gaussian processes for machine learning. MIT press Cambridge, 2006, vol. 1.

[19] A. L. Dontchev and R. T. Rockafellar, “Implicit functions and solution mappings,” Springer Monogr. Math., 2009.

[20] F. Berkenkamp, M. Turchetta, A. P. Schoellig, and A. Krause, “Safe model-based reinforcement learning with stability guarantees,” in Proc. of the Conference on Neural Information Processing Systems (NIPS), 2017.

[21] J. Z. Kolter and A. Y. Ng, “Policy search via the signed derivative.” in Robotics: science and systems, 2009, p. 34.

Designed for Accessibility and to further Open Science