A predictive safety filter for learning-based control of constrained nonlinear dynamical systems

2018·arXiv

Abstract

1 Introduction

Reinforcement learning (RL) has demonstrated its success in solving complex and high-dimensional control tasks, see for example Levine et al. (2016). These results motivate a more widespread transfer to real-world applications to enable automated design of high performance controllers with little need for expert knowledge. In physical systems, such as mechanical, thermal, biological, or chemical systems, physical limitations naturally arise as constraints, such as limited torque in the case of a robot arm or a limited power supply in building control. In addition to physical constraints, many relevant applications in industry require satisfaction of safety spec-ifications, preventing, e.g., an autonomous car or aircraft from crashing, which can typically be formulated in terms of constraints on the system state. The simultaneous satisfaction of safety constraints under physical limitations during RL constitutes one of the main open problems in AI safety as discussed e.g. in Amodei et al. (2016, Section 3).

Significant progress in the safe operation of constrained

⋆ This work was supported by the Swiss National Science Foundation under grant no. PP00P2 157601 / 1. The material in this paper was not presented at any conference.

Email addresses: wkim@ethz.ch (Kim P. Wabersich), mzeilinger@ethz.ch (Melanie N. Zeilinger).

systems has been made through model predictive control techniques, which provide rigorous constraint satisfaction, see, e.g., Mayne (2014). While model-based RL techniques such as Kamthe and Deisenroth (2017) are conceptually closely related to model predictive control, so far relatively few methods have considered safety guarantees. Learning-based model predictive control aims to combine the benefits of both fields, see for example Hewing et al. (2020) for an overview. In addition to the fact that designing such algorithms with rigorous safety guarantees is rather challenging, often conservative, and requires a considerable amount of expert knowledge, the approach is inherently restricted to a model-based control policy. More precisely, at each time step, a finite-horizon optimal control problem is solved in a receding horizon fashion in order to approximate a potentially infinite horizon optimal control policy.

Concept: We propose a model predictive control (MPC) variant as a predictive safety filter (PSF), that can turn highly nonlinear and safety-critical dynamical systems into inherently safe systems, and to which any RL algorithm without safety certificates can be applied ‘out-of-the-box’, see also Figure 1. Compared with a standard use of MPC, the PSF verifies if the input proposed by the RL algorithm is safe, otherwise it is entitled to modify the input as little as necessary to maintain safe operation at all future times. This means that the PSF only needs to keep the system safe instead of controlling it

Fig. 1. Concept of predictive safety filter: Based on the current state x(k), a learning-based algorithm provides a control input , which is processed by the safety filter and applied to the real system.

well with respect to a certain objective (e.g. comfort or economic criteria). The problem of finding a safety filter is therefore in general less complex than finding a desired optimal policy with respect to some objective and subject to constraints, motivating the combination of a predictive safety filter with an RL algorithm to safely optimize performance.

Differently to recently proposed related concepts presented in Gillula and Tomlin (2011); Wabersich and Zeilinger (2018a); Ames et al. (2019), we use the notion of a safe system in Figure 1, as similarly introduced in Wieland and Allgöwer (2007) within the context of safety barrier functions. The concept emphasizes the possibility that any RL algorithm that would have been used to control the original system can be applied to the safe system instead, yielding a certified safe RL application. The predictive safety filter provides safety at a desired level of probability, modularity in terms of the employed RL controller, and minimal intervention by filtering RL input signals only if we cannot guarantee safety at the specified probability level, similar to Fisac et al. (2019).

Contributions: Based on a probabilistic model of the system dynamics, which is inferred from data, this paper presents a predictive safety filter that builds on concepts from MPC for constrained nonlinear systems, and thereby generalizes the safety certification method for linear systems proposed by Wabersich and Zeilinger (2018a). Safety of an RL input is thereby enforced in real-time by searching for a safe backup trajectory for the next time step towards a known set of safe states. If necessary, to ensure safety for all times in the future, the search process for a backup trajectory is allowed to modify (filter) the RL input. As MPC typically outperforms non-optimization based techniques, e.g. based on control Lyapunov functions or sliding-mode controllers by solving an approximate optimal control problem on-line, the proposed PSF provides similar advantages compared with methods utilizing, e.g., control barrier functions for safety (Ames et al.; 2019). More precisely, the PSF formulation provides an implicit representation of the set of safe state and input pairs, approximating the largest set of admissible states and inputs using finite-horizon MPC techniques. At the same time, the implicit safe set representation enables favorable scalability properties compared to, e.g., Hamilton-Jacobi-Bellman safety frameworks (Fisac et al.; 2019) by avoiding offline computations that scale exponentially with the number of state dimensions. Clearly, these advantages come at the price of solving an optimization problem online, for which highly efficient tools are, however, available (Domahidi et al.; 2012).

The application to nonlinear and probabilistic system descriptions, obtained, e.g., through machine learning techniques, is enabled via a nonlinear predictive safety filter formulation that is robust in probability and supports state and input dependent uncertainty information. Robustness with respect to non-uniform model uncertainties is enabled by restricting predicted backup trajectories to confident subsets of the state and input space. The corresponding online optimization problem has similar computational complexity to nominal MPC problems, while being less conservative compared to other robust MPC approaches (Hewing et al.; 2020, Section 3.1) that are often based on a uniform uncertainty bound. The proposed formulation leads to a theoretical analysis that rigorously relates parameters of the predictive safety filter and accuracy of its system model to safety in probability. Depending on the desired constraint satisfaction probability, this enables safe exploration beyond available data.

We illustrate the approach using a simulated pendulum swing-up task, in which only little initial data around the stable downward position is available and overshoots of the upward position are prohibited, imposing challenging safety constraints on the system. Scalability and practical implementation are demonstrated in a quadrotor simulation example, considering the task of learning to quickly approach a landing position for a full-scale model with 12 states and 4 inputs.

Discussion: While the focus of this paper is the certifi-cation of RL algorithms, the concept can also be used together with, e.g., human inputs. For example, in the case of autonomous driving, the safety filter could be used to ensure safety of either an RL-based controller or a human driver, and can be viewed as a driver assistance system that is able to overrule the student driver (or RL algorithm), if necessary for safety.

2 Related Work

Safe model-free reinforcement learning: There is a growing awareness of safety questions in the domain of arti-ficial intelligence (Amodei et al.; 2016), and several safe reinforcement learning techniques have been proposed, see e.g. Garcıa and Fernández (2015) for an overview. Achiam et al. (2017), e.g., provide safety in expectation based on a trust-region approach with respect to the policy gradient.

Most notions of safety considered in this line of research, e.g. one-step constraint satisfaction in expectation, tend to be less strict compared with the probabilistic safety requirements at all time steps in the future as considered in this paper. More importantly, since most techniques are policy-based, safety is coupled to a specific policy and therefore potentially also to a specific task, limiting generalization of the safety certificates.

Learning-based model predictive control: Originating from concepts in robust model predictive control (MPC), extensions of MPC schemes to safe learning-based methods have been proposed, see e.g. Hewing et al. (2020) for a review. In addition, various results have investigated combinations of MPC with learning-based online model identification techniques (Ostafew et al.; 2016; Limon et al.; 2017; Koller et al.; 2018; Soloperto et al.; 2018), also in an adaptive manner (Tanaskovic et al.; 2013). In the context of robotics, similar concepts exist, which are often referred to as funneling, see e.g. Majumdar and Tedrake (2017) and references therein, as well as so called LQR-trees (Tedrake et al.; 2010).

While some of these techniques have been demonstrated to work well in practice (Bouffard et al.; 2012; Ostafew et al.; 2016; Hewing et al.; 2018), they typically either lack rigorous theoretical safety guarantees, tend to be overly conservative by relying on Lipschitz-based estimates in the prediction of the uncertain system evolution, or are restricted to a very specific system class.

and safety frameworks: Conceptually, the idea of safety architectures, as illustrated in Figure 1, was originally proposed by Seto et al. (1998), where a principled switching between a safety controller and basic/experimental controllers enables safe controller tuning online. Based on this concept, control theoretic frameworks have been developed (Prajna and Jadbabaic; 2004), related to control Lyapunov functions (Wieland and Allgöwer; 2007), and have also become known as control barrier functions, see Ames et al. (2019) for an overview. Recent developments also consider the combination with learning tasks through data-driven models (Ohnishi et al.; 2019). While such frameworks inherit strong theoretical results from control Lyapunov function theory, they require the explicit availability of a control barrier function, which is difficult to compute in general. In particular, to the best of the authors’ knowledge, only approximate approaches exist using ellipsoidal or sum-of-squares computations (Wang et al.; 2018) to design a safety barrier function with respect to given, e.g. polytopic, state and input constraints.

In the case of partially known system dynamics, the concept of a control barrier function can be combined with Bayesian model estimates from data that validate the resulting closed-loop system (Berkenkamp et al.; 2016). The techniques share similar limitations with safe model-free RL methods, namely that they are tailored to a specific task. A task-independent learning-based safety framework has been introduced in Gillula and Tomlin (2011), which generalizes the concept of explicitly knowing a barrier function. It consists of a model-based safe set of system states and computes a corresponding safe control policy, which is entitled to override a potentially unsafe RL algorithm to ensure invariance with respect to the safe set of system states i.e. containment within the safe set at all times. This concept was further developed in several papers, providing principled methods to compute the safe set as well as the corresponding safe policy (Fisac et al.; 2019; Wabersich and Zeilinger; 2018b), which build the foundation of the safety filter presented in this paper.

The aforementioned techniques related to both safety barrier functions and safety frameworks either suffer from limited scalability to higher dimensional and complex systems, or only provide principled design computations for specific constrained model classes, such as linear or polynomial models. While also building on the same high-level concept, this paper addresses these limitations by 1) considering a stochastic nonlinear system model belief, which is well-suited to learning-based control of highly nonlinear and unstable system dynamics, and 2) a unified MPC-inspired formulation for the safety policy (predictive safety filter), which avoids the explicit computation of a safety barrier function or a safe set.

In particular, compared to similar MPC-inspired safety mechanisms such as Wabersich and Zeilinger (2018a); Mannucci et al. (2018); Li and Bastani (2019), we consider nonlinear system models with stochastic parameter uncertainties and provide a predictive safety filter formulation that is capable of leveraging the resulting state and input dependent uncertainty estimates to reduce conservatism.

3 Problem Statement

Notation: The set of integers in the interval is , and the set of integers in the interval is . The i-th row and i-th column of a matrix is denoted by and . By we denote the vector of ones with length

Consider deterministic discrete-time systems of the form

with dynamics parametrized by and stochastic initial condition with known distribution . The system is subject to polyhedral state and input constraints and , originating from physical limitations and safety requirements. We consider the case of unknown ‘real’ parameters , but assume the availability of a distribution

which can be estimated from data. The overall objective is to safely find a policy that either minimizes an episodic, finite-time or infinite horizon objective

with , and stochastic stage cost consisting of a deterministic part and zero mean i.i.d. stochastic noise . In order to prescribe a desired level of caution and desired conservatism in exploration, we consider safety in terms of constraint satisfaction at a desired probability level as

This paper addresses the problem of implementing a safety filter as shown in Figure 1, which ensures closed-loop safety according to (4). The filter enables application of any RL algorithm to the virtual input of the safe system, i.e. , with the goal of minimizing the objective, while ensuring safety by selecting the input to the real system as . In other words, the approach turns a safety-critical task into an unconstrained task with respect to the safe system dynamics such that any RL algorithm can be safely applied, see Figure 1. To further specify the desired properties of , consider the following definition of a safety certified learning-based control input.

Definition 3.1. An input is certified as safe for system (1) at time step and state with respect to a , if and application of for implies safety for all times according to (4).

Following this definition, the goal is to provide a safety filter that restricts learning as little as possible by certifying a large set of learning inputs for a given state x(k). If the learning input cannot be certified as safe, the safety filter provides an alternative safe input, i.e. , where the fil-ter aims at the smallest possible modification by, e.g., minimizing . The following section introduces the mechanisms of the proposed predictive safety filter, which builds on a predictive constrained control formulation, planning safe trajectories based on a probabilistic model belief to ensure safe system operation at all times according to (4).

4 Predictive safety ﬁlter

We first develop an intuitive understanding of the predictive safety filter by considering a simplified setting and assuming perfect model knowledge in Section 4.1, which is then extended in Section 4.2 to an uncertain model (1), (2) inferred from data, for which rigorous proofs are provided. As it will be shown, the presented method establishes safety by relying on controllability of (1) along system trajectories, in combination with an efficient mechanism enforcing the system to carefully enter uncertain areas within the state and input space.

4.1 Nominal (simplified) predictive safety filter

Consider the simplified situation where the real system dynamics (1) are perfectly known for some subset of the state and input space, as specified in the following.

Assumption 4.1. There exists a set , such that for all and some it holds that .

Similarly to Wabersich and Zeilinger (2018a), we propose a predictive safety filter that is not pre-computed, but defined via an optimization problem and computed on-the-fly. The main working mechanism is the construction of safe backup plans that, if applied, would keep the system provably safe in the future, see Figure 2 (left) for an illustration. The backup plans are defined via (5), where denote the planned states computed at the current time step k and predicted i time steps into the future with planning horizon N using the corresponding input sequence . One of the key challenges in computing the backup plans is to deal with the fact that a good model is not known in unexplored regions of the state-space, i.e. , shown as red (uncon-fident model) sets in Figure 2. In the nominal setting, we simply address this problem by enforcing the system to strictly stay within the confident model subset via (5e). One of the main problems addressed in the next section will be to relax this constraint to enable cautious exploration of such unconfident subsets. The purpose of the remaining constraints in (5) is to construct backup plans that lead the system within state and input constraints X and U (5c), (5d) into a safe terminal set in N steps (5f).

The objective of constructing the backup plans in (5) is to minimize the deviation between the first element of the input sequence and the input requested by the RL algorithm, such that

Nominal online problem:

Algorithm 1 (Nominal PSF):

Illustration of nominal PSF:

Fig. 2. The basic idea of the predictive safety filter explained using a nominal, simplified version in the left column and the final method on the right. The illustrations show the system state at time k with safe backup plan for a shorter horizon obtained from the solution at time , depicted in brown, and areas with poor model quality in red. An arbitrary learning input is certified if a feasible solution towards the terminal safe set can be found, as shown in green. If this new backup solution cannot be found and the planning problem (5)/(6) is infeasible, the system can be driven to the safe set along the brown previously computed trajectory. Left (NPSF): By assuming perfect system knowledge, the computed backup plans correspond exactly to the true state dynamics and constraints are guaranteed to be satisfied using the nominal backup trajectory. Right (PSF): Backup plans are computed w.r.t. the nominal expected state . The true state trajectory lies within a growing tube around the nominal state with probability , which needs to be considered using tightened constraints according to (9).

if is safe. Conceptually, this mechanism is similar to QP-based barrier function methods (Ames et al.; 2017), where the input is adjusted to remain inside an explicitly known invariant set, with the key differ-ence that here the safe set is implicitly defined. The resulting nominal predictive safety filter is then given by , with being the optimal first control input obtained from (5) based on a prediction horizon of length N. To ensure constraint satisfaction beyond the planning horizon, (5) utilizes a mechanism common in predictive control (see e.g. Chen and Allgöwer (1998)), by requiring the last state of the sequence to lie in a safe terminal set of system states , for which a locally valid safety filter is known.

Assumption 4.2. There exists a terminal safe set , with Lipschitz continuous with Lipschitz constant , and a corresponding terminal safety filter , such that if , then application of implies that and for all .

A terminal safe set and the corresponding controller can be chosen, e.g., as a classical terminal set for non- linear (robust) MPC (Chen and Allgöwer; 1998), regions around stable steady-states of system (1), or using expert system knowledge as is demonstrated in Section 5.

Based on problem (5), the predictive safety filter is defined by Algorithm 1 (Figure 2, left). At every time step, we attempt to solve optimization problem (5). If problem (5) is feasible at time k, safety, i.e., , , directly follows from (5c), (5d). Due to the generality of the terminal safe set, however, problem (5) may become infeasible for some state x(k), even after being feasible at the previous time step . Algorithm 1 implements a shrinking horizon mechanism similar to Thomas et al. (1994); Grune and Palma (2014) to also provide a feasible safe trajectory and input sequence towards the terminal safe set for this case, as detailed in the following:

Assume that (5) was feasible at time with corresponding optimal input sequence . Applica- tion of results in a safe state x(k) as depicted in Figure 2 (left), because by (5e) and therefore X by (5c). At the next time step k, if (5) is not feasible, we can still solve (5) with a reduced planning horizon . This can be easily verified by noting that for , i.e. the tail of the previ- ously computed feasible trajectory from time step , is a feasible solution as depicted by the brown trajectory in Figure 2 (left). Feasibility of (5) for a reduced horizon again directly provides .

The same holds true in the case that j < N steps were consecutively infeasible for planning horizon N, i.e. (5) will then be feasible with horizon until we reach the safe terminal set. This shortening of the horizon is implemented in lines 6-7 of Algorithm 1. If the horizon length reaches 0, the state is in the terminal set and can be applied to ensure (line 9). Note again that if (5) is feasible at time k (line 3-4), can be applied, which ideally results in (i.e. objective (5a) is zero) as shown in Figure 2 (left) together with the optimal backup plan in green. Algorithm 1 therefore ensures constraint satisfaction at all time steps, realizing a predictive safety filter in a receding horizon fashion with varying prediction length. The next section will extend the previously introduced basic concept of the predictive safety filter to consider a data-driven approximate system belief, represented by (1), (2), subject to probabilistic constraint satisfaction (4).

4.2 Predictive safety filter

A key goal of the safety filter is to support exploration beyond available data via the learning policy , in which case Assumption 4.1 does not necessarily hold. While fast approximate computation of the backup trajectories can still be performed online using the mean estimate of the parameter , we need to safely handle the resulting non-vanishing model error

In the following, we first treat uncertainty via a uniform error bound to introduce the safety filter for uncertain systems, which is then extended to consider a less conservative bound and impose it as a constraint in the filter planning problem, in order to reduce conservatism.

Uniformly bounded model error: Assume that the model error with respect to the point estimate can be bounded as

with and Lipschitz continuous and linearly bounded from below, i.e. constants and exist such that , which implies compactness of E. In this case, the filter can still compute backup plans using the point estimate , however, in contrast to the nominal case in Section 4.1, the constraints in (5) are modified such that prediction errors induced by (7) are compensated to ensure constraint satisfaction.

We denote the nominal (expected) system states as , corresponding to the nominal input sequence according to . Due to the model error (7), we need to address the fact that potentially , i.e. , when applying the nominal input , even though the corresponding nominal predicted state satisfies . A common strategy for achieving robustness in predictive control is to tighten the constraints by leveraging controllability along any possible predicted state sequence (Mayne; 2014). Intuitively speaking, controllability enables efficient compensation of deviations via feedback control. More precisely, the possible deviations can be bounded by a decay constant, expressed by a parameter , at which a controller can compensate disturbances of a certain magnitude, de-fined proportionally to a parameter . Using these two measures, deviations from the planned nominal trajectory can be compensated via an iterative tightening of the constraints. This allows a flexible response to upcoming disturbances at the desired probability level during consecutive time steps via replanning, thereby enabling overall constraint satisfaction. Following Köh- ler et al. (2018b), we tighten the constraints (5c), (5d), and (5f) in the computation of the backup plans as

implementing a trade-off between compensation and magnitude of disturbances via the converging recursion

with design parameter and parameter that depends on system (1) as follows.

Assumption 4.3. There exists a control policy , a function , which is continuous in its first argument and satisfies V (x, x, v) = 0 for all , and parameters , such that for a given the following properties hold for all :

and if in addition then

x, µ, vVx, π(x, µ, v, f(µ, v, vρV (x, µ, v).

Informally, Assumption 4.3 defines how well the uncertain system can be controlled in a neighborhood of predicted nominal backup plans . Intuitively speak- ing, considering the task of tracking a reference trajectory as an optimal control problem with value function V (using for example a linear quadratic regulator in the linear dynamics setting), parameter defines ‘how fast’ a reference can be reached, measured in terms of the contraction rate of the optimal tracking cost V . Interestingly, this translates into a system-theoretic requirement on system (1), or more precisely to local incremental stabilizability, which can be formally verified based on a system linearization, as discussed in Köhler et al. (2018a, Prop. 1). The condition can also be found in Appendix A.6 and provides explicit choices for V and . It is, however, important to note that the final algorithm only requires existence of the policy and the corresponding function V , rather than their explicit form.

These concepts lead to a robustified version of the nominal predictive safety filter defined in (6) and Algorithm 2 (Figure 2, right), where we omit (6e) in the case of uniformly bounded errors (8). Assumption 4.3 ties the model uncertainty (8) to the constraint tightening (9) to ensure the existence of a safe backup plan at all times and allows extension of the arguments for the nominal case to a probabilistic model belief. If (6) is feasible at time and the error bound E according to (8), i.e. , is sufficiently small with respect to (see also Sections 4.3 for a detailed discussion) with probability , then at time k, the input sequence based on the plan computed at time step

for with according to Assumption 4.3, and according to (6b), provides a feasible solution to (6) with planning horizon (Algorithm 2, line 6) with probability . Again, the tracking policy is only used in order to show that a solution to (6) exists, but it is not needed for implementation of the approach. The same argument holds true for all until the terminal set is reached (line 10), which allows us to establish safety at all times similarly to the nominal case. A formal proof will be given in the following for the more general case including a constraint on model confidence.

Planning in confident subspaces: To reduce conservatism introduced by uniformly overbounding the uncertainty in (8), a central novelty in the proposed safety filter is the ability to restrict planning to regions in the state and input space (see also Figure 2) where we are sufficiently confident about the system dynamics. More precisely, we restrict predictions to subspaces where the model error (7) is contained in a pre-specified, reduced allowable error set of the form

with scaling factor , which allows easy adjustment of the maximum error magnitude due to the relation , see proof of Lemma A.4. A simple approach would be to compute the region of-fline and add it as an additional state and input constraint, as was similarly done for the case of linear dynamics with state dependent uncertainties by Soloperto et al. (2018). However, it is in general difficult to compute analytically and in addition, the set needs to be recomputed once the model belief (1), (2) is updated based on observed data. We therefore reformulate the requirement to stay inside as an implicit constraint, avoiding the explicit computation of , and include it in the online predictive safety filter problem (6) using the following definition:

Definition 4.4. A set-valued map mapping states and inputs from to subsets of E with is a set-valued model confidence map associated with (1), (2), for a given at probability level , if

holds, with as defined in (7).

While E must include all sufficiently common model errors, must only include errors that are suffi-ciently common at (x, u). Note that according to Defini-tion 4.4 it is not sufficient to guarantee that (13) holds for some k, but it has to hold for all to ensure safety for all times, including also the case . In practice, it might be challenging to select a parametric system class and to infer a representative parameter distribution from a data set D that allows construction of a set-valued model confidence map. In the following we therefore briefly discuss an example of how to design (13) from data using Bayesian regression and refer to Hewing et al. (2020, Section 3) for a review of data-driven prediction models that provide bounds on the model uncertainty.

Data-driven set-valued model confidence map: Consider a Bayesian description of (1) with prior distribution and posterior estimate , inferred from available system data . Define a confidence region at probability level of the random parameters as . A set-valued model confidence map according to Definition 4.4 is then given by

x, u) = (14) x, u, θx, u, θ ,

as it follows from the definition of that

with as shorthand for the random event introduced in (13). Note that similar set-valued model confidence maps can be obtained when using non-parametric Gaussian process regression, by assuming that the system dynamics (1) have bounded norm in a reproducing kernel Hilbert space (Chowdhury and Gopalan; 2017, Theorem 2). In case of large amounts of available data on the whole state and input space a uniform confidence map can be selected using, e.g., Lipschitz arguments similar to Limon et al. (2017), i.e., , reducing to the special case (8).

As discussed for the case of uniformly bounded errors, the tightened constraints (9) ensure safety, if (8) holds for small enough. Since is unknown, and we cannot simply impose in (6) to restrict planning to confident subsets, we make use of the model confidence map in Definition 4.4 to enforce

implying with probability . To this end, we impose (6e) on the nominal plan , where constraint (15) is tightened similarly to (9) using

The tightening again ensures the existence of a feasible solution when replanning with a shorter horizon (Algorithm 2, line 6). In order for the filter to ensure safety in probability using (6e), we additionally require that small changes of the nominal predicted trajectory must not lead to arbitrary large changes in the model confi-dence by assuming that the set-valued model confidence map is Lipschitz continuous in terms of the Hausdorff metric (see Definition A.1 and A.2 in the appendix).

Assumption 4.5. There exists a set-valued model con-fidence map associated with (1), (2), which is Lipschitz continuous with Lipschitz constant under the Hausdorff metric with respect to .

Note that for common models, such as Gaussian Processes, Assumption 4.5 is generally fulfilled, compare also with Fisac et al. (2019, Proposition 11). The above assumptions allow for extension the ideas from the uniform error bound to make use of a potentially reduced error bound that is ensured by imposing (6e) on the backup plan, and thereby again characterize the relation between the tightening in (9),(16) and the specified tolerated model error (12). This leads us to the main result of the paper, showing that the proposed predictive safety filter guarantees safety in probability at all times according to (4).

Theorem 4.6. Let Assumptions 4.2, 4.3 and 4.5 hold and select a tightening factor . If in Assumption 4.5 is sufficiently small, i.e. if for a suffi-ciently small constant c > 0, then one can always select a sufficiently small such that the initial feasibility of (6) for x(0) implies that as defined in Algorithm 2 ensures safe system operation according to (4).

The proof is provided in the appendix. Theorem 4.6 implies that for sufficiently small one can specify a constraint tightening through and impose a corresponding sufficiently small admissible error set scaling in (12), such that if (6) is initially feasible for x(0), application of Algorithm 2 will keep the system safe in probability. Thereby, the upper bound on results from the linear lower bound on in (8), (12) and intuitively means that the set-valued model confidence map estimate in Assumption 4.5 can only change at a spe-cific rate with changing states or inputs that are linearly bounded in terms of the tightening fraction .

While the exact values of the bounds derived in the proof of Theorem 4.6 might be difficult to compute explicitly for design of the PSF, the corresponding analysis in Appendix A.2 unveils inner relations of all design parameters that can be used for efficient practical tuning guidelines as presented in Section 4.3. A specific choice of parameters can then be verified as described in Appendix A.3.

Remark 4.7. While the combination of the proposed safety filter with a learning-based controller naturally restricts exploration, the probabilistic model together with probabilistic constraints provide a principled way to adjust the probability associated with the confident subset, and thereby allow for some exploration beyond the available data as illustrated in the numerical example in Section 5. Large model uncertainties might, however, cause infeasibility of the PSF problem (6) at the initial condition of the system. This would either require the enlargement of the prediction horizon N or the lowering of the probability level for safety in (4), see also Section 4.3 for practical tuning guidelines.

4.3 PSF design parameters

In the following, we provide a more detailed discussion of the design parameters and how to select them.

: Minimum contraction rate (‘speed’) at which the system can reduce the distance (in terms of an appropriate energy function) to a nominal reference trajectory. For example, consider the extreme case of a deadbeat controller that can steer the system to any given reference in one time step. According to Assumption 4.3 this translates into , which renders the constraint tightening (9) constant after one time step. In contrast, systems with very slow convergence rates are characterizes with , corresponding to the worst-case in terms of the constraint tightening (9). A cautious choice is therefore .

: Constraint tightening factor along predicted backup plans. While depends on intrinsic system properties, is a design parameter that allows a trade off of the maximum tolerated prediction model errors (12), i.e. the magnitude of against the conservatism of the predictive safety filter, i.e. the constraint tightening. This can be seen explicitly through the sufficient bounds on in the proof of Lemma A.4, (A.8), which are linear in , i.e. . To satisfy the lower bound according to Theorem 4.6 while preventing the tightened sets (9) from being empty at the end of the planning horizon, a cautious initial choice is given by , where the prediction horizon length N can additionally be reduced to account for larger values of .

: From the definition of in (8) and (12) it follows that holds, i.e. linearly affects the maximum allowable uncertainty in the confident subset of the state space. From Theorem 4.6 it follows that for any valid exists, such that initial feasibility of the predictive safety filter implies chance constraint satisfaction according to (4). The bound on is provided in (A.8) in the appendix. Since smaller values of render the set-valued model confi-dence map constraint (6e) more conservative, the goal during tuning is to find the largest tolerable uncertainty for a given configuration .

: Desired probability level of safety according to (4). Depending on the application, one might consider lowering the probability level for an efficient exploration phase, before enforcing larger values for cautious long term operation. More precisely, the limit case allows selection of , which virtually disables the set-valued model confidence map (13) and backup plans are not restricted to confident subsets anymore. In turn, selecting results in a robust version of the predictive safety filter and therefore limits exploration.

In summary, the small number of design parameters and their interpretability allow for an efficient design of the PSF without more involved and potentially conservative design procedures to formally satisfy the required assumptions, e.g., Assumption 4.3 (see also Köhler et al. (2018a)) or Assumption 4.5. A cautious initial selection of the design parameters is given by , and possibly small for a required probability level and planning horizon N. A practical choice of is discussed in Section 5. The set of parameters can then be verified offline as described in Appendix A.3. If these conservative design parameters cannot be verified, then either the planning horizon N can be reduced to increase , Assumption 4.3 does not hold, more data needs to be collected, or the prior information about needs to be

Fig. 3. Comparison of closed-loop swing-up trajectories during 120 learning episodes under challenging position constraints (dashed lines). Top: Red lines show different learning episodes based on an unsafe learning policy. Middle: Closed-loop learning trajectories using a predictive safety filter with 10 data points (green) and 18000 data points (blue). Bottom: Resulting safe optimal closed-loop performance. The circle radii indicate the relative magnitude of safety ensuring modifications of the learning policy.

refined to render the set-valued model model confidence map according to Assumption 4.5 less conservative.

5 Application to numerical examples

5.1 Swing-up: Safe exploration beyond initial data

We consider the classical control problem of swinging up a pendulum from the downward position with angle to the upward position () with limited input authority, unknown system parameters, and under challenging safety constraints of the form , such that the pendulum is not allowed to tip over, once the upward position has been reached. The discretized dynamics x(k + 1) = f(x(k), u(k)) are simulated using and , where is the angle, is the angular velocity at time step k, h = 0.02 [s] is the discretization interval, is the gravity constant, l = 0.5 [m] is the length, m = 0.15 [kg] is the mass, is the friction and the input

torque u is restricted to .

For learning the swing-up task, we consider an episodic learning setting with horizon length and parametrize a bang-bang open-loop input signal as

with switching times and subject to and . The learning objective is defined as , where the first term describes the distance to the desired upward position given by 180 [deg], while the second term penalizes safety-ensuring interventions by the safety filter and therefore accelerates learning convergence as discussed in Akametalu et al. (2014). Efficient learning-based optimization of the parameters and is performed using Bayesian Optimization as described in Neumann- Brosig et al. (2019), selecting parameter configurations that automatically trade-off exploration of the parameter space and exploitation of promising subsets. While direct application of this learned policy yields a swing-up after some episodes, it causes significant constraint violations as shown in Figure 3 (top), motivating the application of the presented safety filter in the following.

Predictive safety filter from data: The transition model (1), (2) is obtained via linear Bayesian regression (Rasmussen and Williams; 2006), i.e., f(x, u) = , with , unknown parameters with Gaussian priors, and Gaussian noise on obtained system measurements. The set-valued model confidence map according to Definition 4.4 for parametric uncertainties is given by (14) and can here be defined as where with being the posterior variance of conditioned on data and the chi-squared distribution of degree 2 (Slotani; 1964). Starting with 10 data points around the downward position, we update the model belief using the acquired data after each episode. The tightening was experimentally chosen to using sampled system realizations of the posterior distribution as described in Section A.3. The corresponding admissible error set is defined as the 2-norm ball with radius . Consequently, set-valued map constraints of the form according to (6e) can be efficiently implemented as for j = 1, 2, i.e. by en-

forcing all semi-axes of to be smaller or equal than the radius of the admissible error set . The desired probability of chance constraint satisfaction was chosen as 0.95. As the terminal safe set, we select

Fig. 4. Quadrotor experiment using the Bullet Physics SDK (Coumans and Bai; 2016–2019). Top-left: Graphical interface showing the optimal safe trajectory (blue line). Top-right: Quadrotor trajectories projected on the plane using an unsafe policy search (red lines) and the safety augmented policy search (blue lines). Bottom: Zoom-in of top-right plot, where red dots represent states with < 0.01 [m] minimum distance to the ground, which we classify as ground contact.

with . The resulting problem (6) with planning horizon N = 50 was solved in real-time using Ipopt (Wächter and Biegler; 2006) together with the CasADi framework (Andersson et al.; 2018) for automatic differentiation.

Results: Combining the learning-based swing up policy with a predictive safety filter based on only 10 initial data points around the stable downward position results in cautious closed-loop system trajectories that are displayed as green lines in Figure 3 (middle). The corresponding optimal solution after 120 learning episodes is depicted in Figure 3 (middle). We can then leverage the cumulated data from the first experiment (18000 data samples) to refine the prediction model of the safety fil-ter. The additional data enables a significantly less conservative learning behavior, see blue trajectories in Figure 3 (middle), and supports a complete swing-up, which demonstrates safe exploration beyond available data. The corresponding optimal solution after 120 learning episodes is shown in Figure 3 (middle), where the circle radii indicate the magnitude of safety ensuring modifi-cations of the learning policy.

5.2 Safe data-driven quadrotor learning control

To demonstrate the presented method for a more challenging simulation example, we consider the AscTec Hummingbird drone, simulated in the Bullet Physics SDK (Coumans and Bai; 2016–2019), see Figure 4 (Top-left), in combination with a single rotor force model (Furrer et al.; 2016). We employ a two-layer control structure, where an inner PD control loop takes desired pitch, roll and vertical acceleration in the body frame and outputs control signals to the motors. This enables modeling of the inner controlled system around the hovering equilibrium as done in Hu et al. (2018) using 10 states , three inputs , and dynamics of the form . State constraints are given by minimum height and maximal vertical velocity to avoid ground contact and to ensure the validity of the dynamics model. The inputs to the inner control loop are normalized to .

The overall goal is to approach the landing position x = 3 [m], y = 2 [m], and z = 0.2 [m] close to the ground, which is indicated by the red cube in Figure 4 (Top-left), starting from an initial hovering position at z = 3.5 [m], x = y = 0 [m] using an outer PD controller with state and input saturation, which is parametrized as

where and with PD-controller gains and . Similar to the example presented in Section 5.1, we use Bayesian Optimization (Neumann-Brosig et al.; 2019), with that heavily penalizes safety ensur- ing actions by the safety filter. Direct application of the learning algorithm yields a significant number of ground contacts (in addition to violations of the maximum vertical velocity) as shown in Figure 4 by the red trajectories over a total of 240 learning episodes, where ground contacts are defined as < 0.01 [m] minimum distance to the ground and are highlighted by red dots.

Predictive safety filter from data: Similarly to the numerical example in Section 5.1, we construct a model via Bayesian Regression using a Gaussian prior on the parameters together with Gaussian noise on observations. The data required for infering the prediction model is generated through experiments at a safe altitude using 100 random step inputs that are applied for 60 [s] to the inner control loop. Based on the inferred model, the constraint tightening was experimentally chosen as using posterior samples as described in Section A.3 with a maximum allowable error set defined as the 2-norm ball with radius 0.02. The planning horizon is given by N = 20 and the terminal set is selected as a sufficiently high altitude of , from which we can guarantee constraint satisfaction for all future times using, e.g., suboptimal PD controller gains and . The set-valued model confidence map is set up analogue to Section 5.1 with desired chance constraint satisfaction .

Results: As shown in Figure 4 the predictive safety filter enables constraint satisfaction during all 240 learningepisodes and results in a favorable optimal trajectory as shown in Figure 4 (Top-left), for which the safety filter is permanently inactive as specified via the objective function.

6 Conclusion

This paper has addressed the problem of safe RL by introducing a predictive safety filter, which enables modularity in terms of safety and the employed RL algorithm. An optimization-based formulation was proposed that provides rigorous safety guarantees using a possibly data-driven approximate system model. By its capability to consider nonlinear and complex system descriptions without being overly conservative, we believe that the proposed approach is an important step towards safe RL for realistic applications.

References

Achiam, J., Held, D., Tamar, A. and Abbeel, P. (2017). Constrained policy optimization, International Conference on Machine Learning, pp. 22–31.

Akametalu, A. K., Kaynama, S., Fisac, J. F., Zeilinger, M. N., Gillula, J. H. and Tomlin, C. J. (2014). Reachability-based safe learning with Gaussian processes, Proceedings of the IEEE Conference on Decision and Control, IEEE, pp. 1424–1431.

Ames, A. D., Coogan, S., Egerstedt, M., Notomista, G., Sreenath, K. and Tabuada, P. (2019). Control barrier functions: Theory and applications, 2019 18th European Control Conference, ECC 2019 pp. 3420–3431.

Ames, A. D., Xu, X., Grizzle, J. W. and Tabuada, P. (2017). Control Barrier Function Based Quadratic Programs for Safety Critical Systems, IEEE Transactions on Automatic Control 62(8): 3861–3876.

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J. and Mané, D. (2016). Concrete problems in AI safety, arXiv preprint arXiv:1606.06565 .

Andersson, J. A. E., Gillis, J., Horn, G., Rawlings, J. B. and Diehl, M. (2018). CasADi – A software framework for nonlinear optimization and optimal control, Mathematical Programming Computation .

Berkenkamp, F., Moriconi, R., Schoellig, A. P. and Krause, A. (2016). Safe learning of regions of attraction for uncertain, nonlinear systems with Gaussian Processes, 55th IEEE Conference on Decision and Control (CDC), pp. 4661–4666.

Bouffard, P., Aswani, A. and Tomlin, C. (2012). Learning-based model predictive control on a quadrotor: Onboard implementation and experimental

results, 2012 IEEE International Conference on Robotics and Automation, pp. 279–284.

Chen, H. and Allgöwer, F. (1998). A quasi-infinite hori- zon nonlinear model predictive control scheme with guaranteed stability, Automatica 34(10): 1205 – 1217.

Chowdhury, S. R. and Gopalan, A. (2017). On kernel- ized multi-armed bandits, International Conference on Machine Learning (ICML), pp. 844–853.

Coumans, E. and Bai, Y. (2016–2019). Pybullet, a python module for physics simulation for games, robotics and machine learning.

Domahidi, A., Zgraggen, A. U., Zeilinger, M. N., Morari, M. and Jones, C. N. (2012). Efficient interior point methods for multistage problems arising in receding horizon control, 51st IEEE Conference on Decision and Control (CDC), pp. 668–674.

Fisac, J. F., Akametalu, A. K., Zeilinger, M. N., Kay- nama, S., Gillula, J. and Tomlin, C. J. (2019). A General Safety Framework for Learning-Based Control in Uncertain Robotic Systems, IEEE Transactions on Automatic Control 64(7): 2737–2752.

Furrer, F., Burri, M., Achtelik, M. and Siegwart, R. (2016). RotorS-A modular gazebo MAV simulator framework, Studies in Computational Intelligence 625: 595–625.

Garcıa, J. and Fernández, F. (2015). A comprehensive survey on safe reinforcement learning, Journal of Machine Learning Research 16: 1437–1480.

Gillula, J. H. and Tomlin, C. J. (2011). Guaranteed safe online learning of a bounded system, Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, IEEE, pp. 2979–2984.

Grune, L. and Palma, V. G. (2014). On the Benefit of Re-optimization in Optimal Control under Perturbations, Proceedings of the 21st International Symposium on Mathematical Theory of Networks and Systems – MTNS 2014 pp. 439–446.

Hertneck, M., Kohler, J., Trimpe, S. and Allgöwer, F. (2018). Learning an Approximate Model Predictive Controller with Guarantees, IEEE Control Systems Letters 2(3): 543–548.

Hewing, L., Liniger, A. and Zeilinger, M. N. (2018). Cau- tious NMPC with Gaussian process dynamics for autonomous miniature race cars, 2018 European Control Conference (ECC), IEEE, pp. 1341–1348.

Hewing, L., Wabersich, K. P., Menner, M. and Zeilinger, M. N. (2020). Learning-Based Model Predictive Control: Toward Safe Learning in Control, Annual Review of Control, Robotics, and Autonomous Systems 3(1).

Hu, H., Feng, X., Quirynen, R., Villanueva, M. E. and Houska, B. (2018). Real-Time Tube MPC Applied to a 10-State Quadrotor Model, Proceedings of the American Control Conference, IEEE, pp. 3135–3140.

Kamthe, S. and Deisenroth, M. P. (2017). Data-efficient reinforcement learning with probabilistic model predictive control, arXiv preprint arXiv:1706.06491 .

Karg, B., Alamo, T. and Lucia, S. (2019). Probabilistic performance validation of deep learning-based robust NMPC controllers, arXiv preprint arXiv:1910.13906 .

Köhler, J., Müller, M. A. and Allgöwer, F. (2018a). Non- linear reference tracking: An economic model predictive control perspective, IEEE Transactions on Automatic Control pp. 1–1.

Köhler, J., Müller, M. A. and Allgöwer, F. (2018b). A novel constraint tightening approach for nonlinear robust model predictive control, 2018 Annual American Control Conference (ACC), pp. 728–734.

Koller, T., Berkenkamp, F., Turchetta, M. and Krause, A. (2018). Learning-based model predictive control for safe exploration, 2018 IEEE Conference on Decision and Control (CDC), IEEE, pp. 6059–6066.

Levine, S., Finn, C., Darrell, T. and Abbeel, P. (2016). End-to-end training of deep visuomotor policies, The Journal of Machine Learning Research 17: 1334–1373.

Li, S. and Bastani, O. (2019). Robust model predictive shielding for safe reinforcement learning with stochastic dynamics, arXiv preprint arXiv:1910.10885 .

Limon, D., Calliess, J. and Maciejowski, J. (2017). Learning-based nonlinear model predictive control, IFAC-PapersOnLine 50(1): 7769 – 7776.

Majumdar, A. and Tedrake, R. (2017). Funnel libraries for real-time robust feedback motion planning, The International Journal of Robotics Research 36(8): 947– 982.

Mannucci, T., van Kampen, E. J., de Visser, C. and Chu, Q. (2018). Safe exploration algorithms for reinforcement learning controllers, IEEE Transactions on Neural Networks and Learning Systems PP: 1–13.

Mayne, D. Q. (2014). Model predictive control: Recent developments and future promise, Automatica 50(12): 2967–2986.

Neumann-Brosig, M., Marco, A., Schwarzmann, D. and Trimpe, S. (2019). Data-Efficient Autotuning With Bayesian Optimization: An Industrial Control Study, IEEE Transactions on Control Systems Technology pp. 1–11.

Ohnishi, M., Wang, L., Notomista, G. and Egerstedt, M. (2019). Barrier-Certified Adaptive Reinforcement Learning With Applications to Brushbot Navigation, IEEE Transactions on Robotics 35(5): 1186–1205.

Ostafew, C. J., Schoellig, A. P. and Barfoot, T. D. (2016). Robust constrained learning-based NMPC enabling reliable mobile robot path tracking, The International Journal of Robotics Research 35(13): 1547–1563.

Prajna, S. and Jadbabaic, A. (2004). Safety verification of hybrid systems using barrier certificates, Lecture Notes in Computer Science 2993: 477–492.

Rasmussen, C. and Williams, C. (2006). Gaussian Processes for Machine Learning, Adaptive Computation and Machine Learning, MIT Press.

Seto, D., Krogh, B., Sha, L. and Chutinan, A. (1998). The simplex architecture for safe on-line control system upgrades, Proceedings of the American Control Conference, Vol. 6, IEEE, pp. 3504–3508.

Slotani, M. (1964). Tolerance regions for a multivariate normal population, Annals of the Institute of Statistical Mathematics 16(1): 135–153.

Soloperto, R., Müller, M. A., Trimpe, S. and Allgöwer,

F. (2018). Learning-based robust model predictive control with state-dependent uncertainty, IFACPapersOnLine 51(20): 442–447.

Tanaskovic, M., Fagiano, L., Smith, R., Goulart, P. and Morari, M. (2013). Adaptive model predictive control for constrained linear systems, Control Conference (ECC), 2013 European, IEEE, pp. 382–387.

Tedrake, R., Manchester, I. R., Tobenkin, M. and Roberts, J. W. (2010). LQR-trees: Feedback motion planning via sums-of-squares verification, The International Journal of Robotics Research 29(8): 1038– 1052.

Thomas, M. M., Kardos, J. L. and Joseph, B. (1994). Shrinking horizon model predictive control applied to autoclave curing of composite laminate materials, Proceedings of the American Control Conference.

von Luxburg, U. and Schölkopf, B. (2011). Statistical Learning Theory: Models, Concepts, and Results.

Wabersich, K. P. and Zeilinger, M. N. (2018a). Linear model predictive safety certification for learning-based control, 2018 IEEE Conference on Decision and Control (CDC), IEEE, pp. 7130–7135.

Wabersich, K. P. and Zeilinger, M. N. (2018b). Scalable synthesis of safety certificates from data with application to learning-based control, 2018 European Control Conference (ECC), pp. 1691–1697.

Wächter, A. and Biegler, L. T. (2006). On the implemen- tation of an interior-point filter line-search algorithm for large-scale nonlinear programming, Mathematical programming 106(1): 25–57.

Wang, L., Han, D. and Egerstedt, M. (2018). Permis- sive Barrier Certificates for Safe Stabilization Using Sum-of-squares, Proceedings of the American Control Conference 2018-June: 585–590.

Wieland, P. and Allgöwer, F. (2007). Constructive safety using control barrier functions, IFAC Proceedings Volumes 40(12): 462–467.

A Appendix

A.1 Lipschitz continuity w.r.t. Hausdorff metric

Definition A.1. The Hausdorff metric between two sets A and B in a metric space is defined as

Definition A.2. A set valued map E mapping vectors from to subsets of is called Lipschitz continuous with Lipschitz constant under the Haus-dorff metric with respect to the 2-Norm, if for all it holds that

A.2 Proof of Theorem 4.6

We begin by deriving a bound on the amount at which small changes in the planned nominal trajectory affect the set membership constraint (6e) in Lemma A.3. Based on Assumption 4.3 together with Lipschitz continuity of the state, input, and terminal constraints, as well as the aforementioned bound on the set membership constraint, we then show that feasibility of (6) for planning horizon N at time k together with implies the existence of a feasible solution at time in Lemma A.4. Finally, we iteratively apply Lemma A.4, to prove Theorem 4.6.

In the following, we consider a model error set , a safe terminal set according to Assumption 4.2 with and , both Lipschitz continuous functions with constants . In the predictive safety filter optimization problem (6) the constraints are defined according to the tightening (9) and (16). We denote an optimal solution of (6) at time k with planning horizon N as nominal input sequence with corresponding nominal state sequence (6b).

Lemma A.3. Let Assumption 4.5 hold. If , and , then , where and .

Proof. The essential observation is that all can be written as with for which we have by Assumption 4.5 that

This allows us to derive

since by definition . Therefore, for all , (A.1) holds, which implies , completing the proof.

Lemma A.4. Let Assumptions 4.3 and 4.5 hold. For every , there exist corresponding such that if 1) , 2) Problem (6) is feasible at time k with prediction horizon N > 0, 3) is applied to (1), and 4) with probability 1, then the input sequence

with according to Assumption 4.3, corresponding nominal state sequence with and , is a feasible solution to (6) at time k + 1 with prediction horizon .

Proof. The following proof is a modified and extended Version of Köhler et al. (2018b, Proposition 5), which considers nonlinear systems with additive disturbances of the form x(k + 1) = f(x(k), u(k)) + w(k), to address model (1), (2) in combination with the set-valued model confidence and terminal safe set constraints (6e) and (6f). The proof makes use of the conditions in Assumption 4.3 to derive bounds on the difference between the optimal plan at time k, and the constructed plan at time k +1, which is in turn used to show that the constraint tightening implies that the constructed plan is a feasible solution for (6) with planning horizon . In order to streamline notation, note that Assumption 4.3 holds for all which allows us, together with the fact that for all due to feasibility at time k, to omit the third argument of V in the following analysis.

We start by bounding errors in terms of the scaling factor as defined in (12). Using the lower bound we obtain the relation , and it follows that

with . In the next step, we derive a bound on such that holds. Select and note that by assumption and constraint (6e) we obtain

and combined with Assumption 4.3 it follows

Next, we show when is small enough, is a feasible candidate input sequence to (6) with planning horizon in two steps. In Step 1 we show that holds for all by induction, which allows us in Step 2 to construct sufficient bounds on that imply feasibility via the tightening sequence (10) of

the remaining constraints in (6).

Step 1: For the induction start i = 0 we show using the row sum norm and the fact that for all to get

Since we have with (A.3), (A.4) that

therefore

Selecting

In order to show the induction step for all we use Assumption 4.3 and derive

c, µ, µ..

and consequently

V , µδ for all i

since . By Assumption (4.3) we have and

Similar to (A.5) we can conclude with that

for all , which proves constraint satisfaction of the candidate state sequence with respect to state constraints by induction.

Step 2: Regarding the terminal constraint (6f), note that due to we can use Assumption 4.3 similarly to before in order to obtain

Let γ

showing terminal constraint satisfaction of (6f). Next we consider input constraints. Let , yielding together with Assump- tion 4.3 and (A.6)

providing that

showing input constraint satisfaction (6d) of the candidate input sequence. For the uncertainty constraint (6e) we have by Lemma A.3 that holds with and

for all . For a sufficiently small Lipschitz constant in Assumption 4.5, such that with

holds, we obtain

⊆ {} ⊆ {} ⊆

which shows feasibility of (6e). Therefore, if satis-fies for a selected with c as defined in (A.7), then there always exists a

such that is a feasible solution to (6) at time k + 1 with planning horizon , proving the desired statement.

Note that the bounds for i = 1, 2, 3, 4 unveil an inner relationship between the constraint tightening fraction and the maximal allowable error set that is defined through the scaling parameter . Since the bound on takes the form it follows that a bound exists on the error magnitude that can be tolerated depending on the size of the region, for which local incremental stabilizability according to Assumption 4.3 holds. Therefore, increasing the constraint tightening beyond will not neces- sarily allow an increase in the tolerable uncertainty magnitude. We can now utilize Lemma A.4 to prove Theorem 4.6, i.e. that application of Algorithm 2 implies safe system operation according to (4).

Proof of Theorem 4.6. Select according to Lemma A.4. By considering the different cases in Algorithm 2, we show safety according to (4) by utilizing the relation

where we use instead of . Since by Assumption 4.5, relation (A.9) allows us to prove (3) by establishing

Pr()) = 1. (A.10)

The proof therefore reduces to the deterministic case showing that , given , at any time step k, which implies directly (A.10) for all times k and therefore via (A.9) chance constraint satisfaction according to (4), i.e., safe system operation with respect to (4). In order to show and , note that if (6) is feasible at time k for any planning horizon it follows due to the state and input constraints (6c), (6d) that and . This implies directly that and for any time step k for which (6) is feasible for horizon N, as well as for all , for which (6) is infeasible for horizon N, since feasibility of (6) with horizon is obtained from iteratively applying Lemma A.4 due to the condition that . For all , it follows from containment in the terminal safe set via (6f) and Assumption 4.2 that and . This shows (A.10) and therefore via (A.9) that , completing the proof.

A.3 Offline design verification

The intuitive interpretation of the few parameters that need to be chosen allows to propose a selection and to certify it using sampling. Related statistical verification methods have been presented, e.g., in Hertneck et al. (2018); Karg et al. (2019). More precisely, to verify the underlying assumptions of Theorem 4.6 and to ensure that the admissible error, i.e. in (12), is selected suffi-ciently small, we make use of a statistical offline verifica-tion procedure for finite horizon tasks, i.e. in (3), for parameter design given a particular realization of a PSF together with a learning policy . We sample and simulate a sufficiently large number of system parameter realizations as well as initial conditions according to their distributions and and provide a statistical bound on the number of successful simulations that ensure safety according to (4). Let and for be i.i.d. samples and define an indicator function for safe execution as

with sampled system dynamics of the form and initial condi- tion . The probability of satisfying con- straints X and U can therefore be expressed as with

via the law of large numbers. To determine a suf-ficiently large but finite number to ensure we use Hoeffding’s inequality (von Luxburg and Schölkopf; 2011, Section 4.1) as similarly proposed in Hertneck et al. (2018): for an error margin . By defining it follows that holds at confidence level . This allows us to state a formal lower bound on the total number of offline simulations as follows:

Proposition A.5. Consider a specific PSF problem parametrization, learning policy , and confidence level . If holds and

then the chance constraints (4) are fulfilled under application of to the real system (1).

For example, if we have and for a specific choice of design parameters, then needs to be greater than 1656 to provide a sound PSF parametrization with 99% confidence. Note that (A.11) is independent of the complexity of the system and that offline simulations can be executed in parallel. Naturally, small required values of can lead to a rapid increase of . In case of infeasibility one needs to re-adjust the design parameters as described in Section 4.3 or more data needs to be collected.

A.4 Sufficient condition for Assumption 4.3

Different from Köhler et al. (2018a,b) we require the first condition in Assumption 4.3 to also hold for the case that , which is used in the proof of Lemma A.4. Nevertheless, Köhler et al. (2018a, Prop. 1) still implies that the following verifiable assumption is sufficient for Assumption 4.3 and relates it to local stabilizability of system (1).

Assumption A.6. Let and define the linearization . For any , the pair is stabilizable, i.e. there exist positive definite and continuous in r, such that

Furthermore, there exists a constant , such that for any with , the corresponding matrix satisfies:

Given Assumption A.6, we can choose in Assumption 4.3, with (A.13) bounding the rate at which can possibly change in any time step when applying . Note that, according to Köhler et al. (2018a, Prop. 1), Assumption A.6 is only sufficient for Assumption 4.3, and therefore stabilizability of the linearization along any reference might not be needed in practice. However, it allows us to analytically express conditions that can be verified in principled steps. While (A.12) is always satisfied as long as the system is locally stabilizable, the second equation accounts for linearization errors as the system evolves and can therefore be seen as an intrinsic consequence of a local linearization-based analysis. As a practical consequence, the condition might be too conservative for nonlinear systems with quickly changing linearizations along references, i.e. large . See, e.g., Köhler et al. (2018a,b) for concrete examples that demonstrate how Assumption A.6 can be verified for different nonlinear systems.

Designed for Accessibility and to further Open Science