Bayesian Optimization with Safety Constraints: Safe and Automatic Parameter Tuning in Robotics

2016·arXiv

Abstract

1 Introduction

Safety and the ability to operate within constraints imposed by an environment are critical prerequisites for any algorithm that is applied on a real robotic system. Especially in robotics, where systems often face large prior uncertainties, failures can cause serious damage to the robot and its environment (Schaal and Atkeson 2010). To avoid unsafe behavior, safety is typically guaranteed with respect to a model of the robot’s dynamics and environment. When accurate models are not available or when the robotic system contains elements that are

difficult to model, such as computer vision components, the parameters of the algorithms are either tuned manually in experiments on the real system or tuned based on massive amounts of experimental data (Lillicrap et al. 2015). Both methods are time-consuming and potentially safety-critical: the engineer must either carefully select parameters that are safe or collect enough representative data that leads to safe behavior.

In this paper, we present a method to automatically optimize parameters of robotics algorithms while respecting safety constraints during the optimization. The resulting algorithm can be used to optimize parameters on the real robot without failures, since no unsafe parameters are evaluated during the optimization. We expand the theoretical framework of SAFEOPT (Safe Optimization) by Sui et al. (2015) to the more general setting with multiple constraints. We show that our algorithm, SAFEOPT-MC (for multiple constraints), enjoys strong theoretical guarantees about safety and performance, and works well in practice.

Related work In control theory, guaranteeing safety in the presence of unmodeled dynamics is often interpreted as the problem of ensuring stability of the underlying control law with respect to an uncertain dynamic system model (Zhou and Doyle 1998). In this setting, controllers can be gradually improved by estimating the unmodeled dynamics and updating the control law based on this estimate. Safety can be guaranteed by ensuring that either the controller is robustly stable for all possible models within the uncertainty specification (Berkenkamp and Schoellig 2015; Berkenkamp et al. 2017) or the system never leaves a safe subset of the state space (Ostafew et al. 2016; Aswani et al. 2013; Akametalu et al. 2014; Moldovan and Abbeel 2012; Turchetta et al. 2016). Both methods require a system model and uncertainty specification to be known a priori, which must be accurate enough to guarantee stability. In contrast, in our setting we do not assume to have access to a model of the system, but aim to directly optimize the parameters of a control algorithm, without violating safety constraints.

In the robotics literature, optimization algorithms have previously been applied with the goal of maximizing a user-defined performance function through iterative experiments. This is especially powerful when no prior model of the robot is available. However, typical algorithms in the literature do not consider safety of the optimization process, and make other restrictive assumptions such as requiring gradients (Killingsworth and Krsti´c 2006; ˚Astr¨om et al. 1993), which are difficult to obtain from noisy data, or an impractical number of experiments (Davidor 1991).

The objective of learning optimal policies has been extensively studied in the reinforcement learning community (Sutton and Barto 1998). In particular, the area of policy search considers the objective of optimizing the parameters of control algorithms (Kober and Peters 2014). The state of the art methods are based on estimating the gradients of the performance function (Peters and Schaal 2006, 2008). As a result, typically multiple evaluations of very similar parameters are conducted on the real system in order to estimate the gradients, which means the approaches are often not data-efficient and converge to local optima. Safety in gradient-based policy search has only been considered by disallowing large steps along the gradient into areas of the parameter space that have not been explored before (Achiam et al. 2017). Guarantees there hold only in expectation. In contrast, our method is gradient-free, so that it can explore the parameter space globally in a more data-efficient manner. At the same time, we provide high-probability worst-case guarantees for not violating safety constraints during the optimization process.

One class of optimization algorithms that has been successfully applied to robotics is Bayesian optimization (Mockus 2012). In Bayesian optimization, rather than considering the objective function as a black-box about which we can only obtain point-wise information, regularity assumptions are made. These are used to actively learn a model of the objective function. The resulting algorithms are practical and provably find the global optimum of the objective function while evaluating the function at only few parameters (Bull 2011; Srinivas et al. 2012). Bayesian optimization methods often model the unknown function as a Gaussian process (GP) (Rasmussen and Williams 2006). These models are highly flexible, allow to encode as much prior knowledge as desired, and explicitly model noise in the performance function evaluations. The GP models are used to guide function evaluations to locations that are informative about the optimum of the unknown function (Mockus 2012; Jones 2001). Example applications of Bayesian optimization in robotics include gait optimization of legged robots (Calandra et al. 2014a; Lizotte et al. 2007) and the optimization of the controller parameters of a snake-like robot (Tesch et al. 2011). Marco et al. (2017) optimize the weighting matrices of an LQR controller for an inverted pendulum by exploiting additional information from a simulator. Several different Bayesian optimization methods are compared by Calandra et al. (2014b) for the case of bipedal locomotion. While these examples illustrate the potential of Bayesian optimization methods in robotics, none of these examples explicitly considers safety as a requirement.

Recently, the concept of constraints has been incorporated into Bayesian optimization. Gelbart et al. (2014) introduce an algorithm to optimize an unknown function subject to an unknown constraint. However, this constraint was not considered to be safety-critical; that is, the algorithm is allowed to evaluate unsafe parameters. The case of finding a safe subset of the parameters without violating safety constraints was considered by Schreiter et al. (2015), while Sui et al. (2015) presented a similar algorithm to safely optimize an objective function. However, the algorithm of Sui et al. (2015) considers safety as a minimum performance requirement. In robotics, safety constraints are typically functions of the states or inputs that are independent of the performance.

Our contributions In this paper, we present an algorithm that considers multiple, arbitrary safety constraints decoupled from the performance objective. This generalization retains the desirable sample-efficient properties of normal Bayesian optimization, but carefully explores the parameter space in order to maximize performance while guaranteeing safety with high probability. We extend the theory of SAFEOPT (Sui et al. 2015) to account for these additional constraints and show that similar theoretical guarantees can be obtained for the more general setting. We then relax the assumptions used in the proofs to obtain a more practical version of the algorithm and additionally show that the safety guarantees carry over to this case. Next to the theoretical contributions, the second main contribution is an extensive experimental evaluation of the method, where we consider the problem of safely optimizing linear and nonlinear laws on a quadrotor vehicle. The experiments demonstrate that the proposed approach is able to safely optimize parameters of a control algorithms while respecting safety constraints with high probability. Moreover, we show how ideas from contextbased optimization (Krause and Ong 2011) can be used to safely transfer knowledge in order to obtain environmentdependent control laws. For example, in our experiments we optimize a control law for different flying speeds of a quadrotor vehicle. Early results with only a safety constraint on performance and without additional theoretical results were presented in (Berkenkamp et al. 2016).

The rest of the paper is structured as follows: in Section 2, we define the problem of safely optimizing the parameters of control algorithms and provide an overview on GPs and Bayesian optimization in Section 3. In Section 4, we introduce our algorithm, and analyze its theoretical properties in Section 4.2. We evaluate the performance of the algorithm in experiments on a quadrotor vehicle in Section 5 and draw conclusions in Section 6. The proofs of the theorems are provided in Section 7.

2 Problem Statement

We consider a given algorithm that is used to accomplish a certain task with a robot. In general, this algorithm is arbitrary and may contain several components including vision, state estimation, planning, and control laws. The algorithm depends on tuning parameters in some specified, domain .

The goal is to find the parameters within A that maximize a given, scalar performance measure, f. For example, this performance measure may represent the negative tracking error of a robot (Berkenkamp et al. 2016), the average walking speed of a bipedal robot (Calandra et al. 2014a), or any other quantity that can be computed over a finite time horizon. We can only evaluate the performance measure for any parameter set a on finite-time trajectories from experiments on the real robot. The functional dependence of f on a is not known a priori. In the following, we write the performance measure as a function of the parameters , even though measuring performance requires an experiment on the physical robot and typically depends on a trajectory of states, control inputs, and external signals.

We assume that the underlying system is safety-critical; that is, there are constraints that the system must respect when evaluating parameters. Similarly to the performance measure, f(a), these constraints can represent any quantity and may depend on states, inputs, or even environment variables. There are q safety constraints of the form , which together define the safety conditions. This is without loss of generality, since any constraint function can be shifted by a constant in order to obtain this form. The functions are unknown a priori but can be estimated through (typically noisy) experiments for a given parameter set a. For example, in order to encode a state constraint on an obstacle for a robot, the safety function can return the smallest distance to the obstacle along a trajectory of states when using algorithm parameters a. Note that if the functions were known in advance, we could simply exclude unsafe parameters from the set A.

The overall optimization problem can be written as

The goal is to iteratively find the global maximum of this constrained optimization problem by, at each iteration n, selecting parameters and evaluating (up to noise) the corresponding function values and until the optimal parameters are found. In particular, since the constraints define the safety of the underlying system, only parameters that are inside the feasible region of (1) are allowed to be evaluated; that is, only parameters that fulfill these safety requirements on the real system.

Since the functions f and in (1) are unknown a priori, it is not generally possible to solve the corresponding optimization problem without violating the constraints. The first problem is that we do not know how to select a first, safe parameter to evaluate. In the following, we assume that an initial safe set of parameters is known for which the constraints are fulfilled. These serve as a starting point for the exploration of the safe region in (1). In robotics, safe initial parameters with poor performance can often be obtained from a simulation or domain knowledge.

Secondly, in order to safely explore the parameter space beyond , we must be able to infer whether parameters a that we have not evaluated yet are safe to use on the real system. To this end, we make regularity assumptions about the functions f and in (1). We discuss these assumptions in more detail in Section 4.2. However, broadly speaking we make assumptions that allow us to model the functions f and as a GP, construct reliable confidence intervals over the domain A, and imply Lipschitz continuity properties. Using these properties, we are able to generalize safety beyond the initial, safe parameters . Given the model assumptions, we require that the safety constraints hold with high probability over the entire sequence of experiments.

As a consequence of the safety requirements, it is not generally possible to find the global optimum of (1). Instead we aim to find the optimum in the part of the feasible region that is safely reachable from . We formalize this precisely in Section 4.

Lastly, whenever we evaluate parameters on the real system, we only obtain noisy estimates of both the performance function and the constraints, since both depend on noisy sensor data along trajectories. That is, for each parameter a the we evaluate, we obtain measurements and , where , is zero-mean, -sub-Gaussian noise. Note that while is a random variable, we use to denote the measurement obtained at iteration n. In general, the noise variables may be correlated, but we do not consider this case in our theoretical analysis in Section 4.2. We only want to evaluate parameters where all safety constraints are fulfilled, so that for all and .

3 Background

In this section, we review Gaussian processes (GPs) and Bayesian optimization, which form the foundation of our safe Bayesian optimization algorithm in Section 4. The introduction to GPs is standard and based on (Berkenkamp et al. 2016) and (Rasmussen and Williams 2006).

3.1 Gaussian Process (GP)

Both the function f(a) and the safety constraints in Section 2 are unknown a priori. We use GPs as a nonparametric model to approximate these unknown functions over their common domain A. In the following, we focus on a single function, the performance function. We extend this model to multiple functions in order to represent both performance and constraints in Section 3.1.1.

GPs are a popular choice for nonparametric regression in machine learning, where the goal is to find an approximation of a nonlinear map, , from an input vector to the function value f(a). This is accomplished by assuming that the function values f(a), associated with different values of a, are random variables and that any finite number of these random variables have a joint Gaussian distribution (Rasmussen and Williams 2006).

A GP is parameterized by a prior mean function and a covariance function , which defines the covariance of any two function values f(a) and . The latter is also known as the kernel. In this work, the mean is assumed to be zero, without loss of generality. The choice of kernel function is problem-dependent and encodes assumptions about smoothness and rate of change of the unknown function. A review of potential kernels can be found in (Rasmussen and Williams 2006) and more information about the kernels used in this paper is given in Section 5.

The GP framework can be used to predict the function value for an arbitrary parameter based on a set of n past observations, , at the chosen parameters . The GP model assumes that observations are noisy measurements of the true function value f(a); that is, with . Conditioned on these observations, the posterior distribution is a GP again with mean and variance

where is the vector of observed, noisy function values, the covariance matrix has entries , and the vector contains the covariances between the new input and the observed data points in . The matrix denotes the identity matrix.

3.1.1 GPs with multiple outputs So far, we have focused

on GPs that model a single scalar function. In order to model not only the performance, f(a), but also the safety constraints, , we have to consider multiple, possibly correlated functions. In the GP literature, these are usually treated by considering a matrix of kernel functions, which models the correlation between different functions ( ´Alvarez et al. 2012). Here instead, we use an equivalent representation by considering a surrogate function,

which returns either the performance function or the individual safety constraints depending on the additional input with I = {0, . . . , q}, where are the indices belonging to the constraints. The function is a single-output function and can be modeled as a GP with scalar output over the extended parameter space . For example, the kernel for the performance function f(a) and one safety constraint g(a) may look like this:

where is the Kronecker delta. This kernel models the functions f(a) and g(a) with independent kernels and respectively, but also introduces a covariance function that models similarities between the two function outputs. By extending the training data by the extra parameter i, we can use the normal GP framework and predict function values and corresponding uncertainties using (2) and (3). When observing the function values, the index i is added to the parameter set a for each observation. Including noise parameters inside the kernel allows to model noise correlation between the individual functions.

Importantly, using this surrogate function rather than the framework of ´Alvarez et al. (2012) enables us to lift theoretical results of Sui et al. (2015) to the more general case with multiple constraints and provide theoretical guarantees for our algorithm in Section 4.2.

In the setting with multiple outputs, at every iteration n, we obtain |I| = q + 1 measurements; one for each function. For ease of notation, we continue to write and , even though we have obtained measurements at locations in the extended parameter space.

3.2 Bayesian Optimization

Bayesian optimization aims to find the global maximum of an unknown function (Mockus 2012). The framework assumes that evaluating the function is expensive, while computational resources are relatively cheap. This fits our problem in Section 2, where each evaluation of the performance function corresponds to an experiment on the real system, which takes time and causes wear in the robotic system.

In general, Bayesian optimization models the objective function as a random function and uses this model to determine informative sample locations. A popular approach is to model the underlying function as a GP, see Section 3.1. GP-based methods use the posterior mean and variance predictions in (2) and (3) to compute the next sample location. For example, according to the GP-UCB (GP-Upper Confidence Bound) algorithm by Srinivas et al. (2012), the next sample location is

where is an iteration-dependent scalar that reflects the confidence interval of the GP. Intuitively, (5) selects new evaluation points at locations where the upper bound of the confidence interval of the GP estimate is maximal. Repeatedly evaluating the system at locations given by (5) improves the mean estimate of the underlying function and decreases the uncertainty at candidate locations for the maximum, such that the global maximum is provably found eventually (Srinivas et al. 2012).

While (5) is also an optimization problem, its solution does not require any evaluations on the real system and only uses the GP model. This reflects the assumption of cheap computational resources. In practice, Bayesian optimization typically focuses on low-dimensional problems. However, this can be scaled up by discovering a low-dimensional subspace of A for Bayesian optimization (Djolonga et al. 2013; Wang et al. 2013) or encoding additional structure in the kernel (Duvenaud et al. 2011).

3.3 Contextual Bayesian Optimization

Contextual Bayesian optimization is a conceptually straightforward extension of Bayesian optimization (Krause and Ong 2011). It enables optimization of functions that depend on additional, external variables, which are called contexts. For example, the performance of a robot may depend on its battery level or the weather conditions, both of which cannot be influenced directly. Alternatively, contexts can also represent different tasks that the robot has to solve, which are specified externally by a user. The idea is to include the functional dependence on the context in the GP model, but to consider them fixed when selecting the next parameters to evaluate.

For example, given a context that is fixed by the environment, we can model how the performance and constraint functions change with respect to different contexts by multiplying the kernel function over the parameters, with another kernel over the contexts,

This kernel structure implies that function values are correlated when both parameters and the contexts are similar. For example, we would expect selecting the same parameters a for a control algorithm to lead to similar performance values if the context (e.g., the battery level) is similar. Since contexts are not part of the optimization criterion, a modified version of (5) has to be used. It was shown by Krause and Ong (2011) that an algorithm that evaluates the GP-UCB criterion given a fixed context ,

enjoys similar convergence guarantees as normal Bayesian optimization in Section 3.2. Specifically, after seeing a particular context often enough, the criterion (7) will query parameters that are close-to-optimal.

3.4 Safe Bayesian Optimization (SAFEOPT)

In this paper, we extend the safe optimization algorithm SAFEOPT (Sui et al. 2015) to multiple constraints. SAFEOPT is a Bayesian optimization algorithm, see Section 3.2. However, instead of optimizing the underlying performance function f(a) globally, it restricts itself to a safe set of parameters that achieve a certain minimum performance with high probability. This safe set is not known initially, but is estimated after each function evaluation. In this setting, the challenge is to find an appropriate evaluation strategy similar to (5), which at each iteration n not only aims to find the global maximum within the currently known safe set (exploitation), but also aims to increase the set of controllers that are known to be safe (exploration). SAFEOPT trades off between these two sets by choosing for the next experiment the parameters inside the safe set about whose performance we are the most uncertain.

4 SAFEOPT-MC (Multiple Constraints)

In this section, we introduce the SAFEOPT-MC algorithm for multiple constraints and discuss its theoretical properties. The goal of the algorithm is to solve (1) by evaluating different parameters from the domain A without violating the safety constraints. To this end, any algorithm has to consider two important properties:

(i) Expanding the region of the optimization problem that is known to be feasible or safe as much as possible without violating the constraints,

(ii) Finding the optimal parameters within the current safe set.

For objective i), we need quantify the size of the safe set. To do this in a tractable manner, we focus on finite sets A in the following. However, heuristic extensions to continuous domains exist (Duivenvoorden et al. 2017).

The theoretical guarantees of the algorithm rely on the continuity of the underlying function. Many commonly used kernels, such as the squared exponential (Gaussian) kernel, encode Lipschitz-continuous functions with high probability (Ghosal and Roy 2006). We make more specific assumptions that ensure deterministic Lipschitz constants in Section 4.2. For now, we assume that f(a) and are Lipschitz continuous with Lipschitz constant L with respect to some norm .

Since we only observe noisy estimates of both the performance function and the constraints, we cannot expect to find the entire safe region encoded by the constraints within a finite number of evaluations. Instead, we follow Sui et al. (2015) and consider learning the safety constraint up to some accuracy . This assumption is equivalent to a minimum slack of on the constraints in (1).

As mentioned in Section 2, we assume that we have access to initial, safe parameters , for which we know that the safety constraints are satisfied a priori. Starting from these initial parameters, we ask what the best that any safe optimization algorithm could hope to achieve is. In particular, if we knew the safety constraint functions up to accuracy within some safe set of parameters S, we could exploit the continuity properties to expand the safe set to

where represents the number of parameters that can be classified as safe given that we know g up to -error inside S and exploiting the Lipschitz continuity to generalize to new parameters outside of S. The baseline that we compare against is the limit of repeatedly applying this operator on ; that is, with and the baseline is . This set contains all the parameters in A that could be classified as safe starting from if we knew the function up to error. This set does not include all the parameters that potentially fulfill the constraints in (1), but is the best we can do without violating the safety constraints. Hence the optimal value that we compare against is not the one in (1), but

which is the maximum performance value over the set that we could hope to classify as safe starting from the initial safe set, .

4.1 The Algorithm

In this section, we present the SAFEOPT-MC algorithm that guarantees convergence to the previously set baseline. The most critical aspect of the algorithm is safety. However, once safety is ensured, the second challenge is to find an evaluation criterion that enables trading off between exploration, trying to further expand the current estimate of the safe set, and exploitation, trying to improving the estimate of the best parameters within the current set.

To ensure safety, we construct confidence intervals that contain the true functions f and with high probability. In particular, we use the posterior GP estimate given the data observed so far. The confidence intervals for the surrogate function in (4) are defined as

where is a scalar that determines the desired confidence interval. This set contains all possible function values between the lower and upper confidence interval based on the GP posterior. The probability of the true function value lying within this interval depends on the choice of , as well as on the assumptions made about the functions. We provide more details about this choice in Section 4.2, Lemma 1, and Section 4.3.

Rather than defining the lower and upper bounds based on (10), the following analysis requires that consecutive

Figure 2. Optimization with the SAFEOPT-MC algorithm after 1, 2 and 10 parameter evaluations. Based on the mean estimate (blue) and the confidence interval (light blue), the algorithm selects evaluation points for which (black dashed) from the safe set (red), which are either potential maximizers (green) or expanders (magenta). It then learns about the

function by drawing noisy samples from the unknown, underlying function (light gray). This way, we expand the safe region (red) as much as possible and, simultaneously, find the global optimum of the unknown function (17) (cyan circle).

estimates of the lower and upper bounds are contained within each other. This assumption ensures that the safe set does not shrink from one iteration to the next, which we require to prove our results. We relax this assumption in Section 4.3. We define the contained set at iteration n as , where for all and R otherwise. This ensures that parameters in the initial safe set remain safe according to the GP model after additional observations. The lower and upper bounds on this set are defined as and , respectively. For notational clarity, we write and for the performance bounds.

Based on these confidence intervals for the function values and a current safe set , we can enlargen the safe set using the Lipschitz continuity properties,

(11) The set contains all points in , as well as all additional parameters that fulfill the safety constraints given the GP confidence intervals and the Lipschitz constant. With the set of safe parameters defined, the last remaining challenge is to trade off between exploration and exploitation. One could, similar to Schreiter et al. (2015), simply select the most uncertain element over the

entire set. However, this approach is not sample-efficient, since it involves learning about the entire function rather than restricting evaluations to the relevant parameters. To avoid this, we first define subsets of that correspond to parameters that could either improve the estimate of the maximum or could expand the safe set. The set of potential maximizers is defined as

which contains all parameters for which the upper bound of the current performance estimate is above the best lower bound. The parameters in are candidates for the optimum, since they could obtain performance values above the current conservative estimate of the optimal performance.

Similarly, an optimistic set of parameters that could potentially enlarge the safe set is

The function enumerates the number of parameters that could additionally be classified as safe if a safety function obtained a measurement equal to its upper confidence bound. Thus, the set is an optimistic set of parameters that could potentially expand the safe set.

We trade off between the two sets, and , by selecting the most uncertain element across all performance and safety functions; that is, at each iteration n we select

as the next parameter set to be evaluated on the real system. The implications of this selection criterion will become more apparent in the next section, but from a high-level view this criterion leads to a behavior that focuses almost exclusively on exploration initially, as the most uncertain points will typically lie on the boundary of the safe set for many commonly used kernels. This changes once the constraint evaluations return results closer to the safety constraints. At this point, the algorithm keeps switching between selecting parameters that are potential maximizers, and parameters that could expand the safe set and lead to new areas in the parameter space with even higher function values. Pseudocode for the algorithm is found in Algorithm 1.

We show an example run of the algorithm in Figure 2. It starts from an initial safe parameter at which we obtain a measurement in Figure 2d. Based on this, the algorithms uses the continuity properties of the safety function and the GP in order to determine nearby parameters as safe (red set). This corresponds to the region where the high-probability confidence intervals of the GP model (blue shaded) are above the safety threshold (grey dashed line). At the next iteration in Figure 2e, the algorithm evaluates parameters that are close to the boundary of the safe set, in order to expand the set of safe parameters. Eventually the algorithm converges to the optimal parameters in Figure 2c, which obtain the largest performance value that is possible without violating the safety constraints. A local optimization approach, e.g. based on estimated gradients, would have gotten stuck in the local optimum at the initial parameter .

At any iteration, we can obtain an estimate for the current best parameters from

which returns the best, safe lower-bound on the performance function f.

4.2 Theoretical Results

In this section, we show that the same theoretical framework from the SAFEOPT algorithm (Sui et al. 2015) can be extended to multiple constraints and the evaluation criterion (15). Here, we only provide the results and high-level ideas of the proofs. The mathematical details are provided in Section 7. For simplicity, we assume that all

function evaluations are corrupted by the same -subGaussian noise in this section.

In order to provide guarantees for safety, we need the confidence intervals in (10) to hold for all iterations and functions. In the following, we assume that the surrogate function h(a, i) has bounded norm in a reproducing kernel Hilbert space (RKHS, c.f., Christmann and Steinwart (2008)). A RKHS corresponding to a kernel includes functions of the form with and representer points . The bounded norm property implies that the coefficients decay sufficiently fast as j increases. Intuitively, these functions are well-behaved, in that they are regular with respect to the choice of kernel.

The following Lemma allows us to choose a scaling factor for (10), so that we achieve a specific probability of the true function being contained in the confidence intervals for all iterations.

Lemma 1. (based on Chowdhury and Gopalan (2017)). Assume that h(a, i) has RKHS norm bounded by B and that measurements are corrupted by -sub-Gaussian noise. If , then the following holds for all parameters , function indices , and iterations jointly with probability at least :

Moreover, if the kernel is continuously differentiable, then the corresponding functions are Lipschitz continuous (Christmann and Steinwart 2008). Note that Lemma 1 does not make probabilistic assumptions on h – in fact, h could be chosen adversarially, as long as it has bounded norm in the RKHS. Similar results can be be obtained for the Bayesian setting where the function h is assumed to be drawn from the GP prior (Srinivas et al. 2012).

The scaling factor in Lemma 1 depends on the information capacity associated with the kernel k. It is the maximum amount of mutual information that we can obtain about the GP model of from n noisy measurements at parameters ,

Intuitively, it quantifies a best case scenario where we can select the measurements in the most informative manner possible. The information capacity has a sublinear dependence on n for many commonly-used kernels and can numerically approximated up to a small constant factor for any given kernel Srinivas et al. (2012).

Since the confidence intervals hold with probability and the safe set is not empty starting from , it is possible to prove that parameters within the safe set are always safe with high probability. In order for the algorithm to compete with our baseline, we must additionally ensure that the algorithm learns the true function up to confidence in both the sets and . The number of measurements required to achieve this depends on the information capacity , since it encodes how much information can be obtained about the true function from n measurements. We use the sublinearity of in order to bound the number of samples required to estimate the function up to accuracy. We have the following result:

Theorem 1. Assume that h(a, i) has bounded norm in an RKHS and that the measurement noise is -sub-Gaussian. Also, assume that and for all and . Choose as in Lemma 1, define as in (17), and let be the smallest positive integer satisfying

where . For any and , when running Algorithm 1 the following inequalities jointly hold with probability at least :

1. Safety: ∀∈ I

2. Optimality:

Proof. Main idea: safety follows from Lemma 1, since accurate confidence intervals imply that we do not evaluate unsafe parameters. For the optimality, the main idea is that, since we evaluate the most uncertain element in , the uncertainty about the maximum is bounded by . The result follows from showing that, after a finite number of evaluations, either the safe set expands or the maximum uncertainty within shrinks to . See Section 7 for derivations and details.

Theorem 1 states that, given the assumptions we made about the underlying function, Algorithm 1 explores the state space without violating the safety constraints and, after at most samples, finds an estimate that is -close to the optimal value over the safely reachable region. The information capacity , grows at a faster rate of |I|n compared to n in SAFEOPT, since we obtain |I| observations at the same parameters set a, while the SAFEOPT analysis assumes every measurement is optimized independently. However, remains sublinear in n, see Section 7.

4.2.1 Contexts In this section, we show how the theoretical guarantees derived in the previous section transfer to contextual Bayesian optimization. In this setting, part of the variables that influence the performance, the contexts, are fixed by an external process that we do not necessarily control. In normal Bayesian optimization, it was shown by Krause and Ong (2011) that an algorithm that optimizes the GP-UCB criterion in (7) for a fixed context converges to the global optimum after repeatedly seeing the corresponding context.

Intuitively, the fact that part of the variables that influence the performance, the contexts, are now fixed by an external process should not impact the algorithm on a fundamental level. However, safety is a critical issue in our experiments and, in general, one could always select an adversarial context for which we do not have sufficient knowledge to determine safe controller parameters. As a consequence, we make the additional assumption that only ‘safe’ contexts are visited; that is, we assume the following:

Assumption 1. For any , the context is selected such that .

Here, denotes the safe set for the given context . Intuitively, Assumption 1 ensures that for every context there exists at least one parameter choice that is known to satisfy all safety constraints. This assumption includes the case where safe initial parameters for all contexts are known a priori and the case where the algorithm terminates and asks for help from a domainexpert whenever a context leads to an empty safe set.

A trivial extension of SAFEOPT-MC to contexts is to run |Z| independent instances of Algorithm 1, one for each context. This way, it is sufficient to repeatedly see a context several times to apply the previous results to the safe contextual optimization case. One can apply the previous

Figure 3. Example run of the context-dependent SAFEOPT-MC algorithm. For the first six samples, the algorithm only sees the context z = 0 (left function) and obtains measurements there (red crosses). However, by exploiting correlations between different contexts, the algorithm can transfer knowledge about the shape of the function and safe set over to a different context, z = 1 (right function). This enables the algorithm to be significantly more data-efficient.

analysis to this setting, but it would fail to yield guarantes that hold jointly for all contexts.

In order to obtain stronger results that hold jointly across all contexts in Z, we adapt the information capacity (worst-case mutual information) to consider contexts,

Unlike in (19), the mutual information is maximized across contexts in (21). As a result, we can use Lemma 1 to obtain confidence intervals that hold jointly across all contexts.

A second challenge is that contexts are chosen in an arbitrary order. This is in stark contrast to the parameters , which are chosen according to (15) in order to be informative. This means that any tight finite sample bound on Algorithm 1 must necessarily depend on the order of contexts. The following theorem accounts for both of these challenges.

Theorem 2. Under the assumptions of Theorem 1

and Assumption 1. Choose as in Lemma 1, where is now the worst-case mutual information over contexts as in (21). For a given context order and any context , let

be the number of times that we have observed the context z up to iteration and 1 is the indicator function. Let be the smallest positive integers such that

where . We note the information capacity for a fixed context . That is, with it is defined as . For any and , let . Then, when running Algorithm 1 the following inequalities jointly hold with probability at least :

1. ∀∈ I

2. ∀z ∈ Z

Proof. For a fixed context, , we have n(z) and the results follow directly as in Theorem 1. Otherwise, the only difference in the proofs is that depends on the information capacity over contexts, making sure that the confidence intervals are valid across contexts. By visiting contexts in Z \ {z}, we obtain more measurements and increases, which in turn increases the upper bound on the number of samples required at context z.

Theorem 2 states that the contextual variant of Algorithm 1 enjoys the same safety guarantees as the noncontextual version. Additionally, it shows that, after gaining enough information about a particular context, it can identify the optimal parameters. In practice, this upper bound is conservative, since it does not acount for knowledge transfer accross contexts. In practice, correlations between contexts significantly speed up the learning process. For example, in Figure 3 we show a contextual safe optimization problem with two contexts. Even though the algorithm has only been able to explore the parameter space at the first context (z = 0, left function), the correlation between the functions means that information can be transferred to the as-of-yet unobserved context (z = 1, right function). This knowledge transfer significantly improves data-efficiency and the number of evaluations required by the algorithm.

4.3 Practical Implementation

In this section, we discuss possible changes to Algorithm 1 that make the algorithm more practical, at the expense of loosing some of the theoretical guarantees. One challenge in applying Algorithm 1 in practice, is defining a suitable Lipschitz constant. In particular, specifying the wrong constant can lead to conservativeness or unsafe parameters being evaluated. Moreover, smoothness assumptions are already encoded by the kernel choice, which is more intuitive to specify than Lipschitz constants on their own. In practice, we use only the GP model to ensure safety (Berkenkamp et al. 2016); that is, we define and in terms of the confidence intervals of the GP directly. Thus, we can define the safe set without a Lipschitz constant as

While it is difficult to prove the full exploration of the safely reachable set as in Theorem 1, the resulting algorithm remains safe:

Lemma 2. With the assumptions of Lemma 1, , and for all and , when running Algorithm 1 with the safe set defined as in (24), the following holds with probability at least :

Proof. The confidence intervals hold with probability following Lemma 1. Since in (24) is defined as the set of parameters that fulfill the safety constraint and the safe set is never empty since , the claim follows.

Similarly, the set of expanders can be defined in terms of the GP directly, by adding optimistic measurements and counting the number of new parameters that are classified as safe, see (Berkenkamp et al. 2016) for details. However, this potentially adds a large computational burden.

The parameter , which determines the GP’s confidence interval in Lemma 1, may be impractically conservative for experiments. The theoretical safety results also hold when we replace in by the empirical mutual information gained so far, . Empirically, depending on the application, one may also consider setting to a constant value. This roughly corresponds to bounding the failure probability per iteration, rather than over all iterations.

Learning all the different functions, f and , up to the same accuracy may be restrictive if they are scaled differently. A possible solution is to either scale the observed data, or to scale the uncertainties in (15) by the prior variances of the kernels for that specific output. This enables (15) to make more homogeneous decisions across different scales.

5 Quadrotor Experiments

In this section, we demonstrate Algorithm 1 (with the changes discussed in Section 4.3) in experiments on a quadrotor vehicle, a Parrot AR.Drone 2.0.

A Python implementation of the SAFEOPT-MC algorithm that builds on (GPy 2012), a GP library, is available at http://github.com/befelix/SafeOpt. Videos of the experiments can be found at

• Section 5.3: http://tiny.cc/icra16 video

• Section 5.4: https://youtu.be/rLmwYtoE3yg

• Section 5.5: https://youtu.be/4xC4OSiIDGk

5.1 Experimental Setup

During the experiments, measurements of all vehicle states are estimated from position and pose data provided by an overhead motion capture camera system. The quadrotor’s dynamics can be described by six states: positions x = (x, y, z), velocities , ZYX Euler angles , and body angular velocities . The control inputs u are the desired roll and pitch angles and , the desired z-velocity , and the desired yaw angular velocity , which in turn are inputs to an unknown, proprietary, on-board controller.

The position dynamics in the global coordinate frame are

where is the rotation matrix from the body frame to the inertial frame, is the mass-normalized thrust, and is the gravitational force. The goal of the controller is to track a reference signal. We assume that z-position and the yaw angle are controlled by fixed control laws and focus on the position control in x- and y direction. We use two different control laws in the following experiments.

The most simple control law that can be used for this setting is a PD-controller, defined as

where are the two tuning parameters. Intuitively, a larger parameter encourages tracking reference changes more aggressively, while is a damping factor that encourages moderate velocities.

A more sophisticated approach to control quadrotor vehicles is to use estimates of the angles and accelerations to solve for the thrust c. We use loop shaping on the horizontal position control loops so that they behave in the manner of a second-order systems with time constant and damping ratio . Based on a given desired reference trajectory, commanded accelerations are computed as

From the commanded accelerations, we then obtain the control inputs for the desired roll and pitch angles by solving (26) for the angles. Here, the tuning parameters are . For details regarding the controllers see (Schoellig et al. 2012; Lupashin et al. 2014).

The quadrotor was controlled using the ardrone autonomy and vicon bridge packages in ROS Hydro. Computations for the SAFEOPT-MC algorithm in Algorithm 1 were conducted on a regular laptop and took significantly less than one second per iteration. As a result, experiments could be conducted continuously without interruptions or human interventions.

5.2 Kernel Selection

Algorithm 1 critically depends on the GP model for the performance and constraint functions. In this section, we review the kernel used in our experiments and the kind of models that they encode. A more thorough review of kernel properties can be found in (Rasmussen and Williams 2006).

In our experiments, we use the Mat`ern kernel with parameter (Rasmussen and Williams 2006),

which encodes that mean functions are one-times differentiable. This is in contrast to the commonly used squared exponential kernels, which encode smooth (infinitely differentiable) functions. With the Mat`ern kernel, the GP model is parameterized by three hyperparameters: measurement noise in (2) and (3), the kernel’s prior variance , and positive lengthscales , which are the diagonal elements of the diagonal matrix M, M = diag(l). These hyperparameters have intuitive interpretations: the variance of the measurement noise corresponds to the noise in the observations, which includes any randomness in the algorithm and initial conditions, and random disturbances. The prior variance determines the expected magnitude of function values; that is, with probability 0.68 according to the GP prior. Lastly, the lengthscales l determine how quickly the covariance between neighboring values deteriorates with their distance. The smaller the length-scales, the faster the function values can change from one parameter set to the next. In particular, the high-probability Lipschitz constant encoded by this kernel depends on the ratio between the prior variance and the lengthscales, .

When using GPs to model dynamic systems, typically a maximum likelihood estimate of the hyperparameters is used based on data; see (Ostafew et al. 2016) for an example. For Bayesian optimization, the GP model is used to actively acquire data, rather than only using it for regression based on existing data. This dependence between the kernel hyperparameters and the acquired data is known to lead to poor results in Bayesian optimization when using a maximum likelihood estimate of the hyperparameters during data acquisition (Bull 2011). In particular, the corresponding GP estimate is not guaranteed to contain the true function as in Lemma 1. In this work, we critically rely on these confidence bounds to guarantee safety. As a consequence, we do not adapt the hyperparameters as more data becomes available, but treat the kernel as a prior over functions in the true Bayesian sense; that is, the kernel hyperparameters encode our prior knowledge about the functions that we model and are fixed before experiments begin. While this requires intuition about the process, intuitively the less knowledge we encode in the prior, the more data and evaluations on the real system are required in order to determine the best parameters.

5.3 Linear Control

In this section, we use SAFEOPT-MC to optimize the parameters of the linear control law in (27). The entire control algorithm consists of this control law together with the on-board controller and state estimation.

The goal is to find controller parameters that maximize the performance during a 1-meter reference position change. For an experiment with parameters at iteration n, the performance function is defined as

where, to compute the cost c, the states and the input u are weighted by positive semi-definite matrices Q and R. The subscript k indicates the state measurement at time step k in the trajectory and the time horizon is 5 s (N = 350). Performance is defined as the cost improvement relative to 95% of the initial controller cost. The safety constraint is defined only in terms of the performance; that is, the constraint is , which encodes that we do not want to evaluate controller parameters that perform significantly worse than the initial parameters.

While the optimal controller parameters could be easily computed given an accurate model of the system, we do not have a model of the dynamics of the proprietary, on-board controller and the time delays in the system. Moreover, we want to optimize the performance for the real, nonlinear quadrotor system, which is difficult to model accurately. An inaccurate model of the system could be used to improve the prior GP model of the performance function, with the goal of achieving faster convergence. In this case, the uncertainty in the GP model of the performance function would account for inaccuracies in the system model.

We define the domain of the controller parameters as , explicitly including positive controller parameters that certainly lead to crashes. In practice, one

Figure 4. GP mean estimate of the performance function after 30 evaluations. The algorithm adaptively decides which parameters to evaluate based on safety and informativeness. In the bottom-left corner, there is the magnified section of the first three samples, which are close together to determine the location of the initial, safe region. The maximum, magnified in the top-left corner, also has more samples to determine the precise location of the maximum. Other areas are more coarsely sampled to expand the safe region.

would exclude parameters that are known to be unsafe a priori. The initial controller parameters are , which result in a controller with poor performance. Decreasing the controller gains further leads to unstable controllers.

The parameters for the experiments were set as follows: the lengt