b

DiscoverSearch
About
My stuff
Regret Bounds for Safe Gaussian Process Bandit Optimization
2020·arXiv
Abstract
Abstract

Many applications require a learner to make sequential decisions given uncertainty regarding both the system’s payoff function and safety constraints. In safety-critical systems, it is paramount that the learner’s actions do not violate the safety constraints at any stage of the learning process. In this paper, we study a stochastic bandit optimization problem where the unknown payoff and constraint functions are sampled from Gaussian Processes (GPs) first considered in [Srinivas et al., 2010]. We develop a safe variant of GP-UCB called SGP-UCB, with necessary modifications to respect safety constraints at every round. The algorithm has two distinct phases. The first phase seeks to estimate the set of safe actions in the decision set, while the second phase follows the GP-UCB decision rule. Our main contribution is to derive the first sub-linear regret bounds for this problem. We numerically compare SGP-UCB against existing safe Bayesian GP optimization algorithms.

Stochastic bandit optimization has received significant attention in applications where a learner must repeatedly deal with an unknown random environment and observations are costly to obtain. At each round, the learner chooses an action x and observes a noise-perturbed version of an otherwise unknown reward function f(x). The goal is to minimize the so-called cumulative pseudo-regret, i.e., the difference between the expected T -period reward generated by the algorithm and the optimal expected reward if f was known to the learner. The most well-studied case is when the unknown function f comes from a finite dimensional linear model, with regret bounds provided by [Dani et al., 2008, Abbasi-Yadkori et al., 2011, Rusmevichientong and Tsitsiklis, 2010] for Upper Confidence Bound (UCB) based algorithms. In a more general setting, the expected reward is a sample from a Gaussian Process (GP), with regret bounds first provided by [Srinivas et al., 2010]. GPs are a popular choice for modelling reward function in Bayesian optimization methods as well as experimental design with applications in medical trials and robotics, e.g., [Berkenkamp et al., 2016b, Akametalu et al., 2014, Ostafew et al., 2016, Berkenkamp et al., 2016a]. In a closely related line of work, [Srinivas et al., 2010, Valko et al., 2013, Chowdhury and Gopalan, 2017] proposed kernelized UCB algorithms for settings where the reward is an unknown arbitrary function in a reproducing kernel Hilbert space (RKHS), and provided regret bounds through a frequentist analysis. A variant of this problem considers an environment that is also subject to a number of unknown safety constraints. The application of stochastic bandit optimization in safety-critical systems requires the learner to select actions that satisfy these safety constraints at each round, in spite of uncertainty regarding the safety requirements.

In this paper, we consider a stochastic bandit optimization problem where both the reward function f and the constraint function g are samples from Gaussian Processes. We require that the learner’s chosen actions respect safety constraints at every round in spite of uncertainty about safe actions. This setting was first studied in [Sui et al., 2015] in the specific case of a single safety constraint of the form  f(x) ≥ hand later in [Sui et al., 2018], in the more general case of  g(x) ≥ has adopted in our paper. In this case, the learner hopes to overcome the two-fold challenge of keeping the cumulative regret as small as possible while ensuring that selected actions respect the safety constraints at each round of the algorithm. We present SGP-UCB, which is a safety-constrained variant of GP-UCB proposed by [Srinivas et al., 2010]. To ensure constraint satisfaction, SGP-UCB restricts the learner to choose actions from a conservative inner-approximation of the safe decision set that is known to satisfy safety constraints with high probability given the algorithm’s history. The cumulative regret bound of our proposed algorithm (given in Section 3 as our main theoretical result) implies that SGP-UCB is a no-regret algorithm. This is the main difference of our results compared to the algorithms studied in [Sui et al., 2015, Sui et al., 2018] that only come with convergence-but, no regret- guarantees. Throughout the paper, we discuss in detail the differentiating features of our algorithm from existing ones.

Notation. We use lower-case letters for scalars, lower-case bold letters for vectors, and upper-case bold letters for matrices. The Euclidean norm of a vector x is denoted by  ∥x∥2. We denote the transpose of any column vector x by xT. Let A be a positive definite  d × dmatrix and  v ∈ Rd. The weighted 2-norm of v with respect to A is defined by ∥v∥A=√vT Av. We denote the minimum and maximum eigenvalue of A by  λmin(A)and  λmax(A). The maximum of two numbers  α, βis denoted  α ∨ β. For a positive integer n, [n] denotes the set {1, 2, . . ., n}.

1.1 Problem Statement

The learner is given a finite decision set  D0 ⊂ Rd. At each round t, she chooses an action  xt ∈ D0and observes a noise-perturbed value of an unknown reward function  f : D0 → R, i.e.  yt := f(xt) + ηt. At every round, the learner must ensure that the chosen action  xtsatisfies the following safety constraint:

image

where  g : D0 → Ris an unknown function and h is a known constant.1 We define the safe set from which the learner is allowed to take action as:

image

Since g is unknown, the learner cannot identify  Ds0. As such, the best she can do is to choose actions  xtthat are in  Ds0with high probability. We assume that at every round, the learner also receives noise-perturbed feedback on the safety constraint, i.e.  zt := g(xt) + nt.

Goal. Since our knowledge of g comes from noisy observations, we are not able to fully identify the true safe set  Ds0and infer g(x) exactly, but only up to some statistical confidence  g(x) ± ǫfor some  ǫ > 0. Hence, we consider the optimal action through an  ǫ-reachable safe set for some  ǫ > 0:

image

as our benchmark. A natural performance metric in this context is cumulative pseudo-regret [Audibert et al., 2009] over the course of T rounds, which is defined by  RT = �Tt=1 f(x∗ǫ) − f(xt),where  x∗ǫis the optimal safe action that maximizes the reward in expectation over the  Dsǫ, i.e.,  x∗ǫ ∈ arg maxx∈Dsǫ f(x).For the rest of this paper, we simply use regret to refer to the pseudo-regret  RTand drop the subscript  ǫfrom  x∗ǫ.

A desirable asymptotic property of a learning algorithm is that  RT /T → 0as fast as possible as T grows, especially when actions are costly. Algorithms with this property are called no-regret. Thus, the goal of the learner is to follow a no-regret algorithm while ensuring all actions she chooses are safe with high probability.

Regularity Assumptions. The above specified goal cannot be achieved unless certain assumptions are made on f and g. In what follows, we assume that these functions have a certain degree of smoothness. This assumption allows us to model the reward function f and the constraint function g as a sample from a Gaussian Process (GP) [Williams and Rasmussen, 2006]. We now present necessary standard terminology and notations on GPs. A GP(µ(x), k(x, x′))is a probability distribution across a class of smooth functions, which is parameterized by a kernel function  k(x, x′)that characterizes the smoothness of the function. The Bayesian algorithm we analyze uses GP(0, kf(x, x′))and  GP(0, kg(x, x′))as prior distributions over f and g, respectively, where  kfand  kgare positive semi-definite kernel functions. Moreover, we assume bounded variance  kf(x, x) ≤ 1and  kg(x, x) ≤ 1. For a noisy sample  yt = [y1, . . . , yt]T, with i.i.d Gaussian noise  ηt ∼ N(0, σ2)the posterior over f is also a GP with the mean µf,t(x)and variance  σ2f,t(x):

image

where  kf,t(x, x′) = kf(x, x′)− kf,t(x)T (Kf,t + σ2I)−1kf,t(x′), kf,t(x) = [kf(x1, x), . . . , kf(xt, x)]Tand  Kf,tis the positive definite kernel matrix  [kf(x, x′)]x,x′∈{x1,...,xt}. Associated with g, the mean  µg,t(x)and variance  σ2g,t(x)are defined similarly.

1.2 Related work

As mentioned in the introduction, the most closely related works to this paper are [Sui et al., 2015, Sui et al., 2018]. With the objective function denoted by f(x), [Sui et al., 2015] adopts a single constraint of the form  f(x) ≥ h, whereas [Sui et al., 2018, Berkenkamp et al., 2016a] consider the more general constraint set  gi(x) ≥ hi, i ∈ [m]. As is the case in our paper, the objective and constraint are modeled by Gaussian Processes. For algorithmic design purposes, [Sui et al., 2015, Sui et al., 2018] further assume Lipschitzness on reward and constraint functions. These assumptions are not required in our framework. Moreover, both of the aforementioned papers seek to identify a safe decision with the highest possible reward given a limited number of trials; i.e., their goal is to provide best-arm identification with convergence guarantees. Instead, our paper focuses on a long-term performance characterized through cumulative regret bounds. A more detailed comparison to algorithms and guarantees of [Sui et al., 2015, Sui et al., 2018] is given in Section 4.

There exist other contexts where safety constraints have been applied to stochastic bandit optimization frameworks. To name a few, the recent work of [Usmanova et al., 2019] studies a safe variant of the Frank-Wolfe algorithm to solve a smooth optimization problem with unknown convex objective and unknown linear constraints that are accessed by the learner via stochastic zeroth-order feedback. The analysis aims at providing sample complexity results and convergence guarantees, whereas we aim to provide regret bounds. In contrast to our setting, the paper [Usmanova et al., 2019] requires multiple measurements of the constraint at each round of the algorithm. Other closely related works of [Amani et al., 2019, Amani et al., 2020] study the problem of safe linear and generalized linear stochastic bandit where the constraint and loss functions depend linearly (directly or via a link function) on an unknown parameter. In fact, our algorithm can be seen as an extension of Safe-LUCB proposed by [Amani et al., 2019] to safe GPs. Specifically, in Section 3.1, we show that our algorithm and guarantees are similar to those in [Amani et al., 2019] for linear kernels. While [Amani et al., 2019] studies a frequentist setting, our results hold for a rich class of kernels beyond linear kernel.

In a broader sense, the problem of safe learning has received significant attention in reinforcement learning and controls. For example, [Berkenkamp et al., 2017] combines classical reinforcement learning with stability requirements by applying a Gaussian process prior to learn about system dynamics and shows improvement in both control performance and safe region expansion. Another notable work is [Schreiter et al., 2015], which presents an active learning framework that uses Gaussian Processes to learn the safe decision set. In [Turchetta et al., 2016], the authors address the problem of safely exploring finite Markov decision processes (MDP), where state-action pairs are associated with safety features that are modeled by Gaussian processes and must lie above a threshold. Also in the MDP setting, [Moldovan and Abbeel, 2012] proposes an algorithm that allows safe exploration in order to avoid fatal absorbing states that must never be visited during the exploration process. By considering constrained MDPs that are augmented with a set of auxiliary cost functions and replacing them with surrogates that are easy to estimate, [Achiam et al., 2017] proposes a policy search algorithm for constrained reinforcement learning with guarantees for near constraint satisfaction at each iteration. Furthermore, [Wachi et al., 2018] presents a reinforcement learning approach to explore and optimize a safety-constrained MDP where the safety values of states are modeled by GPs. From a control theoretic point of view, the recent work [Liu et al., 2019] studies an algorithmic framework for safe exploration in model-based control which comes with convergence guarantees, but no regret bound. Other notable work in this area include [Gillulay and Tomlin, 2011] that combines reachability analysis and machine learning for autonomously learning the dynamics of a target vehicle and [Aswani et al., 2013] that designs a learning-based MPC scheme that provides deterministic guarantees on robustness when the underlying system model is linear and has a known level of uncertainty.

We start with a description of SGP-UCB, which is summarized in Algorithm 1. Similar to a number of previous works (e.g., [Sui et al., 2018, Amani et al., 2019]), SGP-UCB proceeds in two phases to balance the goal of expanding the safe set and controlling the regret. Prior to designing the decision rule, the algorithm requires a proper expansion of  Ds0. Hence, in the first phase, it takes actions at random from a given safe seed set  Dwuntil the safe set has sufficiently expanded (discussion on other suitable sampling strategies in the first phase is provided in Appendix G). In the second phase, the algorithm exploits GP properties to make predictions of f from past noisy observations  yt. It then follows the Upper Confidence Bound (UCB) machinery to select the action. In the absence of constraint (1), UCB-based algorithms select action  xtsuch that  f(xt)is a high probability upper bound on  f(x∗). Specifically, SGPUCB constructs appropriate confidence intervals  Qf,t(x)of f for  x ∈ D0(see (6)). However, the safety constraint (1) requires the algorithm to have a more delicate sampling rule as follows.

The algorithm exploits the noisy constraint observations  ztto similarly establish confidence intervals  Qg,t(x)for the unknown constraint function such that  g(x) ∈ Qg,t(x)for  x ∈ D0. These confidence intervals allow us to design an inner approximation  Dstof the safe set (see (9)). The chosen actions belong to  Dstwhich guarantees that the safety constraint (1) is met with high probability. In sections 2.1 and 2.2, we explain the first and second phases of the algorithm in detail.

2.1 First Phase: Exploration phase

The exploration phase aims to reach a sufficiently expanded safe subset of  D0. The stopping criterion for this phase is to reach an approximate safe set within which  x∗lies with high probability. The algorithm starts exploring by choosing actions from  Dwat random (see Appendix G for discussion on alternative suitable sampling rules in the first phase). After  T ′rounds of exploration, where  T ′is passed as an input to the algorithm, SGP-UCB exploits the collected observations  zt, t ∈ [T ′]to obtain a reasonable estimate of the unknown function g and consequently to establish an expanded safe set which contains  x∗with high probability.

image

2.2 Second Phase: Exploration-Exploitation phase

In the second phase, the algorithm follows an approach similar to GP-UCB [Srinivas et al., 2010] in order to balance exploration and exploitation and guarantee the no-regret property. At rounds  t = T ′ + 1, . . . , T, SGP-UCB uses previous observations to estimate  Ds0and predict f. It creates the following confidence interval for f(x):

image

where,

image

Confidence intervals  Qg,t(x)corresponding to g(x) are defined in a similar way. We choose  βtaccording to Theorem 1 to guarantee  f(x) ∈ Qf,t(x)and  g(x) ∈ Qg,t(x)for all  x ∈ D0and t > 0 with high probability. Theorem 1 (Confidence Intervals, [Srinivas et al., 2010]). Pick  δ ∈ (0, 1)and set  βt = 2 log((2)|D0|t2π2/6δ), then:

image

with probability at least  1 − δ.

Using the above defined confidence intervals  Qf,t(x)and  Qg,t(x), the algorithm is able to act conservatively to ensure that safety constraint (1) is satisfied. Specifically, at the beginning of each round  t = T ′ + 1, . . . , T, SGP-UCB forms the following so-called safe decision sets based on the mentioned confidence bounds:

image

Recall that  g(x) ≥ ℓg,t(x)for all t > 0 with high probability. Therefore,  Dstis guaranteed to be a set of safe actions with the same probability. After creating safe decision sets in the second phase, the algorithm follows a similar decision rule as in GP-UCB algorithm in [Srinivas et al., 2010]. Specifically,  xtis chosen such that:

image

Consider the following decomposition on the cumulative regret:

image

where  rt = f(x∗) − f(xt)is the instantaneous regret at round t.

Bounding Term II. The main challenge in the analysis of SGP-UCB compared to the classical GP-UCB is that x∗may not lie within the estimated safe set  Dstat all rounds of the algorithm’s second phase if  T ′is not properly chosen.

In the following sections, we show how  T ′is appropriately chosen such that  x∗ ∈ Dstwith high probability for all  t ≥ T ′ + 1. Having said that, we bound the second term of (11) using the standard regret analysis which appears in [Srinivas et al., 2010]. Specifically, the bound depends on the so-called information gain  γtwhich quantifies how fast f can be learned in an information theoretic sense. Concretely,  γt := max|A|≤t I(f; yA), is the maximal mutual information that can be obtained about the GP prior from t samples. Information gain is a problem dependent quantity: its value depends on the given decision set  D0and kernel function  kf. For any finite  D0, it holds [Srinivas et al., 2010]:

image

While  γtis generally bounded by  O(|D0|log t|D0|), it has a sublinear dependence on  |D0|for commonly used kernels (e.g. Gaussian kernel).

Bounding Term I. Since for the first  T ′rounds actions are selected at random, the bound on Term I is linear in  T ′. In other words, the upper bound on the first term is of the form  BT ′, where  B := C√2ℓddiam(D0)/δfor some C > 0 if  kfis an RBF kernel with parameter  ℓ, otherwise  B := 2�2 log(2|D0|)/δsuch that (see Lemma 6 in Appendix D for details):

image

Next, we need to find the value of  T ′such that with high probability  x∗ ∈ Dstfor all  t ≥ T ′ +1. The following lemma, proved in Appendix A, establishes a sufficient condition for  x∗ ∈ Dstwhich is more convenient to work with.

image

From Lemma 1, it suffices to establish an appropriate upper bound on the RHS of (14) to determine the duration of the first phase, i.e.,  T ′.

A positive semi-definite kernel function  kg : Rd × Rd → Ris associated with a feature map  ϕg : Rd → Hkgthat maps the vectors in the primary space to a reproducing kernel Hilbert space (RKHS). In terms of the mapping  ϕg, the kernel function  kgis defined by:

image

Let  dgdenote the dimension of  Hkg(potentially infinite) and define the  t×dgmatrices  Φg,t := [ϕg(x1), . . . , ϕg(xt)]Tat each round t. Using this notation, we can rewrite  σ2g,tas (see Appendix E for details):

image

In the following two subsections, we discuss how this expression helps us control  σ2g,t−1(x∗)for  t ≥ T ′+1. Depending on the type of kernel functions  kgand their corresponding  dg, we derive different expressions for  T ′in Theorems 2 and 3.

3.1 Constraint with finite-dimensional RKHS

In this section we consider g with finite dimensional RKHS. Linear and polynomial kernels are special cases of these types of functions. For a linear kernel  kg(x, y) = xT yand a polynomial kernel  kg(x, y) = (xT y + 1)p, the corresponding  dgis d and�d+pd �, respectively [Pham and Pagh, 2013].

Let  ¯x ∼Unif(Dw)be a d-dimensional random vector uniformly distributed in  Dw. At rounds  t ∈ [T ′], SGPUCB chooses safe iid actions  xt iid∼ ¯x. We denote the covariance matrix of  ϕg(¯x)by  Σg = E[ϕg(¯x)ϕg(¯x)T ] ∈Rdg×dg. A key quantity in our analysis is the minimum eigenvalue of  Σgdenoted by:

image

Regarding the definition of  σ2g,t(x)in (16), we show that if  λ− > 0, σ2g,t−1(x∗)can be controlled for all  t ≥ T ′ + 1by appropriately lower bounding the minimum eigenvalue of the Gram matrix  ΦTg,T ′Φg,T ′, which is possible due to the randomness of chosen actions in the first phase.

Lemma 2. Assume  dg < ∞, λ− > 0, and  x ∈ D0. Then, for any  δ ∈ (0, 1), provided  T ′ ≥ tδ := 8λ− log( dgδ ), the following holds with probability at least  1 − δ,

image

Consequently,  σ2g,t−1(x∗) ≤ 2σ22σ2+λ−T ′ ,for all  t ≥ T ′ + 1.

We present the proof in Appendix B.

Combining Lemmas 1 and 2 gives the desired value of  T ′that guarantees  x∗ ∈ Dstfor all  t ≥ T ′ + 1with high probability. Putting these together, we conclude the following regret bound for constraint with corresponding finite-dimensional RKHS.

Theorem 2 (Regret bound for g with finite dimensional RKHS). Let the same assumptions as in Lemma 2 hold. Let tǫ := 8σ2βTλ−ǫ2and define  T ′ := tǫ ∨ tδ. Then for sufficiently large T and any  δ ∈ (0, 1/3), with probability at least 1 − 3δ:

image

See Appendix D for proof details.

Linear Kernels. We highlight the setting where f and g are associated with linear kernels as a special case. In this setting the primal space  Rdand the corresponding RKHS are the same. Let k be a linear kernel with mapping ϕg : Rd → Hk = Rd, Xt = Φt = [x1, . . . , xt]T, and y be the corresponding observation vector. Therefore, we have µt(x) = xT ˆθtwhere ˆθt = (XTt Xt + σ2I)−1XTt y. We drive the following from (16):

image

where  At = XTt Xt + σ2I. Thus, we observe the close relation in these notations with that in Linear stochas- tic bandits settings, (e.g. see [Dani et al., 2008, Abbasi-Yadkori et al., 2011]). As such, our setting is an extension to [Amani et al., 2019], where linear loss and constraint functions have been studied (albeit in a frequentist setting).

3.2 Constraint with infinite-dimensional RKHS

Now we provide regret guarantees for a more general case where the underlying RKHS corresponding to g can be infinite-dimensional.

In the infinite-dimensional RKHS setting, controlling  σg,t−1(x∗)for  t ≥ T ′ + 1can be challenging. To address this issue, we focus on stationary kernels, i.e.,  kg(x, y) = kg(x − y)2, and apply a finite basis approximation in our analysis. Particularly, we consider  ˜ϕg : Rd → RDgwhich maps the input to a lower-dimensional Euclidean inner product space with dimension  Dgsuch that:

image

Definition 1 ((ǫ0, Dg)-uniform approximation). Let  kg : Rd × Rd → Rbe a stationary kernel, then the inner product  ˜ϕg(x)T ˜ϕg(y)in  RDg, (ǫ0, Dg)-uniformly approximates  kg(x, y)if and only if:

image

Due to the infinite dimensionality of  Hkg, there is no notion for minimum eigenvalue of  ΦTg,T ′Φg,T ′. Hence, we adopt an  (ǫ0, Dg)-unifrom approximation to bound  σ2g,t−1(x∗)for all  t ≥ T ′+1by lower bounding the minimum eigenvalue of the approximated  Dg × Dgmatrix ˜ΦTg,T ′ ˜Φg,T ′instead. The argument follows the same procedure as in Lemma 2, other than an error bound on  σ2g,t−1(x∗)caused by the  (ǫ0, Dg)-unifromly approximation is required.

We consider  ˜ϕg(.)to be an (ǫ0, Dg)-uniform approximation and denote the covariance matrix of  ˜ϕg(¯x)by ˜Σg =E[ ˜ϕg(¯x) ˜ϕg(¯x)T ] ∈ RDg×Dgwith minimum eigenvalue:

image

Lemma 3. Assume that  dg = ∞, kgis a stationary kernel, and ˜λ−defined in (22) is positive. Fix  δ, ǫ0 ∈ (0, 1). Then, it holds with probability at least  1 − δfor all  t ≥ T ′ + 1,

image

provided that  T ′ ≥ ˜tδ := 8˜λ− log( Dgδ ).

Technical details on how  ˜ϕganalytically helps us obtain this upper bound on  σ2g,t−1(x∗)for all  t ≥ T ′ + 1by lower bounding the minimum eigenvalue of ˜ΦTg,T ′ ˜Φg,T ′are deferred to Appendix C. Putting these together, we obtain the regret bound for constraint with corresponding infinite-dimensional RKHS in the following theorem.

image

Figure 1: Illustration of SGP-UCB: (a) The blue and red solid lines denote the unknown reward function f and constraint function g, respectively. The dashed green line represents the threshold  h + ǫ, the gray bar shows the safe seed set  Dw, and the red star is the optimum value of f through  Dsǫ. (b,c) The dashed purple and orange lines are the estimated GP mean functions corresponding to f and g, respectively at rounds 50 and 200.

Theorem 3 (Regret bound for g with infinite dimensional RKHS). Assume there exists an  (ǫ0, Dg)-uniform approximation of stationary kernel  kgwith  0 < ǫ0 ≤ ǫ2σ232T 3βTfor which  ˜λ−defined in (22) is positive. Let  ˜tǫ := 16σ2βT˜λ−ǫ2and

˜tδ := 8˜λ− log( Dgδ )and define  T ′ := ˜tǫ ∨ ˜tδ. Then, for sufficiently large T and any  δ ∈ (0, 1/3), with probability at least  1 − 3δ:

image

Complete proof is given in Appendix D.

Depending on the feature map approximation  ˜ϕg, the dimension  Dgcan be appropriately chosen as a function of the algorithm’s inputs  ǫ, δand d to control the accuracy of the approximation. We emphasize that our analysis is not restricted to specific approximations (see Appendix F for details). We focus on the Quadrature Fourier features (QFF) studied by [Mutny and Krause, 2018] who show that for any stationary kernel k on  Rdwhose inverse Fourier transform decomposes product-wise, i.e.,  p(ω) = �di=1 pj(ωj), we can use Gauss-Hermite quadrature [Hildebrand, 1987] to approximate it. The results in [Mutny and Krause, 2018] imply that the QFF uniform approximation error  ǫ0decreases exponentially with  Dg. More concretely, in this case,  Dg = O�(d + log(d/ǫ0))d�features are required to obtain an ǫ0-accurate approximation of the SE kernel  kg.

A few remarks regarding the differences between our algorithm and existing work on safe-GP optimization are in order. We first remark on the assumptions placed on the safe seed set  Dwin our work, which might appear restrictive when compared to those in the closely related works of [Sui et al., 2015, Sui et al., 2018]. Specifically, our theoretical guarantees require the safe seed set  Dwto satisfy assumptions put forth in Theorems 2 and 3, which would ensure that  λ− > 0and ˜λ− > 0. For instances with dimension  dg < ∞discussed in Section 3.1, a sufficient condition that guarantees  λ− > 0is that  Dwcontains at least  dgactions, such that their maps  ϕg(.)into each corresponding RKHS form linearly independent vectors. For example, for linear constraints the size of the seed set needs only to be linear in the dimension d. Hence, our analysis suggests that Safe GP learning is easy when the safety constraint is simple (e.g. linear/polynomial kernels). However, for instances with dimension  dg = ∞discussed in Section 3.2, the assumption ˜λ− > 0holds if at least  Dgactions with linear independent corresponding  ˜ϕg(.)exist in  Dw. As a corollary, employing QFF for SE kernels in the analysis requires  Dwto contain at least  O��d + log(T/ǫ)�d�

actions.

image

Figure 2: Regret comparison

In comparison, the safe-GP algorithms proposed by [Sui et al., 2015, Sui et al., 2018, Berkenkamp et al., 2016a] can start from a safe seed set of arbitrary size; However, this comes at a costs: 1) There is no guarantee that they are able to explore the entire space to reach a sufficiently expanded safe set that includes  x∗; 2) They require the functions f and g to be Lipschitz continuous with known constants. More specifically, in all the above mentioned work a onestep reachablity operator for a single constraint function g is defined as follows:  Rǫ(S0) := S0 ∪ {x ∈ D0|∃x′ ∈S0, g(x′) − ǫ − Ld(x, x′) ≥ h}, where L is the Lipschitz constant corresponding to the constraint function g. Then they define an  ǫ-reachable safe set by ¯Rǫ(S0) := limn→∞ Rnǫ (S0), where  Rnǫ (S0) := Rǫ(Rǫ . . . (Rǫ

image

n-step reachability operator starting from the safe seed set  S0[Sui et al., 2015]. The above stated  ǫ-reachable safe set clearly depends on  S0. The optimization benchmark in [Sui et al., 2015, Berkenkamp et al., 2016a] is  f ∗ǫ (S0) =maxx∈ ¯Rǫ(S0) f(x), which varies on a case by case basis depending on  S0.

Instead, our benchmark is  f ∗ǫ = maxx∈Dsǫ f(x)which satisfies  f ∗ǫ ≥ f ∗ǫ (S0)since  ¯Rǫ(S0) ⊆ Dsǫfor any choice of  S0. Similarly, the optimization goal of StageOpt [Sui et al., 2018] is approaching  arg maxx∈Rt∗ǫ (S0)for an arbitrary safe seed set  S0, where  t∗is the round at which the first phase of StageOpt, ends under the condition  max wt(x)x∈Gt ≤ ǫ, where  Gtis the set of potential expander points that is created at each round t and  wt(x)is the width of confidence interval of constraint function g(x) at round t (see [Sui et al., 2018]). Hence, as pointed out in [Sui et al., 2018], given an arbitrary seed set, it is not guaranteed that they will be able to discover the globally optimal decision  x∗, e.g. if the safe region around  x∗is topologically separate from that of  S0. On the other hand, given its stronger assumptions on the safe seed set, our algorithm does not suffer from this issue.

Our second remark is concerning the fundamentally different goals of our algorithm versus that of [Sui et al., 2015, Sui et al., 2018, Berkenkamp et al., 2016a]. Unlike SGP-UCB, the proposed algorithms in the latter works are not focused on regret minimization; rather, their focus is on best arm identification through safe exploration, i.e., providing convergence guarantees to the reachable optimal solution defined in the previous remark. We refer the reader to Appendix G for more clarifications on the algorithmic design differences of SGP-UCB and the algorithm studied by [Sui et al., 2015] and their implications on cumulative regret analysis.

In this section, we analyze SGP-UCB through numerical evaluations on synthetic data. We first give an illustration of how our algorithm performs by depicting the estimated f and g and the expanded safe sets at certain rounds. The second experiment seeks to compare SGP-UCB’s performance against a number of other existing algorithms.

In Figure 1, we give an illustration of SGP-UCB’s performance. For the sake of visualization, we implement the algorithm in a 1-dimensional space and connect the data points since we find it instructive to also depict estimates of f and g as well as the growth of the safe sets. The algorithm starts the first phase by sampling actions at random from a given safe seed set. After 50 rounds, in Figure 1b, the safe set has sufficiently expanded such that the optimal action x∗lies within the  Ds50. Figure 1c shows the expansion of the safe set after 200 rounds, which still includes  x∗.

Figure 2 compares the average per-step regret of SGP-UCB against a number of closely related algorithms over 30 realizations (we include the error bars in figures presented in Appendix H). In particular, we compare against 1) StageOpt [Sui et al., 2018]; 2) SafeOpt-MC [Berkenkamp et al., 2016a] that generalizes SafeOpt [Sui et al., 2015] to settings with multiple constraints possibly different than the objective function f; 3) A heuristic variant of GP-UCB, which proceeds the same as SGP-UCB except that there is no exploration phase, i.e.,  T ′ = 0; 4) The standard GP-UCB with oracle access to the safe set.

We evaluated regret on synthetic settings with reward and constraint functions corresponding to SE kernels with hyperparameters 1 and 0.1, respectively. Parameters  T = 500, δ = 0.01, ǫ = 0.01and  σ = 0.1have been chosen in all settings. The decision sets are time-independent sets of 100 actions sampled uniformly from the unit ball in  R2. We implemented all algorithms by starting from the same seed set, i.e.,  Dw = S0. Figure 2 highlights the key role of the seed set’s size that is discussed in detail in Section 4. Figure 2c shows that once the safe seed set contains enough actions, SGP-UCB outperforms SafeOpt-MP and StageOpt whose  ǫ-reachable set (i.e., ¯Rǫ(Dw)in SafeOpt-MP and Rt∗ǫ (Dw)in StageOpt) do not include the true globally optimal  x∗considered in this paper. We also implemented SGP-UCB for settings where  Dwhas relatively small number of safe actions. The results given in Figure 2a show the poor performance of SGP-UCB which is expected since  Dwis not large enough to explore the whole space for the purpose of safe set expansion. In Figure 2b, the regret curves are plotted for instances where  Dwis not large enough to reach the sufficiently expanded safe set including  x∗, but it is also not too small to get the expansion process stuck. In these instances, StageOpt performs well compared to SGP-UCB on average. What is common in all figures is the poor performance of Naive SGP-UCB (almost linear regret) compared to the others since it is never able to expand the safe set properly. When implementing SafeOpt-MC, we took the results of Theorem 1 in [Berkenkamp et al., 2016a] into account. We found  t∗numerically and modified the sampling rule after  t∗as follows: xt := arg maxx∈St ℓf,t(x).

Another issue worth highlighting regrading implementation of SafeOpt-MC and StageOpt is construction of the safe sets  St. In our experiments, we relied on the exact definition of  Stsuggested in [Berkenkamp et al., 2016a, Sui et al., 2018], which depends on the Lipschitz constant of g. While we numerically calculate the Lipschitz constant to have a fair comparison, [Berkenkamp et al., 2016a, Sui et al., 2018] use only the GP model to ensure safety in their numerical experiments. As such, they construct  Stin the same way as we form  Dstin (9). However, since the provided guarantees in these works are obtained with respect to the optimal action through an  ǫ-reachable set, i.e., ¯Rǫ(S0), which clearly depends on  S0, this modification disregards the role of  S0in the provided theoretical results.

As is the case in our proposed algorithm, StageOpt also proceeds in two distinct phases and the duration of the first phase is an input to the algorithm which needs to be specified. An interesting observation is that there are similarities between the first phase duration suggested for StageOpt and that introduced in our paper. These similarities mostly come from their dependence on parameters such as  βTand  ǫ. In our experiments, we did not rely on the value of T ′that the theoretical results suggest. For both implementations, we stopped the first phase when the safe region plateaued for at least 20 iterations, and also hard capped  T ′at 100 iterations (a similar approach was adopted by [Sui et al., 2018]).

We studied a safe stochastic bandit optimization problem where the unknown payoff and constraint functions are sampled from GPs. We proposed SGP-UCB which is comprised of two phases: (i) a pure-exploration phase that speeds up learning of the safe set; (ii) a safe exploration-exploitation phase that focuses on regret minimization. We balanced the two-fold challenge of minimizing regret and expanding the safe set by properly choosing the duration of the first phase  T ′. Our analysis suggests that the type of kernels associated with the constraint functions plays a critical role in tuning  T ′and consequently affects the regret bounds. We used QFF [Mutny and Krause, 2018] as a tool to facilitate our analysis in settings with constraint function with infinite-dimensional RKHS. Beyond analysis, it is interesting to employ such approximations or other approaches like variational inference introduced by [Huggins et al., 2019] to further overcome computational associated with solving (10).

Several issues remain to be studied. While our algorithm is the first providing regret guarantees for safe GP optimization, it is not clear whether it is the best to apply. The answer could depend on the application. Hence, numerical comparisons on real application-specific data is worth investigating. More importantly, the other issue that needs to be addressed is that the existing guarantees (either in terms of cumulative regret, simple regret or optimization gap) for all safe-GP optimization algorithms, suffer from loose constants that make such comparisons hard. Indeed evaluating the performances of all these four algorithms in numerical experiments requires us to resort to empirical tuning of parameters like  T ′, which is an important challenge to overcome.

This research is supported by UCOP grant LFR-18-548175 and NSF grant 1847096.

[Abbasi-Yadkori et al., 2011] Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320.

[Achiam et al., 2017] Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22–31. JMLR. org.

[Akametalu et al., 2014] Akametalu, A. K., Fisac, J. F., Gillula, J. H., Kaynama, S., Zeilinger, M. N., and Tomlin, C. J. (2014). Reachability-based safe learning with gaussian processes. In 53rd IEEE Conference on Decision and Control, pages 1424–1431. IEEE.

[Amani et al., 2019] Amani, S., Alizadeh, M., and Thrampoulidis, C. (2019). Linear stochastic bandits under safety constraints. In Advances in Neural Information Processing Systems, pages 9252–9262.

[Amani et al., 2020] Amani, S., Alizadeh, M., and Thrampoulidis, C. (2020). Generalized linear bandits with safety constraints. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3562–3566. IEEE.

[Aswani et al., 2013] Aswani, A., Gonzalez, H., Sastry, S. S., and Tomlin, C. (2013). Provably safe and robust learning-based model predictive control. Automatica, 49(5):1216–1226.

[Audibert et al., 2009] Audibert, J.-Y., Munos, R., and Szepesvári, C. (2009). Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902.

[Berkenkamp et al., 2016a] Berkenkamp, F., Krause, A., and Schoellig, A. P. (2016a). Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics. arXiv preprint arXiv:1602.04450.

[Berkenkamp et al., 2016b] Berkenkamp, F., Schoellig, A. P., and Krause, A. (2016b). Safe controller optimization for quadrotors with gaussian processes. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 491–496. IEEE.

[Berkenkamp et al., 2017] Berkenkamp, F., Turchetta, M., Schoellig, A., and Krause, A. (2017). Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, pages 908–918.

[Chowdhury and Gopalan, 2017] Chowdhury, S. R. and Gopalan, A. (2017). On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 844–853. JMLR. org.

[Contal et al., 2013] Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N. (2013). Parallel gaussian process opti- mization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 225–240. Springer.

[Dani et al., 2008] Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback.

[Gillulay and Tomlin, 2011] Gillulay, J. H. and Tomlin, C. J. (2011). Guaranteed safe online learning of a bounded system. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2979–2984. IEEE.

[Hildebrand, 1987] Hildebrand, F. B. (1987). Introduction to numerical analysis. Courier Corporation.

[Huggins et al., 2019] Huggins, J., Campbell, T., Kasprzak, M., and Broderick, T. (2019). Scalable gaussian pro- cess inference with finite-data mean and variance guarantees. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 796–805.

[Liu et al., 2019] Liu, A., Shi, G., Chung, S.-J., Anandkumar, A., and Yue, Y. (2019). Robust regression for safe exploration in control. arXiv preprint arXiv:1906.05819.

[Moldovan and Abbeel, 2012] Moldovan, T. M. and Abbeel, P. (2012). Safe exploration in markov decision processes. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1451– 1458. Omnipress.

[Mutny and Krause, 2018] Mutny, M. and Krause, A. (2018). Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. In Advances in Neural Information Processing Systems, pages 9005– 9016.

[Ostafew et al., 2016] Ostafew, C. J., Schoellig, A. P., and Barfoot, T. D. (2016). Robust constrained learning-based nmpc enabling reliable mobile robot path tracking. The International Journal of Robotics Research, 35(13):1547– 1563.

[Pham and Pagh, 2013] Pham, N. and Pagh, R. (2013). Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 239–247. ACM.

[Rahimi and Recht, 2008] Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184.

[Rigollet, 2015] Rigollet, P. (2015). 18. s997: High dimensional statistics. Lecture Notes), Cambridge, MA, USA: MIT Open-CourseWare.

[Rusmevichientong and Tsitsiklis, 2010] Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411.

[Schreiter et al., 2015] Schreiter, J., Nguyen-Tuong, D., Eberts, M., Bischoff, B., Markert, H., and Toussaint, M. (2015). Safe exploration for active learning with gaussian processes. In Joint European conference on machine learning and knowledge discovery in databases, pages 133–149. Springer.

[Srinivas et al., 2010] Srinivas, N., Krause, A., Kakade, S., and Seeger, M. (2010). Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 1015–1022. Omnipress.

[Sui et al., 2018] Sui, Y., Burdick, J., Yue, Y., et al. (2018). Stagewise safe bayesian optimization with gaussian processes. In International Conference on Machine Learning, pages 4788–4796.

[Sui et al., 2015] Sui, Y., Gotovos, A., Burdick, J. W., and Krause, A. (2015). Safe exploration for optimization with gaussian processes. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 997–1005. JMLR.org.

[Tropp et al., 2015] Tropp, J. A. et al. (2015). An introduction to matrix concentration inequalities. Foundations and Trends R⃝in Machine Learning, 8(1-2):1–230.

[Turchetta et al., 2016] Turchetta, M., Berkenkamp, F., and Krause, A. (2016). Safe exploration in finite markov decision processes with gaussian processes. In Advances in Neural Information Processing Systems, pages 4312– 4320.

[Usmanova et al., 2019] Usmanova, I., Krause, A., and Kamgarpour, M. (2019). Safe convex learning under uncertain constraints. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2106–2114.

[Valko et al., 2013] Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. (2013). Finite-time analysis of kernelised contextual bandits. In Uncertainty in Artificial Intelligence.

[Vershynin, 2018] Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press.

[Wachi et al., 2018] Wachi, A., Sui, Y., Yue, Y., and Ono, M. (2018). Safe exploration and optimization of constrained mdps using gaussian processes. In Thirty-Second AAAI Conference on Artificial Intelligence.

[Williams and Rasmussen, 2006] Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA.

In this section we prove the Lemma 1, which states the condition under which it is guaranteed that with probability at least  1 − δit holds that  x∗ ∈ Dst.

image

If we lower-bound the LHS of (25) using the definition of confidence interval  Qg,t(x∗), we obtain:

image

Since each confidence interval  Qg,t(x)is built to contain the g(x) with high probability, it is clear that (25) is satisfied whenever (27) is true.

In order to bound the minimum eigenvalue of the Gram matrices  ΦTg,T ′Φg,T ′, we use the Matrix Chernoff Inequality [Tropp et al., 2015].

Theorem 4 (Matrix Chernoff Inequality, [Tropp et al., 2015]). Consider a finite sequence  {Xk}of independent, random, symmetric matrices in  Rd. Assume that  λmin(Xk) ≥ 0and  λmax(Xk) ≤ Lfor each index k. Introduce the random matrix  Y = �k Xk. Let  µmindenote the minimum eigenvalue of the expectation E[Y],

image

Then, for any  ε ∈ (0, 1), it holds,

image

Proof of Lemma 2. Let  Xt = ϕg(xt)ϕg(xt)Tfor  t ∈ [T ′], such that each  Xtis a symmetric matrix with  λmin(Xt) ≥0 and  λmax(Xt) ≤ 1. In this notation,  ΦTg,T ′Φg,T ′ + σ2I = �T ′t=1 Xt + σ2I.We compute:

image

Thus, Theorem 4 implies the following for any  ε ∈ [0, 1):

image

To complete the proof of the lemma, simply choose  ε = 0.5and  T ′ ≥ 8L2λ− log( dgδ )in (28). This gives

image

Using this high probability lower bound on  λmin�ΦTg,T ′Φg,T ′ + σ2I�and the fact that  λmin�ΦTg,t−1Φg,t−1 + σ2I�≥

λmin�ΦTg,T ′Φg,T ′ + σ2I�for all  t ≥ T ′ + 1, we can easily obtain the desired bound on  σ2g,t−1(x∗)for all  t ≥ T ′ + 1as follows:

image

In this section, we present the proof of Lemma 3.

First, we bound the  σ2g,t(x) − ˜σ2g,t(x)([Mutny and Krause, 2018]), where  ˜σ2g,t(x)is the approximated posterior vari- ance.

Lemma 4 (Approximation of posterior variance, [Mutny and Krause, 2018]). Let the  ˜ϕg(.)T ˜ϕg(.) (ǫ0, Dg)-uniformly approximates the kernel  kg. Then,

image

Completing the proof of Lemma 3.

Proof. In order to complete the proof of Lemma 3, we employ a similar technique as in the proof of Lemma 2. In this direction, we bound  λmin�˜ΦTg,T ′ ˜Φg,T ′ + σ2I�using Theorem 4 such that with probability at least  1 − δ:

image

Note that for a QFF and RFF map  ∥ ˜ϕg(x∗)∥= 1.

Now we combine (31) and (33) to conclude that for  T ′ ≥ 8˜λ− log( Dgδ )and all  t ≥ T ′ + 1with probability at least 1 − δ:

image

as desired.

image

First, we decompose the cumulative regret  RTas follows:

image

The bound on the second term is standard in the literature (see for example [Srinivas et al., 2010]), but we provide the necessary details for completeness.

Lemma 5. Let  βtbe defined as in Theorem 1 and  x∗ ∈ Dstfor  t ≥ T ′ + 1. Then, for any  δ ∈ (0, 1)with probability at least  1 − δ:

image

Proof. The proof is mostly adapted from [Srinivas et al., 2010]. We start with analysis of  rt. If  x∗ ∈ Dstfor  t ≥ T ′+1, the following holds for all  t ≥ T ′ + 1with probability at least  1 − δ:

image

where the second inequality follows from the definition of decision rule in (10) and the fact that  x∗ ∈ Dstfor  t ≥ T ′+1. For the last inequality we used the definition of the confidence interval  Qf,tin (6).

It follows from (37) that with probability at least  1 − δ:

image

where in (38) we used  s2 ≤ C2 log(1+σ−2)for any  s ∈ [0, σ−2]when  C2 = σ−2/log(1+σ−2) ≥ 1and (39) follows from the definition of the mutual information between function values  fTand the observations  yT.

Lemma 6. With probability at least  1 − δ:

image

where  B := C√2ℓddiam(D0)/δfor some positive universal constant C > 0 if the function f is associated with an RBF kernel with parameter  ℓ, otherwise  B := 2�2 log(2|D0|)/δ.

Proof. By the GP assumption, f(x) is a random gaussian  |D0|×1vector with mean 0 and covariance matrix K = [kf x,x′∈D0]. First, we assume  kfis an RBF kernel. For an RBF kernel  kfwith parameter  ℓ, [Vershynin, 2018] implies that for some universal constant  C1 > 0:

image

Let  w(D0)be the gaussian width [Vershynin, 2018] of  D0. Then for some universal constant C > 0:

image

The first inequality follows from the Markov inequality. The second inequality is an application of [Vershynin, 2018] combined with (42). Finally, the last inequality holds due to the fact that  w(D0) ≤diam(D0)2 √d. Therefore, (43) implies that:

image

For a general setting, where  kfis not an RBF kernel, we have:

image

For the second inequality see [Rigollet, 2015] and the last inequality holds because  kf(x, x′) ≤ 1for all  x, x′ ∈ D0.

Hence, we deduce that:

image

We finalize the proof as follows:

image

provided that  B ≥ C√2ℓddiam(D0)/δif  kfis RBF kernel with parameter  ℓand  B ≥ 2�2 log(2|D0|)/δ.

Next, we show that when the Theorems 2 and 3 choices of  T ′are combined with Lemmas 2 and 3, respectively, (14) holds with probability at least  1 − δfor all  t ≥ T ′ + 1.

Completing the proof of Theorem 2. Let T be sufficiently large such that  T ≥ T ′. For all  t ≥ T ′ + 1, Lemma 2 implies:

image

where the last inequality follows from the the definition of  T ′and the fact that  T ′ ≥ 8σ2βTλ−ǫ2. Hence, as stated in Lemma 1, (48) is equivalent to  x∗ ∈ Dstfor all  t ≥ T ′ + 1with probability at least  1 − δ.

We are now ready to complete the proof of Theorem 2. Let T sufficiently large such that:

image

We combine Lemmas 2, 5 and 6 to conclude that with probability at least  1 − 3δ:

image

Completing the proof of Theorem 3. Let T be sufficiently large such that  T ≥ T ′. Employing Lemma 3, for t ≥ T + 1we have:

image

where, in the first inequality we used  βT ′+1 ≤ βT. Note that assumptions  ǫ0 ≤ ǫ2σ232T 3βTand  T ′ ≥ 16σ2βT˜λ−ǫ2combined with (50) guarantees (14). Hence with probability at least  1 − δ, x∗ ∈ Dstfor all  t ≥ T ′ + 1. This fact allows us to bound Term II in (11) using Lemma 5.

Hence, we combine Lemmas 2, 5 and 6 to conclude that with probability at least  1 − 3δ:

image

In this section we show why Eqn. (16) holds. Since the matrices  ΦTg,tΦg,t + σ2Iand  Φg,tΦTg,t + σ2Iare positive definite and  (ΦTg,tΦg,t + σ2I)ΦTg,t = ΦTg,t(Φg,tΦTg,t + σ2I), we get:

image

Also from the definition of  kg,t(x) := Φg,tϕg(x)for all  x ∈ D0, we deduce:

image

Combining (51) and (52), we get:

image

which gives:

image

At the final step, (54) implies:

image

Definition 2 (QFF approximation). For SE kernel the Fourier transform is  p(ω) =� l√2π

the SE kernel  kgis approximated as follows. Choose ¯Dg ∈ Nand  Dg = ¯Ddg, and construct the  2Dg-dimensional feature map:

image

Lemma 7. [Mutny and Krause, 2018] Let  kgbe the SE kernel and  D0 = [0, 1]d. Then, for an  (ǫ0, Dg)-uniform QFF approximation in  RDgit holds that:

image

Lemma 7 implies that if ¯Dg > 1/l2, the uniform approximation error  ǫ0decreases exponentially with  Dg.

Our analysis and results are not limited to only stationary kernels whose Fourier transform decomposes product-wise, i.e.,  p(ω) = �di=1 pj(ωj). There exists other uniform approximations adapted for a broader range of kernels. Specifically, for any stationary kernel, we can adapt the so-called Random Fourier Features (RFF) mapping introduced by [Rahimi and Recht, 2008].

Definition 3 (RFF approximation). For any stationary kernel  kgif  D0 = [0, 1]d, the  Dg-dimensional feature map is constructed by:

image

where  ωj, j ∈ [dg/2]are i.i.d random variables in  Rd.

In the following theorem, we see how  Dgis chosen for an  (ǫ0, Dg)-uniform RFF approximation. Theorem 5. [Rahimi and Recht, 2008] Let  kgbe an stationary kernel  D0 = [0, 1]d. Then, for Random Fourier Features mapping  ˜ϕgand any  δ ∈ (0, 1)it holds with probability at least  1 − δ:

image

provided that ,

image

While the uniform approximation error of QFF decreases exponentially with the size of the linear basis, applying the standard RFF [Rahimi and Recht, 2008] for any stationary kernels implies  Dg = O�dǫ20�number of features are required for an  (ǫ0, Dg)-uniform RFF approximation; in other words, the uniform approximation error of RFF decreases with the inverse square root of the basis dimension. See Appendix F for details on QFF and RFF.

In comparison, QFF scales unfavorably with the dimensionality of the model. Hence, on one hand, QFF is unsuitable for an arbitrary high dimensional kernel approximation. The strengths of QFF manifest on problems with a low dimension or a low effective dimension. On the other hand, adapting RFF results in drastically large  Dgwhen a very small  ǫ0is required to control the accuracy of the approximation.

image

Figure 3: Average regret of SGP-UCB and SafeOpt with linear kernel.

In this section we address the design of safe-GP optimization algorithms for the purpose of best arm identification (studied in [Sui et al., 2015, Sui et al., 2018]) versus that of regret minimization (studied in this paper) and highlight why regret guarantees are not the focus of the former.

A popular sampling criteria for best arm identification, which is also adopted for the safe-GP optimization setting by [Sui et al., 2015], relies on a purely exploratory approach referred to as the uncertainty sampling (a.k.a. maximum variance) rule. Here, the decision maker would select actions from a safe subset of true safe set,  Gt ∪ Mt, with the highest variance of the GP estimate:

image

where  Gt ⊆ Stand  Mt ⊆ Stare the set of potential expanders and maximizers, respectively, and  Stis the expanded safe set at round t that is constructed based on the knowledge of Lipschitz constant of f (see [Sui et al., 2015] for more details).

A general observation about the uncertainty sampling rule adopted in (60) is that while it is a provably good way to explore a function, it is not well suited to control regret, i.e., identifying points x where f(x) is large in order to concentrate sampling there without unnecessarily exploration. A relevant paper that has also highlighted this aspect is [Contal et al., 2013], which has applied the maximum variance rule (60) as a pure exploration approach for GP optimization (in the unconstrained setting), and yet proves theoretical upper bounds on the regret with batches of size K. To do so, they introduce the Gaussian Process Upper Confidence Bound and Pure Exploration algorithm (GP-UCB-PE) which combines the UCB strategy and Pure Exploration. GP-UCB-PE combines the benefits of the UCB policy with pure exploration queries which is based on uncertainty sampling rule in the same batch of K evaluations of f. While only relying on the maximum variance as the sampling criterion results in a greedy pure exploratory algorithm, the UCB strategy has been used in parallel with the pure explorative rule to obtain regret bound.

For the purpose of further clarification, we would also like to numerically highlight the unsuitability of the uncertainty sampling rule (60) for regret minimization. Consider the following example. In a relaxed version of Safe GP, let the objective function f be associated with linear kernels, the true safe set be known to the algorithm and contain all the standard basis vectors,  ei, whose only non-zero element is the i-th element which is 1. At each round t, uncertainty sampling (60) maximizes the estimated variance, which is  ∥x∥A−1t−1according to (20), over the given safe set. It can be shown that at each round t, it equivalently selects an eigenvector of  At−1corresponding to its minimum eigenvalue. Hence, the standard basis vectors,  ei, are repeatedly selected, resulting in a linear regret. In order to illustrate this issue, we implemented SafeOpt on 20 instances where the kernels are linear and the true safe sets  Dsǫare as explained above (and of course unknown to the algorithm). We also evaluated SafeOpt when the cumulative regret is obtained with respect to the benchmark considered in [Sui et al., 2015], i.e.,  f ∗ǫ = maxx∈ ¯Rǫ(Dw) f(x). Figure 3 compares the average regret curves of SGP-UCB and SafeOpt, with respect to both  f ∗ǫ = maxx∈ ¯Rǫ(Dw) f(x)and true benchmark and  f ∗ǫ = maxx∈Dsǫ f(x). Please note that in this experiment, we estimated  ¯Rǫ(Dw)by  R1000ǫ (Dw). Figure 3 highlights the poor performance of SafeOpt compared to that of SGP-UCB when the regret is obtained with respect to true  x∗. Let us however reiterate that SafeOpt was not designed for regret minimization to begin with.

image

Figure 4: Average regret of SGP-UCB: uniform and uncertainty sampling in the first phase.

image

Figure 5:  1 ≤ |Dw|≤ 10

A final remark is concerning a potential modification to SGP-UCB that can improve performance of the first phase but it is unclear how one can analyze this theoretically. Due to pure-explorative behaviour of uncertainty sampling, it might be an appropriate alternative for sampling actions form the safe seed set  Dwin the first phase to explore the function g. In Figure 4, we depict the average regret curves of SGP-UCB over 20 instances where f and g are sampled from GPs with SE kernels with hyper parameters 1 and 0.1, respectively. The curves highlight the performance of SGP-UCB when two different exploration approaches, uniform and uncertainty sampling, are applied in the first phase.

image

Figure 6:  11 ≤ |Dw|≤ 20

image

Figure 7:  21 ≤ |Dw|≤ 25

In this section, we provide the figures including standard deviation of regret curves presented in Figure 2. Figures 5, 6 and 7 highlight the standard deviation around the average regret curves depicted in Figure 2a, 2b and 2c, respectively.


Designed for Accessibility and to further Open Science