On Kernelized Multi-armed Bandits

2017·Arxiv

Abstract

Abstract

We consider the stochastic bandit problem with a continuous set of arms, with the expected reward function over the arms assumed to be fixed but unknown. We provide two new Gaussian process-based algorithms for continuous bandit optimization – Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and derive corresponding regret bounds. Specifically, the bounds hold when the expected reward function belongs to the reproducing kernel Hilbert space (RKHS) that naturally corresponds to a Gaussian process kernel used as input by the algorithms. Along the way, we derive a new self-normalized concentration inequality for vector-valued martingales of arbitrary, possibly infinite, dimension. Finally, experimental evaluation and comparisons to existing algorithms on synthetic and real-world environments are carried out that highlight the favorable gains of the proposed strategies in many cases.

1. Introduction

Optimization over large domains under uncertainty is an important subproblem arising in a variety of sequential decision making problems, such as dynamic pricing in economics (Besbes and Zeevi, 2009), reinforcement learning with continuous state/action spaces (Kaelbling et al., 1996; Smart and Kaelbling, 2000), and power control in wireless communication (Chiang et al., 2008). A typical feature of such problems is a large, or potentially infinite, domain of decision points or covariates (prices, actions, transmit powers), together with only partial and noisy observability of the associated outcomes (demand, state/reward, communication rate); reward/loss information is revealed only for decisions that are chosen. This often makes it hard to balance exploration and exploitation, as available knowledge must be transferred efficiently from a finite set of observations so far to estimates of the values of infinitely many decisions. A classic case in point is that of the canonical stochastic MAB with finitely many arms, where the effort to optimize scales with the total number of arms or decisions; the effect of this is catastrophic for large or infinite arm sets.

With suitable structure in the values or rewards of arms, however, the challenge of sequential optimization can be efficiently addressed. Parametric bandits, especially linearly parameterized bandits (Rusmevichien- tong and Tsitsiklis, 2010), represent a well-studied class of structured decision making settings. Here, every arm corresponds to a known, finite dimensional vector (its feature vector), and its expected reward is assumed to be an unknown linear function of its feature vector. This allows for a large, or even infinite, set of arms all lying in space of finite dimension, say d, and a rich line of work gives algorithms that attain sublinear regret with a polynomial dependence on the dimension, e.g., Confidence Ball (Dani et al., 2008), OFUL (Abbasi-Yadkori et al., 2011) (a strengthening of Confidence Ball) and Thompson sampling for linear bandits (Agrawal and Goyal, 2013)1 The insight here is that even though the number of arms can be large, the number of unknown parameters (or degrees of freedom) in the problem is really only d, which makes it possible to learn about the values of many other arms by playing a single arm.

A different approach to modelling bandit problems with a continuum of arms is via the framework of Gaussian processes (GPs) (Rasmussen and Williams, 2006). GPs are a flexible class of nonparametric models for expressing uncertainty over functions on rather general domain sets, which generalize multivariate Gaussian random vectors. GPs allow tractable regression for estimating an unknown function given a set of (noisy) measurements of its values at chosen domain points. The fact that GPs, being distributions on functions, can also help quantify function uncertainty makes it attractive for basing decision making strategies on them. This has been exploited to great advantage to build nonparametric bandit algorithms, such as GP-UCB (Srinivas et al., 2009), GP-EI and GP-PI (Hoffman et al., 2011). In fact, GP models for bandit optimization, in terms of their kernel maps, can be viewed as the parametric linear bandit paradigm pushed to the extreme, where each feature vector associated to an arm can have infinite dimension 2.

Against this backdrop, our work revisits the problem of bandit optimization with stochastic rewards. Specifically, we consider stochastic multiarmed bandit (MAB) problems with a continuous arm set, and whose (unknown) expected reward function is assumed to lie in a reproducing kernel Hilbert space (RKHS), with bounded RKHS norm – this effectively enforces smoothness on the function3. We make the following contributions-

• We design a new algorithm – Improved Gaussian Process-Upper Confidence Bound (IGP-UCB) – for stochastic bandit optimization. The algorithm can be viewed as a variant of GP-UCB (Srinivas et al., 2009), but uses a significantly reduced confidence interval width resulting in an order-wise improvement in regret compared to GP-UCB. IGP-UCB also shows a markedly improved numerical performance over GP-UCB.

• We develop a nonparametric version of Thompson sampling, called Gaussian Process Thompson sampling (GP-TS), and show that enjoys a regret bound of . Here, T is the total time horizon and is a quantity depending on the RKHS containing the reward function. This is, to our knowledge, the first known regret bound for Thompson sampling in the agnostic setup with nonparametric structure.

• We prove a new self-normalized concentration inequality for infinite-dimensional vector-valued martingales, which is not only key to the design and analysis of the IGP-UCB and GP-TS algorithms, but also potentially of independent interest. The inequality generalizes a corresponding self-normalized bound for martingales in finite dimension proven by Abbasi-Yadkori et al. (2011).

• Empirical comparisons of the algorithms developed above, with other GP-based algorithms, are presented, over both synthetic and real-world setups, demonstrating performance improvements of the proposed algorithms, as well as their performance under misspecification.

2. Problem Statement

We consider the problem of sequentially maximizing a fixed but unknown reward function over a (potentially infinite) set of decisions , also called actions or arms. An algorithm for this problem chooses, at each round t, an action , and subsequently observes a reward , which is a noisy version of the function value at . The action is chosen causally depending upon the arms played and rewards obtained upto round , denoted by the history . We

where is the -algebra generated by the random variables and .This is a mild assump- tion on the noise (it holds, for instance, for distributions bounded in ) and is standard in the bandit literature (Abbasi-Yadkori et al., 2011; Agrawal and Goyal, 2013).

Regret. The goal of an algorithm is to maximize its cumulative reward or alternatively minimize its cumulative regret – the loss incurred due to not knowing f’s maximum point beforehand. Let be a maximum point of f (assuming the maximum is attained). The instantaneous regret incurred at time t is , and the cumulative regret in a time horizon T (not necessarily known a priori) is defined to be . A sub-linear growth of in T signifies that as , or vanishing per-round regret.

Regularity Assumptions. Attaining sub-linear regret is impossible in general for arbitrary reward functions f and domains D, and thus some regularity assumptions are in order. In what follows, we assume that D is compact. The smoothness assumption we make on the reward function f is motivated by Gaussian processes4 and their associated reproducing kernel Hilbert spaces (RKHSs, see Sch¨olkopf and Smola (2002)). Specifically, we assume that f has small norm in the RKHS of functions , with positive semi-definite kernel function . This RKHS, denoted by , is completely specified by its kernel function and vice-versa, with an inner product obeying the reproducing property: for all . In other words, the kernel plays the role of delta functions to represent the evaluation map at each point via the RKHS inner product. The RKHS norm is a measure of the smoothness5 of f, with respect to the kernel function k, and satisfies: if and only if .

We assume a known bound on the RKHS norm of the unknown target function6: . Moreover, we assume bounded variance by restricting , for all . Two common kernels that satisfy bounded variance property are Squared Exponential and Matrn, defined as

where l > 0 and are hyperparameters, encodes the similarity between two points , and is the modified Bessel function. Generally the bounded variance property holds for any stationary kernel, i.e. kernels for which for all . These assumptions are required to make the regret bounds scale-free and are standard in the literature (Agrawal and Goyal, 2013). Instead if or , then our regret bounds would increase by a factor of c.

3. Algorithms

Design philosophy. Both the algorithms we propose use Gaussian likelihood models for observations, and Gaussian process (GP) priors for uncertainty over reward functions. A Gaussian process over D, denoted by , is a collection of random variables , one for each , such that every finite sub-collection of random variables is jointly Gaussian with mean and covariance . The algorithms use , as an initial prior distribution for the unknown reward function f over D, where is the kernel function associated with the RKHS in which f is assumed to have ‘small’ norm at most B. The algorithms also assume that the noise variables are drawn independently, across t, from , with . Thus, the prior distribution for each f(x), is assumed to be , . Moreover, given a set of sampling points within D, it follows under the assumption that the corresponding vector of observed rewards has the multivariate Gaussian distribution , where is the kernel matrix at time t. Then, by the properties of GPs, we have that and f(x) are jointly Gaussian given :

where . Therefore conditioned on the history , the posterior distribution over f is , where

Thus for every , the posterior distribution of f(x), given , is .

Remark. Note that the GP prior and Gaussian likelihood model described above is only an aid to algorithm design, and has nothing to do with the actual reward distribution or noise model as in the problem statement (Section 2). The reward function f is a fixed, unknown, member of the RKHS , and the true sequence of noise variables is allowed to be a conditionally R-sub-Gaussian martingale difference sequence (Equation 1). In general, thus, this represents a misspecified prior and noise model, also termed the agnostic setting by Srinivas et al. (2009).

The proposed algorithms, to follow, assume the knowledge of only the sub-Gaussianity parameter R, kernel function k and upper bound B on the RKHS norm of f. Note that are free parameters (possibly time-dependent) that can be set specific to the algorithm.

3.1 Improved GP-UCB (IGP-UCB) Algorithm

We introduce the IGP-UCB algorithm (Algorithm 1), that uses a combination of the current posterior mean and standard deviation to (a) construct an upper confidence bound (UCB) envelope for the actual function f over D, and (b) choose an action to maximize it. Specifically it chooses, at each round t, the action

with the scale parameter v set to be 1. Such a rule trades off exploration (picking points with high uncertainty

Here, denotes the mutual information between and , where and quantifies the reduction in uncertainty about f after observing at points . is a problem dependent quantity and can be found given the knowledge of domain D and kernel function k. For a compact subset D of is and , respectively, for the Squared Exponential and Matrn kernels (Srinivas et al., 2009), depending only polylogarithmically on the time T.

Discussion. Srinivas et al. (2009) have proposed the GP-UCB algorithm, and Valko et al. (2013) the KernelUCB algorithm, for sequentially optimizing reward functions lying in the RKHS . Both algorithms play an arm at time t using the rule: . GP-UCB uses the exploration parameter , with set to , where is additionally assumed to be a known, uniform (i.e., almost-sure) upper bound on all noise variables (Srinivas et al., 2009, Theorem 3). Compared to GP-UCB, IGP-UCB (Algorithm 1) reduces the width of the confidence interval by a factor roughly at every round t, and, as we will see, this small but critical adjustment leads to much better theoretical and empirical performance compared to GP-UCB. In KernelUCB, is set as , where is the exploration parameter and is the regularization constant. Thus IGP-UCB can be viewed as a special case of KernelUCB where .

3.2 Gaussian Process Thompson Sampling (GP-TS)

Our second algorithm, GP-TS (Algorithm 2), inspired by the success of Thompson sampling for standard and parametric bandits (Agrawal and Goyal, 2012; Kaufmann et al., 2012; Gopalan et al., 2014; Agrawal and Goyal, 2013), uses the time-varying scale parameter and operates as follows. At each round t, GP-TS samples a random function from the GP with mean function and covariance function . Next, it chooses a decision set , and plays the arm that maximizes . We call it GP-Thompson-Sampling as it falls under the general framework of Thompson Sampling, i.e., (a) assume a prior on the underlying parameters of the reward distribution, (b) play the arm according to the prior probability that it is optimal, and (c) observe the outcome and update the prior. However, note that the prior is nonparametric in this case.

4. Main Results

We begin by presenting two key concentration inequalities which are essential in bounding the regret of the proposed algorithms.

Theorem 1 Let be an -valued discrete time stochastic process predictable with respect to the filtration , i.e., is -measurable . Let be a real-valued stochastic process such that for some and for all is (a) -measurable, and (b) R-sub-Gaussian conditionally on . Let be a symmetric, positive-semidefinite kernel, and let . For a given , with probability at least , the following holds simultaneously over all :

(Here, denotes the matrix and for any and , ). Moreover, if is positive definite with probability 1, then the conclusion above holds with .

Theorem 1 represents a self-normalized concentration inequality: the ‘size’ of the increasing-length sequence of martingale differences is normalized by the growing quantity that explicitly depends on the sequence. The following lemma helps provide an alternative, abstract, view of the self-normalized process of Theorem 1, based on the feature space representation induced by a kernel.

Lemma 1 Let be a symmetric, positive-semidefinite kernel, with associated feature map and the reproducing kernel Hilbert space8 (RKHS) . Letting and the (possibly infinite dimensional) matrix9 , we have, whenever is positive definite, that

Observe that is -measurable and also . The process is thus a mar-tingale with values10 in the RKHS H, which can possibly be infinite-dimensional, and moreover, whose deviation is measured by the norm weighted by , which is itself derived from . Theorem 1 represents the kernelized generalization of the finite-dimensional result of Abbasi-Yadkori et al. (2011), and we recover their result under the special case of a linear kernel: for all .

We remark that when is a mapping to a finite-dimensional Hilbert space, the argument of Abbasi- Yadkori et al. (2011, Theorem 1) can be lifted to establish Theorem 1, but it breaks down in the generalized, infinite-dimensional RKHS setting, as the self-normalized bound in their paper has an explicit, growing dependence on the feature dimension. Specifically, the method of mixtures (de la Pena et al., 2009) or Laplace method, as dubbed by Maillard (2016) (Lemma 5.2), fails to hold in infinite dimension. The primary reason for this is that the mixture distribution for finite dimensional spaces can be chosen independently of time, but in a nonparametric setup like ours, where the dimensionality of the self-normalizing factoritself grows with time, the use of (random) stopping times, precludes using time-dependent mixtures. We get around this difficulty by applying a novel ‘double mixture’ construction, in which a pair of mixtures on (a) the space of real-valued functions on , i.e., the support of a Gaussian process, and (b) on real sequences is simultaneously used to obtain a more general result, of potentially independent interest (see Section 5 and the appendix for details).

Our next result shows that how the posterior mean is concentrated around the unknown reward function f.

Theorem 2 Under the same hypotheses as those of Theorem 1, let , and be a member of the RKHS of real-valued functions on D with kernel k, with RKHS norm bounded by B. Then, with probability at least , the following holds for all and :

where is the maximum information gain after rounds and are mean and variance of posterior distribution defined as in Equation 2, 3, 4, with set to and .

Theorem 3.5 of Maillard (2016) states a similar result on the estimation of the unknown reward function from the RKHS. We improve upon it in the sense that the confidence bound in Theorem 2 is simultaneous over all , while the bound has been shown only for a single, fixed x in the Kernel Least-squares setting. We are able to achieve this result by virtue of Theorem 1.

4.1 Regret Bound of IGP-UCB

Theorem 3 Let and is conditionally R-sub-Gaussian. Running IGP-UCB for a function f lying in the RKHS , we obtain a regret bound of with high

probability. More precisely, with probability at least .

Improvement over GP-UCB. Srinivas et al. (2009), in the course of analyzing the GP-UCB algorithm, show that when the reward function lies in the RKHS , GP-UCB obtains regret

with high probability (see Theorem 3 therein for the exact bound). Furthermore, they assume that the noise is uniformly bounded by , while our sub-Gaussianity assumption (see Equation 1) is slightly more general, and we are able to obtain a multiplicative factor improvement in the final regret bound thanks to the new self-normalized inequality (Theorem 1). Additionally, in our numerical experiments, we observe a significantly improved performance of IGP-UCB over GP-UCB, both on synthetically generated function, and on real-world sensor measurement data (see Section 6).

Comparison with KernelUCB. Valko et al. (2013) show that the cumulative regret of KernelUCB is , where , defined as the effective dimension, measures, in a sense, the number of principal directions over which the projection of the data in the RKHS is spread. They show that is at least as good as , precisely and thus the regret bound of KernelUCB is roughly , which is factor better than IGP-UCB. However, KernelUCB requires the number of actions to be finite, so the regret bound is not applicable for infinite or continuum action spaces.

4.2 Regret Bound of GP-TS

For technical reasons, we will analyze the following version of GP-TS. At each round t, the decision set used by GP-TS is restricted to be a unique discretization of D with the property that for all , where is the closest point to x in . This can always be achieved by choosing a compact and convex domain and discretization with size such that

rd/BLrdt/BLtfor all x D, where L

implies, for every ,

as any is Lipschitz continuous with constant (De Freitas et al., 2012, Lemma 1).

Theorem 4 (Regret bound for GP-TS) Let be compact and convex, and a conditionally R-sub-Gaussian sequence. Running GP-TS for a function f lying in the RKHS and with decision sets chosen as above, with probability at least , the regret of GP-TS satisfies .

Comparison with IGP-UCB. Observe that regret scaling of GP-TS is which is a multiplicativefactor away from the bound obtained for IGP-UCB and similar behavior is reflected in our simulations on synthetic data. The additional multiplicative factor ofin the regret bound of GP-TS is essentially a consequence of discretization. How to remove this extra logarithmic dependency, and make the analysis discretization-independent, remains an open question.

Remark. The regret bound for GP-TS is inferior compared to IGP-UCB in terms of the dependency on dimension d, but to the best of our knowledge, Theorem 4 is the first (frequentist) regret guarantee of Thompson Sampling in the agnostic, non-parametric setting of infinite action spaces.

Linear Models and a Matching Lower Bound. If the mean rewards are perfectly linear, i.e. if there exists a such that for all , then we are in the parametric setup, and one way of casting this in the kernelized framework is by using the linear kernel . For this kernel, , and the regret scaling of IGP-UCB is and that of GP-TS is , which recovers the regret bounds of their linear, parametric analogues OFUL (Abbasi-Yadkori et al., 2011) and Linear Thompson sampling (Agrawal and Goyal, 2013), respectively. Moreover, in this case , thus the regret of IGP-UCB isfactor away from that of KernelUCB. But the regret bound of KernelUCB also depends on the number of arms N, and if N is exponential in d, then it also suffers regret. We remark that a similar factor improvement, as obtained by IGP-UCB over GP-UCB, was achieved in the linear parametric setting by Abbasi-Yadkori et al. (2011) in the OFUL algorithm, over its predecessor ConfidenceBall (Dani et al., 2008). Finally we see that the for linear bandit problem with infinitely many actions, IGP-UCB attains the information theoretic lower bound of (see Dani et al. (2008)), but GP-TS is a factor ofaway from it.

5. Overview of Techniques

We briefly outline here the key arguments for all the theorems in Section 4. Formal proofs and auxiliary lemmas required are given in the appendix.

Proof Sketch for Theorem 1. It is convenient to assume that , the induced kernel matrix at time t, is invertible, since this is where the crux of the argument lies. First we show that for any function and for all , thanks to the sub-Gaussian property (1), the processis a non-negative super-martingale with respect to the filtration , where and in fact satisfies . The chief difficulty is to handle the behavior of at a (random) stopping time, since the sizes of quantities such as at the stopping time will be random.

We next construct a mixture martingale by mixing over g drawn from an independent Gaussian process, which is a measure over a large space of functions, i.e., the space . Then, by a change of measure argument, we show that this induces a mixture distribution which is essentially over any desired finite dimension t, thus obtaining . Next from the

fact that and from Markov’s inequality, for any , we obtain

Finally, we lift this bound simultaneously for all t through a standard stopping time construction as in Abbasi- Yadkori et al. (2011).

Proof Sketch for Theorem 2. Here we sketch the special case of , i.e. . Observe that is upper bounded by sum of two terms, and Q := . Now we observe that and use this observation to show that and , which are in turn upper bounded by the terms and respectively. Then the result follows using Theorem 1, along with the assumption that and the fact that almost surely (see Lemma 3) when is invertible.

Proof Sketch for Theorem 3. First from Theorem 2 and the choice of in Algorithm 1, we show that the instantaneous regret at round t is upper bounded by with probability at least . Then the result follows by essentially upper bounding the term by (Lemma 4 in the appendix).

Proof Sketch for Theorem 4. We follow a similar approach given in Agrawal and Goyal (2013) to prove the regret bound of GP-TS. First observe that from our choice of discretization sets , the instantaneous regret at round t is given by , where and is the closest point to in . Now at each round t, after an action is chosen, our algorithm improves the confidence about true reward function f, via an update of and . However, if we play a suboptimal arm, the regret suffered can be much higher than the improvement of our knowledge. To overcome this difficulty, at any round t, we divide the arms (in the present discretization ) into two groups: saturated arms, , defined as those with and unsaturated otherwise, where is an appropriate constant (see Definition 1, 3). The idea is to show that the probability of playing a saturated arm is small and then bound the regret of playing an unsaturated arm in terms of standard deviation. This is useful because the inequality (Lemma 4) allows us to bound the total regret due to unsaturated arms.

First we lower bound the probability of playing an unsaturated arm at round t. We define a filtra-tion as the history up to round and prove that for “most” (in a high probability sense)

with concentration bounds for and f(x) (Lemma 6) and “smoothness” of f (Equation 7), allow us to show that the expected regret at round t is upper bounded in terms of , i.e. in terms of regret due to playing an unsaturated arm. More precisely, we show that for “most”

(Lemma 10), and use it to prove that

is a super-martingale difference sequence adapted to filtration (Lemma 12). Now, using the Azuma- Hoeffding inequality (Lemma 13), along with the bound on , we obtain the desired high- probability regret bound.

6. Experiments

In this section we provide numerical results on both synthetically generated test functions and functions from real-world data. We compare GP-UCB, IGP-UCB and GP-TS with GP-EI and GP-PI11.

Synthetic Test Functions. We use the following procedure to generate test functions from the RKHS. First we sample 100 points uniformly from the interval [0, 1] and use that as our decision set. Then we compute a kernel matrix K on those points and draw reward vector . Finally, the mean of the resulting posterior distribution is used as the test function f. We set noise parameter to be 1% of function range and use . We used Squared Exponential kernel with lengthscale parameter l = 0.2 and Matrn kernel with parameters . Parameters of IGP-UCB, GP-UCB and GP-TS are chosen as given in Section 3, with and set according to theoretical upper bounds for corresponding kernels. We run each algorithm for T = 30000 iterations, over 25 independent trials (samples from the RKHS) and plot the average cumulative regret along with standard deviations (Figure 1). We see a significant improvement in the performance of IGP-UCB over GP-UCB. In fact IGP-UCB performs the best in the pool of competitors, while GP-TS also fares reasonably well compared to GP-UCB and GP-EI/GP-PI.

We next sample 25 random functions from the GP(0, K) and perform the same experiment (Figure 2) for both kernels with exactly same set of parameters. The relative performance of all methods is similar to

Figure 1: Cumulative regret for functions lying in the RKHS corresponding to (a) Squared Exponential ker- nel and (b) Matrn kernel.

Figure 2: Cumulative regret for functions lying in the GP corresponding to (a) Squared Exponential kernel and (b) Matrn kernel.

that in the previous experiment, which is the arguably harder “agnostic” setting of a fixed, unknown target function.

Standard Test Functions. We consider 2 well-known synthetic benchmark functions for Bayesian Optimization: Rosenbrock and Hartman3 (see Azimi et al. (2012) for exact analytical expressions). We sample 100 d points uniformly from the domain of each benchmark function, d being the dimension of respective domain, as the decision set. We consider the Squared Exponential kernel with l = 0.2 and set all parameters exactly as in previous experiment. The cumulative regret for 25 independent trials on Rosenbrock and Hart-man3 benchmarks is shown in Figure 3. We see GP-EI/PI perform better than the rest, while IGP-UCB and GP-TS show competitive performance. Here no algorithm is aware of the underlying kernel function, hence

Figure 3: Cumulative regret for (a) Rosenbrock and (b) Hartman3 benchmark function.

we conjecture that the UCB- and TS- based algorithms are somewhat less robust on the choice of kernel than EI/PI.

Temperature Sensor Data. We use temperature data12 collected from 54 sensors deployed in the Intel Berkeley Research lab between February 28th and April 5th, 2004 with samples collected at 30 second intervals. We tested all algorithms in the context of learning the maximum reading of the sensors collected between 8 am to 9 am. We take measurements of first 5 consecutive days (starting Feb. 28th 2004) to learn algorithm parameters. Following Srinivas et al. (2009), we calculate the empirical covariance matrix of the sensor measurements and use it as the kernel matrix in the algorithms. Here is set to be 5% of the average empirical variance of sensor readings and other algorithm parameters is set similarly as in the previous experiment with (found via cross-validation). The functions for testing consist of one set of measurements from all sensors in the two following days and the cumulative regret is plotted over all such test functions. From Figure 4, we see that IGP-UCB and GP-UCB performs the same, while GP-TS outperforms all its competitors.

Light Sensor Data. We take light sensor data collected in the CMU Intelligent Workplace in Nov 2005, which is available online as Matlab structure13 and contains locations of 41 sensors, 601 train samples and 192 test samples. We compute the kernel matrix, estimate the noise and set other algorithm parameters exactly as in the previous experiment. Here also GP-TS is found to perform better than the others, with IGP-UCB performing better than GP-EI/PI (Figure 4).

Related work. An alternative line of work pertaining to X-armed bandits Kleinberg et al. (2008); Bubeck et al. (2011); Carpentier and Valko (2015); Azar et al. (2014) studies continuum-armed bandits with smoothness structure. For instance, Bubeck et al. (2011) show that with a Lipschitzness assumption on the reward function, algorithms based on discretizing the domain yield nontrivial regret guarantees, of order

. Other Bayesian approaches to function optimization are GP-EI Moˇckus (1975), GP-PI Kushner (1964), GP-EST Wang et al. (2016) and GP-UCB, including the contextual Krause and Ong (2011), high-dimensional Djolonga et al. (2013); Wang et al. (2013), time-varying Bogunovic et al. (2016) safety-aware Gotovos et al. (2015), budget-constraint Hoffman et al. (2013) and noise-free De Freitas et al. (2012) settings. Other relevant work focuses on best arm identification problem in the Bayesian setup considering pure exploration Gr¨unew¨alder et al. (2010). For Thompson sampling (TS), Russo and Van Roy (2014) analyze the Bayesian

Figure 4: Cumulative regret plots for (a) temperature data and (b) light sensor data.

regret of TS, which includes the case where the target function is sampled from a GP prior. Our work obtains the first frequentist regret of TS for unknown, fixed functions from an RKHS.

7. Conclusion

For bandit optimization, we have improved upon the existing GP-UCB algorithm, and introduced a new GPTS algorithm. The proposed algorithms perform well in practice both on synthetic and real-world data. An interesting case is when the kernel function is also not known to the algorithms a priori and needs to be learnt adaptively. Moreover, one can consider classes of time varying functions from the RKHS, and general reinforcement learning with GP techniques. There are also important questions on computational aspects of optimizing functions drawn from GPs.

References

Yasin Abbasi-Yadkori, D´avid P´al, and Csaba Szepesv´ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In COLT, pages 39–1, 2012.

Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In ICML, pages 127–135, 2013.

Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Online stochastic optimization under correlated bandit feedback. In ICML, pages 1557–1565, 2014.

Javad Azimi, Ali Jalali, and Xiaoli Fern. Hybrid batch bayesian optimization. arXiv preprint arXiv:1202.5597, 2012.

Omar Besbes and Assaf Zeevi. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research, 57(6):1407–1420, 2009.

Ilija Bogunovic, Jonathan Scarlett, and Volkan Cevher. Time-varying gaussian process bandit optimization. arXiv preprint arXiv:1601.06650, 2016.

S´ebastien Bubeck, R´emi Munos, Gilles Stoltz, and Csaba Szepesv´ari. X-armed bandits. Journal of Machine Learning Research, 12(May):1655–1695, 2011.

Alexandra Carpentier and Michal Valko. Simple regret for infinitely many armed bandits. In ICML, pages 1133–1141, 2015.

Mung Chiang, Prashanth Hande, Tian Lan, and Chee Wei Tan. Power control in wireless cellular networks. Foundations and Trends in Networking, 2(4):381–533, 2008. ISSN 1554-057X. doi: 10.1561/1300000009.

Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In COLT, pages 355–366, 2008.

Nando De Freitas, Alex Smola, and Masrour Zoghi. Exponential regret bounds for gaussian process bandits with deterministic observations. arXiv preprint arXiv:1206.6457, 2012.

Victor H de la Pena, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes. probability and its applications, 2009.

Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional gaussian process bandits. In Advances in Neural Information Processing Systems, pages 1025–1033, 2013.

Rick Durrett. Probability: Theory and Examples. Brooks/Cole - Thomson Learning, Belmont, CA, 2005.

Aditya Gopalan, Shie Mannor, and Yishay Mansour. Thompson sampling for complex online problems. In ICML, volume 14, pages 100–108, 2014.

Alkis Gotovos, ETHZ CH, and Joel W Burdick. Safe exploration for optimization with gaussian processes. 2015.

Steffen Gr¨unew¨alder, Jean-Yves Audibert, Manfred Opper, and John Shawe-Taylor. Regret bounds for gaus- sian process bandit problems. In AISTATS, pages 273–280, 2010.

Matthew D Hoffman, Eric Brochu, and Nando de Freitas. Portfolio allocation for bayesian optimization. In UAI, pages 327–336, 2011.

Matthew W Hoffman, Bobak Shahriari, and Nando de Freitas. Exploiting correlation and budget constraints in bayesian multi-armed bandit optimization. arXiv preprint arXiv:1303.6746, 2013.

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.

Emilie Kaufmann, Nathaniel Korda, and R´emi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory, pages 199–213. Springer, 2012.

Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681–690. ACM, 2008.

Andreas Krause and Cheng S Ong. Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447–2455, 2011.

Harold J Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97–106, 1964.

Odalric-Ambrym Maillard. Self-normalization techniques for streaming confident regression. 2016.

J Moˇckus. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pages 400–404. Springer, 1975.

Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. 2006.

Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. Math. Oper. Res., 35(2): 395–411, May 2010.

Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.

Bernhard Sch¨olkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.

William D Smart and Leslie Pack Kaelbling. Practical reinforcement learning in continuous spaces. In ICML, pages 903–910, 2000.

Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.

Michal Valko, Nathaniel Korda, R´emi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013.

Zi Wang, Bolei Zhou, and Stefanie Jegelka. Optimization as estimation with gaussian processes in bandit settings. In International Conf. on Artificial and Statistics (AISTATS), 2016.

Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, N Freitas, et al. Bayesian optimization in high di- mensions via random embeddings. AAAI Press/International Joint Conferences on Artificial Intelligence, 2013.

Fuzhen Zhang. The Schur complement and its applications, volume 4. Springer Science & Business Media, 2006.

Appendix A. Proof of Theorem 1

For a function and a sequence of reals , define for any

where the vector . We first establish the following technical result, which resembles Abbasi-Yadkori et al. (2011, Lemma 8).

Lemma 2 For fixed g and is a super-martingale with respect to the filtration .

Proof First, define

Since is -measurable and is -measurable, as well as are measurable. Also, by the conditional R-sub-Gaussianity of , we have

which in turn implies . We also have

Also observe that for all t, as

Now, let be a stopping time with respect to the filtration . By the convergence theorem for non- negative super-martingales (Durrett, 2005), exists almost surely, and thus is well-defined. Now let , be a stopped version of . By Fatou’s lemma (Durrett, 2005),

Now, let be the -algebra generated by , and let be a sequence of independent and identically distributed Gaussian random variables with mean 0 and variance , independent of . Let be a random function distributed according to the Gaussian process measure , and independent of both and .

For each , define . In words, is a mixture of super-martingales of the form , and it is not hard to see that is also a (non-negative) super-martingale w.r.t. the filtration , hence is well-defined almost surely. We can write

An argument similar to (8) also shows that for any stopping time . Now, without loss of generality, we assume R = 1 (this can always be achieved through appropriate scaling), and compute

where is the Gaussian process measure over the function space is the multivariate Gaussian distribution on with mean 0 and covariance where I is the identify, du is standard Lebesgue measure on , and f is the density of the random vector , which is distributed as the multivariate Gaussian given the sampled points up to round t, where is the induced kernel matrix at time t given by . (Note: is not positive definite and invertible when there are repetitions among , but is).

Thus, we have

Now for positive-definite matrices P and Q

Therefore,

which yields

Now as , using Markov’s inequality gives, for any ,

To complete the proof, we now employ a stopping time construction as in Abbasi-Yadkori et al. (2011). For each , define the ‘bad’ event

so that

which is the probability required to be bounded by in the statement of the theorem.

Let be the first time when the bad event happens, i.e., , with by convention. Clearly, is a stopping time, and

Therefore, we can write

by the inequality (9).

When is positive definite (and hence invertible) for each , one can use a similar construction as in Part 1, with (i.e., N is the all-zeros sequence with probability 1), to recover the corresponding conclusion (6) with .

Proof of Lemma 1

Define, for each time t, the matrix , and observe that and . With this, we can compute

completing the proof.

B. Information Theoretic Results

Lemma 3 For every , the maximum information gain , for the points chosen by Algorithm 1 and 2 satisfy, almost surely, the following :

Proof At round t after observing the reward vector at points , the information gain - by the algorithm - about the unknown reward function f is given by the mutual information between and sampled at points :

where , and . Clearly, given the randomness - as perceived by the algorithm - in are only in the noise vector and thus

Again, conditioned on reward vector observed at points , the reward at round s observed at is believed to follow the distribution , which gives

ln(2ln(2) + ln(1 + )). Now by chain rule ln(2) + ln(1 + )), and therefore

Now is a function of , the random points chosen by the algorithm and thus

Now the proof follows from Equation 10 and 11.

Lemma 4 Let be the points selected by the algorithms. The sum of predictive standard deviation at those points can be expressed in terms of the maximum information gain. More precisely,

Proof First note that, by Cauchy-Schwartz inequality, . Now since for all and by our choice of , we have , where in the last inequality we used the fact that for any . Thus we get . This implies

where the last inequality follows from Lemma 3. Now the result follows by choosing .

C. Proof of Theorem 2

First define as , where maps any point x in the primal space to the RKHS H associated with kernel function k. For any two functions , define the inner product as and the RKHS norm as. Now as the unknown reward function f lies in the RKHS , these definitions along with reproducing property of the RKHS imply and for all . Now defining , we get the kernel matrix , for all and . Since the matrices and are strictly positive definite and , we have

which gives

This implies

where the second equality uses 12, the first inequality is by Cauchy-Schwartz and the final equality is from 13. Again see that

where the second equality is from 12, the first inequality is by Cauchy-Schwartz and the fourth inequality uses both 12 and 13. Now recall that, at round t, the posterior mean function , where and . Thus we have

where we have used , where as stated in Theorem 1. Now observe that when K is invertible, . Using , we get

Now see that

Now using Theorem 1, for any , with probability at least , we obtain

from lemma 3. Now choosing we have

and hence the result follows.

D. Analysis of IGP-UCB (Theorem 3)

E. Analysis of GP-TS (Theorem 4)

Lemma 5 For any and any finite subset of D,

for all possible realizations of history .

Proof Fix and . Given history . Thus using Lemma B4 of Hoffman et al. (2013), for any , with probability at least

and now applying union bound,

holds with probability at least , given any possible realizations of history . Now setting , the result follows.

Definition 1 Define For all and , where . Clearly, increases with t.

Definition 2 Define as the event that for all ,

and as the event that for all ,

Definition 3 Define the set of saturated points in discretization at round t as

where , the difference between function values at the closest point to in