Optimization over large domains under uncertainty is an important subproblem arising in a variety of sequential decision making problems, such as dynamic pricing in economics (Besbes and Zeevi, 2009), reinforcement learning with continuous state/action spaces (Kaelbling et al., 1996; Smart and Kaelbling, 2000), and power control in wireless communication (Chiang et al., 2008). A typical feature of such problems is a large, or potentially infinite, domain of decision points or covariates (prices, actions, transmit powers), together with only partial and noisy observability of the associated outcomes (demand, state/reward, communication rate); reward/loss information is revealed only for decisions that are chosen. This often makes it hard to balance exploration and exploitation, as available knowledge must be transferred efficiently from a finite set of observations so far to estimates of the values of infinitely many decisions. A classic case in point is that of the canonical stochastic MAB with finitely many arms, where the effort to optimize scales with the total number of arms or decisions; the effect of this is catastrophic for large or infinite arm sets.
With suitable structure in the values or rewards of arms, however, the challenge of sequential optimization can be efficiently addressed. Parametric bandits, especially linearly parameterized bandits (Rusmevichien- tong and Tsitsiklis, 2010), represent a well-studied class of structured decision making settings. Here, every arm corresponds to a known, finite dimensional vector (its feature vector), and its expected reward is assumed to be an unknown linear function of its feature vector. This allows for a large, or even infinite, set of arms all lying in space of finite dimension, say d, and a rich line of work gives algorithms that attain sublinear regret with a polynomial dependence on the dimension, e.g., Confidence Ball (Dani et al., 2008), OFUL (Abbasi-Yadkori et al., 2011) (a strengthening of Confidence Ball) and Thompson sampling for linear bandits (Agrawal and Goyal, 2013)1 The insight here is that even though the number of arms can be large, the number of unknown parameters (or degrees of freedom) in the problem is really only d, which makes it possible to learn about the values of many other arms by playing a single arm.
A different approach to modelling bandit problems with a continuum of arms is via the framework of Gaussian processes (GPs) (Rasmussen and Williams, 2006). GPs are a flexible class of nonparametric models for expressing uncertainty over functions on rather general domain sets, which generalize multivariate Gaussian random vectors. GPs allow tractable regression for estimating an unknown function given a set of (noisy) measurements of its values at chosen domain points. The fact that GPs, being distributions on functions, can also help quantify function uncertainty makes it attractive for basing decision making strategies on them. This has been exploited to great advantage to build nonparametric bandit algorithms, such as GP-UCB (Srinivas et al., 2009), GP-EI and GP-PI (Hoffman et al., 2011). In fact, GP models for bandit optimization, in terms of their kernel maps, can be viewed as the parametric linear bandit paradigm pushed to the extreme, where each feature vector associated to an arm can have infinite dimension 2.
Against this backdrop, our work revisits the problem of bandit optimization with stochastic rewards. Specifically, we consider stochastic multiarmed bandit (MAB) problems with a continuous arm set, and whose (unknown) expected reward function is assumed to lie in a reproducing kernel Hilbert space (RKHS), with bounded RKHS norm – this effectively enforces smoothness on the function3. We make the following contributions-
• We design a new algorithm – Improved Gaussian Process-Upper Confidence Bound (IGP-UCB) – for stochastic bandit optimization. The algorithm can be viewed as a variant of GP-UCB (Srinivas et al., 2009), but uses a significantly reduced confidence interval width resulting in an order-wise improvement in regret compared to GP-UCB. IGP-UCB also shows a markedly improved numerical performance over GP-UCB.
• We develop a nonparametric version of Thompson sampling, called Gaussian Process Thompson sampling (GP-TS), and show that enjoys a regret bound of . Here, T is the total time horizon and
is a quantity depending on the RKHS containing the reward function. This is, to our knowledge, the first known regret bound for Thompson sampling in the agnostic setup with nonparametric structure.
• We prove a new self-normalized concentration inequality for infinite-dimensional vector-valued martingales, which is not only key to the design and analysis of the IGP-UCB and GP-TS algorithms, but also potentially of independent interest. The inequality generalizes a corresponding self-normalized bound for martingales in finite dimension proven by Abbasi-Yadkori et al. (2011).
• Empirical comparisons of the algorithms developed above, with other GP-based algorithms, are presented, over both synthetic and real-world setups, demonstrating performance improvements of the proposed algorithms, as well as their performance under misspecification.
We consider the problem of sequentially maximizing a fixed but unknown reward function over a (potentially infinite) set of decisions
, also called actions or arms. An algorithm for this problem chooses, at each round t, an action
, and subsequently observes a reward
, which is a noisy version of the function value at
. The action
is chosen causally depending upon the arms played and rewards obtained upto round
, denoted by the history
. We
where is the
-algebra generated by the random variables
and
.This is a mild assump- tion on the noise (it holds, for instance, for distributions bounded in
) and is standard in the bandit literature (Abbasi-Yadkori et al., 2011; Agrawal and Goyal, 2013).
Regret. The goal of an algorithm is to maximize its cumulative reward or alternatively minimize its cumulative regret – the loss incurred due to not knowing f’s maximum point beforehand. Let be a maximum point of f (assuming the maximum is attained). The instantaneous regret incurred at time t is
, and the cumulative regret in a time horizon T (not necessarily known a priori) is defined to be
. A sub-linear growth of
in T signifies that
as
, or vanishing per-round regret.
Regularity Assumptions. Attaining sub-linear regret is impossible in general for arbitrary reward functions f and domains D, and thus some regularity assumptions are in order. In what follows, we assume that D is compact. The smoothness assumption we make on the reward function f is motivated by Gaussian processes4 and their associated reproducing kernel Hilbert spaces (RKHSs, see Sch¨olkopf and Smola (2002)). Specifically, we assume that f has small norm in the RKHS of functions , with positive semi-definite kernel function
. This RKHS, denoted by
, is completely specified by its kernel function
and vice-versa, with an inner product
obeying the reproducing property:
for all
. In other words, the kernel plays the role of delta functions to represent the evaluation map at each point
via the RKHS inner product. The RKHS norm
is a measure of the smoothness5 of f, with respect to the kernel function k, and satisfies:
if and only if
.
We assume a known bound on the RKHS norm of the unknown target function6: . Moreover, we assume bounded variance by restricting
, for all
. Two common kernels that satisfy bounded variance property are Squared Exponential and Mat
rn, defined as
where l > 0 and are hyperparameters,
encodes the similarity between two points
, and
is the modified Bessel function. Generally the bounded variance property holds for any stationary kernel, i.e. kernels for which
for all
. These assumptions are required to make the regret bounds scale-free and are standard in the literature (Agrawal and Goyal, 2013). Instead if
or
, then our regret bounds would increase by a factor of c.
Design philosophy. Both the algorithms we propose use Gaussian likelihood models for observations, and Gaussian process (GP) priors for uncertainty over reward functions. A Gaussian process over D, denoted by , is a collection of random variables
, one for each
, such that every finite sub-collection of random variables
is jointly Gaussian with mean
and covariance
. The algorithms use
, as an initial prior distribution for the unknown reward function f over D, where
is the kernel function associated with the RKHS
in which f is assumed to have ‘small’ norm at most B. The algorithms also assume that the noise variables
are drawn independently, across t, from
, with
. Thus, the prior distribution for each f(x), is assumed to be
,
. Moreover, given a set of sampling points
within D, it follows under the assumption that the corresponding vector of observed rewards
has the multivariate Gaussian distribution
, where
is the kernel matrix at time t. Then, by the properties of GPs, we have that
and f(x) are jointly Gaussian given
:
where . Therefore conditioned on the history
, the posterior distribution over f is
, where
Thus for every , the posterior distribution of f(x), given
, is
.
Remark. Note that the GP prior and Gaussian likelihood model described above is only an aid to algorithm design, and has nothing to do with the actual reward distribution or noise model as in the problem statement (Section 2). The reward function f is a fixed, unknown, member of the RKHS , and the true sequence of noise variables
is allowed to be a conditionally R-sub-Gaussian martingale difference sequence (Equation 1). In general, thus, this represents a misspecified prior and noise model, also termed the agnostic setting by Srinivas et al. (2009).
The proposed algorithms, to follow, assume the knowledge of only the sub-Gaussianity parameter R, kernel function k and upper bound B on the RKHS norm of f. Note that are free parameters (possibly time-dependent) that can be set specific to the algorithm.
3.1 Improved GP-UCB (IGP-UCB) Algorithm
We introduce the IGP-UCB algorithm (Algorithm 1), that uses a combination of the current posterior mean and standard deviation
to (a) construct an upper confidence bound (UCB) envelope for the actual function f over D, and (b) choose an action to maximize it. Specifically it chooses, at each round t, the action
with the scale parameter v set to be 1. Such a rule trades off exploration (picking points with high uncertainty
Here, denotes the mutual information between
and
, where
and quantifies the reduction in uncertainty about f after observing
at points
.
is a problem dependent quantity and can be found given the knowledge of domain D and kernel function k. For a compact subset D of
is
and
, respectively, for the Squared Exponential and Mat
rn kernels (Srinivas et al., 2009), depending only polylogarithmically on the time T.
Discussion. Srinivas et al. (2009) have proposed the GP-UCB algorithm, and Valko et al. (2013) the KernelUCB algorithm, for sequentially optimizing reward functions lying in the RKHS . Both algorithms play an arm at time t using the rule:
. GP-UCB uses the exploration parameter
, with
set to
, where
is additionally assumed to be a known, uniform (i.e., almost-sure) upper bound on all noise variables
(Srinivas et al., 2009, Theorem 3). Compared to GP-UCB, IGP-UCB (Algorithm 1) reduces the width of the confidence interval by a factor roughly
at every round t, and, as we will see, this small but critical adjustment leads to much better theoretical and empirical performance compared to GP-UCB. In KernelUCB,
is set as
, where
is the exploration parameter and
is the regularization constant. Thus IGP-UCB can be viewed as a special case of KernelUCB where
.
3.2 Gaussian Process Thompson Sampling (GP-TS)
Our second algorithm, GP-TS (Algorithm 2), inspired by the success of Thompson sampling for standard and parametric bandits (Agrawal and Goyal, 2012; Kaufmann et al., 2012; Gopalan et al., 2014; Agrawal and Goyal, 2013), uses the time-varying scale parameter and operates as follows. At each round t, GP-TS samples a random function
from the GP with mean function
and covariance function
. Next, it chooses a decision set
, and plays the arm
that maximizes
. We call it GP-Thompson-Sampling as it falls under the general framework of Thompson Sampling, i.e., (a) assume a prior on the underlying parameters of the reward distribution, (b) play the arm according to the prior probability that it is optimal, and (c) observe the outcome and update the prior. However, note that the prior is nonparametric in this case.
We begin by presenting two key concentration inequalities which are essential in bounding the regret of the proposed algorithms.
Theorem 1 Let be an
-valued discrete time stochastic process predictable with respect to the filtration
, i.e.,
is
-measurable
. Let
be a real-valued stochastic process such that for some
and for all
is (a)
-measurable, and (b) R-sub-Gaussian conditionally on
. Let
be a symmetric, positive-semidefinite kernel, and let
. For a given
, with probability at least
, the following holds simultaneously over all
:
(Here, denotes the
matrix
and for any
and
,
). Moreover, if
is positive definite
with probability 1, then the conclusion above holds with
.
Theorem 1 represents a self-normalized concentration inequality: the ‘size’ of the increasing-length sequence of martingale differences is normalized by the growing quantity
that explicitly depends on the sequence. The following lemma helps provide an alternative, abstract, view of the self-normalized process of Theorem 1, based on the feature space representation induced by a kernel.
Lemma 1 Let be a symmetric, positive-semidefinite kernel, with associated feature map
and the reproducing kernel Hilbert space8 (RKHS)
. Letting
and the (possibly infinite dimensional) matrix9
, we have, whenever
is positive definite, that
Observe that is
-measurable and also
. The process
is thus a mar-tingale with values10 in the RKHS H, which can possibly be infinite-dimensional, and moreover, whose deviation is measured by the norm weighted by
, which is itself derived from
. Theorem 1 represents the kernelized generalization of the finite-dimensional result of Abbasi-Yadkori et al. (2011), and we recover their result under the special case of a linear kernel:
for all
.
We remark that when is a mapping to a finite-dimensional Hilbert space, the argument of Abbasi- Yadkori et al. (2011, Theorem 1) can be lifted to establish Theorem 1, but it breaks down in the generalized, infinite-dimensional RKHS setting, as the self-normalized bound in their paper has an explicit, growing dependence on the feature dimension. Specifically, the method of mixtures (de la Pena et al., 2009) or Laplace method, as dubbed by Maillard (2016) (Lemma 5.2), fails to hold in infinite dimension. The primary reason for this is that the mixture distribution for finite dimensional spaces can be chosen independently of time, but in a nonparametric setup like ours, where the dimensionality of the self-normalizing factor
itself grows with time, the use of (random) stopping times, precludes using time-dependent mixtures. We get around this difficulty by applying a novel ‘double mixture’ construction, in which a pair of mixtures on (a) the space of real-valued functions on
, i.e., the support of a Gaussian process, and (b) on real sequences is simultaneously used to obtain a more general result, of potentially independent interest (see Section 5 and the appendix for details).
Our next result shows that how the posterior mean is concentrated around the unknown reward function f.
Theorem 2 Under the same hypotheses as those of Theorem 1, let , and
be a member of the RKHS of real-valued functions on D with kernel k, with RKHS norm bounded by B. Then, with probability at least
, the following holds for all
and
:
where is the maximum information gain after
rounds and
are mean and variance of posterior distribution defined as in Equation 2, 3, 4, with
set to
and
.
Theorem 3.5 of Maillard (2016) states a similar result on the estimation of the unknown reward function from the RKHS. We improve upon it in the sense that the confidence bound in Theorem 2 is simultaneous over all , while the bound has been shown only for a single, fixed x in the Kernel Least-squares setting. We are able to achieve this result by virtue of Theorem 1.
4.1 Regret Bound of IGP-UCB
Theorem 3 Let and
is conditionally R-sub-Gaussian. Running IGP-UCB for a function f lying in the RKHS
, we obtain a regret bound of
with high
probability. More precisely, with probability at least .
Improvement over GP-UCB. Srinivas et al. (2009), in the course of analyzing the GP-UCB algorithm, show that when the reward function lies in the RKHS , GP-UCB obtains regret
with high probability (see Theorem 3 therein for the exact bound). Furthermore, they assume that the noise
is uniformly bounded by
, while our sub-Gaussianity assumption (see Equation 1) is slightly more general, and we are able to obtain a
multiplicative factor improvement in the final regret bound thanks to the new self-normalized inequality (Theorem 1). Additionally, in our numerical experiments, we observe a significantly improved performance of IGP-UCB over GP-UCB, both on synthetically generated function, and on real-world sensor measurement data (see Section 6).
Comparison with KernelUCB. Valko et al. (2013) show that the cumulative regret of KernelUCB is , where
, defined as the effective dimension, measures, in a sense, the number of principal directions over which the projection of the data in the RKHS is spread. They show that
is at least as good as
, precisely
and thus the regret bound of KernelUCB is roughly
, which is
factor better than IGP-UCB. However, KernelUCB requires the number of actions to be finite, so the regret bound is not applicable for infinite or continuum action spaces.
4.2 Regret Bound of GP-TS
For technical reasons, we will analyze the following version of GP-TS. At each round t, the decision set used by GP-TS is restricted to be a unique discretization of D with the property that
for all
, where
is the closest point to x in
. This can always be achieved by choosing a compact and convex domain
and discretization
with size
such that
rd/BLrdt
/BLt
for all x
D, where L
implies, for every ,
as any is Lipschitz continuous with constant
(De Freitas et al., 2012, Lemma 1).
Theorem 4 (Regret bound for GP-TS) Let be compact and convex,
and
a conditionally R-sub-Gaussian sequence. Running GP-TS for a function f lying in the RKHS
and with decision sets
chosen as above, with probability at least
, the regret of GP-TS satisfies
.
Comparison with IGP-UCB. Observe that regret scaling of GP-TS is which is a multiplicative
factor away from the bound
obtained for IGP-UCB and similar behavior is reflected in our simulations on synthetic data. The additional multiplicative factor of
in the regret bound of GP-TS is essentially a consequence of discretization. How to remove this extra logarithmic dependency, and make the analysis discretization-independent, remains an open question.
Remark. The regret bound for GP-TS is inferior compared to IGP-UCB in terms of the dependency on dimension d, but to the best of our knowledge, Theorem 4 is the first (frequentist) regret guarantee of Thompson Sampling in the agnostic, non-parametric setting of infinite action spaces.
Linear Models and a Matching Lower Bound. If the mean rewards are perfectly linear, i.e. if there exists a such that
for all
, then we are in the parametric setup, and one way of casting this in the kernelized framework is by using the linear kernel
. For this kernel,
, and the regret scaling of IGP-UCB is
and that of GP-TS is
, which recovers the regret bounds of their linear, parametric analogues OFUL (Abbasi-Yadkori et al., 2011) and Linear Thompson sampling (Agrawal and Goyal, 2013), respectively. Moreover, in this case
, thus the regret of IGP-UCB is
factor away from that of KernelUCB. But the regret bound of KernelUCB also depends on the number of arms N, and if N is exponential in d, then it also suffers
regret. We remark that a similar
factor improvement, as obtained by IGP-UCB over GP-UCB, was achieved in the linear parametric setting by Abbasi-Yadkori et al. (2011) in the OFUL algorithm, over its predecessor ConfidenceBall (Dani et al., 2008). Finally we see that the for linear bandit problem with infinitely many actions, IGP-UCB attains the information theoretic lower bound of
(see Dani et al. (2008)), but GP-TS is a factor of
away from it.
We briefly outline here the key arguments for all the theorems in Section 4. Formal proofs and auxiliary lemmas required are given in the appendix.
Proof Sketch for Theorem 1. It is convenient to assume that , the induced kernel matrix at time t, is invertible, since this is where the crux of the argument lies. First we show that for any function
and for all
, thanks to the sub-Gaussian property (1), the process
is a non-negative super-martingale with respect to the filtration
, where
and in fact satisfies
. The chief difficulty is to handle the behavior of
at a (random) stopping time, since the sizes of quantities such as
at the stopping time will be random.
We next construct a mixture martingale by mixing
over g drawn from an independent
Gaussian process, which is a measure over a large space of functions, i.e., the space
. Then, by a change of measure argument, we show that this induces a mixture distribution which is essentially
over any desired finite dimension t, thus obtaining
. Next from the
fact that and from Markov’s inequality, for any
, we obtain
Finally, we lift this bound simultaneously for all t through a standard stopping time construction as in Abbasi- Yadkori et al. (2011).
Proof Sketch for Theorem 2. Here we sketch the special case of , i.e.
. Observe that
is upper bounded by sum of two terms,
and Q :=
. Now we observe that
and use this observation to show that
and
, which are in turn upper bounded by the terms
and
respectively. Then the result follows using Theorem 1, along with the assumption that
and the fact that
almost surely (see Lemma 3) when
is invertible.
Proof Sketch for Theorem 3. First from Theorem 2 and the choice of in Algorithm 1, we show that the instantaneous regret
at round t is upper bounded by
with probability at least
. Then the result follows by essentially upper bounding the term
by
(Lemma 4 in the appendix).
Proof Sketch for Theorem 4. We follow a similar approach given in Agrawal and Goyal (2013) to prove the regret bound of GP-TS. First observe that from our choice of discretization sets , the instantaneous regret at round t is given by
, where
and
is the closest point to
in
. Now at each round t, after an action is chosen, our algorithm improves the confidence about true reward function f, via an update of
and
. However, if we play a suboptimal arm, the regret suffered can be much higher than the improvement of our knowledge. To overcome this difficulty, at any round t, we divide the arms (in the present discretization
) into two groups: saturated arms,
, defined as those with
and unsaturated otherwise, where
is an appropriate constant (see Definition 1, 3). The idea is to show that the probability of playing a saturated arm is small and then bound the regret of playing an unsaturated arm in terms of standard deviation. This is useful because the inequality
(Lemma 4) allows us to bound the total regret due to unsaturated arms.
First we lower bound the probability of playing an unsaturated arm at round t. We define a filtra-tion as the history
up to round
and prove that for “most” (in a high probability sense)
with concentration bounds for and f(x) (Lemma 6) and “smoothness” of f (Equation 7), allow us to show that the expected regret at round t is upper bounded in terms of
, i.e. in terms of regret due to playing an unsaturated arm. More precisely, we show that for “most”
(Lemma 10), and use it to prove that
is a super-martingale difference sequence adapted to filtration (Lemma 12). Now, using the Azuma- Hoeffding inequality (Lemma 13), along with the bound on
, we obtain the desired high- probability regret bound.
In this section we provide numerical results on both synthetically generated test functions and functions from real-world data. We compare GP-UCB, IGP-UCB and GP-TS with GP-EI and GP-PI11.
Synthetic Test Functions. We use the following procedure to generate test functions from the RKHS. First we sample 100 points uniformly from the interval [0, 1] and use that as our decision set. Then we compute a kernel matrix K on those points and draw reward vector . Finally, the mean of the resulting posterior distribution is used as the test function f. We set noise parameter
to be 1% of function range and use
. We used Squared Exponential kernel with lengthscale parameter l = 0.2 and Mat
rn kernel with parameters
. Parameters
of IGP-UCB, GP-UCB and GP-TS are chosen as given in Section 3, with
and
set according to theoretical upper bounds for corresponding kernels. We run each algorithm for T = 30000 iterations, over 25 independent trials (samples from the RKHS) and plot the average cumulative regret along with standard deviations (Figure 1). We see a significant improvement in the performance of IGP-UCB over GP-UCB. In fact IGP-UCB performs the best in the pool of competitors, while GP-TS also fares reasonably well compared to GP-UCB and GP-EI/GP-PI.
We next sample 25 random functions from the GP(0, K) and perform the same experiment (Figure 2) for both kernels with exactly same set of parameters. The relative performance of all methods is similar to
Figure 1: Cumulative regret for functions lying in the RKHS corresponding to (a) Squared Exponential ker- nel and (b) Matrn kernel.
Figure 2: Cumulative regret for functions lying in the GP corresponding to (a) Squared Exponential kernel and (b) Matrn kernel.
that in the previous experiment, which is the arguably harder “agnostic” setting of a fixed, unknown target function.
Standard Test Functions. We consider 2 well-known synthetic benchmark functions for Bayesian Optimization: Rosenbrock and Hartman3 (see Azimi et al. (2012) for exact analytical expressions). We sample 100 d points uniformly from the domain of each benchmark function, d being the dimension of respective domain, as the decision set. We consider the Squared Exponential kernel with l = 0.2 and set all parameters exactly as in previous experiment. The cumulative regret for 25 independent trials on Rosenbrock and Hart-man3 benchmarks is shown in Figure 3. We see GP-EI/PI perform better than the rest, while IGP-UCB and GP-TS show competitive performance. Here no algorithm is aware of the underlying kernel function, hence
Figure 3: Cumulative regret for (a) Rosenbrock and (b) Hartman3 benchmark function.
we conjecture that the UCB- and TS- based algorithms are somewhat less robust on the choice of kernel than EI/PI.
Temperature Sensor Data. We use temperature data12 collected from 54 sensors deployed in the Intel Berkeley Research lab between February 28th and April 5th, 2004 with samples collected at 30 second intervals. We tested all algorithms in the context of learning the maximum reading of the sensors collected between 8 am to 9 am. We take measurements of first 5 consecutive days (starting Feb. 28th 2004) to learn algorithm parameters. Following Srinivas et al. (2009), we calculate the empirical covariance matrix of the sensor measurements and use it as the kernel matrix in the algorithms. Here is set to be 5% of the average empirical variance of sensor readings and other algorithm parameters is set similarly as in the previous experiment with
(found via cross-validation). The functions for testing consist of one set of measurements from all sensors in the two following days and the cumulative regret is plotted over all such test functions. From Figure 4, we see that IGP-UCB and GP-UCB performs the same, while GP-TS outperforms all its competitors.
Light Sensor Data. We take light sensor data collected in the CMU Intelligent Workplace in Nov 2005, which is available online as Matlab structure13 and contains locations of 41 sensors, 601 train samples and 192 test samples. We compute the kernel matrix, estimate the noise and set other algorithm parameters exactly as in the previous experiment. Here also GP-TS is found to perform better than the others, with IGP-UCB performing better than GP-EI/PI (Figure 4).
Related work. An alternative line of work pertaining to X-armed bandits Kleinberg et al. (2008); Bubeck et al. (2011); Carpentier and Valko (2015); Azar et al. (2014) studies continuum-armed bandits with smoothness structure. For instance, Bubeck et al. (2011) show that with a Lipschitzness assumption on the reward function, algorithms based on discretizing the domain yield nontrivial regret guarantees, of order
. Other Bayesian approaches to function optimization are GP-EI Moˇckus (1975), GP-PI Kushner (1964), GP-EST Wang et al. (2016) and GP-UCB, including the contextual Krause and Ong (2011), high-dimensional Djolonga et al. (2013); Wang et al. (2013), time-varying Bogunovic et al. (2016) safety-aware Gotovos et al. (2015), budget-constraint Hoffman et al. (2013) and noise-free De Freitas et al. (2012) settings. Other relevant work focuses on best arm identification problem in the Bayesian setup considering pure exploration Gr¨unew¨alder et al. (2010). For Thompson sampling (TS), Russo and Van Roy (2014) analyze the Bayesian
Figure 4: Cumulative regret plots for (a) temperature data and (b) light sensor data.
regret of TS, which includes the case where the target function is sampled from a GP prior. Our work obtains the first frequentist regret of TS for unknown, fixed functions from an RKHS.
For bandit optimization, we have improved upon the existing GP-UCB algorithm, and introduced a new GPTS algorithm. The proposed algorithms perform well in practice both on synthetic and real-world data. An interesting case is when the kernel function is also not known to the algorithms a priori and needs to be learnt adaptively. Moreover, one can consider classes of time varying functions from the RKHS, and general reinforcement learning with GP techniques. There are also important questions on computational aspects of optimizing functions drawn from GPs.
Yasin Abbasi-Yadkori, D´avid P´al, and Csaba Szepesv´ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In COLT, pages 39–1, 2012.
Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In ICML, pages 127–135, 2013.
Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Online stochastic optimization under correlated bandit feedback. In ICML, pages 1557–1565, 2014.
Javad Azimi, Ali Jalali, and Xiaoli Fern. Hybrid batch bayesian optimization. arXiv preprint arXiv:1202.5597, 2012.
Omar Besbes and Assaf Zeevi. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations Research, 57(6):1407–1420, 2009.
Ilija Bogunovic, Jonathan Scarlett, and Volkan Cevher. Time-varying gaussian process bandit optimization. arXiv preprint arXiv:1601.06650, 2016.
S´ebastien Bubeck, R´emi Munos, Gilles Stoltz, and Csaba Szepesv´ari. X-armed bandits. Journal of Machine Learning Research, 12(May):1655–1695, 2011.
Alexandra Carpentier and Michal Valko. Simple regret for infinitely many armed bandits. In ICML, pages 1133–1141, 2015.
Mung Chiang, Prashanth Hande, Tian Lan, and Chee Wei Tan. Power control in wireless cellular networks. Foundations and Trends in Networking, 2(4):381–533, 2008. ISSN 1554-057X. doi: 10.1561/1300000009.
Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In COLT, pages 355–366, 2008.
Nando De Freitas, Alex Smola, and Masrour Zoghi. Exponential regret bounds for gaussian process bandits with deterministic observations. arXiv preprint arXiv:1206.6457, 2012.
Victor H de la Pena, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes. probability and its applications, 2009.
Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional gaussian process bandits. In Advances in Neural Information Processing Systems, pages 1025–1033, 2013.
Rick Durrett. Probability: Theory and Examples. Brooks/Cole - Thomson Learning, Belmont, CA, 2005.
Aditya Gopalan, Shie Mannor, and Yishay Mansour. Thompson sampling for complex online problems. In ICML, volume 14, pages 100–108, 2014.
Alkis Gotovos, ETHZ CH, and Joel W Burdick. Safe exploration for optimization with gaussian processes. 2015.
Steffen Gr¨unew¨alder, Jean-Yves Audibert, Manfred Opper, and John Shawe-Taylor. Regret bounds for gaus- sian process bandit problems. In AISTATS, pages 273–280, 2010.
Matthew D Hoffman, Eric Brochu, and Nando de Freitas. Portfolio allocation for bayesian optimization. In UAI, pages 327–336, 2011.
Matthew W Hoffman, Bobak Shahriari, and Nando de Freitas. Exploiting correlation and budget constraints in bayesian multi-armed bandit optimization. arXiv preprint arXiv:1303.6746, 2013.
Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996.
Emilie Kaufmann, Nathaniel Korda, and R´emi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory, pages 199–213. Springer, 2012.
Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681–690. ACM, 2008.
Andreas Krause and Cheng S Ong. Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447–2455, 2011.
Harold J Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97–106, 1964.
Odalric-Ambrym Maillard. Self-normalization techniques for streaming confident regression. 2016.
J Moˇckus. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pages 400–404. Springer, 1975.
Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. 2006.
Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. Math. Oper. Res., 35(2): 395–411, May 2010.
Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
Bernhard Sch¨olkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
William D Smart and Leslie Pack Kaelbling. Practical reinforcement learning in continuous spaces. In ICML, pages 903–910, 2000.
Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
Michal Valko, Nathaniel Korda, R´emi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013.
Zi Wang, Bolei Zhou, and Stefanie Jegelka. Optimization as estimation with gaussian processes in bandit settings. In International Conf. on Artificial and Statistics (AISTATS), 2016.
Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, N Freitas, et al. Bayesian optimization in high di- mensions via random embeddings. AAAI Press/International Joint Conferences on Artificial Intelligence, 2013.
Fuzhen Zhang. The Schur complement and its applications, volume 4. Springer Science & Business Media, 2006.
For a function and a sequence of reals
, define for any
where the vector . We first establish the following technical result, which resembles Abbasi-Yadkori et al. (2011, Lemma 8).
Lemma 2 For fixed g and is a super-martingale with respect to the filtration
.
Proof First, define
Since is
-measurable and
is
-measurable,
as well as
are
measurable. Also, by the conditional R-sub-Gaussianity of
, we have
which in turn implies . We also have
Also observe that for all t, as
Now, let be a stopping time with respect to the filtration
. By the convergence theorem for non- negative super-martingales (Durrett, 2005),
exists almost surely, and thus
is well-defined. Now let
, be a stopped version of
. By Fatou’s lemma (Durrett, 2005),
Now, let be the
-algebra generated by
, and let
be a sequence of independent and identically distributed Gaussian random variables with mean 0 and variance
, independent of
. Let
be a random function distributed according to the Gaussian process measure
, and independent of both
and
.
For each , define
. In words,
is a mixture of super-martingales of the form
, and it is not hard to see that
is also a (non-negative) super-martingale w.r.t. the filtration
, hence
is well-defined almost surely. We can write
An argument similar to (8) also shows that for any stopping time
. Now, without loss of generality, we assume R = 1 (this can always be achieved through appropriate scaling), and compute
where is the Gaussian process measure
over the function space
is the multivariate Gaussian distribution on
with mean 0 and covariance
where I is the identify, du is standard Lebesgue measure on
, and f is the density of the random vector
, which is distributed as the multivariate Gaussian
given the sampled points
up to round t, where
is the induced kernel matrix at time t given by
. (Note:
is not positive definite and invertible when there are repetitions among
, but
is).
Thus, we have
Now for positive-definite matrices P and Q
Therefore,
which yields
Now as , using Markov’s inequality gives, for any
,
To complete the proof, we now employ a stopping time construction as in Abbasi-Yadkori et al. (2011). For each , define the ‘bad’ event
so that
which is the probability required to be bounded by in the statement of the theorem.
Let be the first time when the bad event
happens, i.e.,
, with
by convention. Clearly,
is a stopping time, and
Therefore, we can write
by the inequality (9).
When is positive definite (and hence invertible) for each
, one can use a similar construction as in Part 1, with
(i.e., N is the all-zeros sequence with probability 1), to recover the corresponding conclusion (6) with
.
Define, for each time t, the matrix
, and observe that
and
. With this, we can compute
completing the proof.
Lemma 3 For every , the maximum information gain
, for the points chosen by Algorithm 1 and 2 satisfy, almost surely, the following :
Proof At round t after observing the reward vector at points
, the information gain - by the algorithm - about the unknown reward function f is given by the mutual information between
and
sampled at points
:
where ,
and
. Clearly, given
the randomness - as perceived by the algorithm - in
are only in the noise vector
and thus
Again, conditioned on reward vector observed at points
, the reward
at round s observed at
is believed to follow the distribution
, which gives
ln(2
ln(2
) +
ln(1 +
)). Now by chain rule
ln(2
) +
ln(1 +
)), and therefore
Now is a function of
, the random points chosen by the algorithm and thus
Now the proof follows from Equation 10 and 11.
Lemma 4 Let be the points selected by the algorithms. The sum of predictive standard deviation at those points can be expressed in terms of the maximum information gain. More precisely,
Proof First note that, by Cauchy-Schwartz inequality, . Now since
for all
and by our choice of
, we have
, where in the last inequality we used the fact that for any
. Thus we get
. This implies
where the last inequality follows from Lemma 3. Now the result follows by choosing .
First define as
, where
maps any point x in the primal space
to the RKHS H associated with kernel function k. For any two functions
, define the inner product
as
and the RKHS norm
as
. Now as the unknown reward function f lies in the RKHS
, these definitions along with reproducing property of the RKHS imply
and
for all
. Now defining
, we get the kernel matrix
,
for all
and
. Since the matrices
and
are strictly positive definite and
, we have
which gives
This implies
where the second equality uses 12, the first inequality is by Cauchy-Schwartz and the final equality is from 13. Again see that
where the second equality is from 12, the first inequality is by Cauchy-Schwartz and the fourth inequality uses both 12 and 13. Now recall that, at round t, the posterior mean function , where
and
. Thus we have
where we have used , where
as stated in Theorem 1. Now observe that when K is invertible,
. Using
, we get
Now see that
Now using Theorem 1, for any , with probability at least
, we obtain
from lemma 3. Now choosing we have
and hence the result follows.
Lemma 5 For any and any finite subset
of D,
for all possible realizations of history .
Proof Fix and
. Given history
. Thus using Lemma B4 of Hoffman et al. (2013), for any
, with probability at least
and now applying union bound,
holds with probability at least , given any possible realizations of history
. Now setting
, the result follows.
Definition 1 Define For all and
, where
. Clearly,
increases with t.
Definition 2 Define as the event that for all
,
and as the event that for all
,
Definition 3 Define the set of saturated points in discretization
at round t as
where , the difference between function values at the closest point to
in
and at x. Clearly
for all t, and hence
is unsaturated at every t.
Definition 4 Define filtration as the history until time t, i.e.,
. By definition,
. Observe that given
, the set
and the event
are completely deterministic.
Lemma 6 Given any and for all possible filtrations
Proof The probability bound for the event follows from Theorem 2 by replacing
with
and for the event
follows from Lemma 5 by setting
and
.
Lemma 7 (Gaussian Anti-concentration) For a Gaussian random variable X with mean and standard deviation
, for any
,
Lemma 8 For any filtration such that
is true,
for any , where
.
Proof Fix any . Given filtration
is a Gaussian random variable with mean
and standard deviation
and since event
is true,
. Now using the anti-concentration inequality in Lemma 7, we have
where, from Definition 2, . Therefore
, and hence the result follows.
Lemma 9 For any filtration such that
is true,
Proof At round t our algorithm chooses the point , at which the highest value of
, within current decision set
, is attained. Now if
is greater than
for all saturated points at round t, i.e.,
, then one of the unsaturated points (which includes
) in
must be played and hence
. This implies
Now form Definition 3, , for all
. Also if both the events
and
are true, then from Definition 1 and 2,
, for all
. Thus for all
,
. Therefore, for any filtration
such that
is true, either
is false, or else for all
. Hence, for any
such that
is true,
> f
> f
,
where we have used Lemma 6 and Lemma 8. Now the proof follows from Equation 14.
Lemma 10 For any filtration such that
is true,
where is the instantaneous regret at round t.
Proof Let be the unsaturated point in
with smallest
, i.e.,
where we have used Equation 15 and Lemma 9. Now, if both the events and
are true, then from Definition 1 and 2,
, for all
. Using this observation along with Definition 3 and the facts that
for all
and
, we have
Therefore, for any such that
is true, either
, or
is false. Now from our assumption of bounded variance, for all
, and hence
where in the last inequality we used that , which holds trivially for
and also holds for
, as
. Now using Equation 7, we have the instantaneous regret at round t,
and then taking conditional expectation on both sides, the result follows from Equation 17.
Definition 5 Let us define , and for all t = 1, . . . , T:
Definition 6 A sequence of random variables is called a super-martingale corresponding to a filtration
, if for all
is
-measurable, and for
,
Lemma 11 (Azuma-Hoeffding Inequality) If a super-martingale , corresponding to filtration
, satisfies
for some constant
, for all t = 1, . . . , T, then for any
,
Lemma 12 is a super-martingale process with respect to filtration
.
Proof From Definition 6, we need to prove that for all and any possible
0, i.e.
Now if such that
is false, then
, and Equation 18 holds trivially. Moreover, for
such that both
is true, Equation 18 follows from Lemma 10.
Lemma 13 Given any , with probability at least
,
where T is the total number of rounds played.
Proof First note that from Definition 5 for all t = 1, . . . , T,
Now as and
, we have
which follows from the fact that and also
. Thus, we can apply Azuma-Hoeffding inequality (Lemma 11) to obtain that with probability at least
,
as by definition for all
. Now, as the event
holds holds for all t with probability at least
(see Lemma 6), then from Definition 5,
for all t with probability at least
. Now by applying union bound, the result follows.
Proof of Theorem 4
Hence, from Lemma 13, with probability at least ,
and thus with high probability,
We now describe a procedure to update the posterior mean and covariance function in a recursive fashion through the properties of Schur complement (Zhang (2006)) rather than evaluating Equation 2 and 3 at each round. Specifically for all we show the following:
These update rules make our algorithms easy to implement and we are not aware of any literature which explicitly states or uses these relations.
. Now using Schur’s complement we get
where and
. Therefore we have
where
kA
λI
,
Thus we have
Now as
Now we have
also
and
Putting all these together we get